Hi Everyone.
I think every one of us at least one time did your parser of some text format csv/xml/html/
etc. to learn format or to learn language approach to do that or just for fun.
I did that many times, but always I didn't have enough time or/and patience to lead it to production-ready or just find better implementation. Nevertheless, I got the experience and have no regrets about it.
For your attention, my way to implement parser with Span
and some tricks.
All things I demonstrated below were used for implementing 3D OBJ format reader.
ReadOnlySpan<T>
If we have a look at the best solution of applying Span - JSON parser in .NET Core System.Text.Json and here is ReadOnlySpan<byte>
but I want to consider a little bit more.
First of all, we can use two types of ReadOnlySpan
for parsing files ReadOnlySpan<char>
and ReadOnlySpan<byte>
. Let's make out what they have in common to analyze text.
//System.MemoryExtensions int IndexOf<T>(this ReadOnlySpan<T>span, [NullableAttribute(1)] T value) where T : IEquatable<T>; bool StartsWith<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value) where T : IEquatable<T>; ReadOnlySpan<T> Trim/*TrimEnd | TrimStart*/(this ReadOnlySpan<T> span, T trimElement) where T : IEquatable<T>; //System.ReadOnlySpan ReadOnlySpan<T> Slice(int start, int length);
These are not all available methods but we will focus only on them. They have very similar semantics as the string
has IndexOf/StartsWith/Trim
and Slice
instead of Substring
and approach to analyzing text will be the same. First, we find some symbol by IndexOf
, make a Slice
of the necessary part, and analyze it, and repeat the process.
It looks like the following:
var LF = new []{'\n'}; while(!span.IsEmpty){ var endLine = span.IndexOf(LF); if (endLine == -1) { //always check, because it can be the end of the span endLine = span.Length; } var line = span.Slice(0, endLine); //analyze the line ... span = span.Slice(endLine, span.Length - endLine); }
What about differences? Yes, they are here .. if we move on we find more differences in getting a certain value.
ReadOnlySpan<char>
ReadOnlySpan<char>
this type is easier to use and debug (because it char array) we can see the exact chars you analyze, converting to string is just invoking ToString()
.
using (var reader = new StreamReader(File.OpenRead(path))) { //getting Span from string var span = reader.ReadToEnd().AsSpan(); //getting string from Span var str = span.ToString(); }
Moreover, The framework has overload methods in simple types. It very good because it allows us to avoid converting span to string every time.
public static Single Parse(ReadOnlySpan<char> s, ... ); public static Int32 Parse(ReadOnlySpan<char> s, ... ); public static Double Parse(ReadOnlySpan<char> s, ... ); //the same for TryParse ...
This is an example of parsing Vertex (it has 3 float values of X/Y/Z coordinates "-0.0085 524.0146 32.0143")
void SplitVertex(ReadOnlySpan<char> span, float[] val, int count = 3) { var index = 0; while (index < count) { var end = span.IndexOf(spaceChar); if (end == -1) { end = span.Length; } var part = span.Slice(0, end).Trim(); val[index] = float.Parse(part, NumberStyles.Float, CultureInfo.InvariantCulture); index++; span = span.Slice(end, span.Length - end).Trim(); } }
Hope there is everything clear!
ReadOnlySpan<byte>
The fastest way to get byte span form file is by using unsafe code and MemoryMappedFile.
using (var mm = MemoryMappedFile.CreateFromFile(path, FileMode.Open)) { using (var vs = mm.CreateViewStream()) { using (var mmv = vs.SafeMemoryMappedViewHandle) { ReadOnlySpan<byte> bytes; unsafe { byte* ptrMemMap = (byte*)0; mmv.AcquirePointer(ref ptrMemMap); bytes = new ReadOnlySpan<byte>(ptrMemMap, (int)mmv.ByteLength); mmv.ReleasePointer(); } } } }
The example of the class with code above is here MemoryMappedFileReader.
For span byte, we also have overload methods for simple types in the framework System.Buffers.Text.Utf8Parser, like this.
public static bool TryParse(ReadOnlySpan<byte> source, out float value, out int bytesConsumed, char standardFormat = '\0');
As you can see by class name, parser works with UTF8 bytes and we should guarantee that.
Example of parsing the same Vertex as before.
static readonly byte space = Convert.ToByte(' '); void SplitVertex(ReadOnlySpan<byte> span, float[] val, int count = 3) { var index = 0; while (index < count) { var end = span.IndexOf(space); if (end == -1) { end = span.Length; } var part = span.Slice(0, end).Trim(space); if (!Utf8Parser.TryParse(part, out float value, out var _)) { throw new Exception("Can't read float"); } val[index] = value; index++; span = span.Slice(end, span.Length - end).Trim(space); } }
Not so many differences from ReadOnlySpan<char>
Next. A fast way to convert bytes to string is also unsafe and not so obvious.
readonly Encoding utf8 = Encoding.UTF8; unsafe string GetString(in ReadOnlySpan<byte> span) { fixed (byte* buffer = &MemoryMarshal.GetReference(span)) { var charCount = utf8.GetCharCount(buffer, span.Length); fixed (char* chars = stackalloc char[charCount]) { var count = utf8.GetChars(buffer, span.Length, chars, charCount); return new string(chars, 0, count); } } }
This method converts the whole span to string if you need to convert part of it just change span.Length
to count of bytes to convert. Don't forget to use the correct Encoding
in converting as well.
Conclusion
We considered minimum scope of code for implementation parser. The main goal of practicing Span
should be correct memory usage but also we achieve performance as well. The speed is very important for any readers/parsers as a memory usage and these all must be a good reason to chose Span
for your applications.
Some tips:
If we use simple types byte
or char
in TryParse/Parse
methods instead ReadOnlySpan<char>
and ReadOnlySpan<byte>
(as I did in examples before) there will be implicit cast to ReadOnlySpan
. The cost of the cast is cheap. I do a simple test with StartsWith
and saw different only in Ticks. Of сourse, you can define all const's as ReadOnlySpan
to avoid casting but I am not sure that makes you happy or significantly increase performance.
Slice
is very cheap just use it.
IndexOf
is a very expensive operation! Avoid using it for symbols that you already know absent in the big span. Try to use it for definitely existed symbols or Slice
and invoke IndexOf
for small parts of it.
String
has override method IndexOf
with startIndex
but Span
does not have it. In this case, we should Slice
the whole span to pieces to analyze independently (see SplitVertex example)
If you use a bytes span avoid extra converting to string
.
As a plus to use bytes is that methods in Utf8Parser
faster than overload methods in simple types for chars.
ReadOnlySpan<char>
- work with chars looks more natural without tricks and unsafe code but it has one bottleneck is a necessary reading whole file and not possible just to wrap a stream. Using chars for big files seems not so useful but if we already have string
or file is small we should use it there is no profit to convert it to bytes.
ReadOnlySpan<byte>
- here is our winner! To be honest, there is no competition, working with bytes much faster for sure.
That is all that I wanted to tell.
Here is the source code of OBJ reader Utf8ByteOBJParser where I have applied all described above.
And of course, thanks for reading.
Have a nice day!