Detect encoding from ReadOnlySpan<byte>#204
Conversation
|
Thanks for the PR!
This is supported in .NET 8? So we could use Note, I will remove .NET 6 support first (#205) - update, done |
|
Close/reopen for new merge commit |
|
private static void WriteSpanToStream(MemoryStream stream, ReadOnlySpan<byte> buffer)
{
#if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER
stream.Write(buffer);
#else
byte[] rent = ArrayPool<byte>.Shared.Rent(buffer.Length);
try
{
buffer.CopyTo(rent);
stream.Write(rent, 0, buffer.Length);
}
finally
{
ArrayPool<byte>.Shared.Return(rent);
}
#endif
}I also updated System.Memory package to resolve a version conflict between System.Memory from Microsoft.SourceLink.GitHub. |
|
|
||
| private static void WriteSpanToStream(MemoryStream stream, ReadOnlySpan<byte> buffer) | ||
| { | ||
| #if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER |
There was a problem hiding this comment.
This won't work unless we also target netstandard2.1?
But after checking in depth, I don't think we should go for net netstandard2.1. We target netstandard 2.0 mainly for .NET Framework (not possible with 2.1) and .NET 10 uses the .NET 8 assembly. So proposal, change this to NET8_0_OR_GREATER
There was a problem hiding this comment.
| #if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER | |
| #if NET8_0_OR_GREATER |
There was a problem hiding this comment.
Pull request overview
This PR introduces a ReadOnlySpan<byte>-based encoding detection API to avoid forcing callers to materialize byte[], and refactors internal probing/analyzer code to operate on spans (using slicing) instead of offset/len parameters.
Changes:
- Add
CharsetDetector.DetectFromBytes(ReadOnlySpan<byte>)and forward existingbyte[]overloads to it. - Update probers/analyzers to accept
ReadOnlySpan<byte>and use slicing rather than offset-based indexing. - Add
System.Memoryfornetstandard2.0and add a unit test for the new overload.
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/CharsetDetectorTest.cs | Adds coverage for detecting encoding from a ReadOnlySpan<byte>. |
| src/UTF-unknown.csproj | Adds System.Memory dependency for netstandard2.0 span support. |
| src/CharsetDetector.cs | Adds ReadOnlySpan<byte> overload and refactors feeding/probing to span-based flow. |
| src/Core/Probers/CharsetProber.cs | Switches prober API and filtering helpers to ReadOnlySpan<byte>; adds span-to-stream helper. |
| src/Core/Probers/SingleByteCharSetProber.cs | Updates HandleData to span-based iteration. |
| src/Core/Probers/SBCSGroupProber.cs | Updates prober pipeline to accept spans and span-based filtering. |
| src/Core/Probers/MBCSGroupProber.cs | Updates HandleData to span input and span forwarding to sub-probers. |
| src/Core/Probers/Latin1Prober.cs | Updates filtering callsites to span-based helpers. |
| src/Core/Probers/HebrewProber.cs | Updates HandleData signature and loop to span indexing. |
| src/Core/Probers/EscCharsetProber.cs | Updates HandleData signature and loop to span indexing. |
| src/Core/Probers/MultiByte/UTF8Prober.cs | Updates HandleData signature and loop to span indexing. |
| src/Core/Probers/MultiByte/Korean/EUCKRProber.cs | Updates HandleData signature and replaces offset math with slicing. |
| src/Core/Probers/MultiByte/Korean/CP949Prober.cs | Updates HandleData signature and replaces offset math with slicing. |
| src/Core/Probers/MultiByte/Japanese/SJISProber.cs | Updates HandleData signature and replaces offset math with slicing. |
| src/Core/Probers/MultiByte/Japanese/EUCJPProber.cs | Updates HandleData signature and replaces offset math with slicing. |
| src/Core/Probers/MultiByte/Chinese/GB18030Prober.cs | Updates HandleData signature and replaces offset math with slicing. |
| src/Core/Probers/MultiByte/Chinese/EUCTWProber.cs | Updates HandleData signature and replaces offset math with slicing. |
| src/Core/Probers/MultiByte/Chinese/Big5Prober.cs | Updates HandleData signature and replaces offset math with slicing. |
| src/Core/Analyzers/CharDistributionAnalyser.cs | Refactors distribution analysis API to consume spans. |
| src/Core/Analyzers/MultiByte/Korean/EUCKRDistributionAnalyser.cs | Updates GetOrder to span-based indexing. |
| src/Core/Analyzers/MultiByte/Japanese/JapaneseContextAnalyser.cs | Refactors context analysis to slice spans when examining characters. |
| src/Core/Analyzers/MultiByte/Japanese/SJISContextAnalyser.cs | Updates GetOrder implementations to span-based indexing. |
| src/Core/Analyzers/MultiByte/Japanese/EUCJPContextAnalyser.cs | Updates GetOrder implementations to span-based indexing. |
| src/Core/Analyzers/MultiByte/Japanese/SJISDistributionAnalyser.cs | Updates GetOrder to span-based indexing. |
| src/Core/Analyzers/MultiByte/Japanese/EUCJPDistributionAnalyser.cs | Updates GetOrder to span-based indexing. |
| src/Core/Analyzers/MultiByte/Chinese/GB18030DistributionAnalyser.cs | Updates GetOrder to span-based indexing. |
| src/Core/Analyzers/MultiByte/Chinese/EUCTWDistributionAnalyser.cs | Updates GetOrder to span-based indexing. |
| src/Core/Analyzers/MultiByte/Chinese/BIG5DistributionAnalyser.cs | Updates GetOrder to span-based indexing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (buf.Length > 0) | ||
| _gotData = true; |
There was a problem hiding this comment.
Sounds as a good suggestion.
We need a unittest to validate this issue/fix
Add an overload that receives
ReadOnlySpan<byte>instead ofbyte[], so callers can detect the encoding of aSpan<T>orReadOnlySpan<T>without copying to abyte[]:The existing
byte[]overloads forward to it. The other methods invoked fromDetectFromBytesnow takeReadOnlySpan<byte>and use slicing instead ofoffset/len.This also affects some related methods, such as
CharsetDetector.Feed,CharsetProber.HandleData.Most of the changes are just signature updates and slicing instead of passing an offset to methods.
As an implementation note, since .NET Standard 2.0 does not have a
MemoryStream.Write(ReadOnlySpan<byte>)method, the data is copied into an array buffer and then written to the stream. This may reduce performance slightly, but I think it is the best approach without using unsafe blocks or reflections.Also this may break some codes outside of UTF-unknown that overload
CharsetDetector.Feedor derived class ofCharsetProber, but I believe that migrating to new signature should not be that hard.