Skip to content

Detect encoding from ReadOnlySpan<byte>#204

Open
harnel-tngn wants to merge 6 commits into
CharsetDetector:masterfrom
harnel-tngn:detect-from-readonlyspan
Open

Detect encoding from ReadOnlySpan<byte>#204
harnel-tngn wants to merge 6 commits into
CharsetDetector:masterfrom
harnel-tngn:detect-from-readonlyspan

Conversation

@harnel-tngn

Copy link
Copy Markdown

Add an overload that receives ReadOnlySpan<byte> instead of byte[], so callers can detect the encoding of a Span<T> or ReadOnlySpan<T> without copying to a byte[]:

public class CharsetDetector
{
    public static DetectionResult DetectFromBytes(ReadOnlySpan<byte> bytes);
}

The existing byte[] overloads forward to it. The other methods invoked from DetectFromBytes now take ReadOnlySpan<byte> and use slicing instead of offset/len.

This also affects some related methods, such as CharsetDetector.Feed, CharsetProber.HandleData.

Most of the changes are just signature updates and slicing instead of passing an offset to methods.

As an implementation note, since .NET Standard 2.0 does not have a MemoryStream.Write(ReadOnlySpan<byte>) method, the data is copied into an array buffer and then written to the stream. This may reduce performance slightly, but I think it is the best approach without using unsafe blocks or reflections.

Also this may break some codes outside of UTF-unknown that overload CharsetDetector.Feed or derived class of CharsetProber, but I believe that migrating to new signature should not be that hard.

@harnel-tngn harnel-tngn changed the title Detect from readonlyspan Detect encoding from ReadOnlySpan<byte> Jun 29, 2026
@304NotModified

304NotModified commented Jun 29, 2026

Copy link
Copy Markdown
Member

Thanks for the PR!

As an implementation note, since .NET Standard 2.0 does not have a MemoryStream.Write(ReadOnlySpan<byte>) method

This is supported in .NET 8? So we could use #IF NET8_0_OR_GREATER. We could also target .NET Standard 2.1 (not instead of .NET Standard 2.0)

Note, I will remove .NET 6 support first (#205) - update, done

@304NotModified 304NotModified added this to the 2.7 milestone Jun 29, 2026
@304NotModified

Copy link
Copy Markdown
Member

Close/reopen for new merge commit

@harnel-tngn

Copy link
Copy Markdown
Author

MemoryStream.Write(ReadOnlySpan<byte>) is supported from .NET Core 2.1. Here is a link to the MSDN document.

CharsetProber.WriteSpanToStream already uses MemoryStream.Write when the target framework is .NET Standard 2.1 / .NET Core 2.1 or newer. If we bump the target framework to .NET Standard 2.1, we can just remove the CharsetProber.WriteSpanToStream method and call MemoryStream.Write directly.

    private static void WriteSpanToStream(MemoryStream stream, ReadOnlySpan<byte> buffer)
    {
#if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER
        stream.Write(buffer);
#else
        byte[] rent = ArrayPool<byte>.Shared.Rent(buffer.Length);

        try
        {
            buffer.CopyTo(rent);

            stream.Write(rent, 0, buffer.Length);
        }
        finally
        {
            ArrayPool<byte>.Shared.Return(rent);
        }
#endif
    }

I also updated System.Memory package to resolve a version conflict between System.Memory from Microsoft.SourceLink.GitHub.


private static void WriteSpanToStream(MemoryStream stream, ReadOnlySpan<byte> buffer)
{
#if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work unless we also target netstandard2.1?

But after checking in depth, I don't think we should go for net netstandard2.1. We target netstandard 2.0 mainly for .NET Framework (not possible with 2.1) and .NET 10 uses the .NET 8 assembly. So proposal, change this to NET8_0_OR_GREATER

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER
#if NET8_0_OR_GREATER

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a ReadOnlySpan<byte>-based encoding detection API to avoid forcing callers to materialize byte[], and refactors internal probing/analyzer code to operate on spans (using slicing) instead of offset/len parameters.

Changes:

  • Add CharsetDetector.DetectFromBytes(ReadOnlySpan<byte>) and forward existing byte[] overloads to it.
  • Update probers/analyzers to accept ReadOnlySpan<byte> and use slicing rather than offset-based indexing.
  • Add System.Memory for netstandard2.0 and add a unit test for the new overload.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/CharsetDetectorTest.cs Adds coverage for detecting encoding from a ReadOnlySpan<byte>.
src/UTF-unknown.csproj Adds System.Memory dependency for netstandard2.0 span support.
src/CharsetDetector.cs Adds ReadOnlySpan<byte> overload and refactors feeding/probing to span-based flow.
src/Core/Probers/CharsetProber.cs Switches prober API and filtering helpers to ReadOnlySpan<byte>; adds span-to-stream helper.
src/Core/Probers/SingleByteCharSetProber.cs Updates HandleData to span-based iteration.
src/Core/Probers/SBCSGroupProber.cs Updates prober pipeline to accept spans and span-based filtering.
src/Core/Probers/MBCSGroupProber.cs Updates HandleData to span input and span forwarding to sub-probers.
src/Core/Probers/Latin1Prober.cs Updates filtering callsites to span-based helpers.
src/Core/Probers/HebrewProber.cs Updates HandleData signature and loop to span indexing.
src/Core/Probers/EscCharsetProber.cs Updates HandleData signature and loop to span indexing.
src/Core/Probers/MultiByte/UTF8Prober.cs Updates HandleData signature and loop to span indexing.
src/Core/Probers/MultiByte/Korean/EUCKRProber.cs Updates HandleData signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Korean/CP949Prober.cs Updates HandleData signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Japanese/SJISProber.cs Updates HandleData signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Japanese/EUCJPProber.cs Updates HandleData signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Chinese/GB18030Prober.cs Updates HandleData signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Chinese/EUCTWProber.cs Updates HandleData signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Chinese/Big5Prober.cs Updates HandleData signature and replaces offset math with slicing.
src/Core/Analyzers/CharDistributionAnalyser.cs Refactors distribution analysis API to consume spans.
src/Core/Analyzers/MultiByte/Korean/EUCKRDistributionAnalyser.cs Updates GetOrder to span-based indexing.
src/Core/Analyzers/MultiByte/Japanese/JapaneseContextAnalyser.cs Refactors context analysis to slice spans when examining characters.
src/Core/Analyzers/MultiByte/Japanese/SJISContextAnalyser.cs Updates GetOrder implementations to span-based indexing.
src/Core/Analyzers/MultiByte/Japanese/EUCJPContextAnalyser.cs Updates GetOrder implementations to span-based indexing.
src/Core/Analyzers/MultiByte/Japanese/SJISDistributionAnalyser.cs Updates GetOrder to span-based indexing.
src/Core/Analyzers/MultiByte/Japanese/EUCJPDistributionAnalyser.cs Updates GetOrder to span-based indexing.
src/Core/Analyzers/MultiByte/Chinese/GB18030DistributionAnalyser.cs Updates GetOrder to span-based indexing.
src/Core/Analyzers/MultiByte/Chinese/EUCTWDistributionAnalyser.cs Updates GetOrder to span-based indexing.
src/Core/Analyzers/MultiByte/Chinese/BIG5DistributionAnalyser.cs Updates GetOrder to span-based indexing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/CharsetDetector.cs
Comment on lines +421 to 422
if (buf.Length > 0)
_gotData = true;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds as a good suggestion.

We need a unittest to validate this issue/fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants