Detect encoding from ReadOnlySpan<byte> by harnel-tngn · Pull Request #204 · CharsetDetector/UTF-unknown

harnel-tngn · 2026-06-29T08:53:13Z

Add an overload that receives ReadOnlySpan<byte> instead of byte[], so callers can detect the encoding of a Span<T> or ReadOnlySpan<T> without copying to a byte[]:

public class CharsetDetector
{
    public static DetectionResult DetectFromBytes(ReadOnlySpan<byte> bytes);
}

The existing byte[] overloads forward to it. The other methods invoked from DetectFromBytes now take ReadOnlySpan<byte> and use slicing instead of offset/len.

This also affects some related methods, such as CharsetDetector.Feed, CharsetProber.HandleData.

Most of the changes are just signature updates and slicing instead of passing an offset to methods.

As an implementation note, since .NET Standard 2.0 does not have a MemoryStream.Write(ReadOnlySpan<byte>) method, the data is copied into an array buffer and then written to the stream. This may reduce performance slightly, but I think it is the best approach without using unsafe blocks or reflections.

Also this may break some codes outside of UTF-unknown that overload CharsetDetector.Feed or derived class of CharsetProber, but I believe that migrating to new signature should not be that hard.

304NotModified · 2026-06-29T19:49:37Z

Thanks for the PR!

As an implementation note, since .NET Standard 2.0 does not have a MemoryStream.Write(ReadOnlySpan<byte>) method

This is supported in .NET 8? So we could use #IF NET8_0_OR_GREATER. We could also target .NET Standard 2.1 (not instead of .NET Standard 2.0)

Note, I will remove .NET 6 support first (#205) - update, done

304NotModified · 2026-06-29T20:03:39Z

Close/reopen for new merge commit

harnel-tngn · 2026-06-30T05:32:40Z

MemoryStream.Write(ReadOnlySpan<byte>) is supported from .NET Core 2.1. Here is a link to the MSDN document.

CharsetProber.WriteSpanToStream already uses MemoryStream.Write when the target framework is .NET Standard 2.1 / .NET Core 2.1 or newer. If we bump the target framework to .NET Standard 2.1, we can just remove the CharsetProber.WriteSpanToStream method and call MemoryStream.Write directly.

    private static void WriteSpanToStream(MemoryStream stream, ReadOnlySpan<byte> buffer)
    {
#if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER
        stream.Write(buffer);
#else
        byte[] rent = ArrayPool<byte>.Shared.Rent(buffer.Length);

        try
        {
            buffer.CopyTo(rent);

            stream.Write(rent, 0, buffer.Length);
        }
        finally
        {
            ArrayPool<byte>.Shared.Return(rent);
        }
#endif
    }

I also updated System.Memory package to resolve a version conflict between System.Memory from Microsoft.SourceLink.GitHub.

304NotModified · 2026-07-03T09:23:07Z

+
+    private static void WriteSpanToStream(MemoryStream stream, ReadOnlySpan<byte> buffer)
+    {
+#if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER


This won't work unless we also target netstandard2.1?

But after checking in depth, I don't think we should go for net netstandard2.1. We target netstandard 2.0 mainly for .NET Framework (not possible with 2.1) and .NET 10 uses the .NET 8 assembly. So proposal, change this to NET8_0_OR_GREATER

Suggested change

#if NETSTANDARD2_1_OR_GREATER || NETCOREAPP2_1_OR_GREATER

#if NET8_0_OR_GREATER

Copilot

Pull request overview

This PR introduces a ReadOnlySpan<byte>-based encoding detection API to avoid forcing callers to materialize byte[], and refactors internal probing/analyzer code to operate on spans (using slicing) instead of offset/len parameters.

Changes:

Add CharsetDetector.DetectFromBytes(ReadOnlySpan<byte>) and forward existing byte[] overloads to it.
Update probers/analyzers to accept ReadOnlySpan<byte> and use slicing rather than offset-based indexing.
Add System.Memory for netstandard2.0 and add a unit test for the new overload.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/CharsetDetectorTest.cs	Adds coverage for detecting encoding from a `ReadOnlySpan<byte>`.
src/UTF-unknown.csproj	Adds `System.Memory` dependency for `netstandard2.0` span support.
src/CharsetDetector.cs	Adds `ReadOnlySpan<byte>` overload and refactors feeding/probing to span-based flow.
src/Core/Probers/CharsetProber.cs	Switches prober API and filtering helpers to `ReadOnlySpan<byte>`; adds span-to-stream helper.
src/Core/Probers/SingleByteCharSetProber.cs	Updates `HandleData` to span-based iteration.
src/Core/Probers/SBCSGroupProber.cs	Updates prober pipeline to accept spans and span-based filtering.
src/Core/Probers/MBCSGroupProber.cs	Updates `HandleData` to span input and span forwarding to sub-probers.
src/Core/Probers/Latin1Prober.cs	Updates filtering callsites to span-based helpers.
src/Core/Probers/HebrewProber.cs	Updates `HandleData` signature and loop to span indexing.
src/Core/Probers/EscCharsetProber.cs	Updates `HandleData` signature and loop to span indexing.
src/Core/Probers/MultiByte/UTF8Prober.cs	Updates `HandleData` signature and loop to span indexing.
src/Core/Probers/MultiByte/Korean/EUCKRProber.cs	Updates `HandleData` signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Korean/CP949Prober.cs	Updates `HandleData` signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Japanese/SJISProber.cs	Updates `HandleData` signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Japanese/EUCJPProber.cs	Updates `HandleData` signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Chinese/GB18030Prober.cs	Updates `HandleData` signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Chinese/EUCTWProber.cs	Updates `HandleData` signature and replaces offset math with slicing.
src/Core/Probers/MultiByte/Chinese/Big5Prober.cs	Updates `HandleData` signature and replaces offset math with slicing.
src/Core/Analyzers/CharDistributionAnalyser.cs	Refactors distribution analysis API to consume spans.
src/Core/Analyzers/MultiByte/Korean/EUCKRDistributionAnalyser.cs	Updates `GetOrder` to span-based indexing.
src/Core/Analyzers/MultiByte/Japanese/JapaneseContextAnalyser.cs	Refactors context analysis to slice spans when examining characters.
src/Core/Analyzers/MultiByte/Japanese/SJISContextAnalyser.cs	Updates `GetOrder` implementations to span-based indexing.
src/Core/Analyzers/MultiByte/Japanese/EUCJPContextAnalyser.cs	Updates `GetOrder` implementations to span-based indexing.
src/Core/Analyzers/MultiByte/Japanese/SJISDistributionAnalyser.cs	Updates `GetOrder` to span-based indexing.
src/Core/Analyzers/MultiByte/Japanese/EUCJPDistributionAnalyser.cs	Updates `GetOrder` to span-based indexing.
src/Core/Analyzers/MultiByte/Chinese/GB18030DistributionAnalyser.cs	Updates `GetOrder` to span-based indexing.
src/Core/Analyzers/MultiByte/Chinese/EUCTWDistributionAnalyser.cs	Updates `GetOrder` to span-based indexing.
src/Core/Analyzers/MultiByte/Chinese/BIG5DistributionAnalyser.cs	Updates `GetOrder` to span-based indexing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

304NotModified · 2026-07-03T09:31:31Z

+        if (buf.Length > 0)
            _gotData = true;


Sounds as a good suggestion.

We need a unittest to validate this issue/fix

harnel-tngn added 4 commits June 29, 2026 16:58

Support detecting encoding from ReadOnlySpan<byte>

c46d4d5

Remove offset parameter from HandleOneChar and GetOrder

0c9e607

Add test for DetectFromBytes(ReadOnlySpan<byte>)

b02f712

Remove obsolete offset/len XML doc param tags

b070eb6

harnel-tngn changed the title ~~Detect from readonlyspan~~ Detect encoding from ReadOnlySpan<byte> Jun 29, 2026

304NotModified added this to the 2.7 milestone Jun 29, 2026

304NotModified closed this Jun 29, 2026

304NotModified reopened this Jun 29, 2026

304NotModified added the feature label Jun 29, 2026

304NotModified and others added 2 commits June 30, 2026 06:21

Merge branch 'master' into detect-from-readonlyspan

82d1fbe

Update System.Memory to 4.6.5 for .NET Standard 2.0

28d360a

304NotModified reviewed Jul 3, 2026

View reviewed changes

304NotModified requested a review from Copilot July 3, 2026 09:23

Copilot started reviewing on behalf of 304NotModified July 3, 2026 09:24 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Copilot AI mentioned this pull request Jul 3, 2026

Add DetectFromBytes(ReadOnlySpan<byte>) overload with comprehensive tests #211

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect encoding from ReadOnlySpan<byte>#204

Detect encoding from ReadOnlySpan<byte>#204
harnel-tngn wants to merge 6 commits into
CharsetDetector:masterfrom
harnel-tngn:detect-from-readonlyspan

harnel-tngn commented Jun 29, 2026

Uh oh!

304NotModified commented Jun 29, 2026 •

edited

Loading

Uh oh!

304NotModified commented Jun 29, 2026

Uh oh!

harnel-tngn commented Jun 30, 2026

Uh oh!

304NotModified Jul 3, 2026

Uh oh!

304NotModified Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

304NotModified Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	#if NETSTANDARD2_1_OR_GREATER \|\| NETCOREAPP2_1_OR_GREATER
	#if NET8_0_OR_GREATER

Uh oh!

Conversation

harnel-tngn commented Jun 29, 2026

Uh oh!

304NotModified commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

304NotModified commented Jun 29, 2026

Uh oh!

harnel-tngn commented Jun 30, 2026

Uh oh!

304NotModified Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

304NotModified Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

304NotModified Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

304NotModified commented Jun 29, 2026 •

edited

Loading