fix: unify schema across batches in JSONStreamDatasource to handle null → concrete type evolution by fengrui-z · Pull Request #972 · datajuicer/data-juicer

fengrui-z · 2026-04-29T08:27:33Z

Summary

Fixes #936

JSONStreamDatasource._read_stream locks the schema from the first batch and reuses it for all subsequent batches. When an
early batch infers a nested field as null (e.g. meta.url = null) and a later batch introduces a concrete type (e.g.
string), the forced cast from string to null fails with ArrowInvalid.

This is a correctness bug in DJ's custom JSON streaming ingestion path. Ray's native ray.data.read_json handles the same input correctly.

Root Cause

# Before: first batch locks schema, all subsequent batches forced to it
table = pyarrow.Table.from_batches([batch], schema=schema)
if schema is None:
    schema = table.schema  # locked forever

Fix

Remove the first-batch schema lock — create table without forced schema
Use pyarrow.unify_schemas to merge schemas across batches, allowing null → concrete type promotion
After unification, cast the batch to the unified schema for consistency

  # After: schema evolves across batches
  table = pyarrow.Table.from_batches([batch])
  if schema is None:
      schema = table.schema
  else:
      unified = pyarrow.unify_schemas([schema, table.schema])
      if not unified.equals(schema):
          schema = unified
      table = pyarrow.Table.from_batches([batch], schema=schema)

unify_schemas internally delegates to Arrow C++ UnifyTypes, which promotes null to the concrete type and recursively handles nested structs.

Test Plan

Verify the minimal repro from [Bug] JSONStreamDatasource locks first-batch schema and fails on later null -> concrete type evolution #936 passes with this fix
Verify ray.data.read_json and read_json_stream produce consistent results on mixed-null JSONL
Verify no regression on JSONL files with uniform schema
Verify no regression on JSONL files with nested structs

See #936 for the minimal reproduction script.

…ll → concrete type evolution The previous implementation locked the schema from the first batch and reused it for all subsequent batches via `Table.from_batches([batch], schema=schema)`. When an early batch inferred a nested field as `null` (e.g. `meta.url = null`) and a later batch introduced a concrete type (e.g. `string`), the cast from `string` to `null` would fail with ArrowInvalid. This fix removes the first-batch schema lock and instead uses `pyarrow.unify_schemas` to merge schemas across batches, allowing `null` types to be promoted to concrete types as new data is read. Fixes datajuicer#936 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the _read_stream method in ray_dataset.py to support schema unification when reading batches from a stream. This allows the system to handle batches with varying but compatible schemas. A review comment suggests refactoring the implementation to reduce code duplication by consolidating the pyarrow.Table creation after the final schema has been determined.

…atajuicer#936)

Ray's internal concat requires all tables from a single file to have identical schemas. The streaming reader cannot guarantee this when schema evolution occurs (null → concrete type triggers ArrowInvalid at block boundaries). Approach: - Buffer all batches from the streaming reader, unify schema, then yield with consistent schema. This is necessary because Ray's _combine_tables fails on struct child type mismatch (null vs string). - When PyArrow's reader throws ArrowInvalid ("changed from"), fall back to paj.read_json() which handles schema inference across all rows in a single pass. Performance note: buffering adds O(file_size) memory within the read task, but Ray already materializes the full generator output per task before passing downstream, so latency impact is negligible. Fixes datajuicer#936 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fengrui-z · 2026-06-08T04:01:16Z

Investigated the performance concern. Key findings:

Buffering is unavoidable — Ray's internal _combine_tables → _align_struct_fields requires all tables from a single file to have identical schemas. Yielding tables with mixed schemas (e.g., meta.url: null then meta.url: string) causes ArrowInvalid: Struct child array #0 does not match type field: null vs string.
Actual performance impact is minimal — Ray materializes the full generator output per read task before passing downstream, so latency is unaffected. The memory cost (holding all batches simultaneously) is equivalent to the paj.read_json() fallback path.
Current fix: Buffer batches + unify_schemas for the happy path; fall back to paj.read_json(path) when PyArrow's streaming reader throws ArrowInvalid("changed from") at block boundaries (the [Bug] JSONStreamDatasource locks first-batch schema and fails on later null -> concrete type evolution #936 case where null → concrete type evolution can't be handled mid-stream).

cyruszhang

Thanks for tackling #936. I reproduced the PyArrow behavior locally: open_json fails on the null -> string transition while read_json succeeds, so the correctness issue here is real.

Two things I think we should address before merging:

The fallback reopens with paj.read_json(path, ...). _read_stream receives path after Ray has normalized it against the configured filesystem, so it is not always a standalone local path/URI. When a caller passes a filesystem, this bypasses the already-open stream/filesystem and can fail. I verified this with a pyarrow.fs.SubTreeFileSystem: the absolute-path case succeeds, while the filesystem-relative path case fails with FileNotFoundError: ... schema_evolution.jsonl. The fallback should reopen through the same filesystem/input source or otherwise preserve filesystem support; please add a filesystem-backed regression test for this path.
batches = [] now buffers every record batch for every file, even when the schema never evolves. That changes read_json_stream from streaming to whole-file memory use. Ray's ReadTask yields blocks from the iterator and its datasource contract explicitly allows a single large file to return multiple blocks to avoid OOM, so this can regress large JSONL workloads. If full-file inference is unavoidable for evolving schemas, I think we should make that behavior explicit and/or limit it to the fallback path rather than silently applying it to all reads.

cyruszhang · 2026-06-16T19:26:35Z

Thanks for tackling #936. I reproduced the PyArrow behavior locally: open_json fails on the null -> string transition while read_json succeeds, so the correctness issue here is real.

Two things I think we should address before merging:

The fallback reopens with paj.read_json(path, ...). _read_stream receives path after Ray has normalized it against the configured filesystem, so it is not always a standalone local path/URI. When a caller passes a filesystem, this bypasses the already-open stream/filesystem and can fail. I verified this with a pyarrow.fs.SubTreeFileSystem: the absolute-path case succeeds, while the filesystem-relative path case fails with FileNotFoundError: ... schema_evolution.jsonl. The fallback should reopen through the same filesystem/input source or otherwise preserve filesystem support; please add a filesystem-backed regression test for this path.

batches = [] now buffers every record batch for every file, even when the schema never evolves. That changes read_json_stream from streaming to whole-file memory use. Ray's ReadTask yields blocks from the iterator and its datasource contract explicitly allows a single large file to return multiple blocks to avoid OOM, so this can regress large JSONL workloads. If full-file inference is unavoidable for evolving schemas, I think we should make that behavior explicit and/or limit it to the fallback path rather than silently applying it to all reads.

One more, independent of the buffering discussion: the unify_schemas happy path is currently broken.

Table.from_batches([batch], schema=unified) does not cast — it requires the batch schema to already equal the target. The moment a batch's schema differs from the unified one, it raises ArrowInvalid: Schema at index 0 was different. Note that message does not contain "changed from", so it escalates straight
to the ValueError instead of recovering.

import pyarrow as pa
b1 = pa.RecordBatch.from_pylist([{"meta": {"url": None}}]) # struct<url: null>
b2 = pa.RecordBatch.from_pylist([{"meta": {"url": "x"}}]) # struct<url: string>
unified = pa.unify_schemas([b1.schema, b2.schema]) # struct<url: string>

pa.Table.from_batches([b1], schema=unified) # ArrowInvalid: Schema at index 0 was different
pa.Table.from_batches([b1]).cast(unified) # OK ✅ (pyarrow 23.0.1)

So the for batch in batches: yield Table.from_batches([batch], schema=schema) loop fails on any batch whose schema isn't already the unified one. #936 never reaches it (it errors earlier inside read_next_batch, via the "changed from" fallback), so this branch is effectively untested dead code today.

Fix is Table.from_batches([batch]).cast(schema).

- Fix happy path casting bug: use .cast(schema) instead of forcing schema in from_batches - Fix memory regression: streaming fast path without buffering, only buffer on schema evolution - Fix filesystem fallback: reopen file through same filesystem abstraction instead of using path directly - Add SubTreeFileSystem regression test to verify filesystem-backed fallback works correctly

fengrui-z requested review from Dludora and yxdyc April 29, 2026 08:27

fengrui-z had a problem deploying to Testing April 29, 2026 08:27 — with GitHub Actions Failure

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread data_juicer/core/data/ray_dataset.py Outdated

test: add regression test for JSONStreamDatasource schema evolution (d…

523f594

…atajuicer#936)

fengrui-z had a problem deploying to Testing April 29, 2026 08:34 — with GitHub Actions Failure

style: apply black formatting

4258403

fengrui-z had a problem deploying to Testing April 29, 2026 08:43 — with GitHub Actions Failure

refactor: simplify schema unification logic per Gemini review

9e248e2

fengrui-z had a problem deploying to Testing April 29, 2026 09:10 — with GitHub Actions Failure

style: merge f-strings for black compliance

132f727

fengrui-z had a problem deploying to Testing April 29, 2026 09:21 — with GitHub Actions Failure

fengrui-z marked this pull request as ready for review April 29, 2026 09:24

fengrui-z temporarily deployed to Testing June 5, 2026 03:33 — with GitHub Actions Inactive

fengrui-z had a problem deploying to Testing June 5, 2026 03:33 — with GitHub Actions Failure

fengrui-z requested a review from cmgzn June 5, 2026 03:36

fengrui-z temporarily deployed to Testing June 5, 2026 06:49 — with GitHub Actions Inactive

fengrui-z had a problem deploying to Testing June 5, 2026 06:49 — with GitHub Actions Failure

fengrui-z temporarily deployed to Testing June 5, 2026 07:12 — with GitHub Actions Inactive

fengrui-z had a problem deploying to Testing June 5, 2026 07:12 — with GitHub Actions Failure

fengrui-z temporarily deployed to Testing June 5, 2026 07:24 — with GitHub Actions Inactive

cmgzn requested a review from cyruszhang June 5, 2026 08:55

fengrui-z force-pushed the fix/json-stream-schema-lock branch from da275b1 to a0e3c6d Compare June 8, 2026 03:30

fengrui-z had a problem deploying to Testing June 8, 2026 03:30 — with GitHub Actions Failure

fengrui-z temporarily deployed to Testing June 8, 2026 03:30 — with GitHub Actions Inactive

fengrui-z force-pushed the fix/json-stream-schema-lock branch from a0e3c6d to 38a27c1 Compare June 8, 2026 03:54

fengrui-z temporarily deployed to Testing June 8, 2026 03:54 — with GitHub Actions Inactive

cyruszhang reviewed Jun 16, 2026

View reviewed changes

fengrui-z requested a deployment to Testing June 17, 2026 07:44 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: unify schema across batches in JSONStreamDatasource to handle null → concrete type evolution#972

fix: unify schema across batches in JSONStreamDatasource to handle null → concrete type evolution#972
fengrui-z wants to merge 7 commits into
datajuicer:mainfrom
fengrui-z:fix/json-stream-schema-lock

fengrui-z commented Apr 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

fengrui-z commented Jun 8, 2026

Uh oh!

cyruszhang left a comment

Uh oh!

cyruszhang commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fengrui-z commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

fengrui-z commented Jun 8, 2026

Uh oh!

cyruszhang left a comment

Choose a reason for hiding this comment

Uh oh!

cyruszhang commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fengrui-z commented Apr 29, 2026 •

edited

Loading