refactor ops processing with output_feature_hints method by cmgzn · Pull Request #986 · datajuicer/data-juicer

cmgzn · 2026-05-26T06:30:15Z

Summary

Add output_feature_hints for HuggingFace-backed NestedDataset.map.

Some operators write nested list fields whose first writer batch can be all empty
lists. HuggingFace Datasets may infer those values as list<null>, then fail
when later batches return concrete nested values such as list<list<int64>> or
list<list<float32>>.

This PR lets operators provide partial output feature hints. NestedDataset.map
merges those hints into the current dataset features and forwards the merged
schema to Dataset.map(features=...), so the Arrow writer does not infer
ambiguous empty-list fields as null.

Changes

Add NestedDataset.map(..., output_feature_hints=...).
Add OP.output_feature_hints(input_features) as the operator-level schema hint hook.
Route mapper/filter/deduplicator/aggregator map calls through the hint-aware map helper.
Add feature hints for imgdiff difference area/caption mappers.
Add developer guide notes in English and Chinese for when operator authors should declare output feature hints.
Add a regression test for an empty nested list in the first map batch followed by a non-empty nested list.

Why

This moves schema disambiguation from ad-hoc return-value shaping inside operators
to an explicit dataset-level mechanism.

Operators can return natural empty values such as [], while still telling
HuggingFace/Arrow the intended concrete output type before map cache batches are
written.

Behavior and Compatibility

This is opt-in. Operators that do not implement output_feature_hints() keep the
previous HuggingFace inference behavior.

For operators that do provide hints, HuggingFace Datasets will cast mapped values
to the declared features. This is intentional, but it means the declared feature
type must match the actual returned values. For example, bbox coordinates
declared as float32 may be stored with float32 precision.

There is also a subtle distinction between preserving old schema workarounds and
changing output semantics. Some operators used sentinel values such as
zero-filled boxes to avoid empty-list schema inference. Replacing those sentinels
with [] is semantically cleaner, but changes exported data from a one-box zero
sentinel to an actually empty list. Downstream code should handle both forms
before such sentinels are removed.

Tests

python -m unittest tests.core.data.test_dj_dataset.TestNestedDataset.test_map_output_feature_hints_allow_empty_nested_list_first_batch
python -m py_compile data_juicer/core/data/dj_dataset.py data_juicer/ops/base_op.py data_juicer/ops/mapper/imgdiff_difference_caption_generator_mapper.py data_juicer/ops/mapper/imgdiff_difference_area_generator_mapper.py
git diff --check

…or ops processing

gemini-code-assist

Code Review

This pull request introduces a mechanism to declare partial output feature hints (output_feature_hints) for operators, resolving schema inference issues in HuggingFace when early batches contain empty lists or ambiguous types. The feedback points out a critical issue in _merge_feature_dicts where recursively merging incompatible types (such as a Sequence and a Struct, which both inherit from Mapping) can corrupt the feature structure, and provides a code suggestion to ensure type compatibility before merging.

refactor: add output_feature_hints method and related functionality f…

f5753d5

…or ops processing

cmgzn requested a deployment to Testing May 26, 2026 06:30 — with GitHub Actions Waiting

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Comment thread data_juicer/core/data/dj_dataset.py

refactor: replace Sequence with List for consistency across ops and docs

dbf8639

cmgzn requested a deployment to Testing May 26, 2026 07:08 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor ops processing with output_feature_hints method#986

refactor ops processing with output_feature_hints method#986
cmgzn wants to merge 2 commits into
datajuicer:mainfrom
cmgzn:codex/output-feature-hints

cmgzn commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cmgzn commented May 26, 2026

Summary

Changes

Why

Behavior and Compatibility

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant