Skip to content

refactor ops processing with output_feature_hints method#986

Open
cmgzn wants to merge 2 commits into
datajuicer:mainfrom
cmgzn:codex/output-feature-hints
Open

refactor ops processing with output_feature_hints method#986
cmgzn wants to merge 2 commits into
datajuicer:mainfrom
cmgzn:codex/output-feature-hints

Conversation

@cmgzn

@cmgzn cmgzn commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add output_feature_hints for HuggingFace-backed NestedDataset.map.

Some operators write nested list fields whose first writer batch can be all empty
lists. HuggingFace Datasets may infer those values as list<null>, then fail
when later batches return concrete nested values such as list<list<int64>> or
list<list<float32>>.

This PR lets operators provide partial output feature hints. NestedDataset.map
merges those hints into the current dataset features and forwards the merged
schema to Dataset.map(features=...), so the Arrow writer does not infer
ambiguous empty-list fields as null.

Changes

  • Add NestedDataset.map(..., output_feature_hints=...).
  • Add OP.output_feature_hints(input_features) as the operator-level schema hint hook.
  • Route mapper/filter/deduplicator/aggregator map calls through the hint-aware map helper.
  • Add feature hints for imgdiff difference area/caption mappers.
  • Add developer guide notes in English and Chinese for when operator authors should declare output feature hints.
  • Add a regression test for an empty nested list in the first map batch followed by a non-empty nested list.

Why

This moves schema disambiguation from ad-hoc return-value shaping inside operators
to an explicit dataset-level mechanism.

Operators can return natural empty values such as [], while still telling
HuggingFace/Arrow the intended concrete output type before map cache batches are
written.

Behavior and Compatibility

This is opt-in. Operators that do not implement output_feature_hints() keep the
previous HuggingFace inference behavior.

For operators that do provide hints, HuggingFace Datasets will cast mapped values
to the declared features. This is intentional, but it means the declared feature
type must match the actual returned values. For example, bbox coordinates
declared as float32 may be stored with float32 precision.

There is also a subtle distinction between preserving old schema workarounds and
changing output semantics. Some operators used sentinel values such as
zero-filled boxes to avoid empty-list schema inference. Replacing those sentinels
with [] is semantically cleaner, but changes exported data from a one-box zero
sentinel to an actually empty list. Downstream code should handle both forms
before such sentinels are removed.

Tests

  • python -m unittest tests.core.data.test_dj_dataset.TestNestedDataset.test_map_output_feature_hints_allow_empty_nested_list_first_batch
  • python -m py_compile data_juicer/core/data/dj_dataset.py data_juicer/ops/base_op.py data_juicer/ops/mapper/imgdiff_difference_caption_generator_mapper.py data_juicer/ops/mapper/imgdiff_difference_area_generator_mapper.py
  • git diff --check

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to declare partial output feature hints (output_feature_hints) for operators, resolving schema inference issues in HuggingFace when early batches contain empty lists or ambiguous types. The feedback points out a critical issue in _merge_feature_dicts where recursively merging incompatible types (such as a Sequence and a Struct, which both inherit from Mapping) can corrupt the feature structure, and provides a code suggestion to ensure type compatibility before merging.

Comment thread data_juicer/core/data/dj_dataset.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant