Fix Ray deduplicator shared state by macroguo-ghy · Pull Request #978 · datajuicer/data-juicer

macroguo-ghy · 2026-05-14T04:12:06Z

Summary

Share Ray deduplicator backend state across map_batches tasks for a single execution.
Prepare Ray actor-backed dedup sets before serializing the operator into Ray tasks without recreating existing actor handles.
Materialize Ray basic deduplicator stats for all stateful backends, including Redis, before later actions can re-run the lazy stats stage.
Add regression coverage for document deduplication across Ray blocks, repeated executions, Redis materialization signaling, and actor handle reuse.
Fix Ray test helper conversion by using RayDataset.to_list() instead of iterating RayDataset directly.

Fixes #971.

Validation

python3 -m pytest tests/ops/deduplicator/test_ray_document_deduplicator.py -q

Result: 9 passed, 10 warnings in 69.43s.

gemini-code-assist

Code Review

This pull request introduces a mechanism to handle stateful operators within Ray datasets by allowing operators to trigger dataset materialization after execution. This change specifically addresses potential issues in deduplication where Ray's lazy re-execution could lead to incorrect results due to persistent state in actors or external backends. The feedback highlights that the RedisBackend should also trigger this materialization to prevent similar state conflicts and suggests refactoring the actor initialization logic to eliminate code duplication.

fengrui-z · 2026-05-29T07:53:16Z

Strength

Correct direction. Eagerly creating dedup actors on the driver and letting Ray pickle the handles to every worker is the canonical pattern for shared stateful operators — the fix targets the actual root cause.

Risks

Loss of lazy autoscale. Actor count is now locked in at planning time based on current cluster_resources(), breaking ActorBackend's original deferred-creation design.
materialize() buffers the full post-dedup dataset into the object store. Previously streaming — a real memory cost on large datasets worth noting in the PR description.
Silently overrides ray_execution_mode='actor' for dedup ops. Downgrades to task mode without a log line; users may wonder why their config "didn't take effect."
Hook is informal. Underscore-prefixed and getattr-probed instead of declared on Filter. Consider promoting to a formal Filter.prepare_for_ray_map_batches() -> bool API for future reuse.
Repeated-read regression test depends on Ray's scheduling and didn't reproduce the original failure locally. A unit test that directly asserts materialize() was called would be more robust across Ray versions.

Recommendation

Approve and merge after noting the materialize() memory cost and lost lazy-autoscale in the PR description. Optionally: promote the hook to a formal Filter API.

When a dedup operator has ray_execution_mode='actor' but gets downgraded to task mode to preserve shared dedup state, emit an info log so users understand why their config was overridden. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

macroguo-ghy added 2 commits May 14, 2026 10:58

fix(ray): share dedup state per execution

09a756f

Fix Ray document deduplicator test dataset conversion

14899c8

macroguo-ghy had a problem deploying to Testing May 14, 2026 04:12 — with GitHub Actions Failure

macroguo-ghy changed the title ~~[codex] Fix Ray deduplicator shared state~~ Fix Ray deduplicator shared state May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Comment thread data_juicer/ops/deduplicator/ray_basic_deduplicator.py Outdated

Comment thread data_juicer/ops/deduplicator/ray_basic_deduplicator.py Outdated

Address Ray deduplicator review comments

5dce715

macroguo-ghy had a problem deploying to Testing May 14, 2026 06:18 — with GitHub Actions Failure

macroguo-ghy temporarily deployed to Testing May 14, 2026 06:18 — with GitHub Actions Inactive

macroguo-ghy marked this pull request as ready for review May 17, 2026 17:28

cmgzn requested review from Dludora and fengrui-z May 22, 2026 07:39

fengrui-z had a problem deploying to Testing June 8, 2026 06:49 — with GitHub Actions Error

fengrui-z force-pushed the codex/fix-ray-dedup-state-971 branch from d8781ae to c551ad0 Compare June 8, 2026 06:55

fengrui-z had a problem deploying to Testing June 8, 2026 06:56 — with GitHub Actions Failure

fengrui-z temporarily deployed to Testing June 8, 2026 06:56 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Ray deduplicator shared state#978

Fix Ray deduplicator shared state#978
macroguo-ghy wants to merge 4 commits into
datajuicer:mainfrom
macroguo-ghy:codex/fix-ray-dedup-state-971

macroguo-ghy commented May 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

fengrui-z commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

macroguo-ghy commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

fengrui-z commented May 29, 2026

Strength

Risks

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

macroguo-ghy commented May 14, 2026 •

edited

Loading