Fix Ray deduplicator shared state#978
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to handle stateful operators within Ray datasets by allowing operators to trigger dataset materialization after execution. This change specifically addresses potential issues in deduplication where Ray's lazy re-execution could lead to incorrect results due to persistent state in actors or external backends. The feedback highlights that the RedisBackend should also trigger this materialization to prevent similar state conflicts and suggests refactoring the actor initialization logic to eliminate code duplication.
Strength
Risks
RecommendationApprove and merge after noting the |
When a dedup operator has ray_execution_mode='actor' but gets downgraded to task mode to preserve shared dedup state, emit an info log so users understand why their config was overridden. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d8781ae to
c551ad0
Compare
Summary
map_batchestasks for a single execution.RayDataset.to_list()instead of iteratingRayDatasetdirectly.Fixes #971.
Validation
Result:
9 passed, 10 warnings in 69.43s.