Skip to content

[Data] [Core] [3/n] Wire BlockRefCounter into Operators#64191

Open
rayhhome wants to merge 3 commits into
ray-project:masterfrom
rayhhome:block-ref-counter-wiring
Open

[Data] [Core] [3/n] Wire BlockRefCounter into Operators#64191
rayhhome wants to merge 3 commits into
ray-project:masterfrom
rayhhome:block-ref-counter-wiring

Conversation

@rayhhome

Copy link
Copy Markdown
Contributor

Description

Wires BlockRefCounter into every operator and the streaming executor. ResourceManager gains the counter as an attribute in this PR, but _estimate_object_store_memory_usage is not changed to keep the scheduling-visible change isolated.

Implementation

physical_operator.py. PhysicalOperator.start() now accepts block_ref_counter (replaces set_block_ref_counter() used in prototype per review). DataOpTask.on_data_ready calls on_block_produced once both the block ref and its metadata are available.
base_physical_operator.py. AllToAllOperator.all_inputs_done skips on_block_produced for output refs that are unchanged from the input, avoiding double-counting when bulk_fn forwards refs unchanged (e.g. randomize_blocks).
streaming_executor_state.py / streaming_executor.py. Operators are no longer started inside build_streaming_topology. The executor starts them after ResourceManager is constructed so the shared counter can be passed in. block_ref_counter.clear() is called at shutdown after queues are drained.
resource_manager.py. Constructs BlockRefCounter() and exposes it via a block_ref_counter property.

Tests

  • test_operators.py: updated to remove set_block_ref_counter calls; adds op.start(ExecutionOptions()) where missing.
  • test_streaming_executor.py: _make_data_op_task helper supplies default block_ref_counter and producer_id so existing DataOpTask tests compile.

Related issues

Depends on #64157 (BlockRefCounter implementation).
Related to #63601 (prototype), #63074 (previous manual BlockRefCounter).

rayhhome added 2 commits June 17, 2026 15:14
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@rayhhome rayhhome requested a review from a team as a code owner June 17, 2026 22:40
Copilot AI review requested due to automatic review settings June 17, 2026 22:40
@rayhhome rayhhome self-assigned this Jun 17, 2026
@rayhhome rayhhome added core Issues that should be addressed in Ray Core data Ray Data-related issues labels Jun 17, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized BlockRefCounter to track object-store memory usage per operator via Ray Core callbacks, integrating it across various physical operators and the streaming executor. The feedback focuses on enhancing the robustness of this counter: first, by replacing the set-based tracking of registered IDs with a dictionary to handle duplicate registrations and prevent memory leaks; second, by skipping tracking for zero-sized blocks to reduce overhead; third, by initializing the counter to a default instance in PhysicalOperator to avoid potential AttributeErrors in tests; and finally, by adding proper type annotations to the overridden start methods across all operator subclasses to ensure type safety.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/data/_internal/execution/block_ref_counter.py
Comment thread python/ray/data/_internal/execution/block_ref_counter.py
Comment thread python/ray/data/_internal/execution/block_ref_counter.py
Comment thread python/ray/data/_internal/execution/interfaces/physical_operator.py
Comment thread python/ray/data/_internal/execution/interfaces/physical_operator.py Outdated
Comment thread python/ray/data/_internal/execution/operators/map_operator.py Outdated
Comment thread python/ray/data/_internal/execution/operators/hash_shuffle.py Outdated
Comment thread python/ray/data/_internal/execution/operators/input_data_buffer.py Outdated
Comment thread python/ray/data/_internal/execution/operators/output_splitter.py Outdated
Comment thread python/ray/data/_internal/execution/operators/union_operator.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a centralized BlockRefCounter to track per-operator object store memory via Ray Core out-of-scope callbacks, wiring it through operator startup and the streaming executor lifecycle.

Changes:

  • Add BlockRefCounter implementation + comprehensive unit/integration tests and Bazel target.
  • Plumb a shared counter from ResourceManager into PhysicalOperator.start() and into DataOpTask so blocks are accounted once metadata is available.
  • Avoid double-counting for AllToAll bulk transforms that forward input refs unchanged; clear accounting on executor shutdown after queues are drained.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/ray/data/tests/test_streaming_executor.py Adds a helper to construct DataOpTask with default BlockRefCounter/producer_id for existing tests.
python/ray/data/tests/test_operators.py Updates operator tests to call start() explicitly where required.
python/ray/data/tests/test_block_ref_counter.py New test suite for BlockRefCounter accounting, clear(), thread-safety, and lifecycle integration.
python/ray/data/BUILD.bazel Registers the new test_block_ref_counter Bazel py_test.
python/ray/data/_internal/execution/streaming_executor.py Starts all operators with a shared counter and clears the counter at shutdown after draining queues.
python/ray/data/_internal/execution/streaming_executor_state.py Stops starting operators during topology construction (executor does it after ResourceManager exists).
python/ray/data/_internal/execution/resource_manager.py Constructs and exposes an executor-wide BlockRefCounter.
python/ray/data/_internal/execution/operators/union_operator.py Updates start() signature to accept/forward block_ref_counter.
python/ray/data/_internal/execution/operators/output_splitter.py Updates start() signature to accept/forward block_ref_counter.
python/ray/data/_internal/execution/operators/map_operator.py Updates start() signature; plumbs counter + producer id into DataOpTask.
python/ray/data/_internal/execution/operators/input_data_buffer.py Updates start() signature to accept/forward block_ref_counter.
python/ray/data/_internal/execution/operators/hash_shuffle.py Updates start() signature; plumbs counter + producer id into finalization DataOpTask.
python/ray/data/_internal/execution/operators/base_physical_operator.py Prevents double-counting in AllToAll by skipping forwarded input refs.
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py Updates start() signature to accept/forward block_ref_counter.
python/ray/data/_internal/execution/interfaces/physical_operator.py Extends PhysicalOperator.start() and DataOpTask to accept a shared counter + producer id; DataOpTask accounts blocks when metadata is ready.
python/ray/data/_internal/execution/block_ref_counter.py Adds the BlockRefCounter implementation using Ray Core out-of-scope callbacks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/ray/data/_internal/execution/interfaces/physical_operator.py Outdated
Comment thread python/ray/data/_internal/execution/interfaces/physical_operator.py
Comment thread python/ray/data/_internal/execution/block_ref_counter.py
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

@bveeramani bveeramani left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +138 to +139
block_ref_counter: Optional[BlockRefCounter],
producer_id: str,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative implementation could be to make callers responsible for updating the reference counter with output_ready_callback rather than making it DataOpTask's responsibility.

Advantage is that it would reduce the amount we pass through block_ref_counter and simplify the DataOpTask interface, though I imagine it might also introduce a moderate amount of duplication.

Will defer you to about what's cleaner

self,
task_index: int,
streaming_gen: ObjectRefGenerator,
block_ref_counter: Optional[BlockRefCounter],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this ever be None? We only create DataOpTasks once execution has already started, so I feel like we shouldn't allow it

def start(
self,
options: ExecutionOptions,
block_ref_counter: Optional[BlockRefCounter] = None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment -- can this ever actually be None? We control when start gets called, and we make it so that start always gets called with a non-None value.

Making this just block_ref_counter: BlockRefCounter would make the code easier to reason about

Comment on lines +174 to +178
@property
def block_ref_counter(self) -> BlockRefCounter:
"""The centralized block reference counter for this executor."""
return self._block_ref_counter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably simplify the dataflow here.

Image

Currently, it's like:

  1. Executor creates ResourceManager
  2. ResourceManager constructs counter
  3. Executor gets counter from ResourceManager
  4. Executor passes counter from ResourceManager to Operators

I think it'd be clearer as

Image
  1. Executor constructs counter
  2. Executor passes counter to ResourceManager as constructor dependency
  3. Executor passes counter to operators in start

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, we can also avoid expanding the ResourceManager interface with the block_ref_coutner proeprty

Comment on lines +1397 to +1398
kwargs.setdefault("block_ref_counter", BlockRefCounter())
kwargs.setdefault("producer_id", "test_op")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below -- I think the tests would be clearer if we just inlined these parameters and didn't introduce the _make_data_op_task layer of indirection

# outputs (e.g., map outputs for map-reduce).
output_buffer, self._stats = self._bulk_fn(self._input_buffer.to_list(), ctx)

# Snapshot input refs before calling bulk_fn. Some bulk_fns (e.g.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense/be simpler if we made on_block_produced idempotent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants