Skip to content

[Data] New Catalog Abstraction and Unity Catalog Read Consolidation#64193

Open
kyuds wants to merge 20 commits into
masterfrom
kyuds/uc-consolidation
Open

[Data] New Catalog Abstraction and Unity Catalog Read Consolidation#64193
kyuds wants to merge 20 commits into
masterfrom
kyuds/uc-consolidation

Conversation

@kyuds

@kyuds kyuds commented Jun 17, 2026

Copy link
Copy Markdown
Member

Description

We have several read_* apis for different sources like Parquet, Delta, and Iceberg. At the same time, users also use catalog providers like Unity Catalog.

For Unity Catalog in particular, we have a separate function called read_unity_catalog which performs catalog authentication and then offloads to the appropriate data format (eg: Parquet). This indirection is a bit confusing, and to pass in arguments for the offloaded function, users have to pass in a dict in reader_kwargs where there is no schema, etc defined.

This PR introduces a new Catalog contract and a UnityCatalog implementation that users can pass into read_* functions and authenticate right away.

This also provides basis to move the authentication part from the Dataset construction phase to the execution phase, which will also potentially allow Ray workers to retry upon credential expiration.

@PublicAPI(stability="alpha")
class Catalog(ABC):
    """A directory service that resolves a table name to a readable source."""

    @abstractmethod
    def resolve(self, table: str, *, reader: ReaderFormat) -> ResolvedSource:
        """Resolve ``table`` for the given ``reader``."""
        ...

TODO:

  • Add documentation

Test (will mark complete if pass)

  • Reading parquet data with unity catalog works
  • Reading delta data with unity catalog works
  • Reading iceberg data with unity catalog works
  • read_unity_catalog still works and data format inference also works

Related issues

N/A

Additional information

Note:

  • The diff is mainly porting over code from removed files.
  • Deprecated read_unity_catalog api.

Signed-off-by: Daniel Shin <kyuds@anyscale.com>
@kyuds kyuds requested a review from a team as a code owner June 17, 2026 23:02

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Catalog connector API (Catalog, UnityCatalog) to invert the dependency between Ray Data readers and the authentication layer, allowing readers like read_parquet, read_delta, and read_iceberg to accept an optional catalog argument for credential and path resolution. The legacy read_unity_catalog function is deprecated and refactored into a shim. The review feedback suggests several important improvements: tracking and cleaning up active temporary GCP credential files to prevent resource leaks and accumulating atexit handlers, and adding defensive type checks in read_parquet and read_delta to ensure that the input path is a single string table identifier when a catalog is specified.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/data/catalog.py Outdated
Comment thread python/ray/data/read_api.py
Comment thread python/ray/data/read_api.py
Comment thread python/ray/data/catalog.py Outdated
Comment thread python/ray/data/read_api.py Outdated
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
@kyuds kyuds added the go add ONLY when ready to merge, run all tests label Jun 17, 2026
Comment thread python/ray/data/read_api.py
Comment thread python/ray/data/read_api.py
kyuds added 9 commits June 17, 2026 16:13
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Comment thread python/ray/data/catalog.py
kyuds added 2 commits June 17, 2026 17:15
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
.
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Comment thread python/ray/data/catalog.py Outdated
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Jun 18, 2026
Comment thread python/ray/data/_internal/datasource/uc_datasource.py
kyuds added 3 commits June 18, 2026 17:15
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ef239dc. Configure here.

Comment thread python/ray/data/read_api.py
kyuds added 2 commits June 18, 2026 18:02
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant