[Data] New Catalog Abstraction and Unity Catalog Read Consolidation#64193
[Data] New Catalog Abstraction and Unity Catalog Read Consolidation#64193kyuds wants to merge 20 commits into
Catalog Abstraction and Unity Catalog Read Consolidation#64193Conversation
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new Catalog connector API (Catalog, UnityCatalog) to invert the dependency between Ray Data readers and the authentication layer, allowing readers like read_parquet, read_delta, and read_iceberg to accept an optional catalog argument for credential and path resolution. The legacy read_unity_catalog function is deprecated and refactored into a shim. The review feedback suggests several important improvements: tracking and cleaning up active temporary GCP credential files to prevent resource leaks and accumulating atexit handlers, and adding defensive type checks in read_parquet and read_delta to ensure that the input path is a single string table identifier when a catalog is specified.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
Signed-off-by: Daniel Shin <kyuds@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ef239dc. Configure here.
Signed-off-by: Daniel Shin <kyuds@anyscale.com>

Description
We have several
read_*apis for different sources like Parquet, Delta, and Iceberg. At the same time, users also use catalog providers like Unity Catalog.For Unity Catalog in particular, we have a separate function called
read_unity_catalogwhich performs catalog authentication and then offloads to the appropriate data format (eg: Parquet). This indirection is a bit confusing, and to pass in arguments for the offloaded function, users have to pass in a dict inreader_kwargswhere there is no schema, etc defined.This PR introduces a new
Catalogcontract and aUnityCatalogimplementation that users can pass intoread_*functions and authenticate right away.This also provides basis to move the authentication part from the Dataset construction phase to the execution phase, which will also potentially allow Ray workers to retry upon credential expiration.
TODO:
Test (will mark complete if pass)
read_unity_catalogstill works and data format inference also worksRelated issues
N/A
Additional information
Note:
read_unity_catalogapi.