Skip to content

Added check for anon on File read methods#1783

Open
ilongin wants to merge 11 commits into
mainfrom
ilongin/1778-annon-propagation
Open

Added check for anon on File read methods#1783
ilongin wants to merge 11 commits into
mainfrom
ilongin/1778-annon-propagation

Conversation

@ilongin

@ilongin ilongin commented May 26, 2026

Copy link
Copy Markdown
Contributor

Closes #1778.

When a File read (open_object / get_file_info) hits PermissionError on a cloud client that has credentials but no access to the target bucket, automatically retry with anon=True and cache the result per (protocol, bucket) so subsequent calls go straight to anonymous mode.

  • Adds _anon_fallback decorator and _ANON_BUCKETS class-level cache on Client
  • Client.fs honors the cache so cached buckets are created anonymously from the start
  • Decorates open_object and get_file_info on the base Client and on GCSClient (which overrides both)
  • Only marks the bucket as anon-needed after the retry actually succeeds; explicit anon=True skips the retry

@ilongin ilongin linked an issue May 26, 2026 that may be closed by this pull request
@ilongin ilongin marked this pull request as draft May 26, 2026 01:17
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented May 26, 2026

Copy link
Copy Markdown

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 572e755
Status: ✅  Deploy successful!
Preview URL: https://724da94b.datachain-2g6.pages.dev
Branch Preview URL: https://ilongin-1778-annon-propagati.datachain-2g6.pages.dev

View logs

@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.54167% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/client/fsspec.py 88.17% 7 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

@ilongin ilongin marked this pull request as ready for review May 26, 2026 09:34
Comment thread tests/unit/test_client.py Outdated
Comment thread src/datachain/client/fsspec.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a per-call anonymous-access fallback to Client: when get_file_info/open_object hits PermissionError, the client transparently rebuilds its filesystem with anon=True, retries once, and on success caches the (protocol, bucket) pair so subsequent reads (including future fs construction) start anonymously. This is aimed at making public-bucket reads survive .save() boundaries (issue #1778) where session-scoped anon detection is lost in fresh processes.

Changes:

  • New _anon_fallback decorator and Client._ANON_BUCKETS class-level cache, with _bucket_needs_anon / _mark_bucket_anon helpers and Client.fs honoring the cache.
  • Decorates base Client.get_file_info / open_object and the GCS overrides of the same.
  • Adds unit tests covering retry success, retry failure, cached-bucket reuse, explicit anon=True skip, and open_object retry.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/datachain/client/fsspec.py Adds the _anon_fallback decorator, the _ANON_BUCKETS cache and helpers, decorates base read methods, and routes fs construction through the cache.
src/datachain/client/gcs.py Imports and applies _anon_fallback to the GCS overrides of get_file_info and open_object.
tests/unit/test_client.py Unit tests for the fallback behavior on a GCSClient, plus a fixture that clears the shared _ANON_BUCKETS cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/datachain/client/fsspec.py Outdated
Comment thread src/datachain/client/fsspec.py Outdated
Comment thread src/datachain/client/fsspec.py Outdated
Comment thread src/datachain/client/fsspec.py Outdated

@classmethod
def _bucket_needs_anon(cls, name: str) -> bool:
return (cls.protocol, name) in cls._ANON_BUCKETS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can there be a distinction by prefix - a prefix inside a bucket allows anon access, other prefix doesn't?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this can be edge case.
The problem is that implementing prefix based cache is much more complex than current one and has some open questions. Issues:

  • need to extract the path from every fsspec method call (different methods take path in different positions
  • need to pick a prefix granularity (per-segment? whole dir? configurable?) - any choice is a heuristic but maybe we can take latest directory
  • need longest-prefix matching on lookup
  • need two fs instances alive at once (one auth, one anon) or pay the cost of rebuilding on each call ~100-150 lines of code + new tests for a scenario most users don't hit (mixed-access bucket with scoped creds)

My suggestion is to just don't use cache when creds are explicitlty set and that's it.

Comment thread src/datachain/client/fsspec.py Outdated
Comment thread tests/unit/test_client.py Outdated
error: str | None = None


class AnonFallbackFS:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, way to complicated for the task we are trying to solve :( really don't like the complexity and amount of code just to deal with anon that is needed only for demos / examples :(

(don't have a better solution in mind though)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure what is the less complex approach. I started adding decorator on Client methods, honestly that maybe seem less complex than this "generic" approach where more code is needed....

Comment thread src/datachain/client/fsspec.py Outdated
# True = anon retry succeeded, use anon from the start.
# False = anon retry also failed, don't bother retrying again.
# Missing key = no decision yet.
_ANON_BUCKETS: ClassVar[dict[tuple[str, str], bool]] = {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a constant

I think _ANON_FALLBACK is also not a constant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed _ANON_BUCKETS to _anon_buckets - it's a runtime-mutated dict so lowercase makes sense.

Kept _ANON_FALLBACK uppercase - it's set once per subclass at class definition and never reassigned, same pattern as CREDENTIAL_KEYS / FS_CLASS / PREFIX / protocol..

Comment thread tests/unit/test_client.py
)


def test_anon_fallback_no_error_no_retry(monkeypatch, _clear_anon_cache):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to use real FSs (existing mocks_, not monkey patches

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests use MagicMock because they need to control exactly when PermissionError is raised (first call fails, retry succeeds) and assert create_fs was called with anon=True. The cloud_server fixtures don't easily simulate per-call PermissionError

Added a separate integration test test_anon_fallback_proxy_integration in tests/func/test_storage_auto_anon.py that uses the real cloud_server fixture - proves the proxy correctly delegates ops to a real cloud emulator end-to-end. The actual anon-retry path stays in unit tests.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread src/datachain/client/fsspec.py
Comment thread src/datachain/client/fsspec.py Outdated
Comment thread src/datachain/client/fsspec.py
@ilongin ilongin requested a review from shcheklein June 1, 2026 13:24
Comment on lines +110 to +114
def __class__(self):
# Pretend to be the underlying fsspec class so isinstance checks
# (e.g. pyarrow's ``isinstance(fs, AbstractFileSystem)``) pass.
return type(self._inner_fs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this isinstance trick? feels hacky to pretend we're the underlying fs class

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we just return class name instead of the type itself?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick is needed because external code (pyarrow, fsspec internals) does isinstance(fs, AbstractFileSystem) and we can't patch those checks. Subclassing isn't clean either - the concrete class is dynamic per-client (S3FileSystem / GCSFileSystem / ...).

On returning the class name instead of the type - __class__ has to be a class object; returning a string would break isinstance entirely.

Comment on lines +147 to +201
def _wrap_sync(self, name, attr):
def call(*args, **kwargs):
if name in ("open", "_open") and self._is_write_open(args, kwargs):
return attr(*args, **kwargs)
try:
return attr(*args, **kwargs)
except PermissionError:
if not self._can_retry():
raise
saved = self._inner_fs
self._swap_to_anon()
keep_anon = False
try:
result = getattr(self._inner_fs, name)(*args, **kwargs)
keep_anon = True
if self._should_use_anon_cache():
self._client._mark_bucket_anon(self._client.name, True)
return result
except PermissionError:
if self._should_use_anon_cache():
self._client._mark_bucket_anon(self._client.name, False)
raise
finally:
if not keep_anon:
self._inner_fs = saved

return call

def _wrap_async(self, name, attr):
async def call(*args, **kwargs):
if name in ("open", "_open") and self._is_write_open(args, kwargs):
return await attr(*args, **kwargs)
try:
return await attr(*args, **kwargs)
except PermissionError:
if not self._can_retry():
raise
saved = self._inner_fs
self._swap_to_anon()
keep_anon = False
try:
result = await getattr(self._inner_fs, name)(*args, **kwargs)
keep_anon = True
if self._should_use_anon_cache():
self._client._mark_bucket_anon(self._client.name, True)
return result
except PermissionError:
if self._should_use_anon_cache():
self._client._mark_bucket_anon(self._client.name, False)
raise
finally:
if not keep_anon:
self._inner_fs = saved

return call

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we abstract some logic? Most part looks similar in both sync and asynchronous one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplication is structural (Python won't let await be conditional, so sync and async need separate function bodies). Extracting helpers just moves the duplication into helpers without removing it, and the wrappers become indirection that hurts readability. I'd rather keep both inline and add a comment noting the deliberate parallel.

@ilongin ilongin requested a review from amritghimire June 8, 2026 14:16

@shcheklein shcheklein left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit too much for the task tbh.

What are the alternatives for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

anon does not propagate across .save() boundaries

4 participants