Added check for anon on File read methods#1783
Conversation
Deploying datachain with
|
| Latest commit: |
572e755
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://724da94b.datachain-2g6.pages.dev |
| Branch Preview URL: | https://ilongin-1778-annon-propagati.datachain-2g6.pages.dev |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Adds a per-call anonymous-access fallback to Client: when get_file_info/open_object hits PermissionError, the client transparently rebuilds its filesystem with anon=True, retries once, and on success caches the (protocol, bucket) pair so subsequent reads (including future fs construction) start anonymously. This is aimed at making public-bucket reads survive .save() boundaries (issue #1778) where session-scoped anon detection is lost in fresh processes.
Changes:
- New
_anon_fallbackdecorator andClient._ANON_BUCKETSclass-level cache, with_bucket_needs_anon/_mark_bucket_anonhelpers andClient.fshonoring the cache. - Decorates base
Client.get_file_info/open_objectand the GCS overrides of the same. - Adds unit tests covering retry success, retry failure, cached-bucket reuse, explicit
anon=Trueskip, andopen_objectretry.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/datachain/client/fsspec.py | Adds the _anon_fallback decorator, the _ANON_BUCKETS cache and helpers, decorates base read methods, and routes fs construction through the cache. |
| src/datachain/client/gcs.py | Imports and applies _anon_fallback to the GCS overrides of get_file_info and open_object. |
| tests/unit/test_client.py | Unit tests for the fallback behavior on a GCSClient, plus a fixture that clears the shared _ANON_BUCKETS cache. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @classmethod | ||
| def _bucket_needs_anon(cls, name: str) -> bool: | ||
| return (cls.protocol, name) in cls._ANON_BUCKETS |
There was a problem hiding this comment.
can there be a distinction by prefix - a prefix inside a bucket allows anon access, other prefix doesn't?
There was a problem hiding this comment.
True, this can be edge case.
The problem is that implementing prefix based cache is much more complex than current one and has some open questions. Issues:
- need to extract the path from every fsspec method call (different methods take path in different positions
- need to pick a prefix granularity (per-segment? whole dir? configurable?) - any choice is a heuristic but maybe we can take latest directory
- need longest-prefix matching on lookup
- need two fs instances alive at once (one auth, one anon) or pay the cost of rebuilding on each call ~100-150 lines of code + new tests for a scenario most users don't hit (mixed-access bucket with scoped creds)
My suggestion is to just don't use cache when creds are explicitlty set and that's it.
| error: str | None = None | ||
|
|
||
|
|
||
| class AnonFallbackFS: |
There was a problem hiding this comment.
ah, way to complicated for the task we are trying to solve :( really don't like the complexity and amount of code just to deal with anon that is needed only for demos / examples :(
(don't have a better solution in mind though)
There was a problem hiding this comment.
I'm also not sure what is the less complex approach. I started adding decorator on Client methods, honestly that maybe seem less complex than this "generic" approach where more code is needed....
| # True = anon retry succeeded, use anon from the start. | ||
| # False = anon retry also failed, don't bother retrying again. | ||
| # Missing key = no decision yet. | ||
| _ANON_BUCKETS: ClassVar[dict[tuple[str, str], bool]] = {} |
There was a problem hiding this comment.
this is not a constant
I think _ANON_FALLBACK is also not a constant
There was a problem hiding this comment.
Renamed _ANON_BUCKETS to _anon_buckets - it's a runtime-mutated dict so lowercase makes sense.
Kept _ANON_FALLBACK uppercase - it's set once per subclass at class definition and never reassigned, same pattern as CREDENTIAL_KEYS / FS_CLASS / PREFIX / protocol..
| ) | ||
|
|
||
|
|
||
| def test_anon_fallback_no_error_no_retry(monkeypatch, _clear_anon_cache): |
There was a problem hiding this comment.
we want to use real FSs (existing mocks_, not monkey patches
There was a problem hiding this comment.
The tests use MagicMock because they need to control exactly when PermissionError is raised (first call fails, retry succeeds) and assert create_fs was called with anon=True. The cloud_server fixtures don't easily simulate per-call PermissionError
Added a separate integration test test_anon_fallback_proxy_integration in tests/func/test_storage_auto_anon.py that uses the real cloud_server fixture - proves the proxy correctly delegates ops to a real cloud emulator end-to-end. The actual anon-retry path stays in unit tests.
| def __class__(self): | ||
| # Pretend to be the underlying fsspec class so isinstance checks | ||
| # (e.g. pyarrow's ``isinstance(fs, AbstractFileSystem)``) pass. | ||
| return type(self._inner_fs) | ||
|
|
There was a problem hiding this comment.
do we need this isinstance trick? feels hacky to pretend we're the underlying fs class
There was a problem hiding this comment.
Also, can we just return class name instead of the type itself?
There was a problem hiding this comment.
The trick is needed because external code (pyarrow, fsspec internals) does isinstance(fs, AbstractFileSystem) and we can't patch those checks. Subclassing isn't clean either - the concrete class is dynamic per-client (S3FileSystem / GCSFileSystem / ...).
On returning the class name instead of the type - __class__ has to be a class object; returning a string would break isinstance entirely.
| def _wrap_sync(self, name, attr): | ||
| def call(*args, **kwargs): | ||
| if name in ("open", "_open") and self._is_write_open(args, kwargs): | ||
| return attr(*args, **kwargs) | ||
| try: | ||
| return attr(*args, **kwargs) | ||
| except PermissionError: | ||
| if not self._can_retry(): | ||
| raise | ||
| saved = self._inner_fs | ||
| self._swap_to_anon() | ||
| keep_anon = False | ||
| try: | ||
| result = getattr(self._inner_fs, name)(*args, **kwargs) | ||
| keep_anon = True | ||
| if self._should_use_anon_cache(): | ||
| self._client._mark_bucket_anon(self._client.name, True) | ||
| return result | ||
| except PermissionError: | ||
| if self._should_use_anon_cache(): | ||
| self._client._mark_bucket_anon(self._client.name, False) | ||
| raise | ||
| finally: | ||
| if not keep_anon: | ||
| self._inner_fs = saved | ||
|
|
||
| return call | ||
|
|
||
| def _wrap_async(self, name, attr): | ||
| async def call(*args, **kwargs): | ||
| if name in ("open", "_open") and self._is_write_open(args, kwargs): | ||
| return await attr(*args, **kwargs) | ||
| try: | ||
| return await attr(*args, **kwargs) | ||
| except PermissionError: | ||
| if not self._can_retry(): | ||
| raise | ||
| saved = self._inner_fs | ||
| self._swap_to_anon() | ||
| keep_anon = False | ||
| try: | ||
| result = await getattr(self._inner_fs, name)(*args, **kwargs) | ||
| keep_anon = True | ||
| if self._should_use_anon_cache(): | ||
| self._client._mark_bucket_anon(self._client.name, True) | ||
| return result | ||
| except PermissionError: | ||
| if self._should_use_anon_cache(): | ||
| self._client._mark_bucket_anon(self._client.name, False) | ||
| raise | ||
| finally: | ||
| if not keep_anon: | ||
| self._inner_fs = saved | ||
|
|
||
| return call |
There was a problem hiding this comment.
Can we abstract some logic? Most part looks similar in both sync and asynchronous one?
There was a problem hiding this comment.
The duplication is structural (Python won't let await be conditional, so sync and async need separate function bodies). Extracting helpers just moves the duplication into helpers without removing it, and the wrappers become indirection that hurts readability. I'd rather keep both inline and add a comment noting the deliberate parallel.
shcheklein
left a comment
There was a problem hiding this comment.
I think this is a bit too much for the task tbh.
What are the alternatives for this?
Closes #1778.
When a
Fileread (open_object/get_file_info) hitsPermissionErroron a cloud client that has credentials but no access to the target bucket, automatically retry withanon=Trueand cache the result per(protocol, bucket)so subsequent calls go straight to anonymous mode._anon_fallbackdecorator and_ANON_BUCKETSclass-level cache onClientClient.fshonors the cache so cached buckets are created anonymously from the startopen_objectandget_file_infoon the baseClientand onGCSClient(which overrides both)anon=Trueskips the retry