Skip to content

feat(data-collection): create DataCollection option in client#6702

Open
ericapisani wants to merge 16 commits into
masterfrom
ep/db-spec-experiement-foundation-dict
Open

feat(data-collection): create DataCollection option in client#6702
ericapisani wants to merge 16 commits into
masterfrom
ep/db-spec-experiement-foundation-dict

.

8b9dedb
Select commit
Loading
Failed to load commit list.
@sentry/warden / warden: find-bugs completed Jul 2, 2026 in 0s

8 issues

find-bugs: Found 8 issues (1 medium, 7 low)

Medium

`_http_headers_from_value` silently accepts string values, causing either silent misconfiguration or a crash - `tests/test_data_collection.py:34-37`

Passing a string like "off" as the http_headers value silently falls back to deny-list header collection instead of raising an error, because _http_headers_from_value uses Python's in operator on the raw value — which performs substring membership on strings rather than dict-key lookup. Any string that contains "request" or "response" as a substring (e.g. "response" itself) will crash with TypeError: string indices must be integers.

Also found at:

  • sentry_sdk/_types.py:184-187

Low

README example meant to reduce PII silently enables gen_ai data collection - `README.md:52-55`

The README example comment data_collection={"user_info": False, "http_bodies": []}, described as disabling "sending user data and HTTP request/response bodies," actually shifts several categories from their safe implicit defaults to the more permissive explicit spec defaults. In particular, providing any explicit data_collection dict routes resolution through _resolve_explicit, where gen_ai defaults to {"inputs": True, "outputs": True} and cookies/query_params default to deny_list (collect-with-filter) rather than off. So a user who uncomments this line to protect PII would, with a gen_ai integration in use, begin sending AI inputs/outputs to Sentry — the opposite of the comment's intent. This is a documentation/UX footgun; the underlying explicit-vs-implicit default divergence is intentional and spec-driven.

`_kvcb_from_value` silently splits a string `terms` value into individual characters - `sentry_sdk/client.py:25`

In sentry_sdk/data_collection.py line 182, behaviour["terms"] = list(terms) has no type guard — passing "terms": "session_id" yields ['s','e','s','s','i','o','n','_','i','d'], making the deny/allow list completely non-functional with no error raised.

Also found at:

  • sentry_sdk/client.py:26
Docstring claims `graphql.document` stays `True` regardless of `send_default_pii`, but the implementation gates it on `send_default_pii` - `sentry_sdk/client.py:25`

In sentry_sdk/data_collection.py, the _map_from_send_default_pii docstring (line 93) states that graphql.document stays True while only graphql.variables and database.query_params follow send_default_pii. However, line 110 sets "graphql": {"document": send_default_pii, "variables": send_default_pii}, so with send_default_pii=False the document is also suppressed — a direct contradiction that will mislead callers about the actual behavior.

Also found at:

  • sentry_sdk/data_collection.py:110
Deprecation warning for `send_default_pii` uses wrong stacklevel and points to SDK internals - `sentry_sdk/client.py:77`

In data_collection.py's resolve_data_collection, the warnings.warn(..., stacklevel=2) for the deprecated send_default_pii option points to client.py:_get_options (the direct caller), not the user's sentry_sdk.init() call. As a result the DeprecationWarning appears to originate from internal SDK code rather than the developer's own code, defeating the purpose of the warning. Tracing the call chain, user code is reached at stacklevel=5.

Also found at:

  • sentry_sdk/client.py:353
  • sentry_sdk/data_collection.py:252
`BaseClient.data_collection` returns a shared mutable dict by reference - `sentry_sdk/client.py:439-441`

The property returns _DISABLED_DATA_COLLECTION_CONFIG directly — a module-level mutable dict. Any caller that writes to the returned dict (e.g. client.data_collection["user_info"] = True) will silently corrupt global state for every subsequent BaseClient/NonRecordingClient instance.

`_kvcb_from_value` raises confusing `AttributeError` on non-dict nested collection values instead of a clear validation error - `sentry_sdk/data_collection.py:186`

_resolve_explicit validates that the top-level data_collection option is a dict (raising a clear TypeError in resolve_data_collection) and validates collection modes (raising a clear ValueError in _kvcb_from_value). However, the per-field parsers assume nested values are dicts and never guard against other types. _kvcb_from_value calls val.get("mode", "deny_list") directly, so passing a non-dict such as {"cookies": "off"} or {"http_headers": {"request": "off"}} raises an opaque AttributeError: 'str' object has no attribute 'get' rather than a descriptive configuration error. This is an input-validation inconsistency, not a security or crash-in-production issue: it only affects misconfigured SDK options at init time. Note that {"http_headers": "off"} (a string in place of the whole dict) does NOT error — _http_headers_from_value uses "request" in val, which on a string performs a substring search that is False, so it silently falls back to defaults; this fallback is intentional and covered by test_http_headers_collection_defaults.

resolve_data_collection forwards None include_local_variables/include_source_context unnormalized, diverging from spotlight path - `sentry_sdk/scope.py:90`

In resolve_data_collection (sentry_sdk/data_collection.py), include_local_variables = options.get("include_local_variables", True) and include_source_context = options.get("include_source_context", True) only apply the True default when the key is absent. If a user calls sentry_sdk.init(include_local_variables=None) (legal per the Optional[bool] type), None is returned and forwarded unchanged to _map_from_send_default_pii/_resolve_explicit, producing stack_frame_variables: None in the resolved DataCollection dict — violating the TypedDict's stack_frame_variables: bool contract. The spotlight re-derive path in client.py normalizes this with is not False, so identical init args yield inconsistent data_collection depending on spotlight mode. Fix by normalizing to booleans: options.get("include_local_variables") is not False (and similarly for include_source_context).


⏱ 21m 5s · 5.8M in / 217.2k out · $8.12

Annotations

Check warning on line 37 in tests/test_data_collection.py

See this annotation in the file changed.

@sentry-warden sentry-warden / warden: find-bugs

`_http_headers_from_value` silently accepts string values, causing either silent misconfiguration or a crash

Passing a string like `"off"` as the `http_headers` value silently falls back to deny-list header collection instead of raising an error, because `_http_headers_from_value` uses Python's `in` operator on the raw value — which performs substring membership on strings rather than dict-key lookup. Any string that contains `"request"` or `"response"` as a substring (e.g. `"response"` itself) will crash with `TypeError: string indices must be integers`.

Check warning on line 187 in sentry_sdk/_types.py

See this annotation in the file changed.

@sentry-warden sentry-warden / warden: find-bugs

[YHH-3SM] `_http_headers_from_value` silently accepts string values, causing either silent misconfiguration or a crash (additional location)

Passing a string like `"off"` as the `http_headers` value silently falls back to deny-list header collection instead of raising an error, because `_http_headers_from_value` uses Python's `in` operator on the raw value — which performs substring membership on strings rather than dict-key lookup. Any string that contains `"request"` or `"response"` as a substring (e.g. `"response"` itself) will crash with `TypeError: string indices must be integers`.