Stop BigQueryStreamingBufferEmptySensor reporting empty too early#69118
Open
anxkhn wants to merge 3 commits into
Open
Stop BigQueryStreamingBufferEmptySensor reporting empty too early#69118anxkhn wants to merge 3 commits into
anxkhn wants to merge 3 commits into
Conversation
BigQuery's streamingBuffer table metadata is eventually consistent: for several seconds after a streaming insert the rows are in the buffer but the metadata still reads absent. Deciding "empty" from a single absent reading therefore false-passes during that window, letting a downstream DML task hit the very "affects rows in the streaming buffer" error the sensor exists to prevent. Require empty_confirmations consecutive empty readings (default 2), one poll interval apart, before reporting empty. This spans the consistency window while still terminating for a genuinely empty table or when the buffer flushes between two polls, which a non-empty -> empty transition check would not.
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BigQuery's
streamingBuffertable metadata is eventually consistent: forseveral seconds after a streaming insert the rows are physically in the buffer
but the metadata still reads absent.
BigQueryStreamingBufferEmptySensor(added in #66652) decided "buffer empty" from a single absent reading, so during
that window it reported empty too early and a downstream DML task (UPDATE /
DELETE / MERGE) hit the very
... would affect rows in the streaming buffererror the sensor exists to prevent.
This makes "empty" trustworthy by requiring
empty_confirmationsconsecutiveempty readings (default
2), each onepoke_intervalapart, before the sensorsucceeds. A non-empty reading resets the counter. The same rule is applied in
both the sync
poke()path and the deferrableBigQueryStreamingBufferEmptyTrigger.Why consecutive confirmations instead of a non-empty -> empty transition check
(as discussed in the issue thread): a transition check polls until timeout when
the buffer flushes between two polls, and never fires for a table that is
genuinely empty (no prior streaming insert). A bounded consecutive-empty count
spans the consistency window while still terminating in both of those cases.
The default of
2forces one fullpoke_intervalto elapse between the twoempty readings, comfortably covering the ~10-12s metadata-lag window, while
remaining configurable (validated
>= 1).The obsolete metadata-wait workaround in the system test
(
example_bigquery_streaming_buffer_sensor.py) is removed, as #66963 requestedonce the sensor handles this itself.
closes: #66963
Was generative AI tooling used to co-author this PR?
Generated-by: OpenCode (Claude Opus 4.8) following the guidelines