Skip to content

Stop BigQueryStreamingBufferEmptySensor reporting empty too early#69118

Open
anxkhn wants to merge 3 commits into
apache:mainfrom
anxkhn:loop/airflow__001
Open

Stop BigQueryStreamingBufferEmptySensor reporting empty too early#69118
anxkhn wants to merge 3 commits into
apache:mainfrom
anxkhn:loop/airflow__001

Conversation

@anxkhn

@anxkhn anxkhn commented Jun 29, 2026

Copy link
Copy Markdown

BigQuery's streamingBuffer table metadata is eventually consistent: for
several seconds after a streaming insert the rows are physically in the buffer
but the metadata still reads absent. BigQueryStreamingBufferEmptySensor
(added in #66652) decided "buffer empty" from a single absent reading, so during
that window it reported empty too early and a downstream DML task (UPDATE /
DELETE / MERGE) hit the very ... would affect rows in the streaming buffer
error the sensor exists to prevent.

This makes "empty" trustworthy by requiring empty_confirmations consecutive
empty readings (default 2), each one poke_interval apart, before the sensor
succeeds. A non-empty reading resets the counter. The same rule is applied in
both the sync poke() path and the deferrable BigQueryStreamingBufferEmptyTrigger.

Why consecutive confirmations instead of a non-empty -> empty transition check
(as discussed in the issue thread): a transition check polls until timeout when
the buffer flushes between two polls, and never fires for a table that is
genuinely empty (no prior streaming insert). A bounded consecutive-empty count
spans the consistency window while still terminating in both of those cases.
The default of 2 forces one full poke_interval to elapse between the two
empty readings, comfortably covering the ~10-12s metadata-lag window, while
remaining configurable (validated >= 1).

The obsolete metadata-wait workaround in the system test
(example_bigquery_streaming_buffer_sensor.py) is removed, as #66963 requested
once the sensor handles this itself.

closes: #66963


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: OpenCode (Claude Opus 4.8) following the guidelines

BigQuery's streamingBuffer table metadata is eventually consistent: for
several seconds after a streaming insert the rows are in the buffer but
the metadata still reads absent. Deciding "empty" from a single absent
reading therefore false-passes during that window, letting a downstream
DML task hit the very "affects rows in the streaming buffer" error the
sensor exists to prevent.

Require empty_confirmations consecutive empty readings (default 2), one
poll interval apart, before reporting empty. This spans the consistency
window while still terminating for a genuinely empty table or when the
buffer flushes between two polls, which a non-empty -> empty transition
check would not.
@anxkhn anxkhn requested a review from shahar1 as a code owner June 29, 2026 08:28
@boring-cyborg boring-cyborg Bot added area:providers provider:google Google (including GCP) related issues labels Jun 29, 2026
@boring-cyborg

boring-cyborg Bot commented Jun 29, 2026

Copy link
Copy Markdown

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BigQueryStreamingBufferEmptySensor can falsely report an empty streaming buffer (metadata lag)

1 participant