[POC] Let TriggerDagRunOperator own its execution via a new accessor#69135
Draft
1fanwang wants to merge 2 commits into
Draft
[POC] Let TriggerDagRunOperator own its execution via a new accessor#691351fanwang wants to merge 2 commits into
1fanwang wants to merge 2 commits into
Conversation
Airflow already exposes the dag-run poll half to task code as ti.get_dagrun_state(); the trigger half had no first-class accessor and was only reachable through the DagRunTriggerException side channel. Adding the symmetric ti.trigger_dag_run() routes a trigger through the same execution-API endpoint and scoped token the task runner already uses, so an operator can own its trigger-and-wait execution directly.
132d681 to
35cc421
Compare
…ble reconnect On Airflow 3, TriggerDagRunOperator.execute() raised DagRunTriggerException and the task runner did the trigger and the wait loop, so the synchronous wait-and-reconnect contract was duplicated between the operator and the runner and could drift. With the new ti.trigger_dag_run() accessor the operator does the submit and poll itself and reuses ResumableJobMixin directly, keeping that contract in one place. The opt-in durable flag persists the triggered run id before polling so a worker crash mid-wait reconnects to the in-flight run on retry instead of triggering a duplicate. Deferrable still needs the triggerer handoff, so it keeps the exception path; Airflow < 3.3 and Airflow 2 are unchanged.
35cc421 to
ea1d794
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
POC for the
[DISCUSS] Let TriggerDagRunOperator own its execution logicthread on the dev list — opened to help the discussion assess the accessor approach, not as a merge-ready change (see Back-compat & scope below).Why
On Airflow 3,
TriggerDagRunOperator.execute()raisesDagRunTriggerExceptionand the task runner does the trigger and the synchronous wait loop. So the trigger-and-wait contract (and any crash recovery) lives in two places — the operator and the runner — and can drift. The poll half is already a first-class Task SDK accessor (ti.get_dagrun_state()); only the trigger half was missing, which is why it had to go through the exception side channel.What
ti.trigger_dag_run(), the counterpart toti.get_dagrun_state(). It hits the same execution-API endpoint and scoped token the runner already uses, so no new authz surface.TriggerDagRunOperatordoes the submit and poll itself and subclassesResumableJobMixindirectly, so the durability contract lives in one place.durableflag (defaultFalse): on a synchronouswait_for_completion, the triggered run id is persisted before polling, so a worker crash mid-wait reconnects to the in-flight run on retry instead of triggering a duplicate.Deferrable keeps the
DagRunTriggerExceptionpath (it still needs the triggerer handoff), and Airflow < 3.3 / Airflow 2 fall back to that path too.Tests
E2E — live, Airflow 3.4 standalone (LocalExecutor + Postgres)
Two DAGs: a parent
TriggerDagRunOperator(wait_for_completion=True, durable=True, poke_interval=3)triggers a child whose only task sleeps 45s.1. Clean path (no crash) — proves the operator owns its execution via the accessors; the runner's
DagRunTriggerExceptionpath is never taken.Parent task log shows the operator's own poll loop (
trigger_dagrun.py), and the side channel is unused:2. Worker
SIGKILLmid-wait (durable reconnect) — the parent's worker process iskill -9ed while it polls; the scheduler retries the task.SIGKILLthe parent worker pidOn the retry the operator reconnects to the same run id instead of triggering a duplicate:
Raw state progression — crash run
Back-compat & scope
TriggerDagRunOperatoris in the standard provider, which still supports Airflow 2.11 and 3.0–3.3; the accessor andResumableJobMixinare 3.3+. So the operator owns its execution on 3.3+ and falls back to the existingDagRunTriggerExceptionpath on 3.0–3.2 and the Airflow-2 path on 2.11 — net-new duplication while those versions are supported, and durability (which needstask_state_store) applies on 3.3+ only. This is the 3.3+ end-state, not a drop-in for the full supported range. The interim-vs-end-state trade-off (and a providermin_versionpolicy) is under discussion on the dev list — the lower-duplication interim that keeps one runner path is #68952 / #68955 (share the mixin's contract rather than have the operator own execution).Refs
This is the accessor-based approach discussed on the
[DISCUSS]thread — one option alongside the hand-rolled-in-the-runner durability in #68936 and the share-the-core refactor in #68952 / #68955. Which direction to take is still under discussion.Was generative AI tooling used to co-author this PR?