Reuse the resumable core for TriggerDagRunOperator's durable wait by 1fanwang · Pull Request #68955 · apache/airflow

1fanwang · 2026-06-24T20:22:19Z

Stacked PR. Builds on #68936 (TriggerDagRunOperator durable) and #68952 (extract
resume_or_submit). The incremental change here is the last commit — the runner refactor;
the earlier commits are the two parent PRs and will drop out of the diff once they merge.

Why

#68936 added a crash-safe durable wait to TriggerDagRunOperator, but it hand-rolled the
persist-and-reconnect logic in the task runner — a duplicate of what ResumableJobMixin already
implements. #68952 lifted that logic into a reusable resume_or_submit core. This PR makes the
runner consume that core, so the durability primitive has one implementation instead of two.

What

_handle_trigger_dag_run's durable path now drives resume_or_submit through runner callbacks —
submit (send TriggerDagRun, raising on DagRunAlreadyExists), get_status (GetDagRunState),
poll (the wait loop, raising on a failed state so the retry policy still fires), get_result
(the run-id XCom). The hand-rolled _evaluate_prior_triggered_run and the inline persist/decision
are deleted. No behaviour change.

This is the proof that the #68952 extraction generalises: one operator that fits ResumableJobMixin
directly (Spark, Livy) and one whose wait lives in the runner (TriggerDagRun) now share the same
durability core.

Tests

The full TriggerDagRunOperator runner suite passes unchanged — the durable reconnect tests now
exercise the resume_or_submit path, and the non-durable / deferrable / conflict tests are
untouched. That unchanged suite is the behaviour-preservation proof.

End-to-end (live, Breeze, on this PR's code)

Same crash scenario as #68936, re-run on the refactored code: a parent
TriggerDagRunOperator(durable=True, wait_for_completion=True) triggers a child Dag; the parent's
worker is SIGKILLed mid-wait; the scheduler retries. Attempt 2 reconnects through the shared
core — the log line is resume_or_submit's own "Reconnecting to existing job", which proves the
durable path now runs the framework primitive rather than the deleted hand-rolled logic.

Raw

attempt=2.log:
  "Reconnecting to existing job"  external_id_key=triggered_dag_run_id
     external_id=manual__2026-...T20:18:21Z  status=running

parent run: success   (trigger_child succeeded on try_number=2)
child dag runs since the trigger:
  manual__2026-...T20:18:21Z  success   <- single run, reconnected
  => 1 run from one parent task, no duplicate

Risk

Behaviour-preserving refactor. The durable path's persist + three-state reconnect is now the shared
core (covered by #68952's tests); the runner only supplies the bindings.

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.

Related, from the dev-list thread "Let TriggerDagRunOperator own its execution logic" — three ways to make the synchronous wait crash-safe:

Add durable option to TriggerDagRunOperator to reconnect on retry #68936: durability hand-rolled in the task runner (the original).
Make ResumableJobMixin's reconnect core reusable outside execute() #68952 / Reuse the resumable core for TriggerDagRunOperator's durable wait #68955: pull ResumableJobMixin's core out so the runner drives the wait through one shared contract — works on 2.11 and 3.0-3.3.
[POC] Let TriggerDagRunOperator own its execution via a new accessor #69135: a ti.trigger_dag_run() accessor so the operator owns its execution and uses the mixin directly — 3.3+, falls back below.

Still open which way to go, and whether to revisit a min_version floor for providers.

With wait_for_completion the trigger-and-wait runs in the task runner. A worker crash while polling makes the retry recompute a fresh run_id and trigger a duplicate child run (or fail with DagRunAlreadyExists), even though the run the first attempt started is healthy and still running. The opt-in durable flag persists the triggered run_id to task_state_store before polling, so the retry reconnects to the in-flight run instead of resubmitting.

TriggerDagRunOperator's durable wait lives in the task runner (it raises DagRunTriggerException and is polled there), not in execute(), so it cannot use ResumableJobMixin and re-implements the persist-and-reconnect logic by hand. Lift the mixin's core into a standalone resume_or_submit() so the same implementation can be driven from the runner now and the triggerer later, instead of duplicated per integration point.

The durable synchronous wait previously hand-rolled the persist-and-reconnect logic in the task runner, duplicating what ResumableJobMixin already implements. Drive the shared resume_or_submit core with runner callbacks instead, so the durability primitive has one implementation across the operator (mixin) and the runner.

potiuk · 2026-06-25T10:26:36Z

@1fanwang — the check-newsfragment-pr-number check is failing: the newsfragment file needs to match this PR's number. Rename it to airflow-core/newsfragments/68955.<type>.rst (matching the PR number) and push.

See the PR quality criteria.

_{Automated first-pass triage note drafted by an AI-assisted tool — may get things wrong; once addressed, a real Apache Airflow maintainer takes the next look. (why automated)}

Drafted-by: Claude Code (Opus 4.8); reviewed by @potiuk before posting

The newsfragment number (68936) does not match this PR (68955), so check-newsfragment-pr-number fails. The durable-flag note belongs with the PR that lands the feature, not this stacked change.

1fanwang added 4 commits June 24, 2026 11:43

Add newsfragment for TriggerDagRunOperator durable flag

ec3c791

Narrow task_state_store and coerce stored run id for mypy

c333423

Make the durable TriggerDagRunOperator newsfragment a single line

001febf

1fanwang requested review from amoghrajesh, ashb and kaxil as code owners June 24, 2026 20:22

boring-cyborg Bot added area:providers area:task-sdk provider:standard labels Jun 24, 2026

1fanwang added 2 commits June 24, 2026 14:01

1fanwang force-pushed the trigger-uses-core branch from 806c55a to 3f5c29c Compare June 24, 2026 21:02

This was referenced Jun 29, 2026

[POC] Let TriggerDagRunOperator own its execution via a new accessor #69135

Open

Add durable option to TriggerDagRunOperator to reconnect on retry #68936

Open

Make ResumableJobMixin's reconnect core reusable outside execute() #68952

Open

Remove newsfragment from stacked TriggerDagRunOperator PR

7551f45

The newsfragment number (68936) does not match this PR (68955), so check-newsfragment-pr-number fails. The durable-flag note belongs with the PR that lands the feature, not this stacked change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reuse the resumable core for TriggerDagRunOperator's durable wait#68955

Reuse the resumable core for TriggerDagRunOperator's durable wait#68955
1fanwang wants to merge 7 commits into
apache:mainfrom
1fanwang:trigger-uses-core

1fanwang commented Jun 24, 2026 •

edited

Loading

Uh oh!

potiuk commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

1fanwang commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Tests

End-to-end (live, Breeze, on this PR's code)

Risk

Was generative AI tooling used to co-author this PR?

Uh oh!

potiuk commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1fanwang commented Jun 24, 2026 •

edited

Loading