Skip to content

Reuse the resumable core for TriggerDagRunOperator's durable wait#68955

Open
1fanwang wants to merge 7 commits into
apache:mainfrom
1fanwang:trigger-uses-core
Open

Reuse the resumable core for TriggerDagRunOperator's durable wait#68955
1fanwang wants to merge 7 commits into
apache:mainfrom
1fanwang:trigger-uses-core

Conversation

@1fanwang

@1fanwang 1fanwang commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Stacked PR. Builds on #68936 (TriggerDagRunOperator durable) and #68952 (extract
resume_or_submit). The incremental change here is the last commit — the runner refactor;
the earlier commits are the two parent PRs and will drop out of the diff once they merge.

Why

#68936 added a crash-safe durable wait to TriggerDagRunOperator, but it hand-rolled the
persist-and-reconnect logic in the task runner — a duplicate of what ResumableJobMixin already
implements. #68952 lifted that logic into a reusable resume_or_submit core. This PR makes the
runner consume that core, so the durability primitive has one implementation instead of two.

What

_handle_trigger_dag_run's durable path now drives resume_or_submit through runner callbacks —
submit (send TriggerDagRun, raising on DagRunAlreadyExists), get_status (GetDagRunState),
poll (the wait loop, raising on a failed state so the retry policy still fires), get_result
(the run-id XCom). The hand-rolled _evaluate_prior_triggered_run and the inline persist/decision
are deleted. No behaviour change.

This is the proof that the #68952 extraction generalises: one operator that fits ResumableJobMixin
directly (Spark, Livy) and one whose wait lives in the runner (TriggerDagRun) now share the same
durability core.

Tests

The full TriggerDagRunOperator runner suite passes unchanged — the durable reconnect tests now
exercise the resume_or_submit path, and the non-durable / deferrable / conflict tests are
untouched. That unchanged suite is the behaviour-preservation proof.

End-to-end (live, Breeze, on this PR's code)

Same crash scenario as #68936, re-run on the refactored code: a parent
TriggerDagRunOperator(durable=True, wait_for_completion=True) triggers a child Dag; the parent's
worker is SIGKILLed mid-wait; the scheduler retries. Attempt 2 reconnects through the shared
core
— the log line is resume_or_submit's own "Reconnecting to existing job", which proves the
durable path now runs the framework primitive rather than the deleted hand-rolled logic.

Raw
attempt=2.log:
  "Reconnecting to existing job"  external_id_key=triggered_dag_run_id
     external_id=manual__2026-...T20:18:21Z  status=running

parent run: success   (trigger_child succeeded on try_number=2)
child dag runs since the trigger:
  manual__2026-...T20:18:21Z  success   <- single run, reconnected
  => 1 run from one parent task, no duplicate

Risk

Behaviour-preserving refactor. The durable path's persist + three-state reconnect is now the shared
core (covered by #68952's tests); the runner only supplies the bindings.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.

Related, from the dev-list thread "Let TriggerDagRunOperator own its execution logic" — three ways to make the synchronous wait crash-safe:

Still open which way to go, and whether to revisit a min_version floor for providers.

1fanwang added 4 commits June 24, 2026 11:43
With wait_for_completion the trigger-and-wait runs in the task runner. A worker
crash while polling makes the retry recompute a fresh run_id and trigger a
duplicate child run (or fail with DagRunAlreadyExists), even though the run the
first attempt started is healthy and still running. The opt-in durable flag
persists the triggered run_id to task_state_store before polling, so the retry
reconnects to the in-flight run instead of resubmitting.
1fanwang added 2 commits June 24, 2026 14:01
TriggerDagRunOperator's durable wait lives in the task runner (it raises
DagRunTriggerException and is polled there), not in execute(), so it cannot use
ResumableJobMixin and re-implements the persist-and-reconnect logic by hand. Lift the
mixin's core into a standalone resume_or_submit() so the same implementation can be
driven from the runner now and the triggerer later, instead of duplicated per
integration point.
The durable synchronous wait previously hand-rolled the persist-and-reconnect logic
in the task runner, duplicating what ResumableJobMixin already implements. Drive the
shared resume_or_submit core with runner callbacks instead, so the durability primitive
has one implementation across the operator (mixin) and the runner.
@1fanwang 1fanwang force-pushed the trigger-uses-core branch from 806c55a to 3f5c29c Compare June 24, 2026 21:02
@potiuk

potiuk commented Jun 25, 2026

Copy link
Copy Markdown
Member

@1fanwang — the check-newsfragment-pr-number check is failing: the newsfragment file needs to match this PR's number. Rename it to airflow-core/newsfragments/68955.<type>.rst (matching the PR number) and push.

See the PR quality criteria.

Automated first-pass triage note drafted by an AI-assisted tool — may get things wrong; once addressed, a real Apache Airflow maintainer takes the next look. (why automated)


Drafted-by: Claude Code (Opus 4.8); reviewed by @potiuk before posting

The newsfragment number (68936) does not match this PR (68955), so
check-newsfragment-pr-number fails. The durable-flag note belongs with the
PR that lands the feature, not this stacked change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants