Skip to content

[nexus] Don't re-fetch a disk after committing attach/detach (fixes #10695)#10706

Open
MavenRain wants to merge 1 commit into
oxidecomputer:mainfrom
MavenRain:fix-10695-attach-disk-atomicity
Open

[nexus] Don't re-fetch a disk after committing attach/detach (fixes #10695)#10706
MavenRain wants to merge 1 commit into
oxidecomputer:mainfrom
MavenRain:fix-10695-attach-disk-atomicity

Conversation

@MavenRain

Copy link
Copy Markdown

Fixes #10695.

Problem

Running many concurrent instance-creates (30 instances × 11 disks each on london) exhausts the qorb DB connection pool. Some instance-create sagas then get permanently stuck mid-unwind, leaving instances wedged in starting. The saga log shows a forward failure at attach_disk_output ("all claims are used") and a permanent undo failure at a local_storage_disk create-record undo ("disk cannot be deleted in state "attached"").

Root cause

DataStore::instance_attach_disk commits the disk→Attached transition via an atomic CTE and then does a second, non-transactional disk_get re-fetch (a fresh pool claim, acquired while still holding the CTE's connection) only to build the richer datastore::Disk return value. Under pool exhaustion that re-fetch fails after the attach has committed.

steno runs a forward action once and never runs the undo of a failed node, so a disk left Attached by an attach whose re-fetch failed has no detach scheduled. During unwind its create-record undo then hits project_delete_disk_no_auth (which only permits Detached/Faulted/Creating) and fails permanently . . . steno stops the unwind at the first undo failure (saga_stuck), wedging the instance. instance_detach_disk has the mirror problem and is used as a saga undo action, where a post-commit re-fetch failure turns a successful detach into a permanent undo failure.

Change

Make the committing CTE the last fallible operation in both methods:

  • instance_attach_disk / instance_detach_disk now return the model::Disk the CTE already produces (no post-commit re-fetch).
  • The two API callers (Nexus::instance_attach_disk / instance_detach_disk) do the datastore::Disk upgrade via disk_get, where a post-commit failure is recoverable (client retries; both ops are idempotent).
  • The instance-create saga discards the returned disk, so the re-fetch was pure waste there.

After this, the disk-attach/detach saga actions can only fail before they commit, so an overloaded instance-create fails cleanly and fully unwinds instead of leaking an Attached disk and wedging.

Scope / non-goals

This removes the specific latent atomicity defect. It does not make instance-create robust to sustained pool exhaustion in general (a detach undo can still fail to acquire a connection and stop an unwind) . . . that is an honest DB-unavailable failure better addressed at the pool/admission layer, and is out of scope here.

Testing

Existing datastore coverage in sled.rs exercises attach (success + attach-blocked-while-starting) and detach; those callers discard the return value and are unaffected by the type change. I was not able to build the full workspace in my environment; CI should validate compilation. A saga fault-injection test that fails the attach action after its DB commit and asserts a clean, leak-free unwind would be a good follow-up.

Related

An automated audit found the same commit-then-fallible shape in several other sagas (see the linked issue comment) . . . notably instance_create::sic_create_network_interface (orphaned NIC), snapshot_create::ssc_attach_disk_to_pantry, and vpc_create::svc_create_gateway. Those are left for follow-ups; the underlying invariant is: a saga forward action must not perform a fallible operation after committing state.

…xidecomputer#10695)

`DataStore::instance_attach_disk` and `DataStore::instance_detach_disk`
each ran an atomic CTE that commits the disk's state transition, and then
performed a second, non-transactional `disk_get` re-fetch, acquiring an
additional pool connection while still holding the first, purely to
return the richer `datastore::Disk` type.

Under connection-pool exhaustion that post-commit re-fetch fails with
"Backends exist, and appear online, but all claims are used" *after* the
attach/detach has already committed.  In a saga this is fatal: steno gives
a forward action a single attempt and never runs the undo of a *failed*
node, so a disk left `Attached` by an attach whose re-fetch failed has no
detach scheduled.  During unwind its create-record undo then fails
permanently ("disk cannot be deleted in state \"attached\""), which wedges
the instance-create unwind and leaves the instance stuck in `starting`
(oxidecomputer#10695).  `instance_detach_disk` has the mirror problem on the saga *undo*
path: a re-fetch failure after the detach commits turns a successful
detach into a permanent undo failure.

Make the committing CTE the last fallible operation in both methods by
returning the `model::Disk` the CTE already produces, and move the
`datastore::Disk` upgrade to the two API callers in `app/instance.rs`,
where a post-commit failure is recoverable (the client simply retries;
attach and detach are idempotent).  The instance-create saga discards the
returned disk entirely, so the re-fetch was pure waste on that path.

Signed-off-by: Onyeka Obi <softwareengineerasaservant@isurvivable.cv>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Instances stuck in starting after instance create overload

1 participant