Add MGS API for getting Host Panic/Host Boot Failure messages by jamesmunns · Pull Request #495 · oxidecomputer/management-gateway-service

jamesmunns · 2026-05-19T14:44:12Z

This is a component of oxidecomputer/hubris#2518 ~~and is probably not totally ready for feedback yet~~.

This is a part of implementing oxidecomputer/hubris#2504.

Protocol side of oxidecomputer/hubris#2504 and oxidecomputer/hubris#2518

jamesmunns · 2026-07-01T18:04:44Z

Should be able to move this out of WIP tomorrow, progress update here: oxidecomputer/hubris#2504 (comment).

jamesmunns · 2026-07-02T11:48:29Z

This has been tested as working (with the pending hubris changes): oxidecomputer/hubris#2504 (comment), and should now be ready for review.

jgallagher · 2026-07-02T14:17:05Z

    GetPmbusStatus(PowerRailName),
+
+    /// Request for Host Panic Payload
+    GetHostPanicPayload {


With apologies for how tedious this is: since we're adding to the protocol, can we bump the version and add tests that ensure we don't break compatibility in future changes? See #482 for a simple one, or #392 for a somewhat more involved one.

Also it looks like #497 should have had this also 😬. We could either PR a version bump + tests for that separately, then come back to this PR, or roll both changes into one version bump?

jgallagher · 2026-07-02T14:25:48Z

+                    )
+                    .await?;
+
+                // TODO: If either of these change, it would mean that the host


I think the TODOs and having to choose a buffer length are signs that sp.get_host_panic_payload() is too low-level for what we'd like from the gateway-sp-comms interface. It's the common interface between faux-mgs and real MGS, and ideally those behave more-or-less identically. If we've got logic here that's specifically documented as "the control plane should do this differently", that's not true anymore.

Many of the gateway-sp-comms methods are simple 1-to-1 wrappers around MGS <-> SP messages, but there are several that wrap multiple messages to present a nicer interface to the control plane; e.g., get_component_details() sends multiple requests necessary to collect all the information (similar to this interface I think), or start_update(), which drives an entire update.

From reading the faux-mgs code here, I think the interface I'd propose is sp.get_host_panic_payload() -> Result<Vec<u8>, _> (no arguments, return the entire panic message), which handles picking a reasonable buffer size and embeds this while loop. If we go this route we don't want these assertions though - probably reasonable to return a specific error for "host panic'd while we were reading the panic message" and then let the caller retry?

I think that's reasonable! I can push down the "get all" logic one layer so that gateway-sp-comms handles more of this autonomously.

Just checking for the return type:

Result<Option<Vec<u8>>, _> is cool, you don't want the "sequence number" to be surfaced up? (it is a rough indicator of "how many times has the system panicked/bootfailed since the last SP reboot")

You want Vec<u8>, instead of String? The data should always be UTF-8, but we can let the caller decide what to do if we got a payload but it was corrupted (or the host gave us bad UTF-8 data).

Oh, bootfail also has a u8 "failure reason", I'm guessing we do want to pass that up as well? (panic doesn't have this).

Ah, yes, sorry, definitely include more than just the message itself if there's more relevant information. I'd include both the sequence number and the failure reason (and any other bits if I missed those).

You want Vec, instead of String? The data should always be UTF-8, but we can let the caller decide what to do if we got a payload but it was corrupted (or the host gave us bad UTF-8 data).

I had the same internal discussion before posting that. 😅 I could see reasonable arguments for:

Vec<u8> (fewer error cases to handle in gateway-sp-comms, always reports the exact data from the host)

String via from_utf8 (fails if we get corrupt / bad data, which seems not great)

String via from_utf8_lossy (doesn't fail, but is lossy)

I don't think I'd pick "String via from_utf8". If we really expect these to always be UTF8, I'd be fine with "String via from_utf8_lossy", since then at least we'll always get as much of a printable thing as there is.

Okay, I am back to report that I am a liar! Panic and Bootfail data are in fact not strings, but instead a specific format, for which a parser is provided in ipcc-rs/ipcc-data, and was defined by RFD0316.

We don't already seem to depend on ipcc-rs here yet. I think there are probably two reasonable-ish options:

Just return the Vec<u8>, allow the control plane to depend on both gateway-sp-comms and ipcc-data.

Add a dep on ipcc-data, and return an ipcc_data::HSSPanic instead of a Vec<u8>.

I'd probably lean towards the former, just because it makes dep versioning less messy to defer integration, but it does still leave this API as "some assembly required".

Yeah, Vec<u8> seems fine - MGS can be "just a transport" in this case, and it's up to the control plane and host to coordinate on how to interpret it.

That said, we probably want to add ipcc-data as a dep to faux-mgs, so it can do similar interpretation? Otherwise it's stuck with just a blob?

Yup, it seems much more reasonable to add to faux-mgs than gateway-sp-comms, imo.

jgallagher · 2026-07-02T14:44:32Z

+                // The SP can only store one host panic at a time, so there's
+                // no way to retrieve an older panic after it has been
+                // overwritten.
+                assert_eq!(seqno, res.seqno);


This comment is not super relevant given my suggestion above will get rid of this, but just in general: I don't think assert! is ever correct in this context - it should be reserved for programmer error / invariant violation, and I think we'd treat any report of a faux-mgs panic as a bug in faux-mgs.

This is much more of an operational error akin to an I/O error reading a file or somesuch. As noted above I think we don't need to necessarily go all the way to a retry-loop-with-a-cap; it's fine to return an error (again akin to bailing out on an I/O error partway through reading a file).

Gotcha! It's definitely not something I'd do in a library, but it seemed exceptional enough to be acceptable for a binary tool. This will go away once it gets consumed by the lib API, but I will keep that in mind for the future!

Yeah, gotcha. I think some missing context here is that while this is a binary tool, in practice it has been wrapped up by other binary tools and is therefore used as a quasi library. This is kind of an unfortunate sequence, but does make the "probably shouldn't panic" more reasonable - I think faux-mgs is executed by other tools much, much more often than it's executed directly by a human.

jamesmunns added 4 commits May 19, 2026 12:43

Add host failure protocol messages

0c30aab

Protocol side of oxidecomputer/hubris#2504 and oxidecomputer/hubris#2518

Make it pub

086e518

Make some fields pub, reorg

470fcde

Update error variants to match IPC errors

7b4d6b5

jamesmunns requested a review from hawkw May 19, 2026 14:44

jamesmunns mentioned this pull request May 19, 2026

Obtain Host Panic/Boot Failure Messages oxidecomputer/hubris#2518

Open

jamesmunns changed the title ~~WIP: Add MGS API for getting Host Panic/Host Boot Failure messagesd~~ WIP: Add MGS API for getting Host Panic/Host Boot Failure messages May 19, 2026

jamesmunns mentioned this pull request Jun 1, 2026

Add NullStr type #496

Merged

jamesmunns added 4 commits June 26, 2026 15:47

Merge branch 'main' into james/host-fails

9954c9d

Adjust host-panic messages

4dfe88d

Update for bootfail

574f952

Correct pagination logic

0646bfc

jamesmunns mentioned this pull request Jul 1, 2026

control-plane-agent: Add interface(s) for retrieving Host Panic/Boot Failure messages oxidecomputer/hubris#2504

Open

Add some missing docs, index -> seqno

2f8d7dd

jamesmunns changed the title ~~WIP: Add MGS API for getting Host Panic/Host Boot Failure messages~~ Add MGS API for getting Host Panic/Host Boot Failure messages Jul 2, 2026

jamesmunns marked this pull request as ready for review July 2, 2026 11:47

jamesmunns added 2 commits July 2, 2026 13:59

Fix some faux-mgs TODOs/debugging

b70d3a5

One more index -> seqno

b93969b

jgallagher reviewed Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

jamesmunns commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesmunns commented Jul 1, 2026

Uh oh!

jamesmunns commented Jul 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesmunns commented May 19, 2026 •

edited

Loading