Skip to content

Add MGS API for getting Host Panic/Host Boot Failure messages#495

Open
jamesmunns wants to merge 11 commits into
mainfrom
james/host-fails
Open

Add MGS API for getting Host Panic/Host Boot Failure messages#495
jamesmunns wants to merge 11 commits into
mainfrom
james/host-fails

Conversation

@jamesmunns

@jamesmunns jamesmunns commented May 19, 2026

Copy link
Copy Markdown
Contributor

This is a component of oxidecomputer/hubris#2518 and is probably not totally ready for feedback yet.

This is a part of implementing oxidecomputer/hubris#2504.

@jamesmunns jamesmunns requested a review from hawkw May 19, 2026 14:44
@jamesmunns jamesmunns changed the title WIP: Add MGS API for getting Host Panic/Host Boot Failure messagesd WIP: Add MGS API for getting Host Panic/Host Boot Failure messages May 19, 2026
@jamesmunns jamesmunns mentioned this pull request Jun 1, 2026
@jamesmunns

Copy link
Copy Markdown
Contributor Author

Should be able to move this out of WIP tomorrow, progress update here: oxidecomputer/hubris#2504 (comment).

@jamesmunns jamesmunns changed the title WIP: Add MGS API for getting Host Panic/Host Boot Failure messages Add MGS API for getting Host Panic/Host Boot Failure messages Jul 2, 2026
@jamesmunns jamesmunns marked this pull request as ready for review July 2, 2026 11:47
@jamesmunns

Copy link
Copy Markdown
Contributor Author

This has been tested as working (with the pending hubris changes): oxidecomputer/hubris#2504 (comment), and should now be ready for review.

GetPmbusStatus(PowerRailName),

/// Request for Host Panic Payload
GetHostPanicPayload {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With apologies for how tedious this is: since we're adding to the protocol, can we bump the version and add tests that ensure we don't break compatibility in future changes? See #482 for a simple one, or #392 for a somewhat more involved one.

Also it looks like #497 should have had this also 😬. We could either PR a version bump + tests for that separately, then come back to this PR, or roll both changes into one version bump?

Comment thread faux-mgs/src/main.rs
)
.await?;

// TODO: If either of these change, it would mean that the host

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TODOs and having to choose a buffer length are signs that sp.get_host_panic_payload() is too low-level for what we'd like from the gateway-sp-comms interface. It's the common interface between faux-mgs and real MGS, and ideally those behave more-or-less identically. If we've got logic here that's specifically documented as "the control plane should do this differently", that's not true anymore.

Many of the gateway-sp-comms methods are simple 1-to-1 wrappers around MGS <-> SP messages, but there are several that wrap multiple messages to present a nicer interface to the control plane; e.g., get_component_details() sends multiple requests necessary to collect all the information (similar to this interface I think), or start_update(), which drives an entire update.

From reading the faux-mgs code here, I think the interface I'd propose is sp.get_host_panic_payload() -> Result<Vec<u8>, _> (no arguments, return the entire panic message), which handles picking a reasonable buffer size and embeds this while loop. If we go this route we don't want these assertions though - probably reasonable to return a specific error for "host panic'd while we were reading the panic message" and then let the caller retry?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable! I can push down the "get all" logic one layer so that gateway-sp-comms handles more of this autonomously.

Just checking for the return type:

  • Result<Option<Vec<u8>>, _> is cool, you don't want the "sequence number" to be surfaced up? (it is a rough indicator of "how many times has the system panicked/bootfailed since the last SP reboot")
  • You want Vec<u8>, instead of String? The data should always be UTF-8, but we can let the caller decide what to do if we got a payload but it was corrupted (or the host gave us bad UTF-8 data).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, bootfail also has a u8 "failure reason", I'm guessing we do want to pass that up as well? (panic doesn't have this).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, sorry, definitely include more than just the message itself if there's more relevant information. I'd include both the sequence number and the failure reason (and any other bits if I missed those).

You want Vec, instead of String? The data should always be UTF-8, but we can let the caller decide what to do if we got a payload but it was corrupted (or the host gave us bad UTF-8 data).

I had the same internal discussion before posting that. 😅 I could see reasonable arguments for:

  • Vec<u8> (fewer error cases to handle in gateway-sp-comms, always reports the exact data from the host)
  • String via from_utf8 (fails if we get corrupt / bad data, which seems not great)
  • String via from_utf8_lossy (doesn't fail, but is lossy)

I don't think I'd pick "String via from_utf8". If we really expect these to always be UTF8, I'd be fine with "String via from_utf8_lossy", since then at least we'll always get as much of a printable thing as there is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I am back to report that I am a liar! Panic and Bootfail data are in fact not strings, but instead a specific format, for which a parser is provided in ipcc-rs/ipcc-data, and was defined by RFD0316.

We don't already seem to depend on ipcc-rs here yet. I think there are probably two reasonable-ish options:

  • Just return the Vec<u8>, allow the control plane to depend on both gateway-sp-comms and ipcc-data.
  • Add a dep on ipcc-data, and return an ipcc_data::HSSPanic instead of a Vec<u8>.

I'd probably lean towards the former, just because it makes dep versioning less messy to defer integration, but it does still leave this API as "some assembly required".

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Vec<u8> seems fine - MGS can be "just a transport" in this case, and it's up to the control plane and host to coordinate on how to interpret it.

That said, we probably want to add ipcc-data as a dep to faux-mgs, so it can do similar interpretation? Otherwise it's stuck with just a blob?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it seems much more reasonable to add to faux-mgs than gateway-sp-comms, imo.

Comment thread faux-mgs/src/main.rs
// The SP can only store one host panic at a time, so there's
// no way to retrieve an older panic after it has been
// overwritten.
assert_eq!(seqno, res.seqno);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not super relevant given my suggestion above will get rid of this, but just in general: I don't think assert! is ever correct in this context - it should be reserved for programmer error / invariant violation, and I think we'd treat any report of a faux-mgs panic as a bug in faux-mgs.

This is much more of an operational error akin to an I/O error reading a file or somesuch. As noted above I think we don't need to necessarily go all the way to a retry-loop-with-a-cap; it's fine to return an error (again akin to bailing out on an I/O error partway through reading a file).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! It's definitely not something I'd do in a library, but it seemed exceptional enough to be acceptable for a binary tool. This will go away once it gets consumed by the lib API, but I will keep that in mind for the future!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, gotcha. I think some missing context here is that while this is a binary tool, in practice it has been wrapped up by other binary tools and is therefore used as a quasi library. This is kind of an unfortunate sequence, but does make the "probably shouldn't panic" more reasonable - I think faux-mgs is executed by other tools much, much more often than it's executed directly by a human.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants