Add MGS API for getting Host Panic/Host Boot Failure messages#495
Add MGS API for getting Host Panic/Host Boot Failure messages#495jamesmunns wants to merge 11 commits into
Conversation
|
Should be able to move this out of WIP tomorrow, progress update here: oxidecomputer/hubris#2504 (comment). |
|
This has been tested as working (with the pending |
| GetPmbusStatus(PowerRailName), | ||
|
|
||
| /// Request for Host Panic Payload | ||
| GetHostPanicPayload { |
There was a problem hiding this comment.
With apologies for how tedious this is: since we're adding to the protocol, can we bump the version and add tests that ensure we don't break compatibility in future changes? See #482 for a simple one, or #392 for a somewhat more involved one.
Also it looks like #497 should have had this also 😬. We could either PR a version bump + tests for that separately, then come back to this PR, or roll both changes into one version bump?
| ) | ||
| .await?; | ||
|
|
||
| // TODO: If either of these change, it would mean that the host |
There was a problem hiding this comment.
I think the TODOs and having to choose a buffer length are signs that sp.get_host_panic_payload() is too low-level for what we'd like from the gateway-sp-comms interface. It's the common interface between faux-mgs and real MGS, and ideally those behave more-or-less identically. If we've got logic here that's specifically documented as "the control plane should do this differently", that's not true anymore.
Many of the gateway-sp-comms methods are simple 1-to-1 wrappers around MGS <-> SP messages, but there are several that wrap multiple messages to present a nicer interface to the control plane; e.g., get_component_details() sends multiple requests necessary to collect all the information (similar to this interface I think), or start_update(), which drives an entire update.
From reading the faux-mgs code here, I think the interface I'd propose is sp.get_host_panic_payload() -> Result<Vec<u8>, _> (no arguments, return the entire panic message), which handles picking a reasonable buffer size and embeds this while loop. If we go this route we don't want these assertions though - probably reasonable to return a specific error for "host panic'd while we were reading the panic message" and then let the caller retry?
There was a problem hiding this comment.
I think that's reasonable! I can push down the "get all" logic one layer so that gateway-sp-comms handles more of this autonomously.
Just checking for the return type:
Result<Option<Vec<u8>>, _>is cool, you don't want the "sequence number" to be surfaced up? (it is a rough indicator of "how many times has the system panicked/bootfailed since the last SP reboot")- You want
Vec<u8>, instead ofString? The data should always be UTF-8, but we can let the caller decide what to do if we got a payload but it was corrupted (or the host gave us bad UTF-8 data).
There was a problem hiding this comment.
Oh, bootfail also has a u8 "failure reason", I'm guessing we do want to pass that up as well? (panic doesn't have this).
There was a problem hiding this comment.
Ah, yes, sorry, definitely include more than just the message itself if there's more relevant information. I'd include both the sequence number and the failure reason (and any other bits if I missed those).
You want Vec, instead of String? The data should always be UTF-8, but we can let the caller decide what to do if we got a payload but it was corrupted (or the host gave us bad UTF-8 data).
I had the same internal discussion before posting that. 😅 I could see reasonable arguments for:
Vec<u8>(fewer error cases to handle ingateway-sp-comms, always reports the exact data from the host)Stringviafrom_utf8(fails if we get corrupt / bad data, which seems not great)Stringviafrom_utf8_lossy(doesn't fail, but is lossy)
I don't think I'd pick "String via from_utf8". If we really expect these to always be UTF8, I'd be fine with "String via from_utf8_lossy", since then at least we'll always get as much of a printable thing as there is.
There was a problem hiding this comment.
Okay, I am back to report that I am a liar! Panic and Bootfail data are in fact not strings, but instead a specific format, for which a parser is provided in ipcc-rs/ipcc-data, and was defined by RFD0316.
We don't already seem to depend on ipcc-rs here yet. I think there are probably two reasonable-ish options:
- Just return the
Vec<u8>, allow the control plane to depend on bothgateway-sp-commsandipcc-data. - Add a dep on
ipcc-data, and return anipcc_data::HSSPanicinstead of aVec<u8>.
I'd probably lean towards the former, just because it makes dep versioning less messy to defer integration, but it does still leave this API as "some assembly required".
There was a problem hiding this comment.
Yeah, Vec<u8> seems fine - MGS can be "just a transport" in this case, and it's up to the control plane and host to coordinate on how to interpret it.
That said, we probably want to add ipcc-data as a dep to faux-mgs, so it can do similar interpretation? Otherwise it's stuck with just a blob?
There was a problem hiding this comment.
Yup, it seems much more reasonable to add to faux-mgs than gateway-sp-comms, imo.
| // The SP can only store one host panic at a time, so there's | ||
| // no way to retrieve an older panic after it has been | ||
| // overwritten. | ||
| assert_eq!(seqno, res.seqno); |
There was a problem hiding this comment.
This comment is not super relevant given my suggestion above will get rid of this, but just in general: I don't think assert! is ever correct in this context - it should be reserved for programmer error / invariant violation, and I think we'd treat any report of a faux-mgs panic as a bug in faux-mgs.
This is much more of an operational error akin to an I/O error reading a file or somesuch. As noted above I think we don't need to necessarily go all the way to a retry-loop-with-a-cap; it's fine to return an error (again akin to bailing out on an I/O error partway through reading a file).
There was a problem hiding this comment.
Gotcha! It's definitely not something I'd do in a library, but it seemed exceptional enough to be acceptable for a binary tool. This will go away once it gets consumed by the lib API, but I will keep that in mind for the future!
There was a problem hiding this comment.
Yeah, gotcha. I think some missing context here is that while this is a binary tool, in practice it has been wrapped up by other binary tools and is therefore used as a quasi library. This is kind of an unfortunate sequence, but does make the "probably shouldn't panic" more reasonable - I think faux-mgs is executed by other tools much, much more often than it's executed directly by a human.
This is a component of oxidecomputer/hubris#2518
and is probably not totally ready for feedback yet.This is a part of implementing oxidecomputer/hubris#2504.