-
Notifications
You must be signed in to change notification settings - Fork 3
Add MGS API for getting Host Panic/Host Boot Failure messages #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
0c30aab
086e518
470fcde
7b4d6b5
9954c9d
4dfe88d
574f952
0646bfc
2f8d7dd
b70d3a5
b93969b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,7 @@ use gateway_messages::ApobComponentActionResponse; | |
| use gateway_messages::ComponentAction; | ||
| use gateway_messages::ComponentActionResponse; | ||
| use gateway_messages::EcdsaSha2Nistp256Challenge; | ||
| use gateway_messages::HostInfoRequest; | ||
| use gateway_messages::IgnitionCommand; | ||
| use gateway_messages::LedComponentAction; | ||
| use gateway_messages::MonorailComponentAction; | ||
|
|
@@ -542,6 +543,10 @@ enum Command { | |
| #[clap(value_parser = parse_power_rail_name)] | ||
| rail: PowerRailName, | ||
| }, | ||
| /// Get the Host Panic Payload | ||
| GetHostPanic, | ||
| /// Get the Host Boot Failure message | ||
| GetBootFail, | ||
| } | ||
|
|
||
| #[derive(Subcommand, Debug, Clone)] | ||
|
|
@@ -2524,6 +2529,141 @@ async fn run_command( | |
|
|
||
| Ok(Output::Lines(lines)) | ||
| } | ||
| Command::GetHostPanic => { | ||
| // Get the first segment, if any | ||
| // | ||
| // TODO: buffer sizing? We have a normal sized UDP frame (15xx | ||
| // bytes?), with some overhead, 1k might be reasonable here, but | ||
| // there's probably not that much speed benefit to halving the | ||
| // number of frames sent as this isn't done in a "hot" loop. | ||
| let res = sp.get_host_panic_payload(None, 512).await?; | ||
|
|
||
| let mut total = res.contents; | ||
| let ttl_bytes = res.total_len; | ||
| let seqno = res.seqno; | ||
|
|
||
| // Truncate the payload (which potentially contains *more* bytes | ||
| // than were actually used!) if the total message bytes fit into | ||
| // a single frame. This is a no-op if ttl_bytes > total.len(). | ||
| total.truncate(ttl_bytes); | ||
|
|
||
| // Request the entire contents, one chunk at a time | ||
| while total.len() < ttl_bytes { | ||
| // Get the NEXT chunk of data, after the part(s) that we already | ||
| // have received. | ||
| let res = sp | ||
| .get_host_panic_payload( | ||
| Some(HostInfoRequest { | ||
| offset: total.len() as u32, | ||
| seqno, | ||
| }), | ||
| 512, | ||
| ) | ||
| .await?; | ||
|
|
||
| // TODO: If either of these change, it would mean that the host | ||
| // panicked RIGHT as we were asking about it. The simpler route | ||
| // is to just panic, if we wanted to be really fancy we could | ||
| // put this whole match arm in an outer loop and gracefully | ||
| // retry. If you are taking this impl for real control plane | ||
| // things, consider doing that, maybe with some upper bound of | ||
| // retries! | ||
| // | ||
| // The SP can only store one host panic at a time, so there's | ||
| // no way to retrieve an older panic after it has been | ||
| // overwritten. | ||
| assert_eq!(seqno, res.seqno); | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment is not super relevant given my suggestion above will get rid of this, but just in general: I don't think This is much more of an operational error akin to an I/O error reading a file or somesuch. As noted above I think we don't need to necessarily go all the way to a retry-loop-with-a-cap; it's fine to return an error (again akin to bailing out on an I/O error partway through reading a file).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotcha! It's definitely not something I'd do in a library, but it seemed exceptional enough to be acceptable for a binary tool. This will go away once it gets consumed by the lib API, but I will keep that in mind for the future!
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, gotcha. I think some missing context here is that while this is a binary tool, in practice it has been wrapped up by other binary tools and is therefore used as a quasi library. This is kind of an unfortunate sequence, but does make the "probably shouldn't panic" more reasonable - I think faux-mgs is executed by other tools much, much more often than it's executed directly by a human. |
||
| assert_eq!(ttl_bytes, res.total_len); | ||
| total.extend_from_slice(&res.contents); | ||
| } | ||
|
|
||
| // Again, truncate `total`, to handle any extra bytes in the last | ||
| // received frame. | ||
| total.truncate(ttl_bytes); | ||
|
|
||
| let mut out = vec![]; | ||
| if let Ok(text) = std::str::from_utf8(&total) { | ||
| out.push("Panic Text:".to_string()); | ||
| // TODO: Is this necessary? Just push as one to_string? | ||
| out.extend(text.lines().map(str::to_string)); | ||
| Ok(Output::Lines(out)) | ||
| } else { | ||
| out.push( | ||
| "Panic Text was not a valid UTF-8 string.".to_string(), | ||
| ); | ||
| out.push("Panic Text bytes (hex):".to_string()); | ||
| out.push(format!("{total:02X?}")); | ||
| Ok(Output::Lines(out)) | ||
| } | ||
| } | ||
| Command::GetBootFail => { | ||
| // Get the first segment, if any | ||
| // | ||
| // TODO: buffer sizing? We have a normal sized UDP frame (15xx | ||
| // bytes?), with some overhead, 1k might be reasonable here, but | ||
| // there's probably not that much speed benefit to halving the | ||
| // number of frames sent as this isn't done in a "hot" loop. | ||
| let res = sp.get_host_bootfail_payload(None, 512).await?; | ||
|
|
||
| let mut total = res.contents; | ||
| let ttl_bytes = res.total_len; | ||
| let seqno = res.seqno; | ||
|
|
||
| // Truncate the payload (which potentially contains *more* bytes | ||
| // than were actually used!) if the total message bytes fit into | ||
| // a single frame. This is a no-op if ttl_bytes > total.len(). | ||
| total.truncate(ttl_bytes); | ||
|
|
||
| // Request the entire contents, one chunk at a time | ||
| while total.len() < ttl_bytes { | ||
| // Get the NEXT chunk of data, after the part(s) that we already | ||
| // have received. | ||
| let res = sp | ||
| .get_host_bootfail_payload( | ||
| Some(HostInfoRequest { | ||
| offset: total.len() as u32, | ||
| seqno, | ||
| }), | ||
| 512, | ||
| ) | ||
| .await?; | ||
|
|
||
| // TODO: If either of these change, it would mean that the host | ||
| // failed RIGHT as we were asking about it. The simpler route | ||
| // is to just panic, if we wanted to be really fancy we could | ||
| // put this whole match arm in an outer loop and gracefully | ||
| // retry. If you are taking this impl for real control plane | ||
| // things, consider doing that, maybe with some upper bound of | ||
| // retries! | ||
| // | ||
| // The SP can only store one bootfail at a time, so there's | ||
| // no way to retrieve an older panic after it has been | ||
| // overwritten. | ||
| assert_eq!(seqno, res.seqno); | ||
| assert_eq!(ttl_bytes, res.total_len); | ||
| total.extend_from_slice(&res.contents); | ||
| } | ||
|
|
||
| // Again, truncate `total`, to handle any extra bytes in the last | ||
| // received frame. | ||
| total.truncate(ttl_bytes); | ||
|
|
||
| let mut out = vec![]; | ||
| if let Ok(text) = std::str::from_utf8(&total) { | ||
| out.push("Boot Failure Text:".to_string()); | ||
| // TODO: Is this necessary? Just push as one to_string? | ||
| out.extend(text.lines().map(str::to_string)); | ||
| Ok(Output::Lines(out)) | ||
| } else { | ||
| out.push( | ||
| "Boot Failure Text was not a valid UTF-8 string." | ||
| .to_string(), | ||
| ); | ||
| out.push("Boot Failure Text bytes (hex):".to_string()); | ||
| out.push(format!("{total:02X?}")); | ||
| Ok(Output::Lines(out)) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -243,8 +243,41 @@ pub enum MgsRequest { | |
| /// different than the current active slot (see `ComponentGetActiveSlot`). | ||
| ComponentGetPersistentSlot(SpComponent), | ||
|
|
||
| /// Request the STATUS registers of a PMBus device, indexed by power rail name. | ||
| /// Request the STATUS registers of a PMBus device, indexed by power rail | ||
| /// name. | ||
| GetPmbusStatus(PowerRailName), | ||
|
|
||
| /// Request for Host Panic Payload | ||
| GetHostPanicPayload { | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With apologies for how tedious this is: since we're adding to the protocol, can we bump the version and add tests that ensure we don't break compatibility in future changes? See #482 for a simple one, or #392 for a somewhat more involved one. Also it looks like #497 should have had this also 😬. We could either PR a version bump + tests for that separately, then come back to this PR, or roll both changes into one version bump? |
||
| // If Some: must include a valid offset, and the index of the current Host Panic. | ||
| // If None: Will always be an offset of 0, and the index of the current Host Panic | ||
| // (if any) will be included in the response | ||
| request: Option<HostInfoRequest>, | ||
| // The maximum size, in bytes, of the fragment to be returned | ||
| len: u32, | ||
| }, | ||
|
|
||
| /// Request for Host Boot Failure Payload | ||
| GetHostBootfailPayload { | ||
| // If Some: must include a valid offset, and the index of the current Host Bootfail. | ||
| // If None: Will always be an offset of 0, and the index of the current Host Bootfail | ||
| // (if any) will be included in the response | ||
| request: Option<HostInfoRequest>, | ||
| // The maximum size, in bytes, of the fragment to be returned | ||
| len: u32, | ||
| }, | ||
| } | ||
|
|
||
| /// Request information for Host Bootfail or Host Panic data | ||
| #[derive( | ||
| Debug, Clone, Copy, PartialEq, Eq, Deserialize, Serialize, SerializedSize, | ||
| )] | ||
| pub struct HostInfoRequest { | ||
| /// The offset, in bytes, from the start of the data | ||
| pub offset: u32, | ||
| /// The specific data sequence number that uniquely identifies the requested | ||
| /// data | ||
| pub seqno: u32, | ||
| } | ||
|
|
||
| #[derive( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TODOs and having to choose a buffer length are signs that
sp.get_host_panic_payload()is too low-level for what we'd like from thegateway-sp-commsinterface. It's the common interface between faux-mgs and real MGS, and ideally those behave more-or-less identically. If we've got logic here that's specifically documented as "the control plane should do this differently", that's not true anymore.Many of the
gateway-sp-commsmethods are simple 1-to-1 wrappers around MGS <-> SP messages, but there are several that wrap multiple messages to present a nicer interface to the control plane; e.g.,get_component_details()sends multiple requests necessary to collect all the information (similar to this interface I think), orstart_update(), which drives an entire update.From reading the faux-mgs code here, I think the interface I'd propose is
sp.get_host_panic_payload() -> Result<Vec<u8>, _>(no arguments, return the entire panic message), which handles picking a reasonable buffer size and embeds this while loop. If we go this route we don't want these assertions though - probably reasonable to return a specific error for "host panic'd while we were reading the panic message" and then let the caller retry?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's reasonable! I can push down the "get all" logic one layer so that
gateway-sp-commshandles more of this autonomously.Just checking for the return type:
Result<Option<Vec<u8>>, _>is cool, you don't want the "sequence number" to be surfaced up? (it is a rough indicator of "how many times has the system panicked/bootfailed since the last SP reboot")Vec<u8>, instead ofString? The data should always be UTF-8, but we can let the caller decide what to do if we got a payload but it was corrupted (or the host gave us bad UTF-8 data).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh,
bootfailalso has au8"failure reason", I'm guessing we do want to pass that up as well? (panic doesn't have this).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, sorry, definitely include more than just the message itself if there's more relevant information. I'd include both the sequence number and the failure reason (and any other bits if I missed those).
I had the same internal discussion before posting that. 😅 I could see reasonable arguments for:
Vec<u8>(fewer error cases to handle ingateway-sp-comms, always reports the exact data from the host)Stringviafrom_utf8(fails if we get corrupt / bad data, which seems not great)Stringviafrom_utf8_lossy(doesn't fail, but is lossy)I don't think I'd pick "
Stringviafrom_utf8". If we really expect these to always be UTF8, I'd be fine with "Stringviafrom_utf8_lossy", since then at least we'll always get as much of a printable thing as there is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I am back to report that I am a liar! Panic and Bootfail data are in fact not strings, but instead a specific format, for which a parser is provided in
ipcc-rs/ipcc-data, and was defined by RFD0316.We don't already seem to depend on
ipcc-rshere yet. I think there are probably two reasonable-ish options:Vec<u8>, allow the control plane to depend on bothgateway-sp-commsandipcc-data.ipcc-data, and return anipcc_data::HSSPanicinstead of aVec<u8>.I'd probably lean towards the former, just because it makes dep versioning less messy to defer integration, but it does still leave this API as "some assembly required".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,
Vec<u8>seems fine - MGS can be "just a transport" in this case, and it's up to the control plane and host to coordinate on how to interpret it.That said, we probably want to add
ipcc-dataas a dep tofaux-mgs, so it can do similar interpretation? Otherwise it's stuck with just a blob?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it seems much more reasonable to add to
faux-mgsthangateway-sp-comms, imo.