feat(gpu): install NVIDIA GPU Operator via foundational software flow#348
feat(gpu): install NVIDIA GPU Operator via foundational software flow#348viniciusdc wants to merge 1 commit into
Conversation
Documents that provider/cluster-conditional foundational software installs imperatively via provider-driven Helm rather than as ArgoCD apps, while unconditional software stays in GitOps. Refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md for this case, which is what makes the GPU operator's Deploy-time install the intended pattern rather than a deviation. Records the target shared helmInstaller + per-provider ConditionalCharts design; the implementation lands in #349, with #348 (GPU operator) and #352 (autoscaler) as the first instances and MetalLB flagged for migration.
d646045 to
370b96f
Compare
dcmcand
left a comment
There was a problem hiding this comment.
Thanks for this - the architecture is clean and consistent with the autoscaler/LBC Helm-in-Deploy pattern, the OTel and status.Send conventions all hold, and the helm-values test guarding driver/toolkit/MOFED against a future chart bump is a nice touch. Requesting changes for two correctness issues plus the merge state.
Two blockers (inline):
isGPUInstanceTypematchesg4ad, which is an AMD GPU, not NVIDIA - the operator would install and advertise nothing.reconcileGPUOperatorfails the whole deploy on a transient live-node-check error, on the CPU-only path that runs on every deploy.Destroyalready handles this best-effort;Deployshould too.
Also, the PR is currently not mergeable: it conflicts with main in provider.go and config.go because #352 (autoscaling) merged and restructured the same Deploy/Destroy regions. A rebase onto main is needed. The bonus is that reconcileGPUOperator can then mirror the merged installClusterAutoscaler call site, including a gate so CPU-only clusters skip the extra API round-trips (see the inline question).
A few non-blocking questions/nits are inline (chart-version v prefix vs NGC, upgrade-path AddRepo, table-driven test). I verified the gpu_operator config-block removal is non-breaking (main never carried one) and that isGPUInstanceType correctly excludes graviton / inf2 / trn1 (just not g4ad).
Design-of-record ADR-0006 (#361) and the shared interface in #349 noted, no concerns there.
| if family == "" || (family[0] != 'g' && family[0] != 'p') { | ||
| return false | ||
| } | ||
| return strings.ContainsAny(family, "0123456789") |
There was a problem hiding this comment.
Blocker: this returns true for g4ad.*, but g4ad instances carry an AMD Radeon Pro V520, not an NVIDIA GPU. Installing the NVIDIA operator there deploys a device plugin that finds nothing and never advertises nvidia.com/gpu - a cluster that looks set up but silently isn't. The g+digit rule can't distinguish g4ad from the NVIDIA g4dn. Please exclude g4ad explicitly, and add it to the negative list in TestIsGPUInstanceType (which currently omits it, so the gap isn't caught).
There was a problem hiding this comment.
Good catch. Fixed — isGPUInstanceType now excludes g4ad explicitly (it is the only AMD GPU family on EC2), and g4ad.xlarge is in the negative list of the now table-driven TestIsGPUInstanceType.
| live, err := clusterHasGPUNodes(ctx, kubeconfigBytes) | ||
| if err != nil { | ||
| span.RecordError(err) | ||
| return fmt.Errorf("failed to inspect cluster for GPU nodes: %w", err) |
There was a problem hiding this comment.
Blocker: this propagates a live-node-check error as a hard Deploy failure, and it sits on the CPU-only path that runs on every deploy. A transient Kubernetes API error here aborts an otherwise-successful deploy after tf.Apply, whereas Destroy treats the same situation as best-effort.
Suggest making the live check advisory: on error, warn via status.Send and fall through with want = false. uninstallGPUOperator is already a safe no-op when the release is absent, so a flaky list call can't fail the deploy for a component that has no cloud resources.
There was a problem hiding this comment.
Fixed — the live-node check is now advisory. On a list error it status.Sends a warning and falls through with want=false instead of aborting the deploy; install/uninstall errors still surface as real failures. Config gpu: true short-circuits the check entirely, so clusters that declare GPU never pay the round-trip, and CPU-only clusters do a single advisory node-list that can no longer fail the deploy. If you would rather drop it to a pure config gate (zero round-trips on CPU-only, at the cost of not catching undeclared GPU nodes), say the word.
| // gpuOperatorChartVersion pins the upstream NVIDIA gpu-operator Helm chart. | ||
| // v26.3.2 is the latest published version in the nvidia repo (chart version | ||
| // tracks appVersion). Bump as new releases land. | ||
| gpuOperatorChartVersion = "v26.3.2" |
There was a problem hiding this comment.
Two small things:
- The
vprefix (v26.3.2) differs from the LBC'sdefaultLBCChartVersion = "3.2.1". Helm chart versions are usually bare semver (thevis a git-tag convention). Can you confirm Helm'sLocateChartagainst the NGC registry accepts thev? If NGC expects26.3.2, this fails with an opaque "chart version not found." Matching the LBC format would be safer. - The comment says the chart version "tracks appVersion" - that isn't generally true for gpu-operator across releases. Pointing at the NGC catalog/releases page would be more durable bump guidance.
There was a problem hiding this comment.
Confirmed against the NGC index: gpu-operator publishes its chart as version: "v26.3.2", with the v — so the prefix is required for LocateChart here, unlike the bare-semver LBC chart. Kept v26.3.2 and rewrote the comment to say exactly that and point at the NGC catalog for bump guidance, dropping the "tracks appVersion" line (you are right that it does not generally hold).
| } | ||
|
|
||
| // upgradeGPUOperator upgrades an existing gpu-operator release. | ||
| func upgradeGPUOperator(ctx context.Context, actionConfig *action.Configuration) error { |
There was a problem hiding this comment.
Question (also applies to upgradeAWSLoadBalancerController, so not a regression): the upgrade path doesn't call helm.AddRepo before LocateChart. Within one deploy it's fine since install runs AddRepo first, but on a fresh machine where the operator already exists on-cluster, the upgrade path runs without the NGC repo in the local Helm cache and LocateChart can fail. Worth confirming a mitigation or adding AddRepo here.
There was a problem hiding this comment.
In our call path it is covered: reconcileGPUOperator → installGPUOperator runs helm.AddRepo before the history-check that branches to upgrade, so the repo is in the local cache by the time upgradeGPUOperator → LocateChart runs. upgradeGPUOperator is never called on its own. Left as-is to mirror the LBC structure you flagged; happy to hoist AddRepo above the branch if you would prefer it explicit in both.
|
|
||
| import "testing" | ||
|
|
||
| func TestIsGPUInstanceType(t *testing.T) { |
There was a problem hiding this comment.
Stylistic (repo convention is table-driven tests, per CLAUDE.md): converting this to a []struct{name, input, want} table with t.Run would match TestHasGPUNodeGroups and make it easy to add the g4ad negative case from the blocker above.
There was a problem hiding this comment.
Done — converted to a name/input/want table with t.Run, matching TestHasGPUNodeGroups, and added the g4ad negative case plus trn1/graviton/empty.
…cile Installs the NVIDIA GPU Operator on AWS clusters that have GPU nodes so nvidia.com/gpu is advertised, without each software pack shipping its own operator app. Follows the cluster-autoscaler / LBC imperative-Helm-in-Deploy pattern rather than a GitOps ArgoCD app (decision recorded in ADR-0006, #361). pkg/provider/aws/gpu_operator.go: - installGPUOperator / upgradeGPUOperator / uninstallGPUOperator via Helm (history-check, then install or upgrade). Values are the AWS NVIDIA-AMI defaults: driver and toolkit off, device plugin on, MOFED off (EFA safety). - reconcileGPUOperator: install when the cluster has or is configured to have GPU nodes, uninstall otherwise. gpu: true node groups are the primary signal; for clusters that don't declare them an advisory live-node check (by instance-type label) catches undeclared nodes and never fails the deploy. - isGPUInstanceType: g/p families with a digit, excluding g4ad (AMD, not NVIDIA). Deploy reconciles before the GitOps stage; Destroy uninstalls before tofu destroy, best-effort like Longhorn. Chart pinned to v26.3.2 (NGC lists it v-prefixed). No IAM needed; the operator only touches in-cluster resources.
370b96f to
ae92bd2
Compare
|
Thanks for the thorough review @dcmcand. Pushed the changes (rebased onto main, single commit now):
Replies on each thread. ADR-0006 (the Helm-vs-GitOps decision) is split out to #361 for its own discussion. |
dcmcand
left a comment
There was a problem hiding this comment.
Code review (Go). Strong implementation overall - GPU family detection, the union-of-signals reconcile logic, Destroy/Longhorn symmetry, and OTel instrumentation are all well done and faithfully follow the LBC/Longhorn pattern. Inline comments below cover 2 blockers, 1 question, and a few stylistic notes. (Architecture/ADR findings are being handled separately.)
| return err | ||
| } | ||
|
|
||
| release, err := client.Run(loadedChart, gpuOperatorHelmValues()) |
There was a problem hiding this comment.
[Blocker] client.Run discards ctx - use RunWithContext.
In Helm v3.19.4, Install.Run creates context.Background() internally and ignores the caller's context. A Ctrl-C or upstream timeout during the ~10-minute install window won't be propagated to Helm: the install keeps running in the background after the CLI has already returned an error or exited.
release, err := client.RunWithContext(ctx, loadedChart, gpuOperatorHelmValues())(The LBC/autoscaler siblings have the same issue but aren't touched by this PR, so just fixing the new code is consistent with project norms.)
| return err | ||
| } | ||
|
|
||
| release, err := client.Run(gpuOperatorReleaseName, loadedChart, gpuOperatorHelmValues()) |
There was a problem hiding this comment.
[Blocker] Same ctx-dropping issue on the upgrade path. Upgrade.Run also builds its own context.Background() in this Helm version. Switch to client.RunWithContext(ctx, gpuOperatorReleaseName, loadedChart, gpuOperatorHelmValues()).
Note: Uninstall.Run (line 330) has no RunWithContext equivalent in v3.19.4, so leave that one as-is.
| live, err := clusterHasGPUNodes(ctx, kubeconfigBytes) | ||
| if err != nil { | ||
| status.Send(ctx, status.NewUpdate(status.LevelWarning, | ||
| fmt.Sprintf("Skipping GPU node check — could not list nodes: %v", err)). |
There was a problem hiding this comment.
[Blocker] Em dash in user-visible status message. This project bans em dashes (U+2014) in output. Replace with a colon or hyphen, e.g. "Skipping GPU node check: could not list nodes: %v".
| kubeconfigBytes, kErr := p.GetKubeconfig(ctx, projectName, clusterConfig) | ||
| switch { | ||
| case kErr != nil: | ||
| status.Send(ctx, status.NewUpdate(status.LevelWarning, fmt.Sprintf("Skipping GPU Operator uninstall — kubeconfig unavailable: %v", kErr)). |
There was a problem hiding this comment.
[Blocker] Em dash in user-visible status message. Same as the gpu_operator.go message - replace — with a colon or hyphen: "Skipping GPU Operator uninstall: kubeconfig unavailable: %v". (The two existing Longhorn messages just above have the same pre-existing issue, but the new message should be clean.)
| return fmt.Errorf("failed to get kubeconfig for GPU Operator reconcile: %w", err) | ||
| } | ||
|
|
||
| if err := reconcileGPUOperator(ctx, kubeconfigBytes, awsCfg); err != nil { |
There was a problem hiding this comment.
[Question] Unconditional reconcile on every deploy. This block runs for every deploy, including clusters with no GPU. For those, it fetches a kubeconfig (cached, cheap), makes a real clusterHasGPUNodes node-list API call, then a no-op uninstall. LBC and cluster-autoscaler are each guarded by a config flag instead.
Was the always-on reconcile intentional - i.e. a true desired-state loop that removes the operator when GPU nodes disappear? If the node-list call on non-GPU clusters is a concern, a if cfg.hasGPUNodeGroups() short-circuit before the live check would skip it, but that would miss undeclared GPU nodes. Just confirming intent so we know whether the live-node path is permanent or transitional.
| return fmt.Errorf("failed to upgrade GPU Operator: %w", err) | ||
| } | ||
|
|
||
| span.SetAttributes( |
There was a problem hiding this comment.
[Stylistic] Upgrade span is missing chart_version. installGPUOperator sets attribute.String("chart_version", gpuOperatorChartVersion) on its span (line 179) but the upgrade path doesn't. Upgrade is the hot path on subsequent deploys for a GPU cluster, so anyone debugging "why didn't the chart version change" won't see it in upgrade traces. Add the same attribute here for parity.
|
|
||
| // loadGPUOperatorChart locates and loads the gpu-operator Helm chart. | ||
| func loadGPUOperatorChart(chartPathOptions action.ChartPathOptions) (*chart.Chart, error) { | ||
| chartPath, err := chartPathOptions.LocateChart(gpuOperatorChartName, cli.New()) |
There was a problem hiding this comment.
[Stylistic / tracking only] cli.New() reads HELM_* env vars (HELM_NAMESPACE, HELM_REPOSITORY_CACHE, etc.), so the effective Helm settings depend on the operator's shell environment in a way that isn't obvious from the signature. Consistent with the LBC/autoscaler siblings, so not a regression - just worth tracking as a source of subtle behavior differences when the #349 shared Helm infrastructure lands.
| } | ||
| } | ||
|
|
||
| func TestHasGPUNodeGroups(t *testing.T) { |
There was a problem hiding this comment.
[Stylistic] Add edge cases to TestHasGPUNodeGroups. Ranging over a nil/empty map is safe and returns false, but an explicit "empty node groups" case (Config{}) documents that nil-safe behavior as intentional and guards against a future refactor to a length check. A case with only a GPU: true group (no non-GPU group) would also round out coverage.
| } | ||
| } | ||
|
|
||
| func TestGPUOperatorHelmValues(t *testing.T) { |
There was a problem hiding this comment.
[Stylistic] TestGPUOperatorHelmValues reads as a flat sequential test; project convention is table-driven. The checked values (driver.enabled, toolkit.enabled, devicePlugin.enabled, MOFED_ENABLED) map cleanly to a table of {path, expected} rows, which also makes adding future coverage (gfd.enabled, dcgm.enabled) easier without restructuring.
…361) Documents that provider/cluster-conditional foundational software installs imperatively via provider-driven Helm rather than as ArgoCD apps, while unconditional software stays in GitOps. Refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md for this case, which is what makes the GPU operator's Deploy-time install the intended pattern rather than a deviation. Records the target shared helmInstaller + per-provider ConditionalCharts design; the implementation lands in #349, with #348 (GPU operator) and #352 (autoscaler) as the first instances and MetalLB flagged for migration. Co-authored-by: Chuck McAndrew <6248903+dcmcand@users.noreply.github.com>
closes #232
Installs the NVIDIA GPU Operator so GPU node groups actually get
nvidia.com/gpuadvertised, instead of every software pack having to ship its own operator app.This started as an ArgoCD app and I moved it to an imperative Helm install that the provider reconciles, the same way the cluster-autoscaler (#352) and the AWS load balancer controller already work. The operator wants ordered, idempotent lifecycle we drive ourselves (install when GPU nodes exist, remove when they don't, clean up on destroy), and that fits Helm-in-
Deploya lot better than a manifest the GitOps layer renders. It also meant I could drop the shared config struct entirely: gating is now just whether the cluster has a GPU signal, not a config block.What this does
pkg/provider/aws/gpu_operator.go:installGPUOperator/upgradeGPUOperator/uninstallGPUOperatorover Helm (history-check, then install or upgrade). Values are the AWS defaults: driver and toolkit off (the AL2023 NVIDIA AMI already ships them), device plugin on, andMOFED_ENABLED=falseso it doesn't grab theuverbsdevices and fight the EFA plugin.reconcileGPUOperator: installs when the cluster has (or is configured to have) GPU nodes, uninstalls otherwise. "Has GPU nodes" is the union of node groups taggedgpu: true(which already selects the NVIDIA AMI) and live nodes on a GPU instance type. The config side covers a GPU group scaled to zero; the live check covers nodes the config didn't declare. I read the instance-type label on purpose rather thannvidia.com/gpu, since that resource only shows up once the operator is already running.isGPUInstanceType: g/p families with a digit, so it catchesgr6but notgraviton.Deployreconciles before the GitOps/argo stage sonvidia.com/gpuis there by the time GPU workloads sync;Destroyuninstalls beforetofu destroy, best-effort like Longhorn. The operator has no cloud resources of its own so it can't actually block teardown, but removing it first keeps things symmetric.No config block and no shared struct anymore: removed
pkg/provider/gpu.go, theInfraSettings.GPUOperatorfield, both providers'gpu_operatorYAML +ResolveGPUOperator, the ArgoCD template, and the writer skip/prune plumbing.Notes
Deployrather than as an ArgoCD app, and refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md for that case. The sharedhelmInstaller+ per-providerConditionalChartsinterface the ADR describes lands in Define a consistent pattern for conditional foundational apps (CFAs) #349; this PR (and Support autoscaling in AWS provider #352) are the first instances.v26.3.2(a const ingpu_operator.go), the latest publishedgpu-operatorchart in the nvidia repo. No IAM/Pod Identity needed here, the operator only touches in-cluster resources.isGPUInstanceType,hasGPUNodeGroups, and the Helm values. The install/reconcile paths need a live cluster + Helm, so I didn't unit-test those here.Device-management docs behind the AWS defaults: EKS.