Skip to content

feat(gpu): install NVIDIA GPU Operator via foundational software flow#348

Open
viniciusdc wants to merge 1 commit into
mainfrom
feat/gpu-operator-foundational
Open

feat(gpu): install NVIDIA GPU Operator via foundational software flow#348
viniciusdc wants to merge 1 commit into
mainfrom
feat/gpu-operator-foundational

Conversation

@viniciusdc

@viniciusdc viniciusdc commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

closes #232

Installs the NVIDIA GPU Operator so GPU node groups actually get nvidia.com/gpu advertised, instead of every software pack having to ship its own operator app.

This started as an ArgoCD app and I moved it to an imperative Helm install that the provider reconciles, the same way the cluster-autoscaler (#352) and the AWS load balancer controller already work. The operator wants ordered, idempotent lifecycle we drive ourselves (install when GPU nodes exist, remove when they don't, clean up on destroy), and that fits Helm-in-Deploy a lot better than a manifest the GitOps layer renders. It also meant I could drop the shared config struct entirely: gating is now just whether the cluster has a GPU signal, not a config block.

What this does

  • pkg/provider/aws/gpu_operator.go: installGPUOperator / upgradeGPUOperator / uninstallGPUOperator over Helm (history-check, then install or upgrade). Values are the AWS defaults: driver and toolkit off (the AL2023 NVIDIA AMI already ships them), device plugin on, and MOFED_ENABLED=false so it doesn't grab the uverbs devices and fight the EFA plugin.
  • reconcileGPUOperator: installs when the cluster has (or is configured to have) GPU nodes, uninstalls otherwise. "Has GPU nodes" is the union of node groups tagged gpu: true (which already selects the NVIDIA AMI) and live nodes on a GPU instance type. The config side covers a GPU group scaled to zero; the live check covers nodes the config didn't declare. I read the instance-type label on purpose rather than nvidia.com/gpu, since that resource only shows up once the operator is already running.
  • isGPUInstanceType: g/p families with a digit, so it catches gr6 but not graviton.
  • Wiring: Deploy reconciles before the GitOps/argo stage so nvidia.com/gpu is there by the time GPU workloads sync; Destroy uninstalls before tofu destroy, best-effort like Longhorn. The operator has no cloud resources of its own so it can't actually block teardown, but removing it first keeps things symmetric.

No config block and no shared struct anymore: removed pkg/provider/gpu.go, the InfraSettings.GPUOperator field, both providers' gpu_operator YAML + ResolveGPUOperator, the ArgoCD template, and the writer skip/prune plumbing.

Notes

Device-management docs behind the AWS defaults: EKS.

@viniciusdc viniciusdc marked this pull request as ready for review June 4, 2026 13:42
viniciusdc added a commit that referenced this pull request Jun 5, 2026
Documents that provider/cluster-conditional foundational software installs
imperatively via provider-driven Helm rather than as ArgoCD apps, while
unconditional software stays in GitOps. Refines the blanket "GitOps for
software" rule in ADR-0001 / AGENTS.md for this case, which is what makes
the GPU operator's Deploy-time install the intended pattern rather than a
deviation.

Records the target shared helmInstaller + per-provider ConditionalCharts
design; the implementation lands in #349, with #348 (GPU operator) and #352
(autoscaler) as the first instances and MetalLB flagged for migration.
@viniciusdc viniciusdc force-pushed the feat/gpu-operator-foundational branch from d646045 to 370b96f Compare June 5, 2026 01:23

@dcmcand dcmcand left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this - the architecture is clean and consistent with the autoscaler/LBC Helm-in-Deploy pattern, the OTel and status.Send conventions all hold, and the helm-values test guarding driver/toolkit/MOFED against a future chart bump is a nice touch. Requesting changes for two correctness issues plus the merge state.

Two blockers (inline):

  1. isGPUInstanceType matches g4ad, which is an AMD GPU, not NVIDIA - the operator would install and advertise nothing.
  2. reconcileGPUOperator fails the whole deploy on a transient live-node-check error, on the CPU-only path that runs on every deploy. Destroy already handles this best-effort; Deploy should too.

Also, the PR is currently not mergeable: it conflicts with main in provider.go and config.go because #352 (autoscaling) merged and restructured the same Deploy/Destroy regions. A rebase onto main is needed. The bonus is that reconcileGPUOperator can then mirror the merged installClusterAutoscaler call site, including a gate so CPU-only clusters skip the extra API round-trips (see the inline question).

A few non-blocking questions/nits are inline (chart-version v prefix vs NGC, upgrade-path AddRepo, table-driven test). I verified the gpu_operator config-block removal is non-breaking (main never carried one) and that isGPUInstanceType correctly excludes graviton / inf2 / trn1 (just not g4ad).

Design-of-record ADR-0006 (#361) and the shared interface in #349 noted, no concerns there.

if family == "" || (family[0] != 'g' && family[0] != 'p') {
return false
}
return strings.ContainsAny(family, "0123456789")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: this returns true for g4ad.*, but g4ad instances carry an AMD Radeon Pro V520, not an NVIDIA GPU. Installing the NVIDIA operator there deploys a device plugin that finds nothing and never advertises nvidia.com/gpu - a cluster that looks set up but silently isn't. The g+digit rule can't distinguish g4ad from the NVIDIA g4dn. Please exclude g4ad explicitly, and add it to the negative list in TestIsGPUInstanceType (which currently omits it, so the gap isn't caught).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed — isGPUInstanceType now excludes g4ad explicitly (it is the only AMD GPU family on EC2), and g4ad.xlarge is in the negative list of the now table-driven TestIsGPUInstanceType.

Comment thread pkg/provider/aws/gpu_operator.go Outdated
live, err := clusterHasGPUNodes(ctx, kubeconfigBytes)
if err != nil {
span.RecordError(err)
return fmt.Errorf("failed to inspect cluster for GPU nodes: %w", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: this propagates a live-node-check error as a hard Deploy failure, and it sits on the CPU-only path that runs on every deploy. A transient Kubernetes API error here aborts an otherwise-successful deploy after tf.Apply, whereas Destroy treats the same situation as best-effort.

Suggest making the live check advisory: on error, warn via status.Send and fall through with want = false. uninstallGPUOperator is already a safe no-op when the release is absent, so a flaky list call can't fail the deploy for a component that has no cloud resources.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the live-node check is now advisory. On a list error it status.Sends a warning and falls through with want=false instead of aborting the deploy; install/uninstall errors still surface as real failures. Config gpu: true short-circuits the check entirely, so clusters that declare GPU never pay the round-trip, and CPU-only clusters do a single advisory node-list that can no longer fail the deploy. If you would rather drop it to a pure config gate (zero round-trips on CPU-only, at the cost of not catching undeclared GPU nodes), say the word.

// gpuOperatorChartVersion pins the upstream NVIDIA gpu-operator Helm chart.
// v26.3.2 is the latest published version in the nvidia repo (chart version
// tracks appVersion). Bump as new releases land.
gpuOperatorChartVersion = "v26.3.2"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small things:

  • The v prefix (v26.3.2) differs from the LBC's defaultLBCChartVersion = "3.2.1". Helm chart versions are usually bare semver (the v is a git-tag convention). Can you confirm Helm's LocateChart against the NGC registry accepts the v? If NGC expects 26.3.2, this fails with an opaque "chart version not found." Matching the LBC format would be safer.
  • The comment says the chart version "tracks appVersion" - that isn't generally true for gpu-operator across releases. Pointing at the NGC catalog/releases page would be more durable bump guidance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed against the NGC index: gpu-operator publishes its chart as version: "v26.3.2", with the v — so the prefix is required for LocateChart here, unlike the bare-semver LBC chart. Kept v26.3.2 and rewrote the comment to say exactly that and point at the NGC catalog for bump guidance, dropping the "tracks appVersion" line (you are right that it does not generally hold).

}

// upgradeGPUOperator upgrades an existing gpu-operator release.
func upgradeGPUOperator(ctx context.Context, actionConfig *action.Configuration) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question (also applies to upgradeAWSLoadBalancerController, so not a regression): the upgrade path doesn't call helm.AddRepo before LocateChart. Within one deploy it's fine since install runs AddRepo first, but on a fresh machine where the operator already exists on-cluster, the upgrade path runs without the NGC repo in the local Helm cache and LocateChart can fail. Worth confirming a mitigation or adding AddRepo here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our call path it is covered: reconcileGPUOperator → installGPUOperator runs helm.AddRepo before the history-check that branches to upgrade, so the repo is in the local cache by the time upgradeGPUOperator → LocateChart runs. upgradeGPUOperator is never called on its own. Left as-is to mirror the LBC structure you flagged; happy to hoist AddRepo above the branch if you would prefer it explicit in both.


import "testing"

func TestIsGPUInstanceType(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic (repo convention is table-driven tests, per CLAUDE.md): converting this to a []struct{name, input, want} table with t.Run would match TestHasGPUNodeGroups and make it easy to add the g4ad negative case from the blocker above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — converted to a name/input/want table with t.Run, matching TestHasGPUNodeGroups, and added the g4ad negative case plus trn1/graviton/empty.

…cile

Installs the NVIDIA GPU Operator on AWS clusters that have GPU nodes so
nvidia.com/gpu is advertised, without each software pack shipping its own
operator app. Follows the cluster-autoscaler / LBC imperative-Helm-in-Deploy
pattern rather than a GitOps ArgoCD app (decision recorded in ADR-0006, #361).

pkg/provider/aws/gpu_operator.go:
- installGPUOperator / upgradeGPUOperator / uninstallGPUOperator via Helm
  (history-check, then install or upgrade). Values are the AWS NVIDIA-AMI
  defaults: driver and toolkit off, device plugin on, MOFED off (EFA safety).
- reconcileGPUOperator: install when the cluster has or is configured to have
  GPU nodes, uninstall otherwise. gpu: true node groups are the primary signal;
  for clusters that don't declare them an advisory live-node check (by
  instance-type label) catches undeclared nodes and never fails the deploy.
- isGPUInstanceType: g/p families with a digit, excluding g4ad (AMD, not NVIDIA).

Deploy reconciles before the GitOps stage; Destroy uninstalls before tofu
destroy, best-effort like Longhorn. Chart pinned to v26.3.2 (NGC lists it
v-prefixed). No IAM needed; the operator only touches in-cluster resources.
@viniciusdc viniciusdc force-pushed the feat/gpu-operator-foundational branch from 370b96f to ae92bd2 Compare June 5, 2026 12:26
@viniciusdc

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @dcmcand. Pushed the changes (rebased onto main, single commit now):

  • g4ad blockerisGPUInstanceType excludes g4ad; added to the table-driven test negatives.
  • Live-node-check blocker — the check is advisory now (warn + want=false on error, never aborts the deploy); install/uninstall errors still surface. gpu: true short-circuits it so GPU clusters skip the round-trip.
  • Rebase — onto main past Support autoscaling in AWS provider #352; reconcileGPUOperator sits right after the autoscaler install in Deploy, uninstall before tofu destroy in Destroy.
  • Chart v prefix — confirmed correct against the NGC index (chart is published v-prefixed); comment rewritten.
  • AddRepo — covered by the call path (install AddRepos before the history-check branch); details inline.
  • Table-driven test — done.

Replies on each thread. ADR-0006 (the Helm-vs-GitOps decision) is split out to #361 for its own discussion.

@viniciusdc viniciusdc requested a review from dcmcand June 5, 2026 12:28

@dcmcand dcmcand left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review (Go). Strong implementation overall - GPU family detection, the union-of-signals reconcile logic, Destroy/Longhorn symmetry, and OTel instrumentation are all well done and faithfully follow the LBC/Longhorn pattern. Inline comments below cover 2 blockers, 1 question, and a few stylistic notes. (Architecture/ADR findings are being handled separately.)

return err
}

release, err := client.Run(loadedChart, gpuOperatorHelmValues())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] client.Run discards ctx - use RunWithContext.

In Helm v3.19.4, Install.Run creates context.Background() internally and ignores the caller's context. A Ctrl-C or upstream timeout during the ~10-minute install window won't be propagated to Helm: the install keeps running in the background after the CLI has already returned an error or exited.

release, err := client.RunWithContext(ctx, loadedChart, gpuOperatorHelmValues())

(The LBC/autoscaler siblings have the same issue but aren't touched by this PR, so just fixing the new code is consistent with project norms.)

return err
}

release, err := client.Run(gpuOperatorReleaseName, loadedChart, gpuOperatorHelmValues())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] Same ctx-dropping issue on the upgrade path. Upgrade.Run also builds its own context.Background() in this Helm version. Switch to client.RunWithContext(ctx, gpuOperatorReleaseName, loadedChart, gpuOperatorHelmValues()).

Note: Uninstall.Run (line 330) has no RunWithContext equivalent in v3.19.4, so leave that one as-is.

live, err := clusterHasGPUNodes(ctx, kubeconfigBytes)
if err != nil {
status.Send(ctx, status.NewUpdate(status.LevelWarning,
fmt.Sprintf("Skipping GPU node check — could not list nodes: %v", err)).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] Em dash in user-visible status message. This project bans em dashes (U+2014) in output. Replace with a colon or hyphen, e.g. "Skipping GPU node check: could not list nodes: %v".

kubeconfigBytes, kErr := p.GetKubeconfig(ctx, projectName, clusterConfig)
switch {
case kErr != nil:
status.Send(ctx, status.NewUpdate(status.LevelWarning, fmt.Sprintf("Skipping GPU Operator uninstall — kubeconfig unavailable: %v", kErr)).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker] Em dash in user-visible status message. Same as the gpu_operator.go message - replace with a colon or hyphen: "Skipping GPU Operator uninstall: kubeconfig unavailable: %v". (The two existing Longhorn messages just above have the same pre-existing issue, but the new message should be clean.)

return fmt.Errorf("failed to get kubeconfig for GPU Operator reconcile: %w", err)
}

if err := reconcileGPUOperator(ctx, kubeconfigBytes, awsCfg); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Unconditional reconcile on every deploy. This block runs for every deploy, including clusters with no GPU. For those, it fetches a kubeconfig (cached, cheap), makes a real clusterHasGPUNodes node-list API call, then a no-op uninstall. LBC and cluster-autoscaler are each guarded by a config flag instead.

Was the always-on reconcile intentional - i.e. a true desired-state loop that removes the operator when GPU nodes disappear? If the node-list call on non-GPU clusters is a concern, a if cfg.hasGPUNodeGroups() short-circuit before the live check would skip it, but that would miss undeclared GPU nodes. Just confirming intent so we know whether the live-node path is permanent or transitional.

return fmt.Errorf("failed to upgrade GPU Operator: %w", err)
}

span.SetAttributes(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Stylistic] Upgrade span is missing chart_version. installGPUOperator sets attribute.String("chart_version", gpuOperatorChartVersion) on its span (line 179) but the upgrade path doesn't. Upgrade is the hot path on subsequent deploys for a GPU cluster, so anyone debugging "why didn't the chart version change" won't see it in upgrade traces. Add the same attribute here for parity.


// loadGPUOperatorChart locates and loads the gpu-operator Helm chart.
func loadGPUOperatorChart(chartPathOptions action.ChartPathOptions) (*chart.Chart, error) {
chartPath, err := chartPathOptions.LocateChart(gpuOperatorChartName, cli.New())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Stylistic / tracking only] cli.New() reads HELM_* env vars (HELM_NAMESPACE, HELM_REPOSITORY_CACHE, etc.), so the effective Helm settings depend on the operator's shell environment in a way that isn't obvious from the signature. Consistent with the LBC/autoscaler siblings, so not a regression - just worth tracking as a source of subtle behavior differences when the #349 shared Helm infrastructure lands.

}
}

func TestHasGPUNodeGroups(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Stylistic] Add edge cases to TestHasGPUNodeGroups. Ranging over a nil/empty map is safe and returns false, but an explicit "empty node groups" case (Config{}) documents that nil-safe behavior as intentional and guards against a future refactor to a length check. A case with only a GPU: true group (no non-GPU group) would also round out coverage.

}
}

func TestGPUOperatorHelmValues(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Stylistic] TestGPUOperatorHelmValues reads as a flat sequential test; project convention is table-driven. The checked values (driver.enabled, toolkit.enabled, devicePlugin.enabled, MOFED_ENABLED) map cleanly to a table of {path, expected} rows, which also makes adding future coverage (gfd.enabled, dcgm.enabled) easier without restructuring.

dcmcand added a commit that referenced this pull request Jun 9, 2026
…361)

Documents that provider/cluster-conditional foundational software installs
imperatively via provider-driven Helm rather than as ArgoCD apps, while
unconditional software stays in GitOps. Refines the blanket "GitOps for
software" rule in ADR-0001 / AGENTS.md for this case, which is what makes
the GPU operator's Deploy-time install the intended pattern rather than a
deviation.

Records the target shared helmInstaller + per-provider ConditionalCharts
design; the implementation lands in #349, with #348 (GPU operator) and #352
(autoscaler) as the first instances and MetalLB flagged for migration.

Co-authored-by: Chuck McAndrew <6248903+dcmcand@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Install NVIDIA GPU Operator via foundational software flow

2 participants