feat(gpu): install NVIDIA GPU Operator via foundational software flow by viniciusdc · Pull Request #348 · nebari-dev/nebari-infrastructure-core

viniciusdc · 2026-06-02T17:00:13Z

closes #232

Installs the NVIDIA GPU Operator so GPU node groups actually get nvidia.com/gpu advertised, instead of every software pack having to ship its own operator app.

This started as an ArgoCD app and I moved it to an imperative Helm install that the provider reconciles, the same way the cluster-autoscaler (#352) and the AWS load balancer controller already work. The operator wants ordered, idempotent lifecycle we drive ourselves (install when GPU nodes exist, remove when they don't, clean up on destroy), and that fits Helm-in-Deploy a lot better than a manifest the GitOps layer renders. It also meant I could drop the shared config struct entirely: gating is now just whether the cluster has a GPU signal, not a config block.

What this does

pkg/provider/aws/gpu_operator.go: installGPUOperator / upgradeGPUOperator / uninstallGPUOperator over Helm (history-check, then install or upgrade). Values are the AWS defaults: driver and toolkit off (the AL2023 NVIDIA AMI already ships them), device plugin on, and MOFED_ENABLED=false so it doesn't grab the uverbs devices and fight the EFA plugin.
reconcileGPUOperator: installs when the cluster has (or is configured to have) GPU nodes, uninstalls otherwise. "Has GPU nodes" is the union of node groups tagged gpu: true (which already selects the NVIDIA AMI) and live nodes on a GPU instance type. The config side covers a GPU group scaled to zero; the live check covers nodes the config didn't declare. I read the instance-type label on purpose rather than nvidia.com/gpu, since that resource only shows up once the operator is already running.
isGPUInstanceType: g/p families with a digit, so it catches gr6 but not graviton.
Wiring: Deploy reconciles before the GitOps/argo stage so nvidia.com/gpu is there by the time GPU workloads sync; Destroy uninstalls before tofu destroy, best-effort like Longhorn. The operator has no cloud resources of its own so it can't actually block teardown, but removing it first keeps things symmetric.

No config block and no shared struct anymore: removed pkg/provider/gpu.go, the InfraSettings.GPUOperator field, both providers' gpu_operator YAML + ResolveGPUOperator, the ArgoCD template, and the writer skip/prune plumbing.

Notes

Design of record: ADR-0006, proposed in docs(adr): ADR-0006 — conditional foundational software via provider-driven Helm #361. It records why provider/cluster-conditional foundational software installs via provider-driven Helm-in-Deploy rather than as an ArgoCD app, and refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md for that case. The shared helmInstaller + per-provider ConditionalCharts interface the ADR describes lands in Define a consistent pattern for conditional foundational apps (CFAs) #349; this PR (and Support autoscaling in AWS provider #352) are the first instances.
Chart pinned to v26.3.2 (a const in gpu_operator.go), the latest published gpu-operator chart in the nvidia repo. No IAM/Pod Identity needed here, the operator only touches in-cluster resources.
AWS only for now. I pulled the Azure wiring back out since it leaned on the shared struct. It comes back when we make the operator multi-cloud, and IMO that belongs in the broader "move the conditional foundational apps onto a Helm + per-provider interface" refactor (Define a consistent pattern for conditional foundational apps (CFAs) #349) rather than living per-provider forever.
Tests cover isGPUInstanceType, hasGPUNodeGroups, and the Helm values. The install/reconcile paths need a live cluster + Helm, so I didn't unit-test those here.

Device-management docs behind the AWS defaults: EKS.

Documents that provider/cluster-conditional foundational software installs imperatively via provider-driven Helm rather than as ArgoCD apps, while unconditional software stays in GitOps. Refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md for this case, which is what makes the GPU operator's Deploy-time install the intended pattern rather than a deviation. Records the target shared helmInstaller + per-provider ConditionalCharts design; the implementation lands in #349, with #348 (GPU operator) and #352 (autoscaler) as the first instances and MetalLB flagged for migration.

dcmcand

Thanks for this - the architecture is clean and consistent with the autoscaler/LBC Helm-in-Deploy pattern, the OTel and status.Send conventions all hold, and the helm-values test guarding driver/toolkit/MOFED against a future chart bump is a nice touch. Requesting changes for two correctness issues plus the merge state.

Two blockers (inline):

isGPUInstanceType matches g4ad, which is an AMD GPU, not NVIDIA - the operator would install and advertise nothing.
reconcileGPUOperator fails the whole deploy on a transient live-node-check error, on the CPU-only path that runs on every deploy. Destroy already handles this best-effort; Deploy should too.

Also, the PR is currently not mergeable: it conflicts with main in provider.go and config.go because #352 (autoscaling) merged and restructured the same Deploy/Destroy regions. A rebase onto main is needed. The bonus is that reconcileGPUOperator can then mirror the merged installClusterAutoscaler call site, including a gate so CPU-only clusters skip the extra API round-trips (see the inline question).

A few non-blocking questions/nits are inline (chart-version v prefix vs NGC, upgrade-path AddRepo, table-driven test). I verified the gpu_operator config-block removal is non-breaking (main never carried one) and that isGPUInstanceType correctly excludes graviton / inf2 / trn1 (just not g4ad).

Design-of-record ADR-0006 (#361) and the shared interface in #349 noted, no concerns there.

dcmcand · 2026-06-05T11:30:21Z

+	if family == "" || (family[0] != 'g' && family[0] != 'p') {
+		return false
+	}
+	return strings.ContainsAny(family, "0123456789")


Blocker: this returns true for g4ad.*, but g4ad instances carry an AMD Radeon Pro V520, not an NVIDIA GPU. Installing the NVIDIA operator there deploys a device plugin that finds nothing and never advertises nvidia.com/gpu - a cluster that looks set up but silently isn't. The g+digit rule can't distinguish g4ad from the NVIDIA g4dn. Please exclude g4ad explicitly, and add it to the negative list in TestIsGPUInstanceType (which currently omits it, so the gap isn't caught).

Good catch. Fixed — isGPUInstanceType now excludes g4ad explicitly (it is the only AMD GPU family on EC2), and g4ad.xlarge is in the negative list of the now table-driven TestIsGPUInstanceType.

dcmcand · 2026-06-05T11:30:21Z

+		live, err := clusterHasGPUNodes(ctx, kubeconfigBytes)
+		if err != nil {
+			span.RecordError(err)
+			return fmt.Errorf("failed to inspect cluster for GPU nodes: %w", err)


Blocker: this propagates a live-node-check error as a hard Deploy failure, and it sits on the CPU-only path that runs on every deploy. A transient Kubernetes API error here aborts an otherwise-successful deploy after tf.Apply, whereas Destroy treats the same situation as best-effort.

Suggest making the live check advisory: on error, warn via status.Send and fall through with want = false. uninstallGPUOperator is already a safe no-op when the release is absent, so a flaky list call can't fail the deploy for a component that has no cloud resources.

Fixed — the live-node check is now advisory. On a list error it status.Sends a warning and falls through with want=false instead of aborting the deploy; install/uninstall errors still surface as real failures. Config gpu: true short-circuits the check entirely, so clusters that declare GPU never pay the round-trip, and CPU-only clusters do a single advisory node-list that can no longer fail the deploy. If you would rather drop it to a pure config gate (zero round-trips on CPU-only, at the cost of not catching undeclared GPU nodes), say the word.

dcmcand · 2026-06-05T11:30:21Z

+	// gpuOperatorChartVersion pins the upstream NVIDIA gpu-operator Helm chart.
+	// v26.3.2 is the latest published version in the nvidia repo (chart version
+	// tracks appVersion). Bump as new releases land.
+	gpuOperatorChartVersion   = "v26.3.2"


Two small things:

The v prefix (v26.3.2) differs from the LBC's defaultLBCChartVersion = "3.2.1". Helm chart versions are usually bare semver (the v is a git-tag convention). Can you confirm Helm's LocateChart against the NGC registry accepts the v? If NGC expects 26.3.2, this fails with an opaque "chart version not found." Matching the LBC format would be safer.

The comment says the chart version "tracks appVersion" - that isn't generally true for gpu-operator across releases. Pointing at the NGC catalog/releases page would be more durable bump guidance.

Confirmed against the NGC index: gpu-operator publishes its chart as version: "v26.3.2", with the v — so the prefix is required for LocateChart here, unlike the bare-semver LBC chart. Kept v26.3.2 and rewrote the comment to say exactly that and point at the NGC catalog for bump guidance, dropping the "tracks appVersion" line (you are right that it does not generally hold).

dcmcand · 2026-06-05T11:30:21Z

+}
+
+// upgradeGPUOperator upgrades an existing gpu-operator release.
+func upgradeGPUOperator(ctx context.Context, actionConfig *action.Configuration) error {


Question (also applies to upgradeAWSLoadBalancerController, so not a regression): the upgrade path doesn't call helm.AddRepo before LocateChart. Within one deploy it's fine since install runs AddRepo first, but on a fresh machine where the operator already exists on-cluster, the upgrade path runs without the NGC repo in the local Helm cache and LocateChart can fail. Worth confirming a mitigation or adding AddRepo here.

In our call path it is covered: reconcileGPUOperator → installGPUOperator runs helm.AddRepo before the history-check that branches to upgrade, so the repo is in the local cache by the time upgradeGPUOperator → LocateChart runs. upgradeGPUOperator is never called on its own. Left as-is to mirror the LBC structure you flagged; happy to hoist AddRepo above the branch if you would prefer it explicit in both.

dcmcand · 2026-06-05T11:30:21Z

+
+import "testing"
+
+func TestIsGPUInstanceType(t *testing.T) {


Stylistic (repo convention is table-driven tests, per CLAUDE.md): converting this to a []struct{name, input, want} table with t.Run would match TestHasGPUNodeGroups and make it easy to add the g4ad negative case from the blocker above.

Done — converted to a name/input/want table with t.Run, matching TestHasGPUNodeGroups, and added the g4ad negative case plus trn1/graviton/empty.

…cile Installs the NVIDIA GPU Operator on AWS clusters that have GPU nodes so nvidia.com/gpu is advertised, without each software pack shipping its own operator app. Follows the cluster-autoscaler / LBC imperative-Helm-in-Deploy pattern rather than a GitOps ArgoCD app (decision recorded in ADR-0006, #361). pkg/provider/aws/gpu_operator.go: - installGPUOperator / upgradeGPUOperator / uninstallGPUOperator via Helm (history-check, then install or upgrade). Values are the AWS NVIDIA-AMI defaults: driver and toolkit off, device plugin on, MOFED off (EFA safety). - reconcileGPUOperator: install when the cluster has or is configured to have GPU nodes, uninstall otherwise. gpu: true node groups are the primary signal; for clusters that don't declare them an advisory live-node check (by instance-type label) catches undeclared nodes and never fails the deploy. - isGPUInstanceType: g/p families with a digit, excluding g4ad (AMD, not NVIDIA). Deploy reconciles before the GitOps stage; Destroy uninstalls before tofu destroy, best-effort like Longhorn. Chart pinned to v26.3.2 (NGC lists it v-prefixed). No IAM needed; the operator only touches in-cluster resources.

viniciusdc · 2026-06-05T12:28:30Z

Thanks for the thorough review @dcmcand. Pushed the changes (rebased onto main, single commit now):

g4ad blocker — isGPUInstanceType excludes g4ad; added to the table-driven test negatives.
Live-node-check blocker — the check is advisory now (warn + want=false on error, never aborts the deploy); install/uninstall errors still surface. gpu: true short-circuits it so GPU clusters skip the round-trip.
Rebase — onto main past Support autoscaling in AWS provider #352; reconcileGPUOperator sits right after the autoscaler install in Deploy, uninstall before tofu destroy in Destroy.
Chart v prefix — confirmed correct against the NGC index (chart is published v-prefixed); comment rewritten.
AddRepo — covered by the call path (install AddRepos before the history-check branch); details inline.
Table-driven test — done.

Replies on each thread. ADR-0006 (the Helm-vs-GitOps decision) is split out to #361 for its own discussion.

dcmcand

Code review (Go). Strong implementation overall - GPU family detection, the union-of-signals reconcile logic, Destroy/Longhorn symmetry, and OTel instrumentation are all well done and faithfully follow the LBC/Longhorn pattern. Inline comments below cover 2 blockers, 1 question, and a few stylistic notes. (Architecture/ADR findings are being handled separately.)

dcmcand · 2026-06-09T12:10:07Z

+		return err
+	}
+
+	release, err := client.Run(loadedChart, gpuOperatorHelmValues())


[Blocker] client.Run discards ctx - use RunWithContext.

In Helm v3.19.4, Install.Run creates context.Background() internally and ignores the caller's context. A Ctrl-C or upstream timeout during the ~10-minute install window won't be propagated to Helm: the install keeps running in the background after the CLI has already returned an error or exited.

release, err := client.RunWithContext(ctx, loadedChart, gpuOperatorHelmValues())

(The LBC/autoscaler siblings have the same issue but aren't touched by this PR, so just fixing the new code is consistent with project norms.)

dcmcand · 2026-06-09T12:10:07Z

+		return err
+	}
+
+	release, err := client.Run(gpuOperatorReleaseName, loadedChart, gpuOperatorHelmValues())


[Blocker] Same ctx-dropping issue on the upgrade path. Upgrade.Run also builds its own context.Background() in this Helm version. Switch to client.RunWithContext(ctx, gpuOperatorReleaseName, loadedChart, gpuOperatorHelmValues()).

Note: Uninstall.Run (line 330) has no RunWithContext equivalent in v3.19.4, so leave that one as-is.

dcmcand · 2026-06-09T12:10:07Z

+		live, err := clusterHasGPUNodes(ctx, kubeconfigBytes)
+		if err != nil {
+			status.Send(ctx, status.NewUpdate(status.LevelWarning,
+				fmt.Sprintf("Skipping GPU node check — could not list nodes: %v", err)).


[Blocker] Em dash in user-visible status message. This project bans em dashes (U+2014) in output. Replace with a colon or hyphen, e.g. "Skipping GPU node check: could not list nodes: %v".

dcmcand · 2026-06-09T12:10:07Z

+		kubeconfigBytes, kErr := p.GetKubeconfig(ctx, projectName, clusterConfig)
+		switch {
+		case kErr != nil:
+			status.Send(ctx, status.NewUpdate(status.LevelWarning, fmt.Sprintf("Skipping GPU Operator uninstall — kubeconfig unavailable: %v", kErr)).


[Blocker] Em dash in user-visible status message. Same as the gpu_operator.go message - replace — with a colon or hyphen: "Skipping GPU Operator uninstall: kubeconfig unavailable: %v". (The two existing Longhorn messages just above have the same pre-existing issue, but the new message should be clean.)

dcmcand · 2026-06-09T12:10:07Z

+			return fmt.Errorf("failed to get kubeconfig for GPU Operator reconcile: %w", err)
+		}
+
+		if err := reconcileGPUOperator(ctx, kubeconfigBytes, awsCfg); err != nil {


[Question] Unconditional reconcile on every deploy. This block runs for every deploy, including clusters with no GPU. For those, it fetches a kubeconfig (cached, cheap), makes a real clusterHasGPUNodes node-list API call, then a no-op uninstall. LBC and cluster-autoscaler are each guarded by a config flag instead.

Was the always-on reconcile intentional - i.e. a true desired-state loop that removes the operator when GPU nodes disappear? If the node-list call on non-GPU clusters is a concern, a if cfg.hasGPUNodeGroups() short-circuit before the live check would skip it, but that would miss undeclared GPU nodes. Just confirming intent so we know whether the live-node path is permanent or transitional.

dcmcand · 2026-06-09T12:10:07Z

+		return fmt.Errorf("failed to upgrade GPU Operator: %w", err)
+	}
+
+	span.SetAttributes(


[Stylistic] Upgrade span is missing chart_version. installGPUOperator sets attribute.String("chart_version", gpuOperatorChartVersion) on its span (line 179) but the upgrade path doesn't. Upgrade is the hot path on subsequent deploys for a GPU cluster, so anyone debugging "why didn't the chart version change" won't see it in upgrade traces. Add the same attribute here for parity.

dcmcand · 2026-06-09T12:10:08Z

+
+// loadGPUOperatorChart locates and loads the gpu-operator Helm chart.
+func loadGPUOperatorChart(chartPathOptions action.ChartPathOptions) (*chart.Chart, error) {
+	chartPath, err := chartPathOptions.LocateChart(gpuOperatorChartName, cli.New())


[Stylistic / tracking only] cli.New() reads HELM_* env vars (HELM_NAMESPACE, HELM_REPOSITORY_CACHE, etc.), so the effective Helm settings depend on the operator's shell environment in a way that isn't obvious from the signature. Consistent with the LBC/autoscaler siblings, so not a regression - just worth tracking as a source of subtle behavior differences when the #349 shared Helm infrastructure lands.

dcmcand · 2026-06-09T12:10:08Z

+	}
+}
+
+func TestHasGPUNodeGroups(t *testing.T) {


[Stylistic] Add edge cases to TestHasGPUNodeGroups. Ranging over a nil/empty map is safe and returns false, but an explicit "empty node groups" case (Config{}) documents that nil-safe behavior as intentional and guards against a future refactor to a length check. A case with only a GPU: true group (no non-GPU group) would also round out coverage.

dcmcand · 2026-06-09T12:10:08Z

+	}
+}
+
+func TestGPUOperatorHelmValues(t *testing.T) {


[Stylistic] TestGPUOperatorHelmValues reads as a flat sequential test; project convention is table-driven. The checked values (driver.enabled, toolkit.enabled, devicePlugin.enabled, MOFED_ENABLED) map cleanly to a table of {path, expected} rows, which also makes adding future coverage (gfd.enabled, dcgm.enabled) easier without restructuring.

…361) Documents that provider/cluster-conditional foundational software installs imperatively via provider-driven Helm rather than as ArgoCD apps, while unconditional software stays in GitOps. Refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md for this case, which is what makes the GPU operator's Deploy-time install the intended pattern rather than a deviation. Records the target shared helmInstaller + per-provider ConditionalCharts design; the implementation lands in #349, with #348 (GPU operator) and #352 (autoscaler) as the first instances and MetalLB flagged for migration. Co-authored-by: Chuck McAndrew <6248903+dcmcand@users.noreply.github.com>

viniciusdc mentioned this pull request Jun 2, 2026

Define a consistent pattern for conditional foundational apps (CFAs) #349

Open

viniciusdc marked this pull request as ready for review June 4, 2026 13:42

viniciusdc requested review from Adam-D-Lewis, aktech, dcmcand, dharhas, marcelovilla and tylerpotts as code owners June 4, 2026 13:42

viniciusdc force-pushed the feat/gpu-operator-foundational branch from d646045 to 370b96f Compare June 5, 2026 01:23

viniciusdc mentioned this pull request Jun 5, 2026

docs(adr): ADR-0006 — conditional foundational software via provider-driven Helm #361

Merged

dcmcand requested changes Jun 5, 2026

View reviewed changes

viniciusdc force-pushed the feat/gpu-operator-foundational branch from 370b96f to ae92bd2 Compare June 5, 2026 12:26

viniciusdc requested a review from dcmcand June 5, 2026 12:28

dcmcand reviewed Jun 9, 2026

View reviewed changes

This was referenced Jun 19, 2026

Longhorn dedicated_nodes: reconcile allowScheduling on newly-joined nodes via a controller #371

Open

Node groups follow the latest EKS AMI, so any nic deploy rolls all of them #410

Open

Uh oh!

Conversation

viniciusdc commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Notes

Uh oh!

dcmcand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viniciusdc commented Jun 5, 2026

Uh oh!

dcmcand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

viniciusdc commented Jun 2, 2026 •

edited

Loading