docs(adr): ADR-0006 — conditional foundational software via provider-driven Helm#361
Merged
Merged
Conversation
Documents that provider/cluster-conditional foundational software installs imperatively via provider-driven Helm rather than as ArgoCD apps, while unconditional software stays in GitOps. Refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md for this case, which is what makes the GPU operator's Deploy-time install the intended pattern rather than a deviation. Records the target shared helmInstaller + per-provider ConditionalCharts design; the implementation lands in #349, with #348 (GPU operator) and #352 (autoscaler) as the first instances and MetalLB flagged for migration.
This was referenced Jun 5, 2026
viniciusdc
added a commit
that referenced
this pull request
Jun 5, 2026
…cile Installs the NVIDIA GPU Operator on AWS clusters that have GPU nodes so nvidia.com/gpu is advertised, without each software pack shipping its own operator app. Follows the cluster-autoscaler / LBC imperative-Helm-in-Deploy pattern rather than a GitOps ArgoCD app (decision recorded in ADR-0006, #361). pkg/provider/aws/gpu_operator.go: - installGPUOperator / upgradeGPUOperator / uninstallGPUOperator via Helm (history-check, then install or upgrade). Values are the AWS NVIDIA-AMI defaults: driver and toolkit off, device plugin on, MOFED off (EFA safety). - reconcileGPUOperator: install when the cluster has or is configured to have GPU nodes, uninstall otherwise. gpu: true node groups are the primary signal; for clusters that don't declare them an advisory live-node check (by instance-type label) catches undeclared nodes and never fails the deploy. - isGPUInstanceType: g/p families with a digit, excluding g4ad (AMD, not NVIDIA). Deploy reconciles before the GitOps stage; Destroy uninstalls before tofu destroy, best-effort like Longhorn. Chart pinned to v26.3.2 (NGC lists it v-prefixed). No IAM needed; the operator only touches in-cluster resources.
…onal-software-helm # Conflicts: # docs/adr/README.md
dcmcand
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Splitting this out of #348 so the decision gets its own discussion thread and footprint, per our convention that ADRs land on their own PRs.
What this records
We decided in an internal sync that provider/cluster-conditional foundational software installs via provider-driven Helm in
Deploy()/Destroy(), not as ArgoCD apps. Unconditional foundational software (cert-manager, Keycloak, Envoy Gateway, PostgreSQL, the Nebari operator) stays in GitOps. ADR-0006 writes that down and refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md #5 for the conditional case.The motivation: the install/remove decision and the values for conditional software depend on provider + live-cluster state the provider already computes (GPU nodes present, existing-vs-cloud, region, VPC ID, node-group-derived toggles). Gating that through the GitOps render layer leaks provider knowledge into
pkg/argocdandInfraSettingsand doesn't scale; several components also need ordered teardown relative totofu destroy(Longhorn's CSI finalizers). The Helm logic is already hand-rolled and duplicated acrosscluster_autoscaler.go,aws_load_balancer_controller.go, andgpu_operator.go.Target design (implementation tracked in #349)
A shared
helmInstaller(kubeconfig + chart + values + lifecycle) plus a per-providerConditionalChartslister, with a reconcile loop that installs the listed set and uninstalls managed releases no longer listed. The GPU operator (#348) and the cluster-autoscaler (#352) are the first hand-rolled instances to fold in; MetalLB migrates off the writer-skip mechanism.This PR is the decision + design only (one markdown file + the ADR index row). The interface implementation is #349. Discussion welcome here.