Skip to content

docs(adr): ADR-0006 — conditional foundational software via provider-driven Helm#361

Merged
dcmcand merged 2 commits into
mainfrom
docs/adr-0006-conditional-software-helm
Jun 9, 2026
Merged

docs(adr): ADR-0006 — conditional foundational software via provider-driven Helm#361
dcmcand merged 2 commits into
mainfrom
docs/adr-0006-conditional-software-helm

Conversation

@viniciusdc

Copy link
Copy Markdown
Contributor

Splitting this out of #348 so the decision gets its own discussion thread and footprint, per our convention that ADRs land on their own PRs.

What this records

We decided in an internal sync that provider/cluster-conditional foundational software installs via provider-driven Helm in Deploy()/Destroy(), not as ArgoCD apps. Unconditional foundational software (cert-manager, Keycloak, Envoy Gateway, PostgreSQL, the Nebari operator) stays in GitOps. ADR-0006 writes that down and refines the blanket "GitOps for software" rule in ADR-0001 / AGENTS.md #5 for the conditional case.

The motivation: the install/remove decision and the values for conditional software depend on provider + live-cluster state the provider already computes (GPU nodes present, existing-vs-cloud, region, VPC ID, node-group-derived toggles). Gating that through the GitOps render layer leaks provider knowledge into pkg/argocd and InfraSettings and doesn't scale; several components also need ordered teardown relative to tofu destroy (Longhorn's CSI finalizers). The Helm logic is already hand-rolled and duplicated across cluster_autoscaler.go, aws_load_balancer_controller.go, and gpu_operator.go.

Target design (implementation tracked in #349)

A shared helmInstaller (kubeconfig + chart + values + lifecycle) plus a per-provider ConditionalCharts lister, with a reconcile loop that installs the listed set and uninstalls managed releases no longer listed. The GPU operator (#348) and the cluster-autoscaler (#352) are the first hand-rolled instances to fold in; MetalLB migrates off the writer-skip mechanism.

This PR is the decision + design only (one markdown file + the ADR index row). The interface implementation is #349. Discussion welcome here.

Documents that provider/cluster-conditional foundational software installs
imperatively via provider-driven Helm rather than as ArgoCD apps, while
unconditional software stays in GitOps. Refines the blanket "GitOps for
software" rule in ADR-0001 / AGENTS.md for this case, which is what makes
the GPU operator's Deploy-time install the intended pattern rather than a
deviation.

Records the target shared helmInstaller + per-provider ConditionalCharts
design; the implementation lands in #349, with #348 (GPU operator) and #352
(autoscaler) as the first instances and MetalLB flagged for migration.
viniciusdc added a commit that referenced this pull request Jun 5, 2026
…cile

Installs the NVIDIA GPU Operator on AWS clusters that have GPU nodes so
nvidia.com/gpu is advertised, without each software pack shipping its own
operator app. Follows the cluster-autoscaler / LBC imperative-Helm-in-Deploy
pattern rather than a GitOps ArgoCD app (decision recorded in ADR-0006, #361).

pkg/provider/aws/gpu_operator.go:
- installGPUOperator / upgradeGPUOperator / uninstallGPUOperator via Helm
  (history-check, then install or upgrade). Values are the AWS NVIDIA-AMI
  defaults: driver and toolkit off, device plugin on, MOFED off (EFA safety).
- reconcileGPUOperator: install when the cluster has or is configured to have
  GPU nodes, uninstall otherwise. gpu: true node groups are the primary signal;
  for clusters that don't declare them an advisory live-node check (by
  instance-type label) catches undeclared nodes and never fails the deploy.
- isGPUInstanceType: g/p families with a digit, excluding g4ad (AMD, not NVIDIA).

Deploy reconciles before the GitOps stage; Destroy uninstalls before tofu
destroy, best-effort like Longhorn. Chart pinned to v26.3.2 (NGC lists it
v-prefixed). No IAM needed; the operator only touches in-cluster resources.
dcmcand
dcmcand previously approved these changes Jun 9, 2026

@dcmcand dcmcand left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

…onal-software-helm

# Conflicts:
#	docs/adr/README.md
@dcmcand dcmcand merged commit 0a9bd1c into main Jun 9, 2026
2 checks passed
@dcmcand dcmcand deleted the docs/adr-0006-conditional-software-helm branch June 9, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants