From f5ddeff67dd4fc7c6c44e7b2637b5695e172da20 Mon Sep 17 00:00:00 2001 From: Chuck McAndrew <6248903+dcmcand@users.noreply.github.com> Date: Tue, 12 May 2026 20:01:03 +0200 Subject: [PATCH 1/3] docs(design-doc): audit and rewrite design docs against current code Most files under docs/design-doc/ had drifted substantially from the codebase: invented CLI commands, wrong package layouts, fictional code samples, wrong YAML config schemas, and a foundational software stack (LGTM) presented as deployed when only the OpenTelemetry Collector ships today. This commit rewrites the heavily-drifted docs from scratch against current code (verified against pkg/, cmd/nic/, examples/, Makefile, .github/workflows/) and applies surgical fixes elsewhere. Highlights: - Acknowledge per-provider tool choice: AWS uses OpenTofu, Hetzner uses the hetzner-k3s binary, local uses Kind, existing is a no-op. The Provider interface is the contract. - Cross-reference ADR-0004 (out-of-tree provider plugins) where relevant. - Fix the config schema reference to match the real cluster.: / dns.: discriminator pattern. Remove fictional top-level provider:, version:, kubernetes:, tls:, foundational_software:, images:, features: blocks. - Document Hetzner and existing providers (previously missing). - Mark GCP/Azure providers as stubs (not deployable today). - Replace fictional CLI commands (nic plan / status / state / unlock / init / stack / marketplace / health) with the real surface (deploy, destroy, validate, kubeconfig, version). - Replace DynamoDB-locked S3 backend with the real native lockfile configuration (use_lockfile = true). - Reframe nebari-operator as out-of-tree (lives at github.com/nebari-dev/nebari-operator); NIC just deploys it. Correct CRD name throughout (NebariApp, not NebariApplication / NicApp). - Realign the testing strategy and milestones with what CI actually runs and what is actually shipped vs planned. Closes #300 --- docs/design-doc/appendix/14-open-questions.md | 96 +- .../appendix/15-future-enhancements.md | 7 + .../appendix/16-configuration-reference.md | 1555 +++-------------- docs/design-doc/appendix/17-appendix.md | 96 +- .../architecture/01-introduction.md | 132 +- .../architecture/02-system-overview.md | 212 ++- .../architecture/03-goals-and-non-goals.md | 70 +- .../architecture/04-key-decisions.md | 287 +-- .../architecture/05-state-management.md | 357 +--- .../06-opentofu-module-architecture.md | 727 +------- .../implementation/07-configuration-design.md | 289 ++- .../08-terraform-exec-integration.md | 674 +------ .../09-dns-provider-architecture.md | 34 +- .../10-foundational-software.md | 434 +---- .../implementation/11-nebari-operator.md | 514 +----- docs/design-doc/nic-summary.md | 6 +- .../operations/12-testing-strategy.md | 648 +------ docs/design-doc/operations/13-milestones.md | 227 ++- .../operations/longhorn-node-maintenance.md | 2 +- 19 files changed, 1415 insertions(+), 4952 deletions(-) diff --git a/docs/design-doc/appendix/14-open-questions.md b/docs/design-doc/appendix/14-open-questions.md index 423de068..db2572f8 100644 --- a/docs/design-doc/appendix/14-open-questions.md +++ b/docs/design-doc/appendix/14-open-questions.md @@ -1,85 +1,45 @@ # Open Questions -### 13.1 Technical Questions +Numbering note: this file is the chapter immediately following the operations section. Anchor links in older docs may reference these by "13.x" - that's stale; the file's section numbers below are the canonical ones. -1. **Resolved:** Using OpenTofu with terraform-exec orchestration and standard Terraform state management -2. **Multi-Cluster:** How to manage multiple clusters in one state file? (Options: separate states, or cluster array in state) -3. **Custom Kubernetes Distributions:** Support for k0s, k3d, RKE2? (v1: No, v2: Maybe) -4. **Helm Chart Storage:** Where to store foundational software Helm charts? (OCI registry? Git?) -5. **Operator HA:** Should operator run in HA mode (multiple replicas)? (Recommendation: Yes, with leader election) +## 14.1 Technical Questions -### 13.2 Configuration Questions +1. **Resolved**: OpenTofu via `pkg/tofu` (`terraform-exec` wrapper) is used by the AWS provider. Other providers do not use tofu (Hetzner uses `hetzner-k3s` directly; local uses Kind; existing is a no-op). See [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md) for the proposed out-of-tree plugin direction that formalizes this. +2. **Multi-Cluster**: How to manage multiple clusters from one NIC invocation? Today: one cluster per `nic deploy` invocation. Still open. +3. **Custom Kubernetes Distributions**: Support for k0s, k3d, RKE2? Today: Kind for local, k3s for Hetzner, EKS for AWS. RKE2/k0s remain open. +4. **Helm Chart Storage**: Foundational charts live in `pkg/argocd/templates/apps/` as ArgoCD `Application` manifests that reference upstream Helm repositories. OCI mirroring for offline installs is still open. +5. **Operator HA**: Should the Nebari Operator run HA with leader election? Owned upstream at [`nebari-dev/nebari-operator`](https://github.com/nebari-dev/nebari-operator). -6. **Config Validation:** Schema validation via JSON Schema or custom Go validation? (Recommendation: Custom Go + JSON Schema for IDE support) -7. **Config Inheritance:** Support for base + overlay configs? (Recommendation: No for MVP, Yes in future versions via `extends` field) -8. **Secrets Management:** How to handle secrets in config (Keycloak admin password, etc.)? (Options: external secrets operator, sealed secrets, cloud secrets manager) +## 14.2 Configuration Questions -### 13.3 Deployment Questions +6. **Config Validation**: Today: custom Go validation in `pkg/config/config.go` (`NebariConfig.Validate`). JSON Schema export for IDE support remains open. +7. **Config Inheritance** (`extends`): Not implemented. See [`15-future-enhancements.md`](15-future-enhancements.md). +8. **Secrets Management**: **Resolved for MVP**: env vars via `.env` (loaded by `godotenv` in `cmd/nic/main.go`). Git auth uses env-var indirection (`ssh_key_env` / `token_env`). External Secrets Operator / Sealed Secrets / cloud secrets managers remain open as longer-term options. -9. **Rollback Strategy:** Should `nic rollback` be a command? (Recommendation: Yes, Phase 2) -10. **Blue/Green Deployments:** Support for blue/green cluster deployments? (Recommendation: Future) -11. **Canary Deployments:** For foundational software updates? (Recommendation: Future) +## 14.3 Deployment Questions -### 13.4 Integration Questions +9. **Rollback Strategy**: Should `nic rollback` exist? Still open. Today: re-apply a previous config. +10. **Blue/Green Cluster Deployments**: Future. +11. **Canary Deployments for foundational software updates**: Future (depends on ArgoCD's own progressive sync features). -12. **CI/CD Integration:** Should NIC provide GitHub Actions / GitLab CI templates? (Recommendation: Yes, Phase 2) -13. **Monitoring Integration:** Should NIC phone home telemetry (opt-in)? (Recommendation: Phase 2, opt-in only) -14. **Marketplace Integration:** Package as AWS Marketplace / GCP Marketplace offering? (Recommendation: Future) +## 14.4 Integration Questions -### 13.5 Platform Automation Questions +12. **CI/CD Templates**: Should NIC ship GitHub Actions / GitLab CI templates? Still open; the `git_repository:` consumption side is shipped, but template generation is not. +13. **Phone-Home Telemetry**: Should NIC emit opt-in usage telemetry? Still open. +14. **Marketplace Integration**: AWS/GCP Marketplace listings? Future. -15. **Git Repository Provisioning:** Should NIC automatically provision Git repositories and setup CI/CD workflows for infrastructure changes? +## 14.5 Platform Automation Questions - - **Use Case:** `nic init` creates GitHub repo, adds config.yaml, sets up GitHub Actions/GitLab CI for automated infrastructure updates - - **Providers:** GitHub, GitLab, Gitea (self-hosted) - - **Features:** Branch protection, PR-based workflow, automated validation, auto-apply on merge - - **Recommendation:** Phase 2, start with GitHub integration +15. **Git Repository Provisioning**: NIC **consumes** an existing GitOps repo today (`pkg/git`, `git_repository:` config). The **provisioning** side (auto-create the repo on GitHub/GitLab/Gitea, configure protections, etc.) is still open. See `15-future-enhancements.md` §2. -16. **CI/CD Workflow Generation:** Should NIC auto-generate and manage CI/CD pipelines for infrastructure automation? - - **Workflows:** - - PR validation: `nic validate` + `nic plan` on every PR - - Auto-deploy: `nic deploy` on merge to main (with approval gates) - - Scheduled drift detection: Daily `nic status` to detect manual changes - - Automated testing: Integration tests before deployment - - **Customization:** Template-based with user overrides - - **Recommendation:** Phase 2, essential for GitOps workflow +16. **CI/CD Workflow Generation**: Auto-generate validation/deploy/drift workflows. Still open. -### 13.6 Application Stack Questions +## 14.6 Application Stack Questions -17. **Software Stack Specification:** Should NIC support declarative specifications for complete software stacks (databases, message queues, caching, etc.) deployable on top of foundational software? +17. **Software Stack Specification**: Declarative specs for full platform stacks (databases, queues, apps). Still open. Today: user packs install themselves via ArgoCD using `NebariApp` CRs from the upstream operator. +18. **Full Stack in One Repo**: Still open. The GitOps repo layout is owned by NIC for the foundational set today; users overlay their own applications. +19. **Stack Templates & Marketplace**: Still open. The "Software Pack" concept exists in the broader Nebari ecosystem; a curated marketplace is future work. - - **Use Case:** Define entire platform + applications in single config.yaml - - **Example Stacks:** - - Data Science: PostgreSQL + Redis + MinIO + JupyterHub + Dask - - ML Platform: MLflow + Kubeflow + Model Registry + Feature Store - - Web Platform: PostgreSQL + Redis + RabbitMQ + Object Storage - - **Integration:** Via Helm chart repositories, ArgoCD ApplicationSets - - **Recommendation:** Phase 2, using Helm chart catalogs and pre-defined stack templates +## 14.7 Provider Plugin Architecture -18. **Full Stack in One Repo:** Should users be able to define foundational software + application stacks + configuration in a single repository? - - - **Structure:** - - ``` - nebari-deployment/ - ├── config.yaml # Platform + stacks - ├── stacks/ - │ ├── postgresql-values.yaml # DB config - │ ├── jupyterhub-values.yaml # App config - │ └── dask-values.yaml # Compute config - ├── policies/ # OPA policies - └── .github/workflows/ # Auto-generated CI/CD - ``` - - - **Benefits:** Single source of truth, version controlled, auditable, reproducible - - **Recommendation:** Phase 2, core feature for platform teams - -19. **Stack Templates & Marketplace:** Should NIC provide pre-built stack templates (data science, ML, web app) and a marketplace for community stacks? - - **Built-in Templates:** - - nebari-data-science-stack - - nebari-ml-platform-stack - - nebari-web-platform-stack - - **Community Marketplace:** GitHub-based registry of vetted stack configurations - - **Recommendation:** Phase 2 for templates, Future for marketplace - ---- +20. **Out-of-Tree Provider Plugins** ([ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md), Proposed): Open questions from the ADR include scope of plugin kinds, relationship to Nebari stages, credential model, validation without install, trust/signing, and migration of existing in-tree providers. These are tracked in the ADR rather than duplicated here. diff --git a/docs/design-doc/appendix/15-future-enhancements.md b/docs/design-doc/appendix/15-future-enhancements.md index 58f8654e..c856b808 100644 --- a/docs/design-doc/appendix/15-future-enhancements.md +++ b/docs/design-doc/appendix/15-future-enhancements.md @@ -1,5 +1,12 @@ # Future Enhancements +> **Status**: this document describes future/aspirational features. Config snippets here use a hypothetical schema and **do not** match the current NIC config format. See [`16-configuration-reference.md`](16-configuration-reference.md) for the current schema (`cluster.:` / `dns.:` discriminator pattern; no top-level `provider:` field; no `version:`, `kubernetes:`, `node_pools:`, `tls:`, `foundational_software:`, `images:`, or `features:` blocks). CLI commands like `nic plan`, `nic status`, `nic state`, `nic unlock`, `nic init`, `nic stack`, `nic marketplace` do not exist today; only `deploy`, `destroy`, `validate`, `kubeconfig`, and `version` are implemented. +> +> Some items below have shipped in part: +> +> - **§2 Git Repository Provisioning & CI/CD**: the **consumption** side is done (`pkg/git`, `git_repository:` config block, env-var auth, `file://` local repos). The **provisioning** side (`nic init` creating a new repo, auto-generated workflows) is still future work. +> - **Secrets management** via `.env` + env-var indirection is shipped for MVP (see [`14-open-questions.md`](14-open-questions.md) §14.2). + This document provides detailed specifications for future enhancements planned for NIC. ## 1. Configuration Overlays for Multi-Environment Support diff --git a/docs/design-doc/appendix/16-configuration-reference.md b/docs/design-doc/appendix/16-configuration-reference.md index abf56f4b..6c3da86b 100644 --- a/docs/design-doc/appendix/16-configuration-reference.md +++ b/docs/design-doc/appendix/16-configuration-reference.md @@ -1,1387 +1,334 @@ # Configuration Reference -This document provides a complete reference for all configuration options in NIC, based on the actual struct definitions -in `pkg/config/config.go`. +This is the authoritative reference for `nebari-config.yaml`. Field-level source of truth is the Go code; this document is updated as code changes. Ground-truth file references are inline. ## Table of Contents -1. [Global Configuration](#global-configuration) -2. [AWS Provider Configuration](#aws-provider-configuration) -3. [GCP Provider Configuration](#gcp-provider-configuration) -4. [Azure Provider Configuration](#azure-provider-configuration) -5. [Hetzner Provider Configuration](#hetzner-provider-configuration) -6. [Local Provider Configuration](#local-provider-configuration) -7. [DNS Provider Configuration](#dns-provider-configuration) -8. [Complete Examples](#complete-examples) +1. [Top-Level Schema](#1-top-level-schema) +2. [Cluster Providers](#2-cluster-providers) + 1. [`cluster.aws`](#21-clusteraws-amazon-eks) + 2. [`cluster.hetzner`](#22-clusterhetzner-hetzner-cloud-k3s) + 3. [`cluster.local`](#23-clusterlocal-kind-for-development) + 4. [`cluster.existing`](#24-clusterexisting-adopt-a-pre-provisioned-cluster) + 5. [`cluster.gcp` / `cluster.azure`](#25-clustergcp--clusterazure-stubs) +3. [DNS Providers](#3-dns-providers) +4. [Certificate](#4-certificate) +5. [Git Repository](#5-git-repository) +6. [Environment Variables](#6-environment-variables) +--- +## 1. Top-Level Schema -## Global Configuration - -These fields apply to all providers and are defined in `NebariConfig` (pkg/config/config.go:4-22). - -```yaml -# REQUIRED: Unique name for your Nebari deployment -# Used for resource naming and tagging -project_name: my-nebari - -# REQUIRED: Cloud provider to use -# Valid values: aws, gcp, azure, hetzner, local -provider: aws - -# OPTIONAL: Domain name for your Nebari deployment -# Required if you want to enable TLS/HTTPS access -# Example: nebari.example.com -domain: nebari.example.com - -# OPTIONAL: DNS provider configuration -# The provider name is the key, its config is the value -# Only one DNS provider can be configured at a time -# See "DNS Provider Configuration" section for details -dns: - cloudflare: - zone_name: example.com -``` - -**Field Descriptions:** - -- **project_name** (string, required): Unique identifier for your Nebari deployment. Used in resource naming and tagging - across all cloud resources. -- **provider** (string, required): Cloud provider to deploy infrastructure on. Must be one of: `aws`, `gcp`, `azure`, - `hetzner`, `local`. -- **domain** (string, optional): Fully qualified domain name for accessing Nebari services. Required for TLS/Let's - Encrypt integration. -- **dns** (object, optional): DNS provider configuration. The provider name is the key (e.g., `cloudflare`), and its - config is the value. Only one provider can be configured. See DNS Provider Configuration section. - - - -## AWS Provider Configuration - -AWS-specific configuration defined in `AWSConfig` (pkg/config/config.go:24-39). +Defined by `NebariConfig` in `pkg/config/config.go`: ```yaml -provider: aws - -amazon_web_services: - # REQUIRED: AWS region to deploy infrastructure - # Example: us-west-2, us-east-1, eu-west-1 - region: us-west-2 - - # REQUIRED: Kubernetes version for EKS cluster - # Example: "1.28", "1.29", "1.30" - # Must be a string (quoted) to preserve minor version - kubernetes_version: "1.28" - - # OPTIONAL: List of availability zones to use - # If not specified, AWS will automatically select zones in the region - # Must be valid AZs within the specified region - availability_zones: - - us-west-2a - - us-west-2b - - us-west-2c - - # OPTIONAL: CIDR block for VPC creation - # Default: AWS default VPC CIDR - # Must be a valid RFC 1918 private network range - vpc_cidr_block: "10.10.0.0/16" - - # OPTIONAL: Enable private EKS API endpoint (accessible from within VPC only) - # Default: false - endpoint_private_access: false - - # OPTIONAL: Enable public EKS API endpoint (accessible from internet) - # Default: false - endpoint_public_access: true - - # OPTIONAL: ARN of KMS key for EKS secrets encryption - # If specified, Kubernetes secrets will be encrypted with this key - # Example: arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012 - eks_kms_arn: "" - - # OPTIONAL: Use existing subnet IDs instead of creating new VPC - # Provide list of subnet IDs that span multiple AZs - # If specified, VPC creation is skipped - existing_subnet_ids: - - subnet-12345678 - - subnet-87654321 - - # OPTIONAL: Use existing security group ID - # If specified, this security group will be used for EKS cluster - existing_security_group_id: sg-12345678 - - # OPTIONAL: IAM permissions boundary ARN - # Applied to all IAM roles created by NIC - # Required in enterprise environments with mandatory permission boundaries - # Example: arn:aws:iam::123456789012:policy/PermissionsBoundary - permissions_boundary: "" - - # OPTIONAL: AWS resource tags - # Applied to all AWS resources created by NIC - # Useful for cost allocation, compliance, and organization - tags: - Environment: production - Project: nebari - ManagedBy: nic - CostCenter: engineering - - # REQUIRED: Node groups (worker node pools) configuration - # At least one node group is required for a functional cluster - # Map of node group name to configuration - node_groups: - # General purpose node group (typically required) - general: - # REQUIRED: EC2 instance type - # Example: m5.2xlarge, m6i.4xlarge, c5.xlarge - # Choose based on workload requirements (CPU, memory, network) - instance: m5.2xlarge - - # OPTIONAL: Minimum number of nodes in this group - # Default: 0 - # Autoscaler will not scale below this number - min_nodes: 1 - - # OPTIONAL: Maximum number of nodes in this group - # Default: 1 - # Autoscaler will not scale above this number - max_nodes: 5 - - # OPTIONAL: Kubernetes taints for this node group - # Prevents pods from being scheduled unless they have matching tolerations - # Useful for dedicated workloads (GPU, high-memory, etc.) - taints: - - key: workload - value: general - effect: NoSchedule # NoSchedule, PreferNoSchedule, or NoExecute - - # OPTIONAL: Enable GPU support for this node group - # Default: false - # Set to true for GPU instance types (p3, p4, g4, g5) - gpu: false - - # OPTIONAL: Deploy nodes in a single subnet only - # Default: false - # Set to true if node group should not span multiple AZs - single_subnet: false - - # OPTIONAL: IAM permissions boundary for this node group's IAM role - # Overrides the global permissions_boundary for this specific node group - permissions_boundary: "" - - # OPTIONAL: Use EC2 Spot instances for cost savings - # Default: false - # Spot instances are cheaper but can be interrupted - # Not recommended for critical workloads - spot: false - - # User workload node group example - user: - instance: m5.xlarge - min_nodes: 0 - max_nodes: 10 - taints: [] - - # GPU node group example - gpu: - instance: g5.2xlarge - min_nodes: 0 - max_nodes: 5 - gpu: true - spot: false - taints: - - key: nvidia.com/gpu - value: "true" - effect: NoSchedule - - # Spot instance node group example - spot-workers: - instance: m5.4xlarge - min_nodes: 0 - max_nodes: 20 - spot: true - taints: - - key: workload - value: spot - effect: NoSchedule -``` - -**AWS Environment Variables (Secrets):** - -NIC requires AWS credentials via environment variables or IAM roles: - -```bash -# Option 1: Long-term credentials (development only) -AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE -AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY +project_name: my-nebari # required, [a-zA-Z0-9][a-zA-Z0-9_-]* +domain: nebari.example.com # optional, but needed for routable services -# Option 2: Temporary credentials (recommended) -AWS_ACCESS_KEY_ID=ASIAIOSFODNN7EXAMPLE -AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY -AWS_SESSION_TOKEN=FwoGZXIvYXdzEBYaDJH... +cluster: # required, exactly one provider + : + ... -# Option 3: AWS Profile (recommended for local development) -AWS_PROFILE=nebari-admin +dns: # optional, exactly one provider + : + ... -# Option 4: IAM Role (recommended for CI/CD) -# No environment variables needed - uses instance/pod IAM role -``` - - - -## GCP Provider Configuration +git_repository: # required on cloud providers; optional on local + url: ... + ... -GCP-specific configuration defined in `GCPConfig` (pkg/config/config.go:41-57). - -```yaml -provider: gcp - -google_cloud_platform: - # REQUIRED: GCP project ID where resources will be created - # Example: my-project-123456 - # Must be an existing GCP project with billing enabled - project: my-gcp-project-id - - # REQUIRED: GCP region to deploy infrastructure - # Example: us-central1, us-east1, europe-west1 - region: us-central1 - - # REQUIRED: Kubernetes version for GKE cluster - # Example: "1.28", "1.29", "1.30" - # Must be a string (quoted) to preserve minor version - kubernetes_version: "1.28" - - # OPTIONAL: List of availability zones (GCP zones) to use - # If not specified, GCP will automatically select zones in the region - # Must be valid zones within the specified region - # Example: us-central1-a, us-central1-b, us-central1-c - availability_zones: - - us-central1-a - - us-central1-b - - us-central1-c - - # OPTIONAL: GKE release channel for automatic version management - # Valid values: "RAPID", "REGULAR", "STABLE", "UNSPECIFIED" - # Default: "REGULAR" - # - RAPID: Bleeding edge, frequent updates - # - REGULAR: Balanced updates (recommended) - # - STABLE: Conservative updates, well-tested - # - UNSPECIFIED: Manual version management - release_channel: "REGULAR" - - # OPTIONAL: GKE networking mode - # Valid values: "ROUTE", "VPC_NATIVE" - # Default: "VPC_NATIVE" - # - ROUTE: Routes-based networking (legacy) - # - VPC_NATIVE: IP aliasing, recommended for new clusters - networking_mode: "VPC_NATIVE" - - # OPTIONAL: VPC network to use for cluster - # Default: "default" - # Can specify existing VPC network name - network: "default" - - # OPTIONAL: VPC subnetwork to use for cluster - # Required if using custom VPC network - # Example: projects/my-project/regions/us-central1/subnetworks/my-subnet - subnetwork: "" - - # OPTIONAL: IP allocation policy for VPC-native clusters - # Defines secondary IP ranges for pods and services - # Only applies when networking_mode is "VPC_NATIVE" - ip_allocation_policy: - cluster_secondary_range_name: gke-pods - services_secondary_range_name: gke-services - cluster_ipv4_cidr_block: "10.4.0.0/14" - services_ipv4_cidr_block: "10.0.32.0/20" - - # OPTIONAL: Master authorized networks configuration - # Restricts access to GKE control plane - # Map of CIDR name to CIDR block - master_authorized_networks_config: - office-network: "203.0.113.0/24" - vpn-network: "198.51.100.0/24" - - # OPTIONAL: Private cluster configuration - # Enables GKE private cluster mode - private_cluster_config: - enable_private_nodes: true - enable_private_endpoint: false - master_ipv4_cidr_block: "172.16.0.0/28" - - # OPTIONAL: GCP network tags (labels) - # Applied to all GCE instances (nodes) - # Used for firewall rules and organization - # Note: GCP uses tags as strings, not key-value pairs - tags: - - production - - nebari - - data-science - - # REQUIRED: Node groups (node pools) configuration - # At least one node group is required for a functional cluster - # Map of node group name to configuration - node_groups: - # General purpose node pool - general: - # REQUIRED: GCE machine type - # Example: n1-standard-8, n2-standard-16, e2-standard-8 - # Choose based on workload requirements (CPU, memory) - instance: e2-standard-8 - - # OPTIONAL: Minimum number of nodes per zone - # Default: 0 - # Autoscaler will not scale below this number (per zone) - min_nodes: 1 - - # OPTIONAL: Maximum number of nodes per zone - # Default: 1 - # Autoscaler will not scale above this number (per zone) - max_nodes: 5 - - # OPTIONAL: Kubernetes taints for this node pool - # Prevents pods from being scheduled unless they have matching tolerations - taints: - - key: workload - value: general - effect: NoSchedule # NoSchedule, PreferNoSchedule, or NoExecute - - # OPTIONAL: Use preemptible VMs for cost savings - # Default: false - # Preemptible VMs are cheaper but can be terminated at any time - # Not recommended for critical workloads - preemptible: false - - # OPTIONAL: Kubernetes labels for this node pool - # Applied to all nodes in this pool - # Used for node affinity and pod scheduling - labels: - workload: general - environment: production - - # OPTIONAL: GPU configuration for this node pool - # Required for GPU workloads (TensorFlow, PyTorch, etc.) - # Must use GPU-enabled machine types (n1-standard-* with GPUs) - guest_accelerators: - - name: nvidia-tesla-t4 # GPU type - count: 1 # Number of GPUs per node - - # User workload node pool example - user: - instance: e2-standard-4 - min_nodes: 0 - max_nodes: 10 - labels: - workload: user - - # GPU node pool example - gpu: - instance: n1-standard-8 - min_nodes: 0 - max_nodes: 5 - labels: - workload: gpu - nvidia.com/gpu: "true" - taints: - - key: nvidia.com/gpu - value: "true" - effect: NoSchedule - guest_accelerators: - - name: nvidia-tesla-v100 - count: 2 - - # Preemptible node pool example - preemptible-workers: - instance: n2-standard-16 - min_nodes: 0 - max_nodes: 20 - preemptible: true - labels: - workload: batch - preemptible: "true" - taints: - - key: preemptible - value: "true" - effect: NoSchedule +certificate: # optional, defaults to selfsigned + type: ... ``` -**GCP Environment Variables (Secrets):** - -NIC requires GCP credentials via environment variables or service account: - -```bash -# Option 1: Service account key file (development only) -GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json +Anti-pattern: there is no top-level `provider:`, `version:`, `name:`, `kubernetes:`, `node_pools:`, `tls:`, `foundational_software:`, `images:`, or `features:` field. If older documentation shows those, it is out of date. -# Option 2: Service account key JSON (CI/CD) -GOOGLE_CREDENTIALS='{"type":"service_account","project_id":"my-project",...}' +| Field | Type | Required | Source | +|-------|------|----------|--------| +| `project_name` | string | ✅ | `NebariConfig.ProjectName` | +| `domain` | string | optional | `NebariConfig.Domain` | +| `cluster` | map | ✅ | `NebariConfig.Cluster` (`ClusterConfig`) | +| `dns` | map | optional | `NebariConfig.DNS` (`DNSConfig`) | +| `git_repository` | object | conditional | `NebariConfig.GitRepository` (`git.Config`) | +| `certificate` | object | optional | `NebariConfig.Certificate` (`CertificateConfig`) | -# Option 3: Workload Identity (recommended for GKE) -# No environment variables needed - uses pod service account +--- -# GCP Project ID (optional, can be in config) -GOOGLE_PROJECT=my-gcp-project-id -``` +## 2. Cluster Providers +`cluster:` takes exactly one key, the provider name. The shape of the nested object is provider-specific. +Valid provider names (registered in `cmd/nic/main.go`): `aws`, `hetzner`, `local`, `existing`, `gcp`, `azure`. -## Azure Provider Configuration +### 2.1 `cluster.aws` (Amazon EKS) -Azure-specific configuration defined in `AzureConfig` (pkg/config/config.go:59-76). +Source: `pkg/provider/aws/config.go`. Status: **implemented**. ```yaml -provider: azure - -azure: - # REQUIRED: Azure region to deploy infrastructure - # Example: eastus, westus2, westeurope - region: eastus - - # OPTIONAL: Kubernetes version for AKS cluster - # Example: "1.28", "1.29", "1.30" - # Must be a string (quoted) to preserve minor version - # If not specified, uses latest available version in region - kubernetes_version: "1.28" - - # REQUIRED: Storage account name postfix - # Used to create unique storage account names - # Must be lowercase alphanumeric, 3-24 characters - # Final name: - storage_account_postfix: "nebari" - - # OPTIONAL: Resource group name for all resources - # If not specified, NIC will create: - - # Must be unique within your Azure subscription - resource_group_name: "nebari-resources" - - # OPTIONAL: Node resource group name - # Separate resource group for AKS node resources (VMs, disks, NICs) - # If not specified, Azure creates: MC___ - node_resource_group_name: "nebari-node-resources" - - # OPTIONAL: VNet subnet ID for AKS cluster - # Use existing subnet instead of creating new VNet - # Example: /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/{subnet} - vnet_subnet_id: "" - - # OPTIONAL: Enable private cluster mode - # Default: false - # If true, AKS API server is not publicly accessible - # Requires VPN or ExpressRoute for management access - private_cluster_enabled: false - - # OPTIONAL: Maximum pods per node - # Default: 30 (Azure default) - # Maximum: 250 - # Affects IP address requirements in subnet - max_pods: 30 - - # OPTIONAL: Enable Azure Workload Identity - # Default: false - # Allows pods to authenticate to Azure services using managed identities - # Recommended for secure access to Azure resources - workload_identity_enabled: true - - # OPTIONAL: Enable Azure Policy for Kubernetes - # Default: false - # Enables policy-based governance for AKS cluster - # Useful for compliance and security enforcement - azure_policy_enabled: false - - # OPTIONAL: Azure resource tags - # Applied to all Azure resources created by NIC - # Useful for cost allocation, compliance, and organization - tags: - Environment: production - Project: nebari - ManagedBy: nic - CostCenter: engineering - - # OPTIONAL: Network profile configuration - # Defines networking settings for AKS cluster - network_profile: - network_plugin: azure # azure (Azure CNI) or kubenet - network_policy: azure # azure, calico, or none - service_cidr: "10.0.0.0/16" - dns_service_ip: "10.0.0.10" - docker_bridge_cidr: "172.17.0.1/16" - - # OPTIONAL: Authorized IP ranges for API server access - # Only these IPs can access the AKS API server - # Empty list = allow all (not recommended for production) - authorized_ip_ranges: - - "203.0.113.0/24" - - "198.51.100.0/24" - - # REQUIRED: Node groups (node pools) configuration - # At least one node group is required for a functional cluster - # Map of node group name to configuration - node_groups: - # General purpose node pool (system pool) - general: - # REQUIRED: Azure VM size - # Example: Standard_D8_v3, Standard_D16s_v3, Standard_E8s_v3 - # Choose based on workload requirements (CPU, memory) - instance: Standard_D8_v3 - - # OPTIONAL: Minimum number of nodes in this pool - # Default: 0 - # Autoscaler will not scale below this number - # First node pool (system pool) should have min_nodes >= 1 - min_nodes: 1 - - # OPTIONAL: Maximum number of nodes in this pool - # Default: 1 - # Autoscaler will not scale above this number - max_nodes: 5 - - # OPTIONAL: Kubernetes taints for this node pool - # Prevents pods from being scheduled unless they have matching tolerations - taints: - - key: CriticalAddonsOnly - value: "true" - effect: NoSchedule # NoSchedule, PreferNoSchedule, or NoExecute - - # User workload node pool example - user: - instance: Standard_D4_v3 - min_nodes: 0 - max_nodes: 10 - - # High-memory node pool example - highmem: - instance: Standard_E16s_v3 - min_nodes: 0 - max_nodes: 5 - taints: - - key: workload - value: highmem - effect: NoSchedule - - # GPU node pool example (requires GPU-enabled VM sizes) - gpu: - instance: Standard_NC6s_v3 - min_nodes: 0 - max_nodes: 3 - taints: - - key: sku - value: gpu - effect: NoSchedule -``` - -**Azure Environment Variables (Secrets):** - -NIC requires Azure credentials via environment variables or managed identity: - -```bash -# Option 1: Service Principal (recommended for automation) -AZURE_CLIENT_ID=12345678-1234-1234-1234-123456789012 -AZURE_CLIENT_SECRET=your-client-secret -AZURE_TENANT_ID=87654321-4321-4321-4321-210987654321 -AZURE_SUBSCRIPTION_ID=11111111-1111-1111-1111-111111111111 - -# Option 2: Managed Identity (recommended for Azure VMs/AKS) -# No environment variables needed - uses VM/pod managed identity - -# Option 3: Azure CLI authentication (development only) -# Run: az login -# NIC will use credentials from Azure CLI -``` - - - -## Hetzner Provider Configuration - -Hetzner Cloud provider configuration defined in `Config` (pkg/provider/hetzner/config.go). Provisions k3s clusters on -Hetzner Cloud using the hetzner-k3s CLI tool. +cluster: + aws: + region: us-west-2 # required + kubernetes_version: "1.34" # required (string) + availability_zones: # optional (defaults to []; module picks) + - us-west-2a + - us-west-2b + vpc_cidr_block: "10.10.0.0/16" # optional, default: "10.0.0.0/16" + endpoint_private_access: true + endpoint_public_access: true + + # Optional: adopt existing VPC infrastructure + # existing_vpc_id: vpc-... + # existing_private_subnet_ids: [subnet-..., subnet-...] + # existing_security_group_id: sg-... + + # Optional: pin to existing IAM roles + # existing_cluster_role_arn: arn:aws:iam::... + # existing_node_role_arn: arn:aws:iam::... + # permissions_boundary: arn:aws:iam::...:policy/... + + # Optional: EKS KMS key + log types + # eks_kms_arn: arn:aws:kms:... + enabled_log_types: ["api", "audit"] + + node_groups: # map keyed by node-group name + user: + instance: m7i.xlarge + min_nodes: 1 + max_nodes: 5 + # ami_type: AL2023_x86_64_STANDARD # defaults to AL2023 STANDARD + # gpu: true # uses AL2023_x86_64_NVIDIA AMI + # spot: true + # disk_size: 100 + # labels: + # workload: user + # taints: + # - key: nebari.example/dedicated + # value: user + # effect: NO_SCHEDULE # NO_SCHEDULE, NO_EXECUTE, PREFER_NO_SCHEDULE + + tags: # optional map[string]string + Environment: development + + # Optional: AWS Load Balancer Controller (default: enabled) + # aws_load_balancer_controller: + # enabled: true + # chart_version: "3.2.1" + # destroy_timeout: 5m + + # Optional: EFS shared storage + efs: + enabled: true + performance_mode: generalPurpose # generalPurpose | maxIO + throughput_mode: bursting # bursting | provisioned | elastic + encrypted: true + # provisioned_throughput_mibps: 100 # required if throughput_mode is provisioned + # kms_key_arn: arn:aws:kms:... + # storage_class_name: efs-sc + + # Optional: Longhorn distributed storage (default: enabled when nil) + # longhorn: + # enabled: true + # replica_count: 2 + # dedicated_nodes: false + # node_selector: { workload: storage } +``` + +Fields not in `aws.NodeGroup`: `single_subnet`, per-node-group `permissions_boundary`. If you see them in older docs, they are not real. + +State backend: S3 with `use_lockfile = true`, bucket auto-created per [§5.2 of State Management](../architecture/05-state-management.md). No DynamoDB. + +### 2.2 `cluster.hetzner` (Hetzner Cloud k3s) + +Source: `pkg/provider/hetzner/config.go`. Status: **implemented**. Backed by the `hetzner-k3s` binary - **not** OpenTofu. ```yaml -provider: hetzner - -hetzner_cloud: - # REQUIRED: Hetzner datacenter location - # Examples: ash (Ashburn), fsn1 (Falkenstein), nbg1 (Nuremberg), hel1 (Helsinki) - location: ash - - # REQUIRED: Kubernetes version for the k3s cluster - # Short form ("1.32", "1.32.0") is resolved to the latest k3s release via GitHub API - # Explicit form ("v1.32.0+k3s1") is used as-is (useful for air-gapped or pinned scenarios) - kubernetes_version: "1.32" - - # OPTIONAL: Allow application pods on control-plane nodes - # Default: true (enables single-node clusters and better utilization of small instances) - # Set to false for production clusters where you want dedicated masters that - # only run etcd and the Kubernetes API server. When false, at least one - # non-master node group is required. - schedule_workloads_on_masters: true - - # REQUIRED: Node groups - at least one group must have master: true - # Uses the same map[string]NodeGroup pattern as AWS, GCP, and Azure providers. - # Exactly one group must be marked as the master (k3s control plane). - node_groups: - # Control plane node group - exactly one group must have master: true - master: - # REQUIRED: Hetzner server type - # Examples: cpx11, cpx21, cpx31, cpx41, cpx51, cx22, cax11 (ARM) - instance_type: cpx31 - - # REQUIRED: Number of control-plane nodes - # Must be odd (1, 3, 5) for k3s HA with embedded etcd - count: 1 - - # REQUIRED for one group: marks this as the k3s control plane - master: true - - # Worker node groups (zero or more) - workers: - instance_type: cpx31 - count: 2 - - # OPTIONAL: Override location for this worker group - # Only valid for worker (non-master) groups - # location: fsn1 - - # OPTIONAL: Autoscaling configuration - # autoscaling: - # enabled: true - # min_instances: 1 - # max_instances: 10 - - # OPTIONAL: Provide your own SSH keys instead of auto-generated ones - # If omitted, NIC generates an ed25519 key pair in ~/.cache/nic/hetzner-k3s/ssh/ - # ssh: - # public_key_path: "~/.ssh/id_ed25519.pub" - # private_key_path: "~/.ssh/id_ed25519" - - # OPTIONAL: Restrict SSH and Kubernetes API access - # Defaults to 0.0.0.0/0 (open to all) if omitted - restrict these in production - # network: - # ssh_allowed_cidrs: - # - 203.0.113.0/24 - # api_allowed_cidrs: - # - 203.0.113.0/24 -``` - -**Hetzner Environment Variables (Secrets):** - -```bash -# REQUIRED: Hetzner Cloud API token -# Create at: https://console.hetzner.cloud/ -> Project -> Security -> API Tokens -# Needs Read & Write permissions -HETZNER_TOKEN=your-hetzner-api-token -``` - -**Accessing the cluster after deploy:** - -The kubeconfig is written to `~/.cache/nic/hetzner-k3s//kubeconfig`: - -```bash -export KUBECONFIG=~/.cache/nic/hetzner-k3s/my-nebari/kubeconfig -kubectl get nodes -``` - -**SSH access to nodes:** - -NIC auto-generates an ed25519 key pair in `~/.cache/nic/hetzner-k3s/ssh/` (or uses your custom keys if configured via -`ssh:` in the config). To SSH into a node: - -```bash -# Get node IPs -kubectl get nodes -o wide - -# SSH as root using the auto-generated key -ssh -i ~/.cache/nic/hetzner-k3s/ssh/hetzner_ed25519 root@ -``` - -**Important: SSH key portability** - -Unlike managed Kubernetes providers (EKS, GKE, AKS) where authentication is handled by cloud IAM, Hetzner uses -hetzner-k3s which provisions clusters over SSH. The SSH key pair used during `nic deploy` is required for all subsequent -cluster operations (redeploy, destroy, scale) from any machine. - -If you auto-generate keys (the default), they are stored in `~/.cache/nic/hetzner-k3s/ssh/`. To manage the cluster from -a different computer, you must copy these files: - -```bash -# On the original machine, copy both files: -~/.cache/nic/hetzner-k3s/ssh/hetzner_ed25519 -~/.cache/nic/hetzner-k3s/ssh/hetzner_ed25519.pub -``` - -Alternatively, use your own SSH keys by configuring the `ssh:` block in your config, so the same key is available on all -machines without manual copying. - -**Key differences from managed Kubernetes providers (AWS/GCP/Azure):** - -- Uses k3s instead of a managed Kubernetes service (EKS/GKE/AKS) -- Requires exactly one `master: true` node group for the k3s control plane -- Master count must be odd (1, 3, 5) for etcd quorum -- `schedule_workloads_on_masters` controls whether app pods run on masters (defaults to true) -- Worker groups can override the top-level location for multi-region deployments -- SSH and API access CIDRs default to 0.0.0.0/0 if not restricted - - - -## Local Provider Configuration - -Local K3s provider configuration defined in `LocalConfig` (pkg/config/config.go:78-83). +cluster: + hetzner: + location: ash # required: Hetzner location (ash, fsn1, nbg1, ...) + kubernetes_version: "1.32" # required: "1.32", "1.32.0", or "v1.32.0+k3s1" + + # Optional: prevent application pods on control-plane nodes. + # Default: true (single-node clusters and small instances work better). + # Set to false for production with dedicated masters. + # schedule_workloads_on_masters: false + + # Optional: preserve CSI volumes through destroy. + # When true, deploy labels volumes persist=true and destroy skips them. + # persist_data: false + + node_groups: # map keyed by node-group name; exactly one must have master: true + master: + instance_type: cpx31 + count: 1 # for k3s HA, count should be 1, 3, or 5 (odd) + master: true + workers: + instance_type: cpx31 + count: 2 + # autoscaling: + # enabled: true + # min_instances: 2 + # max_instances: 6 + + # Optional: provide your own SSH keys (else NIC generates ed25519 keys in ~/.cache/nic/hetzner-k3s/ssh/) + # ssh: + # public_key_path: ~/.ssh/id_ed25519.pub + # private_key_path: ~/.ssh/id_ed25519 + + # Optional: restrict SSH and API CIDRs (defaults to 0.0.0.0/0; NIC warns at validate time) + # network: + # ssh_allowed_cidrs: [203.0.113.0/24] + # api_allowed_cidrs: [203.0.113.0/24] +``` + +The Hetzner provider requires the `HCLOUD_TOKEN` environment variable. + +### 2.3 `cluster.local` (Kind for development) + +Source: `pkg/provider/local/config.go`. Status: **implemented as a stub**. The local provider does not create the cluster itself; `make localkind-up` does. The provider is a thin adapter that runs the bootstrap (ArgoCD + foundational apps) against the Kind cluster. ```yaml -provider: local - -local: - # OPTIONAL: Kubernetes context to use from kubeconfig - # Default: current context from ~/.kube/config - # Use to specify which cluster to deploy to when you have multiple contexts - kube_context: "k3d-nebari-local" - - # OPTIONAL: Node selectors for workload placement - # Map of workload type to Kubernetes node selector labels - # Used to target specific nodes in the local cluster - # Useful when running multi-node K3s/K3d/Kind clusters - node_selectors: - # General workloads node selector - general: - kubernetes.io/os: linux - node-role.kubernetes.io/worker: "true" - - # User workloads node selector - user: - kubernetes.io/os: linux - workload: user - - # Worker/batch workloads node selector - worker: - kubernetes.io/os: linux - workload: batch - - # GPU workloads node selector (if you have GPU nodes locally) - gpu: - kubernetes.io/os: linux - nvidia.com/gpu: "true" -``` - -**Local Provider Notes:** - -- **Purpose**: Deploy Nebari to existing local Kubernetes cluster (K3s, K3d, Kind, Minikube, Docker Desktop) -- **No cloud credentials required**: Uses local kubeconfig for authentication -- **No infrastructure provisioning**: Assumes cluster already exists -- **Node selectors only**: No node group creation, only workload placement control -- **Development/testing use case**: Not recommended for production deployments +cluster: + local: + kube_context: "kind-nebari-local" # context name from kubeconfig + # storage_class: standard # default: "standard"; use "local-path" for k3s + # https_port: 443 # override e.g. 8443 if 443 is in use -**Local Provider Environment Variables:** + # MetalLB defaults to enabled with pool 192.168.1.100-192.168.1.110 + # metallb: + # enabled: false # disable for k3s (ships with ServiceLB) + # address_pool: 172.18.255.100-172.18.255.110 -```bash -# OPTIONAL: Custom kubeconfig location -KUBECONFIG=/path/to/custom/kubeconfig - -# If not set, uses default: ~/.kube/config + # Optional: per-node-group selectors used by software packs + # node_selectors: + # general: + # kubernetes.io/os: linux + # user: + # kubernetes.io/os: linux ``` +The local provider sets `InfraSettings.SupportsLocalGitOps = true`, which lets NIC auto-create `/tmp/nebari-gitops-` when `git_repository:` is not specified. +### 2.4 `cluster.existing` (adopt a pre-provisioned cluster) -## DNS Provider Configuration - -DNS provider configuration for managing DNS records and Let's Encrypt integration. - -### Cloudflare DNS Provider - -Cloudflare DNS provider defined in `cloudflare.Config` (pkg/dnsprovider/cloudflare/config.go:5-8). +Source: `pkg/provider/existing/config.go`. Status: **implemented**. No provisioning happens; NIC just runs the bootstrap against whatever cluster the kubeconfig points at. ```yaml -dns: - cloudflare: - # REQUIRED: Cloudflare zone name (your domain) - # This is the domain you manage in Cloudflare - # Example: example.com, mycompany.com - # NIC will create DNS records under this zone - zone_name: example.com -``` +cluster: + existing: + # Path to the kubeconfig file. May be absolute or relative; tilde is NOT expanded. + # When empty: falls back to $KUBECONFIG env var, then $HOME/.kube/config. + kubeconfig: path/to/kubeconfig -**Cloudflare Environment Variables (Secrets):** + # Required: context name within that kubeconfig. + context: "arn:aws:eks:us-west-2:123456789012:cluster/my-nebari" -Cloudflare API credentials must be provided via environment variables: + # Optional: default StorageClass for foundational PVCs (default: "standard") + storage_class: gp2 -```bash -# REQUIRED: Cloudflare API Token -# Create at: https://dash.cloudflare.com/profile/api-tokens -# Required permissions: Zone:Read, DNS:Edit -CLOUDFLARE_API_TOKEN=your-cloudflare-api-token + # Optional: annotations applied to the Envoy Gateway LoadBalancer Service + # load_balancer_annotations: + # load-balancer.hetzner.cloud/location: ash ``` -**How to Create Cloudflare API Token:** - -1. Go to https://dash.cloudflare.com/profile/api-tokens -2. Click "Create Token" -3. Use "Edit zone DNS" template or create custom token -4. Permissions required: - - Zone / DNS / Edit - - Zone / Zone / Read -5. Zone Resources: Include / Specific zone / your-domain.com -6. Copy token and add to `.env` file: `CLOUDFLARE_API_TOKEN=...` +### 2.5 `cluster.gcp` / `cluster.azure` (stubs) -**DNS Provider Integration:** +Sources: `pkg/provider/gcp/config.go`, `pkg/provider/azure/config.go`. Status: **registered but not implemented**. The struct fields exist for forward compatibility; calling `Validate`, `Deploy`, `Destroy`, or `GetKubeconfig` on these providers returns "not yet implemented" today. -When a `dns` block is configured, NIC will: -- On deploy: create root domain and wildcard (`*.domain`) DNS records pointing to the load balancer endpoint -- On destroy: remove those DNS records before tearing down infrastructure -- DNS errors are treated as warnings and never block deploy or destroy +The GCP struct accepts: `project`, `region`, `kubernetes_version`, `availability_zones`, `release_channel`, `node_groups` (map), `tags`, `networking_mode`, `network`, `subnetwork`, `ip_allocation_policy`, `master_authorized_networks_config`, `private_cluster_config`. -**Known limitation:** If you change the `domain` field and redeploy, records for the old domain are not automatically -removed. You must manually delete them from Cloudflare. See -[DNS Provider Architecture](../implementation/09-dns-provider-architecture.md#orphaned-records-on-domain-change) for -details. +The Azure struct accepts: `region`, `kubernetes_version`, `storage_account_postfix`, `authorized_ip_ranges`, `resource_group_name`, `node_resource_group_name`, `node_groups` (map), `vnet_subnet_id`, `private_cluster_enabled`, `tags`, `network_profile`, `max_pods`, `workload_identity_enabled`, `azure_policy_enabled`. +See [`examples/gcp-config.yaml`](../../../examples/gcp-config.yaml) and [`examples/azure-config.yaml`](../../../examples/azure-config.yaml) for schemas. Don't try to deploy with them. +--- -## Complete Examples +## 3. DNS Providers -### Minimal AWS Configuration - -```yaml -# Minimal production-ready AWS deployment -project_name: nebari-prod -provider: aws -domain: nebari.example.com - -amazon_web_services: - region: us-west-2 - kubernetes_version: "1.28" - - node_groups: - general: - instance: m5.2xlarge - min_nodes: 3 - max_nodes: 10 -``` - -### Full-Featured AWS Configuration - -```yaml -# Production AWS deployment with all common options -project_name: nebari-production -provider: aws -domain: nebari.company.com - -amazon_web_services: - region: us-east-1 - kubernetes_version: "1.29" - availability_zones: - - us-east-1a - - us-east-1b - - us-east-1c - vpc_cidr_block: "10.100.0.0/16" - endpoint_private_access: true - endpoint_public_access: true - permissions_boundary: "arn:aws:iam::123456789012:policy/DepartmentBoundary" - - tags: - Environment: production - Project: nebari - Team: data-science - CostCenter: engineering - ManagedBy: nic - - node_groups: - general: - instance: m6i.4xlarge - min_nodes: 3 - max_nodes: 10 - taints: - - key: CriticalAddonsOnly - value: "true" - effect: NoSchedule - - user: - instance: m6i.2xlarge - min_nodes: 2 - max_nodes: 50 - - worker: - instance: c6i.8xlarge - min_nodes: 0 - max_nodes: 20 - taints: - - key: workload - value: batch - effect: NoSchedule - - gpu: - instance: g5.2xlarge - min_nodes: 0 - max_nodes: 10 - gpu: true - taints: - - key: nvidia.com/gpu - value: "true" - effect: NoSchedule - - spot: - instance: m6i.8xlarge - min_nodes: 0 - max_nodes: 30 - spot: true - taints: - - key: spot - value: "true" - effect: NoSchedule - -dns: - cloudflare: - zone_name: company.com -``` +`dns:` takes exactly one key. The shape is provider-specific. -### Minimal GCP Configuration +Valid provider names: `cloudflare` (the only DNS provider implemented today). -```yaml -# Minimal production-ready GCP deployment -project_name: nebari-prod -provider: gcp -domain: nebari.example.com - -google_cloud_platform: - project: my-gcp-project - region: us-central1 - kubernetes_version: "1.28" - - node_groups: - general: - instance: e2-standard-8 - min_nodes: 3 - max_nodes: 10 -``` +### 3.1 `dns.cloudflare` -### Full-Featured GCP Configuration +Source: `pkg/dnsprovider/cloudflare/config.go`. ```yaml -# Production GCP deployment with all common options -project_name: nebari-production -provider: gcp -domain: nebari.company.com - -google_cloud_platform: - project: company-nebari-prod - region: us-central1 - kubernetes_version: "1.29" - availability_zones: - - us-central1-a - - us-central1-b - - us-central1-c - release_channel: "REGULAR" - networking_mode: "VPC_NATIVE" - network: "nebari-network" - - ip_allocation_policy: - cluster_secondary_range_name: gke-pods - services_secondary_range_name: gke-services - cluster_ipv4_cidr_block: "10.4.0.0/14" - services_ipv4_cidr_block: "10.0.32.0/20" - - master_authorized_networks_config: - office: "203.0.113.0/24" - vpn: "198.51.100.0/24" - - private_cluster_config: - enable_private_nodes: true - enable_private_endpoint: false - master_ipv4_cidr_block: "172.16.0.0/28" - - tags: - - production - - nebari - - data-science - - node_groups: - general: - instance: n2-standard-8 - min_nodes: 3 - max_nodes: 10 - labels: - workload: system - taints: - - key: CriticalAddonsOnly - value: "true" - effect: NoSchedule - - user: - instance: n2-standard-4 - min_nodes: 2 - max_nodes: 50 - labels: - workload: user - - worker: - instance: c2-standard-16 - min_nodes: 0 - max_nodes: 20 - labels: - workload: batch - taints: - - key: workload - value: batch - effect: NoSchedule - - gpu: - instance: n1-standard-8 - min_nodes: 0 - max_nodes: 10 - labels: - workload: gpu - guest_accelerators: - - name: nvidia-tesla-t4 - count: 1 - taints: - - key: nvidia.com/gpu - value: "true" - effect: NoSchedule - - preemptible: - instance: n2-standard-16 - min_nodes: 0 - max_nodes: 30 - preemptible: true - labels: - workload: preemptible - taints: - - key: preemptible - value: "true" - effect: NoSchedule - dns: cloudflare: - zone_name: company.com + zone_name: example.com # the Cloudflare zone hosting `domain` ``` -### Minimal Azure Configuration - -```yaml -# Minimal production-ready Azure deployment -project_name: nebari-prod -provider: azure -domain: nebari.example.com - -azure: - region: eastus - kubernetes_version: "1.28" - storage_account_postfix: "nbri" - - node_groups: - general: - instance: Standard_D8_v3 - min_nodes: 3 - max_nodes: 10 -``` +Behavior: -### Full-Featured Azure Configuration +- On deploy, NIC waits for the Envoy Gateway LB to receive a hostname or IP and then creates a root record and a wildcard record (`*.`) in the zone. Record type is A for IPs, CNAME for hostnames. +- On destroy, both records are removed. Idempotent. +- Failures are non-blocking: deploy/destroy continue with a warning. -```yaml -# Production Azure deployment with all common options -project_name: nebari-production -provider: azure -domain: nebari.company.com - -azure: - region: eastus - kubernetes_version: "1.29" - storage_account_postfix: "nbriprod" - resource_group_name: "nebari-prod-rg" - node_resource_group_name: "nebari-prod-nodes-rg" - private_cluster_enabled: false - max_pods: 50 - workload_identity_enabled: true - azure_policy_enabled: true - - authorized_ip_ranges: - - "203.0.113.0/24" # Office network - - "198.51.100.0/24" # VPN network - - network_profile: - network_plugin: azure - network_policy: azure - service_cidr: "10.0.0.0/16" - dns_service_ip: "10.0.0.10" - docker_bridge_cidr: "172.17.0.1/16" - - tags: - Environment: production - Project: nebari - Team: data-science - CostCenter: engineering - ManagedBy: nic - - node_groups: - general: - instance: Standard_D8s_v3 - min_nodes: 3 - max_nodes: 10 - taints: - - key: CriticalAddonsOnly - value: "true" - effect: NoSchedule - - user: - instance: Standard_D4s_v3 - min_nodes: 2 - max_nodes: 50 - - worker: - instance: Standard_F16s_v2 - min_nodes: 0 - max_nodes: 20 - taints: - - key: workload - value: batch - effect: NoSchedule - - highmem: - instance: Standard_E16s_v3 - min_nodes: 0 - max_nodes: 10 - taints: - - key: workload - value: highmem - effect: NoSchedule - - gpu: - instance: Standard_NC6s_v3 - min_nodes: 0 - max_nodes: 5 - taints: - - key: sku - value: gpu - effect: NoSchedule +Credential: `CLOUDFLARE_API_TOKEN` env var, with Zone:Read and DNS:Edit permissions on the zone. Domain must be a suffix of `zone_name` (suffix check with a dot separator). -dns: - cloudflare: - zone_name: company.com -``` +Future DNS providers (Route53, Azure DNS, Google Cloud DNS) will follow the same shape and the same `DNSProvider` interface defined in `pkg/dnsprovider/provider.go`. -### Minimal Hetzner Configuration +--- -```yaml -# Single-node Hetzner cluster (dev/testing) -project_name: nebari-dev -provider: hetzner -domain: nebari.example.com - -hetzner_cloud: - location: ash - kubernetes_version: "1.32" - node_groups: - master: - instance_type: cpx31 - count: 1 - master: true -``` +## 4. Certificate -### Production Hetzner Configuration +Source: `pkg/config/config.go` (`CertificateConfig`, `ACMEConfig`). ```yaml -# Multi-node Hetzner cluster with dedicated masters -project_name: nebari-prod -provider: hetzner -domain: nebari.example.com - -hetzner_cloud: - location: fsn1 - kubernetes_version: "1.32" - schedule_workloads_on_masters: false - node_groups: - master: - instance_type: cpx31 - count: 3 - master: true - general: - instance_type: cpx41 - count: 3 - gpu: - instance_type: ccx33 - count: 1 - autoscaling: - enabled: true - min_instances: 0 - max_instances: 5 - network: - ssh_allowed_cidrs: - - 203.0.113.0/24 - api_allowed_cidrs: - - 203.0.113.0/24 - -dns: - cloudflare: - zone_name: example.com +certificate: + type: letsencrypt # "selfsigned" (default) | "letsencrypt" + acme: # required when type: letsencrypt + email: admin@example.com + # server: https://acme-staging-v02.api.letsencrypt.org/directory # use staging for testing ``` -### Local Development Configuration - -```yaml -# Local K3d/Kind cluster for development -project_name: nebari-dev -provider: local -domain: nebari.local - -local: - kube_context: "k3d-nebari-local" - node_selectors: - general: - kubernetes.io/os: linux - user: - kubernetes.io/os: linux - worker: - kubernetes.io/os: linux -``` +When omitted, NIC behaves as if `type: selfsigned` was set. `selfsigned` is appropriate for local clusters, internal environments, and `existing` clusters where cert lifecycle is handled out-of-band. `letsencrypt` requires a publicly-routable `domain` (and typically a DNS provider). -### Multi-Environment Setup (Separate Files) +--- -**base-production.yaml** (production baseline): -```yaml -project_name: nebari-prod -provider: aws -domain: nebari.company.com - -amazon_web_services: - region: us-east-1 - kubernetes_version: "1.29" - vpc_cidr_block: "10.100.0.0/16" - - tags: - Environment: production - ManagedBy: nic - - node_groups: - general: - instance: m6i.4xlarge - min_nodes: 3 - max_nodes: 10 -``` +## 5. Git Repository -**staging.yaml** (smaller staging environment): -```yaml -project_name: nebari-staging -provider: aws -domain: staging.nebari.company.com - -amazon_web_services: - region: us-west-2 - kubernetes_version: "1.29" - vpc_cidr_block: "10.200.0.0/16" - - tags: - Environment: staging - ManagedBy: nic - - node_groups: - general: - instance: m5.xlarge - min_nodes: 1 - max_nodes: 3 -``` +Source: `pkg/git/config.go` (`Config`, `AuthConfig`). -**development.yaml** (minimal dev environment): ```yaml -project_name: nebari-dev -provider: local -domain: nebari.local - -local: - kube_context: "k3d-nebari-dev" -``` - - - -## Configuration Validation - -Use `nic validate` to check your configuration before deployment: - -```bash -# Validate configuration file -nic validate -f config.yaml - -# Example output: -# ✅ Configuration valid -# -# Summary: -# Provider: AWS (us-west-2) -# Project: nebari-prod -# Domain: nebari.example.com -# DNS: cloudflare -# Node Groups: 4 (general, user, worker, gpu) -``` - - - -## Environment Variables Reference - -### AWS Provider -```bash -AWS_ACCESS_KEY_ID= -AWS_SECRET_ACCESS_KEY= -AWS_SESSION_TOKEN= # Optional, for temporary credentials -AWS_PROFILE= # Optional, use named profile -AWS_REGION= # Optional, overrides config -``` - -### GCP Provider -```bash -GOOGLE_APPLICATION_CREDENTIALS= -GOOGLE_CREDENTIALS= # Alternative to file path -GOOGLE_PROJECT= # Optional, overrides config -``` +git_repository: + url: "git@github.com:my-org/my-gitops-repo.git" # SSH, HTTPS, or file:// path + branch: main # default: "main" + path: "clusters/my-nebari" # optional subdirectory -### Azure Provider -```bash -AZURE_CLIENT_ID= -AZURE_CLIENT_SECRET= -AZURE_TENANT_ID= -AZURE_SUBSCRIPTION_ID= -``` - -### Hetzner Provider -```bash -HETZNER_TOKEN= # Required, Hetzner Cloud API token -``` - -### Cloudflare DNS Provider -```bash -CLOUDFLARE_API_TOKEN= # Recommended -# OR (legacy) -CLOUDFLARE_API_KEY= -CLOUDFLARE_EMAIL= -``` + auth: # NIC's write credentials + ssh_key_env: GIT_SSH_PRIVATE_KEY # name of env var holding the PEM-encoded key + # OR for HTTPS: + # token_env: GIT_TOKEN -### Local Provider -```bash -KUBECONFIG= # Optional, default: ~/.kube/config + # Optional: separate read-only credentials for ArgoCD (falls back to `auth` when unset) + # argocd_auth: + # ssh_key_env: ARGOCD_SSH_KEY ``` +Notes: +- `file://` URLs are valid. Combined with `InfraSettings.SupportsLocalGitOps = true` (currently only the local provider), this enables a zero-credential GitOps workflow for development. +- When `git_repository:` is omitted on a provider that supports local GitOps, NIC auto-creates `/tmp/nebari-gitops-` and points ArgoCD at it. +- When `git_repository:` is omitted on a provider that does **not** support local GitOps (e.g., AWS), the deploy continues but the GitOps bootstrap is skipped. +- The CLI scrubs the `auth:` and `argocd_auth:` blocks from any copy of the config it writes into the repo (`scrubSensitiveFields` in `cmd/nic/deploy.go`). -## Best Practices - -### Security -1. **Never commit secrets to git**: Use `.env` file (gitignored) or CI/CD secret management -2. **Use restrictive CIDR blocks**: Limit `authorized_ip_ranges` to known networks -3. **Enable private clusters**: Set `private_cluster_enabled: true` for production when possible -4. **Use permissions boundaries**: Apply `permissions_boundary` in enterprise environments -5. **Rotate credentials regularly**: Update API tokens and service account keys periodically - -### High Availability -1. **Multi-AZ deployment**: Specify multiple `availability_zones` (minimum 3 for production) -2. **Adequate min_nodes**: Set `min_nodes >= 3` for general node group in production -3. **Node group redundancy**: Use multiple node groups for different workload types - -### Cost Optimization -1. **Use spot/preemptible for batch workloads**: Save 60-90% on compute costs -2. **Right-size instances**: Start small, scale up based on actual usage -3. **Set appropriate max_nodes**: Prevent runaway scaling costs -4. **Use tags for cost allocation**: Track spending by team, project, environment +--- -### Scalability -1. **Autoscaling ranges**: Set `min_nodes` for baseline, `max_nodes` for peak capacity -2. **Use taints for specialized workloads**: Ensure GPU/high-memory nodes only used when needed -3. **Monitor node utilization**: Adjust instance types and scaling limits based on metrics +## 6. Environment Variables +Loaded by `godotenv` from `.env` (gitignored) at startup. Used for credentials and runtime options. +| Variable | Used by | Purpose | +|----------|---------|---------| +| `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_REGION` | AWS provider | Standard AWS SDK credentials | +| `HCLOUD_TOKEN` | Hetzner provider | Hetzner Cloud API token | +| `CLOUDFLARE_API_TOKEN` | Cloudflare DNS | Zone:Read + DNS:Edit on the configured zone | +| `GIT_SSH_PRIVATE_KEY` (or whatever you point `git_repository.auth.ssh_key_env` at) | `pkg/git` | SSH private key in PEM form | +| `GIT_TOKEN` (or whatever you point `git_repository.auth.token_env` at) | `pkg/git` | Personal access token for HTTPS git URLs | +| `KUBECONFIG` | `existing` provider, `nic kubeconfig` | Kubeconfig path (used when `cluster.existing.kubeconfig` is empty) | +| `OTEL_EXPORTER` | `pkg/telemetry` | `console` (default), `otlp`, `both`, `none` | +| `OTEL_ENDPOINT` | `pkg/telemetry` | OTLP endpoint (default: `localhost:4317`) | -**Last Updated**: 2026-03-27 -**NIC Version**: v0.1.0 -**Source**: Generated from pkg/config/config.go and pkg/dnsprovider/*/config.go +`.env.example` in the repo root lists the variables NIC looks at; copy to `.env` and fill in the values you need. diff --git a/docs/design-doc/appendix/17-appendix.md b/docs/design-doc/appendix/17-appendix.md index e4320eb4..79d91e36 100644 --- a/docs/design-doc/appendix/17-appendix.md +++ b/docs/design-doc/appendix/17-appendix.md @@ -1,73 +1,89 @@ # Appendix -### 14.1 Glossary +## 17.1 Glossary | Term | Definition | | ----------------- | ----------------------------------------------------- | | **NIC** | Nebari Infrastructure Core - this project | -| **LGTM** | Loki, Grafana, Tempo, Mimir - observability stack | +| **LGTM** | Loki, Grafana, Tempo, Mimir - observability stack (planned; not yet deployed by NIC) | | **CRD** | Custom Resource Definition - Kubernetes API extension | | **HTTPRoute** | Kubernetes Gateway API resource for HTTP routing | | **OIDC** | OpenID Connect - authentication protocol | | **OTLP** | OpenTelemetry Protocol - telemetry data format | | **ArgoCD** | GitOps continuous deployment tool | | **cert-manager** | Kubernetes certificate management | -| **Envoy Gateway** | Modern ingress controller using Gateway API | +| **Envoy Gateway** | Kubernetes Gateway API implementation | | **Keycloak** | Open-source identity and access management | +| **NebariApp** | CRD reconciled by the Nebari Operator (developed out-of-tree at `nebari-dev/nebari-operator`) | +| **InfraSettings** | Provider-shaped capability struct returned by `provider.InfraSettings(cfg)`; the seam that lets CLI / `pkg/argocd` avoid branching on provider name | -### 14.2 Decision Log +## 17.2 Decision Log -| Date | Decision | Rationale | -| ---------- | ----------------------------------------- | -------------------------------------------- | -| 2025-01-30 | Clean break from old Nebari | 7 years of lessons, avoid legacy complexity | -| 2025-01-30 | Use OpenTofu with terraform-exec | Battle-tested modules, community ecosystem | -| 2025-01-30 | Deploy foundational software via ArgoCD | GitOps best practices, dependency management | -| 2025-01-30 | Build Nebari Operator for app integration | Automate repetitive auth/o11y/routing tasks | -| 2025-01-30 | Use LGTM stack for observability | Industry standard, proven at scale | -| 2025-01-30 | Use Envoy Gateway for ingress | Future-proof, Gateway API, advanced features | +| Date | Decision | Rationale | +| ---- | -------- | --------- | +| 2025-01-30 | Clean break from old Nebari | Seven years of lessons; avoid legacy complexity | +| 2025-01-30 | OpenTofu for the AWS provider via `terraform-exec` | Battle-tested EKS module, broad ecosystem familiarity | +| 2025-01-30 | Deploy foundational software via ArgoCD | GitOps best practices, dependency management via sync waves | +| 2025-01-30 | Deploy the Nebari Operator (developed out-of-tree) for app integration | Automate auth/routing for `NebariApp` CRs; keep NIC focused on infrastructure | +| 2025-01-30 | Envoy Gateway for ingress | Future-proof, Kubernetes Gateway API | +| 2026-?? | `provider.InfraSettings` for provider-shaped capabilities | Avoid `switch` on provider name in CLI/library code; new providers don't require edits elsewhere | +| 2026-?? | Hetzner provider via `hetzner-k3s` binary (no tofu) | The `Provider` interface is the contract; each provider picks the right tool | +| 2026-04-15 | [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md): Out-of-tree provider plugins | Smaller core binary, supported path for private (e.g., ASCOT DNS) integrations | -### 14.3 Success Criteria +The specific commit dates for the 2026 entries can be reconstructed from git history; the entries above are placeholders for the decisions themselves. -**v1.0 Success Criteria:** +## 17.3 Success Criteria -1. ✅ Deploy production Kubernetes on AWS, GCP, Azure, Local -2. ✅ All 9 foundational components deploy via ArgoCD -3. ✅ Nebari Operator automates app integration (auth, o11y, routing) -4. ✅ NIC fully instrumented with OpenTelemetry -5. ✅ Documentation complete (user guides, API reference) -6. ✅ All provider tests passing -7. ✅ Performance: AWS cluster deployment <20 minutes +**Current alpha-line success (today's bar):** -**User Success Criteria:** +- ✅ AWS and Hetzner cluster providers functional +- ✅ Local Kind workflow via `make localkind-up` +- ✅ `existing` provider for adopting clusters NIC didn't provision +- ✅ Foundational stack syncing via ArgoCD: cert-manager, Envoy Gateway, Keycloak (+ postgresql), MetalLB (conditional), OpenTelemetry Collector, Nebari Operator, Nebari Landing Page +- ✅ NIC instrumented with OpenTelemetry (with documented exemptions; operation-granularity wrappers on `TerraformExecutor` are tracked as outstanding work) +- ✅ Unit tests + lint + race + coverage in CI -- ✅ User can deploy platform with one command: `nic deploy` -- ✅ User can register app with one CRD: `NebariApplication` -- ✅ User gets auth, o11y, routing automatically (no manual config) -- ✅ User can access Grafana dashboards immediately -- ✅ User can troubleshoot via traces/logs/metrics +**v1.0 success (planned):** -### 14.4 Risks and Mitigations +- ⏳ GCP and Azure providers functional (or replaced by out-of-tree plugins per ADR-0004) +- ⏳ LGTM observability backend deployed by NIC +- ⏳ Documented upgrade paths between releases +- ⏳ End-to-end test coverage across providers +- ⏳ AWS cluster deploy under 20 minutes from a fresh account -| Risk | Impact | Mitigation | -| ------------------------------- | -------- | -------------------------------------------------------------------------------- | -| **Cloud API changes** | High | Pin SDK versions, comprehensive integration tests, monitor API deprecations | -| **Kubernetes version skew** | Medium | Test against multiple K8s versions (N, N-1, N-2), document supported versions | -| **ArgoCD application failures** | High | Health checks, retry logic, rollback capability, manual override option | -| **State corruption** | Critical | Atomic writes, backups before writes, state versioning, validation before save | -| **Certificate expiration** | Medium | cert-manager auto-renewal, monitoring alerts, runbook for manual renewal | -| **Keycloak downtime** | High | HA deployment (2+ replicas), external database, backup/restore procedures | -| **Operator bugs** | Medium | Thorough testing, dry-run mode, status reporting, manual CRD delete escape hatch | +**User success criteria:** -### 14.5 References +- ✅ One command to deploy: `nic deploy -f config.yaml` +- ✅ One CR per app to register with the platform: `NebariApp` +- ✅ Auth and routing wired automatically by the operator +- ⏳ Grafana dashboards immediately available (depends on LGTM) +- ⏳ End-to-end troubleshooting via traces/logs/metrics (depends on LGTM) + +## 17.4 Risks and Mitigations + +| Risk | Impact | Mitigation | +| ---- | ------ | ---------- | +| Cloud API changes | High | Pinned SDK versions; integration tests against LocalStack; monitor API deprecations | +| Kubernetes version skew | Medium | Test against N, N-1, N-2; document supported versions per provider | +| ArgoCD application failures | High | Sync waves enforce ordering; ArgoCD self-heal handles drift; manual `argocd app sync` as override | +| State corruption (AWS) | Critical | S3 versioning enabled; native lockfile-based locking; validation at parse time | +| Certificate expiration | Medium | cert-manager auto-renewal; alerts via OpenTelemetry Collector (backend pending) | +| Keycloak downtime | High | Configurable replica count; external Postgres backing store; backup/restore is roadmap | +| Operator bugs | Medium | Operator is out-of-tree at `nebari-dev/nebari-operator` with its own test surface; NIC pins a known-good version | +| Stuck S3 lockfile after Ctrl-C | Medium | Known issue [#63](https://github.com/nebari-dev/nebari-infrastructure-core/issues/63); `nic unlock` tracked at [#64](https://github.com/nebari-dev/nebari-infrastructure-core/issues/64) | + +## 17.5 References - [Kubernetes Gateway API](https://gateway-api.sigs.k8s.io/) - [OpenTelemetry Go SDK](https://opentelemetry.io/docs/languages/go/) - [ArgoCD Documentation](https://argo-cd.readthedocs.io/) - [Keycloak Documentation](https://www.keycloak.org/documentation) -- [Grafana LGTM Stack](https://grafana.com/oss/) - [cert-manager Documentation](https://cert-manager.io/docs/) - [Envoy Gateway](https://gateway.envoyproxy.io/) -- [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) +- [`nebari-dev/nebari-operator`](https://github.com/nebari-dev/nebari-operator) - out-of-tree operator that reconciles `NebariApp` CRs +- `nebari-dev/eks-cluster/aws` v0.4.0 (OpenTofu Registry) - upstream Terraform module used by NIC's AWS provider; see `pkg/provider/aws/templates/main.tf` +- [`hetzner-k3s`](https://github.com/vitobotta/hetzner-k3s) - binary used by NIC's Hetzner provider +- [ADR-0004: Out-of-Tree Provider Plugin Architecture](../../adr/0004-out-of-tree-provider-plugins.md) --- diff --git a/docs/design-doc/architecture/01-introduction.md b/docs/design-doc/architecture/01-introduction.md index bfde3a64..3ac59bf5 100644 --- a/docs/design-doc/architecture/01-introduction.md +++ b/docs/design-doc/architecture/01-introduction.md @@ -2,109 +2,113 @@ ### 1.1 Purpose -This document describes the architectural design for Nebari Infrastructure Core (NIC) v2.0, a clean-break redesign that applies seven years of lessons learned from developing and deploying Nebari. NIC is a standalone command line tool that provides opinionated Kubernetes deployments with a complete foundational software stack across AWS, GCP, Azure, and on-premises environments. +This document describes the architectural design for Nebari Infrastructure Core (NIC), a clean-break redesign that applies seven years of lessons learned from developing and deploying Nebari. NIC is a standalone command-line tool that provisions Kubernetes clusters and bootstraps an opinionated foundational software stack on top of them. ### 1.2 Core Design Principles -1. **Opinionated by Default**: Best practices from 7 years of production Nebari deployments -2. **Complete Platform**: Kubernetes + foundational software (auth, o11y, routing, GitOps) -3. **Declarative Infrastructure**: Declare desired state, OpenTofu reconciles to match -4. **OpenTofu Modules**: Leverage battle-tested Terraform/OpenTofu modules for infrastructure -5. **terraform-exec Orchestration**: Go CLI controls OpenTofu via terraform-exec library -6. **Standard State Management**: Terraform state files with remote backends (S3, GCS, Azure Blob) -7. **Multi-Cloud Consistency**: Common platform experience across all providers -8. **Observability-First**: OpenTelemetry instrumentation and LGTM stack built-in -9. **Application-Centric**: Nebari Operator automates app registration with auth, o11y, routing -10. **GitOps Native**: ArgoCD for all foundational software deployment +1. **Opinionated by Default**: Best practices from seven years of production Nebari deployments +2. **Complete Platform**: Kubernetes plus foundational software (auth, routing, GitOps, certs) +3. **Provider Abstraction**: Each cluster provider chooses the right backing tool for its environment - OpenTofu for AWS, native CLI/SDK for Hetzner, Kind for local dev. The `provider.Provider` interface is the contract, not a single IaC tool. +4. **Declarative Infrastructure**: Declare desired state in `nebari-config.yaml`; the configured provider reconciles to match +5. **GitOps Native**: ArgoCD is the deployment mechanism for all foundational software +6. **Standard State Management (where applicable)**: AWS uses Terraform state in S3 with native lockfile-based locking; non-tofu providers manage state in tool-specific ways +7. **Observability-First**: OpenTelemetry instrumentation in library code, structured `slog` logging in the CLI layer +8. **Application-Centric**: The Nebari Operator (deployed by NIC, developed out-of-tree) automates app registration with auth and routing ### 1.3 What NIC Provides -**Complete Platform Stack:** +**Layered Platform Stack:** ``` ┌─────────────────────────────────────────────────────────────┐ -│ Application Layer (Managed by Nebari Operator) │ -│ - User applications registered via nebari-application CRD │ -│ - Auto-configured auth, o11y, routing │ +│ Application Layer (Managed by Nebari Operator) │ +│ - User applications registered via NebariApp CRD │ +│ - Auto-configured auth and routing │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ Foundational Software (Deployed by ArgoCD) │ -│ ├── Keycloak (Authentication & Authorization) │ -│ ├── LGTM Stack (Loki, Grafana, Tempo, Mimir) │ -│ ├── OpenTelemetry Collector (Metrics, Logs, Traces) │ -│ ├── cert-manager (TLS Certificate Management) │ -│ ├── Envoy Gateway (Ingress & API Gateway) │ -│ └── ArgoCD (GitOps Continuous Deployment) │ +│ Foundational Software (Deployed by ArgoCD) │ +│ ├── cert-manager + cluster-issuers (TLS automation) │ +│ ├── Envoy Gateway + gateway-config + httproutes (ingress) │ +│ ├── Keycloak + postgresql (authentication) │ +│ ├── MetalLB + metallb-config (LB, local/bare-metal only) │ +│ ├── OpenTelemetry Collector (telemetry pipeline) │ +│ ├── Nebari Operator (NebariApp reconciler, out-of-tree) │ +│ └── Nebari Landing Page (service catalog UI) │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ Kubernetes Cluster (Deployed by NIC) │ -│ - Production-ready configuration │ -│ - Multi-zone, highly available │ -│ - Observability & security best practices │ +│ Kubernetes Cluster (Provisioned by NIC) │ +│ - Provider-specific configuration │ +│ - Multi-AZ on cloud providers; single-node on local │ +│ - StorageClass and LB integration per provider │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ Cloud Infrastructure (Provisioned by NIC via OpenTofu) │ -│ - VPC/Networking │ -│ - Managed Kubernetes (EKS/GKE/AKS/K3s) │ -│ - Node pools & auto-scaling │ -│ - Storage, security, IAM │ +│ Cluster Provider (per-provider backing tool) │ +│ - AWS: OpenTofu (EKS via nebari-dev/eks-cluster module) │ +│ - Hetzner: hetzner-k3s binary │ +│ - Local: Kind (driven by `make localkind-up`) │ +│ - Existing: no-op adapter for pre-provisioned clusters │ +│ - GCP, Azure: stubs, not yet implemented │ └─────────────────────────────────────────────────────────────┘ ``` +A full LGTM (Loki / Grafana / Tempo / Mimir) observability backend is **not** currently deployed by NIC. Only the OpenTelemetry Collector is shipped today; building out the backend is on the roadmap (see [`13-milestones.md`](../operations/13-milestones.md)). + ### 1.4 Scope **In Scope:** -- Cloud infrastructure provisioning via OpenTofu modules (VPC, managed K8s, node pools, storage, IAM) -- Kubernetes cluster deployment (production-ready configuration) -- Foundational software deployment (Keycloak, LGTM, cert-manager, Envoy, ArgoCD) -- Nebari Kubernetes Operator (nebari-application CRD) -- Supported platforms: AWS (EKS), GCP (GKE), Azure (AKS), On-Prem (K3s) -- Configuration via declarative YAML -- Terraform state management with remote backends -- OpenTelemetry instrumentation throughout -- Structured logging via slog +- Cloud infrastructure provisioning, by the configured cluster provider's chosen tool +- Kubernetes cluster deployment (production-ready configuration on cloud providers) +- Foundational software deployment via ArgoCD from a generated GitOps repository +- A DNS provider abstraction (currently with a Cloudflare implementation) +- Configuration via declarative YAML (`nebari-config.yaml`) +- OpenTelemetry instrumentation in library code +- Structured logging via `slog` at the CLI layer +- The Nebari Operator is deployed by NIC but developed in a separate repository (`github.com/nebari-dev/nebari-operator`) **Out of Scope:** -- Application deployment (handled by users via ArgoCD or kubectl) +- Application deployment beyond foundational software (handled by users via ArgoCD or kubectl) - Legacy Nebari compatibility (clean break) -- Custom cloud SDK implementations (using OpenTofu/Terraform ecosystem) - Managed database services (users provision separately) -- CI/CD pipelines (beyond ArgoCD for foundational software) +- General-purpose CI/CD pipelines (beyond ArgoCD for foundational software) +- Implementing the Nebari Operator itself (lives out-of-tree) -### 1.5 Lessons Learned from 7 Years of Nebari +### 1.5 Lessons Learned from Seven Years of Nebari **What We're Keeping:** -- ✅ Opinionated platform approach (reduces decision fatigue) -- ✅ Multi-cloud support (AWS, GCP, Azure, Local) -- ✅ Declarative configuration (infrastructure as code) -- ✅ Authentication-first design (Keycloak integration) -- ✅ Observability focus (monitoring from day one) +- Opinionated platform approach (reduces decision fatigue) +- Multi-cluster-provider support (AWS, Hetzner, local Kind today; GCP and Azure planned) +- Declarative configuration (infrastructure as code) +- Authentication-first design (Keycloak integration) +- Observability focus (telemetry instrumentation from day one) **What We're Changing:** -- ❌ **Custom Terraform wrappers** → ✅ terraform-exec orchestration with Go CLI -- ❌ **Staged deployment fragmentation** → ✅ Unified deployment -- ❌ **Manual app integration** → ✅ Operator-automated registration -- ❌ **Scattered observability** → ✅ Unified LGTM stack + OpenTelemetry -- ❌ **Ad-hoc ingress** → ✅ Envoy Gateway with consistent API -- ❌ **Implicit dependencies** → ✅ ArgoCD dependency graph +| Old Nebari | NIC | +|------------|-----| +| Custom Terraform wrappers | terraform-exec orchestration, per-provider tool choice | +| Staged deployment fragmentation | Unified deployment via `nic deploy` | +| Manual app integration | Operator-automated `NebariApp` registration | +| Ad-hoc ingress | Envoy Gateway with Kubernetes Gateway API | +| Implicit dependencies | ArgoCD app-of-apps with sync waves | **Key Insights Applied:** -| Insight | Design Impact | -| ---------------------------------------- | ----------------------------------------------------- | -| **Users want "batteries included"** | Foundational software deployed by default | -| **Auth integration is tedious** | Operator automates OAuth client creation | -| **Observability is an afterthought** | LGTM stack + OpenTelemetry built-in | -| **Certificate management is painful** | cert-manager + automated ingress TLS | -| **Terraform modules are battle-tested** | Leverage community OpenTofu/Terraform modules | -| **Multi-cloud drift is real** | Provider abstraction enforces consistency | -| **GitOps reduces deployment errors** | ArgoCD for all foundational components | +| Insight | Design Impact | +| ------- | ------------- | +| Users want "batteries included" | Foundational software deployed by default | +| Auth integration is tedious | Operator automates OAuth client creation | +| Observability is an afterthought | OpenTelemetry Collector deployed by default (backend pending) | +| Certificate management is painful | cert-manager plus automated route TLS | +| Battle-tested modules beat hand-rolled IaC | AWS uses the upstream `nebari-dev/eks-cluster` module | +| Multi-cloud drift is real | The `Provider` interface enforces the consistent contract | +| GitOps reduces deployment errors | ArgoCD for all foundational components | + +For the rationale behind per-provider tool choice and the planned out-of-tree plugin direction, see [ADR-0004: Out-of-Tree Provider Plugin Architecture](../../adr/0004-out-of-tree-provider-plugins.md). --- diff --git a/docs/design-doc/architecture/02-system-overview.md b/docs/design-doc/architecture/02-system-overview.md index 88962dc9..5eaf824c 100644 --- a/docs/design-doc/architecture/02-system-overview.md +++ b/docs/design-doc/architecture/02-system-overview.md @@ -6,131 +6,171 @@ ``` ┌─────────────────────────────────────────────────────────────┐ -│ 1. User defines config.yaml │ -│ - Cloud provider (aws/gcp/azure/local) │ -│ - Cluster size and node pools │ -│ - Foundational software configuration │ -│ - Domain and TLS settings │ +│ 1. User defines nebari-config.yaml │ +│ - cluster.: ... (aws | hetzner | │ +│ local | existing) │ +│ - dns.: ... (optional, cloudflare) │ +│ - git_repository: ... (optional on local) │ +│ - certificate: ... (selfsigned | letsencrypt) │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ 2. NIC CLI parses config and plans deployment │ -│ $ nic deploy -f config.yaml │ +│ 2. NIC CLI parses config and dispatches to a provider │ +│ $ nic deploy -f config.yaml │ +│ - cmd/nic parses YAML into pkg/config.NebariConfig │ +│ - Looks up the provider from pkg/registry.Registry │ +│ - Calls provider.Deploy(ctx, projectName, cluster, opts) │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ 3. Cloud Infrastructure Provisioning (OpenTofu) │ -│ - Go CLI invokes terraform-exec │ -│ - OpenTofu executes HCL modules │ -│ ├── VPC/Network (subnets, security groups, NAT) │ -│ ├── Managed Kubernetes (EKS/GKE/AKS/K3s) │ -│ ├── Node Pools (general, compute, gpu) │ -│ ├── Storage (EFS/Filestore/Azure Files) │ -│ └── IAM (service accounts, roles, policies) │ +│ 3. Cluster Provisioning (provider-specific) │ +│ - AWS: pkg/tofu.Setup → OpenTofu init/plan/apply │ +│ using embedded templates that call the │ +│ upstream nebari-dev/eks-cluster Terraform │ +│ module. State lives in S3 with native │ +│ lockfile-based locking. │ +│ - Hetzner: shells out to the hetzner-k3s binary against │ +│ the Hetzner Cloud API. No tofu involved. │ +│ - Local: stub - user runs `make localkind-up`, which │ +│ creates a Kind cluster and then invokes nic. │ +│ - Existing: no-op; uses kubeconfig + context from config.│ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ 4. Kubernetes Bootstrap (via OpenTofu kubernetes provider) │ -│ ├── Namespaces (nebari-system, monitoring, ingress) │ -│ ├── Storage Classes (persistent volumes) │ -│ ├── RBAC (cluster roles, service accounts) │ -│ ├── Network Policies (namespace isolation) │ -│ └── Priority Classes (workload prioritization) │ +│ 4. GitOps Bootstrap (pkg/argocd, pkg/git) │ +│ - Renders ArgoCD app manifests into a Git repository │ +│ (remote or file://) configured via git_repository │ +│ - For providers with InfraSettings.SupportsLocalGitOps= │ +│ true (local/Kind), auto-creates a local repo if none │ +│ is configured │ +│ - Commits and pushes (or commits locally for file://) │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ 5. ArgoCD Deployment (Helm via OpenTofu) │ -│ - Installed in nebari-system namespace │ -│ - Configured with foundational-software repo │ -│ - Sets up app-of-apps pattern │ +│ 5. ArgoCD Install (pkg/argocd, pkg/helm) │ +│ - NIC installs ArgoCD via the embedded Helm Go SDK │ +│ (helm.sh/helm/v3/pkg/action), not via a Terraform │ +│ helm_release resource │ +│ - Configures Keycloak OIDC for SSO │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ 6. Foundational Software (ArgoCD Applications) │ -│ ├── cert-manager (TLS automation) │ -│ ├── Envoy Gateway (ingress controller) │ -│ ├── OpenTelemetry Collector (telemetry pipeline) │ -│ ├── Mimir (metrics storage) │ -│ ├── Loki (log aggregation) │ -│ ├── Tempo (trace storage) │ -│ ├── Grafana (visualization) │ -│ └── Keycloak (authentication) │ +│ 6. Foundational Services (ArgoCD Applications) │ +│ Manifests live under pkg/argocd/templates/apps/ and are │ +│ rendered into the GitOps repo. ArgoCD then syncs them │ +│ via a root app-of-apps: │ +│ ├── cert-manager + cluster-issuers + certificates │ +│ ├── Envoy Gateway + gateway-config + httproutes │ +│ ├── postgresql + Keycloak │ +│ ├── metallb + metallb-config (only when needed) │ +│ ├── opentelemetry-collector │ +│ ├── nebari-operator (kustomized from upstream repo) │ +│ └── nebari-landingpage │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ 7. Nebari Operator Deployment │ -│ - Installed via ArgoCD │ -│ - Watches nebari-application CRD │ -│ - Registers apps with Keycloak, Envoy, o11y │ +│ 7. DNS + Endpoint Surfacing (optional) │ +│ - pkg/endpoint watches the Envoy Gateway Service for an │ +│ assigned load-balancer hostname or IP │ +│ - If dns. is configured (Cloudflare today), │ +│ records are provisioned automatically │ +│ - Otherwise the CLI prints exact A/CNAME instructions │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ -│ 8. Platform Ready │ -│ ✅ Kubernetes cluster running │ -│ ✅ Foundational software operational │ -│ ✅ Auth, o11y, routing configured │ -│ ✅ Users can deploy applications │ +│ 8. Platform Ready │ +│ - Kubernetes cluster running (or adopted) │ +│ - Foundational software syncing via ArgoCD │ +│ - Auth and routing configured │ +│ - Users can install NebariApp software packs │ └─────────────────────────────────────────────────────────────┘ ``` ### 2.2 Component Breakdown -**NIC CLI (`cmd/nic`):** +The actual repository layout is captured in [`AGENTS.md`](../../../AGENTS.md). Key packages: -- Command-line interface for platform management -- Commands: `deploy`, `destroy`, `status`, `validate`, `plan` -- Orchestrates OpenTofu via terraform-exec library -- OpenTelemetry tracing for all operations -- Structured logging via slog +**`cmd/nic/` (CLI)** -**terraform-exec Wrapper (`pkg/tofu`):** +- Cobra-based commands: `deploy`, `destroy`, `validate`, `kubeconfig`, `version`. There is no `status` or `plan` subcommand today. +- Reads `.env` via `godotenv` and initializes OpenTelemetry via `pkg/telemetry`. +- Owns the `slog` JSON logger. Library code (under `pkg/`) does not log. +- Owns the status-channel handler (`cmd/nic/status_handler.go`); see Section 2.4. -- Programmatic control of OpenTofu execution -- Init, Plan, Apply, Destroy, Output methods -- Working directory and state management -- OpenTelemetry instrumented +**`pkg/provider/` (Cluster providers)** -**OpenTofu Modules (`terraform/modules`):** +- `pkg/provider/provider.go` defines the `Provider` interface (`Name`, `Validate`, `Deploy`, `Destroy`, `GetKubeconfig`, `Summary`, `InfraSettings`) and the `InfraSettings` capability struct (`StorageClass`, `NeedsMetalLB`, `LoadBalancerAnnotations`, `MetalLBAddressPool`, `KeycloakBasePath`, `HTTPSPort`, `EFSStorageClass`, `SupportsLocalGitOps`). +- One sub-package per cluster provider: `aws/`, `hetzner/`, `local/`, `existing/`, plus `gcp/` and `azure/` stubs (registered but their methods return "not yet implemented"). +- AWS-specific Terraform templates live under `pkg/provider/aws/templates/` and are embedded into the binary via `go:embed`. -- `aws/` - VPC, EKS, EFS modules -- `gcp/` - VPC, GKE, Filestore modules -- `azure/` - VNet, AKS, Azure Files modules -- `local/` - K3s module -- `kubernetes/` - Bootstrap resources -- `argocd/` - ArgoCD and foundational apps +**`pkg/dnsprovider/` (DNS providers)** -**Kubernetes Management (via OpenTofu kubernetes provider):** +- `pkg/dnsprovider/provider.go` defines the `DNSProvider` interface (`Name`, `ProvisionRecords`, `DestroyRecords`). +- `pkg/dnsprovider/cloudflare/` is the only implementation today. -- Bootstrap resources (namespaces, RBAC, storage classes) -- ArgoCD installation via Helm provider -- Foundational software ArgoCD applications +**`pkg/registry/` (Unified provider registry)** -**Foundational Software (deployed by ArgoCD):** +- `registry.Registry` holds two `ProviderList` instances: `ClusterProviders` (a `ProviderList[provider.Provider]`) and `DNSProviders` (a `ProviderList[dnsprovider.DNSProvider]`). +- All providers are registered explicitly in `cmd/nic/main.go` `init()`. No blank imports or `init()` magic. -- ArgoCD application definitions -- Configuration templates for each component -- Health checks and readiness gates -- Dependency ordering (cert-manager first, then Envoy, etc.) +**`pkg/tofu/` (terraform-exec wrapper)** -**Nebari Operator (`pkg/operator`):** +- Single file `pkg/tofu/tofu.go` defines `TerraformExecutor`, which embeds `*tfexec.Terraform` plus a temp working dir and an `afero.Fs`. +- `Setup(ctx, templates fs.FS, tfvars any)` extracts embedded templates, downloads the OpenTofu binary via `tofudl` with caching at `~/.cache/nic/tofu/`, sets `TF_PLUGIN_CACHE_DIR`, writes `terraform.tfvars.json`, and returns the executor. +- `Init`, `Plan`, `Apply`, `Destroy` call the `*JSON` variants of `tfexec` and stream output through the status channel (Section 2.4). `Output` uses the standard tfexec entry point. -- Kubernetes operator built with controller-runtime -- Reconciles nebari-application CRD -- Integrates with Keycloak, Envoy Gateway, Grafana -- Automatic OAuth client creation, route configuration, dashboard provisioning +**`pkg/config/` (Config parsing)** + +- `pkg/config/config.go` defines `NebariConfig` with fields `ProjectName`, `Domain`, `Cluster *ClusterConfig`, `DNS *DNSConfig`, `GitRepository *git.Config`, `Certificate *CertificateConfig`. +- `ClusterConfig` and `DNSConfig` both use the discriminator pattern: a single inline map keyed by provider name. Provider-specific config is opaque to the config package and is decoded by the provider itself. + +**`pkg/argocd/` (ArgoCD orchestration)** + +- Installs ArgoCD via the embedded Helm Go SDK (`pkg/helm`), not via a Terraform `helm_release`. +- Renders the foundational app-of-apps from templates under `pkg/argocd/templates/apps/` and `pkg/argocd/templates/manifests/`. Apps include cert-manager, cluster-issuers, certificates, envoy-gateway, gateway-config, httproutes, keycloak, postgresql, metallb, metallb-config, opentelemetry-collector, nebari-landingpage, nebari-operator, and the root app. +- The nebari-operator app references the upstream repository (`github.com/nebari-dev/nebari-operator`) via Kustomize; the operator's source code does not live in this repo. + +**`pkg/dns`/`pkg/endpoint`/`pkg/git`/`pkg/helm`/`pkg/kubeconfig`/`pkg/status`/`pkg/telemetry`** + +- `pkg/endpoint` waits for the Envoy Gateway `Service` to receive an LB hostname or IP, so the CLI can either provision DNS or print manual instructions. +- `pkg/git` clones, commits, and pushes the GitOps repo (including `file://` local paths). +- `pkg/helm` is a thin wrapper around `helm.sh/helm/v3/pkg/action` used by `pkg/argocd`. +- `pkg/status` is the in-process status channel used to surface user-visible progress from library code without violating the "no `slog` in `pkg/`" rule. +- `pkg/telemetry` wires up the OpenTelemetry tracer provider, with exporters selected via `OTEL_EXPORTER` (`console` default, `otlp`, `both`, `none`). ### 2.3 Why This Architecture? -| Design Choice | Rationale | -| ------------------------------------ | ----------------------------------------------------------------------- | -| **OpenTofu vs Custom SDKs** | Battle-tested modules, community ecosystem, familiar to teams | -| **terraform-exec Orchestration** | Programmatic control, OpenTelemetry instrumentation, Go integration | -| **Terraform State vs Stateless** | Standard tooling, team collaboration, ecosystem compatibility | -| **ArgoCD for Foundational Software** | GitOps best practices, dependency management, declarative updates | -| **Operator for App Registration** | Automates repetitive tasks, reduces human error, consistent integration | -| **LGTM Stack vs Custom** | Industry-standard, proven at scale, unified Grafana Labs ecosystem | -| **Envoy Gateway vs Others** | Kubernetes Gateway API, future-proof, advanced routing features | -| **Helm for ArgoCD Only** | Minimize Helm usage, ArgoCD handles rest via manifests | -| **OpenTelemetry Built-In** | Observability from day one, vendor-neutral, industry standard | +| Design Choice | Rationale | +| ------------- | --------- | +| `Provider` interface as the contract (not "Terraform everywhere") | Honest about reality: only AWS uses OpenTofu; Hetzner uses its own CLI; `local` is a Kind stub. See [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md). | +| terraform-exec for AWS | Programmatic control, JSON output for status streaming, broad ecosystem familiarity. | +| Terraform state in S3 (AWS) | Industry-standard, well-supported tooling, and native lockfile-based locking (no DynamoDB table required). | +| ArgoCD for foundational software | GitOps best practices, declarative dependency management via sync waves, self-healing. | +| Embedded Helm SDK for the ArgoCD install itself | Bootstraps the GitOps controller without requiring an out-of-band Helm CLI. After ArgoCD is up, everything else is GitOps. | +| Out-of-tree Nebari Operator | The operator is its own product with its own release cadence. NIC just deploys it. | +| `InfraSettings` for provider-shaped capabilities | CLI code never switches on provider name. Providers expose capabilities (e.g., `NeedsMetalLB`, `StorageClass`, `SupportsLocalGitOps`) and the rest of the system consumes them. | +| OpenTelemetry in library code, `slog` in CLI | Library code is reusable across CLI commands and (eventually) plugins. CLI is the only layer that emits human-facing logs. | + +### 2.4 The Status Channel: pkg → cmd Seam + +Library code under `pkg/` is forbidden from calling `slog`. User-visible progress instead flows through the status channel attached to `ctx`: + +``` +pkg/* (e.g., pkg/tofu, pkg/argocd) + │ + │ status.Update via status.NewWriter or status.Send + ▼ +ctx-attached chan status.Update + │ + ▼ +cmd/nic/status_handler.go + │ translates each Update into slog records + ▼ +JSON logs on stderr +``` + +This decouples library code from any specific logging backend and keeps long-running subprocesses (e.g., `tofu apply -json`) streaming live progress without requiring the producer to enumerate every interesting field. + +`pkg/status` and the byte/line-level helpers inside `pkg/tofu` (`streamThroughStatus`, `jsonLineMapper`, `mapStatusLevel`) are intentionally exempt from per-function OpenTelemetry instrumentation: spans at that granularity would dwarf the operations they describe. --- diff --git a/docs/design-doc/architecture/03-goals-and-non-goals.md b/docs/design-doc/architecture/03-goals-and-non-goals.md index 2e0610ba..43e90750 100644 --- a/docs/design-doc/architecture/03-goals-and-non-goals.md +++ b/docs/design-doc/architecture/03-goals-and-non-goals.md @@ -2,44 +2,54 @@ ### 3.1 Primary Goals -**Phase 1 Goals (MVP):** -1. ✅ Deploy production-ready Kubernetes on AWS, GCP, Azure, Local -2. ✅ Deploy all foundational software via ArgoCD -3. ✅ Nebari Operator with basic nebari-application CRD support -4. ✅ Working auth (Keycloak), o11y (LGTM), routing (Envoy) -5. ✅ OpenTofu-based infrastructure provisioning with standard state management -6. ✅ OpenTelemetry instrumentation throughout NIC -7. ✅ Comprehensive documentation and examples - -**Phase 2 Goals (Iteration):** -1. Advanced Keycloak integration (SAML, LDAP federation) -2. Custom Grafana dashboards for NIC-deployed clusters -3. Automated backup and restore for foundational software -4. Multi-cluster support (deploy multiple clusters) -5. Cost optimization features (spot instances, autoscaling) -6. Compliance profiles (HIPAA, SOC2, PCI-DSS) -7. **Git repository provisioning** (GitHub/GitLab) with auto-generated CI/CD workflows -8. **Software stack specification** - Deploy complete stacks (databases, caching, apps) alongside foundational software -9. **Full-stack-in-one-repo** - Define platform + applications + config in single version-controlled repository -10. **Stack templates** - Pre-built configurations for common use cases (data science, ML platform, web apps) +Status icons reflect current state, not original ambition: + +- ✅ shipped +- 🟡 partially shipped +- ⏳ planned + +**Phase 1 (MVP):** + +1. ✅ Deploy production Kubernetes on **AWS (EKS)** and **Hetzner (k3s)**; ✅ local Kind clusters for development; ✅ `existing` provider to adopt a pre-provisioned cluster; ⏳ GCP and Azure providers (currently registered stubs) +2. ✅ Deploy foundational software via ArgoCD: cert-manager, cluster-issuers, certificates, Envoy Gateway, gateway-config, httproutes, postgresql, Keycloak, MetalLB (where needed), OpenTelemetry Collector, nebari-operator, nebari-landingpage +3. ✅ Nebari Operator deployed as a foundational app (operator source lives in [`nebari-dev/nebari-operator`](https://github.com/nebari-dev/nebari-operator)) +4. ✅ Working **auth** (Keycloak with OIDC SSO into ArgoCD) and **routing** (Envoy Gateway with Kubernetes Gateway API). 🟡 **Observability**: the OpenTelemetry Collector ships, but a full LGTM backend (Loki / Grafana / Tempo / Mimir) does not - that work is deferred. +5. ✅ Configuration-driven cluster provisioning, with per-provider backing tools (OpenTofu for AWS, hetzner-k3s for Hetzner, Kind for local, kubeconfig adoption for existing). State management is provider-specific; AWS uses S3 with native lockfile-based locking. +6. 🟡 OpenTelemetry instrumentation in library code (CLAUDE.md documents exemptions for `pkg/status` and byte/line-level helpers inside `pkg/tofu`; operation-granularity wrappers on `TerraformExecutor` are tracked as outstanding work) +7. ✅ A documented `Provider` interface and `InfraSettings` capability struct so adding a new cluster provider does not require changes to CLI or `pkg/argocd` + +**Phase 2 (Iteration):** + +1. ⏳ Full LGTM observability backend (Loki / Grafana / Tempo / Mimir) +2. ⏳ Advanced Keycloak integration (SAML, LDAP federation) +3. ⏳ Custom Grafana dashboards for NIC-deployed clusters +4. ⏳ Automated backup and restore for foundational software +5. ⏳ Multi-cluster support (deploy multiple clusters from one CLI) +6. ⏳ Cost optimization features (spot instances, autoscaling policies) +7. ⏳ Compliance profiles (HIPAA, SOC2, PCI-DSS) +8. ⏳ Auto-provisioning of Git repositories and CI workflows (consumption of an existing repo is already supported via `git_repository:`) +9. ⏳ Software pack specification - declare full stacks (databases, caching, apps) alongside foundational software +10. ⏳ Stack templates for common use cases (data science, ML platform, web apps) +11. ⏳ Out-of-tree provider plugins as described in [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md) **Future Goals:** -1. Service mesh integration (Istio/Linkerd) -2. Advanced security (OPA/Gatekeeper policies) -3. Edge deployment support -4. Hybrid cloud networking -5. AI/ML workload optimizations (GPU pools, model serving) + +1. ⏳ Service-mesh integration (Istio/Linkerd) +2. ⏳ Advanced security (OPA/Gatekeeper policies) +3. ⏳ Edge deployment support +4. ⏳ Hybrid cloud networking +5. ⏳ AI/ML workload optimizations (GPU pools, model serving) ### 3.2 Explicit Non-Goals **Not Doing:** + - ❌ Backward compatibility with old Nebari (clean break) -- ❌ Supporting Terraform-based deployments -- ❌ Managed database services (RDS/CloudSQL/etc.) -- ❌ Application deployment (beyond foundational software) +- ❌ Managed database services (RDS, CloudSQL, etc.) +- ❌ User application deployment (beyond foundational software). Apps install themselves via ArgoCD with `NebariApp` CRs. - ❌ Windows node pools (Linux only) -- ❌ Bare-metal Kubernetes (except K3s) -- ❌ Custom Kubernetes distributions (stick to EKS/GKE/AKS/K3s) +- ❌ Custom Kubernetes distributions. The supported distributions are EKS (AWS), k3s (Hetzner via hetzner-k3s), Kind (local dev), and any pre-existing CNCF-conformant cluster (via the `existing` provider). - ❌ Non-standard authentication (only Keycloak) +- ❌ Forcing every provider through OpenTofu. The `Provider` interface is the contract; the backing tool is provider-specific. --- diff --git a/docs/design-doc/architecture/04-key-decisions.md b/docs/design-doc/architecture/04-key-decisions.md index 457a562f..be0c7b2f 100644 --- a/docs/design-doc/architecture/04-key-decisions.md +++ b/docs/design-doc/architecture/04-key-decisions.md @@ -2,267 +2,172 @@ ### 4.1 Decision: Unified Deployment (Not Staged) -**Context:** Old Nebari had 6+ stages (terraform-state, infrastructure, kubernetes-initialize, ingress, keycloak, etc.) +**Context:** Old Nebari had six or more stages (terraform-state, infrastructure, kubernetes-initialize, ingress, keycloak, etc.). -**Decision:** NIC deploys everything in one unified workflow. +**Decision:** NIC deploys everything from `nic deploy -f config.yaml` in one workflow. **Rationale:** -- Eliminates stage dependency complexity +- Eliminates stage-dependency complexity - Faster deployment (parallel operations where possible) -- Easier to reason about (one command: `nic deploy`) +- Easier to reason about (one command) - Clearer error messages (no inter-stage state issues) -- ArgoCD handles application-level dependencies +- ArgoCD handles application-level dependencies via sync waves -**Alternatives Considered:** +### 4.2 Decision: Per-Provider Backing Tools (Not "OpenTofu Everywhere") -| Alternative | Rejected Because | -|-------------|------------------| -| Keep staged approach | Complexity, state management issues, slower | -| Makefile-based orchestration | Not portable, hard to debug, limited error handling | -| Ansible playbooks | YAML hell, imperative, no true state management | +**Context:** Different cluster providers have different idiomatic tooling. EKS has excellent Terraform support. Hetzner Cloud has a purpose-built tool (`hetzner-k3s`) that handles bootstrap better than tofu would. Kind is configured via a CLI flag and a YAML file. -### 4.2 Decision: OpenTofu/Terraform Modules for Infrastructure +**Decision:** The `provider.Provider` interface is the abstraction. Each provider implementation chooses the right backing tool for its environment: -**Context:** Need to provision cloud infrastructure (VPC, EKS/GKE/AKS, storage) reliably. +- **AWS:** OpenTofu, via the `terraform-exec` Go library, running the upstream `nebari-dev/eks-cluster` registry module +- **Hetzner:** the `hetzner-k3s` binary, talking directly to the Hetzner Cloud API +- **Local:** Kind, driven by `make localkind-up`. The local provider itself is a thin adapter; the CLI is responsible for cluster creation. +- **Existing:** no IaC at all - the provider reads `kubeconfig` and `context` from config and adopts the cluster -**Decision:** Use OpenTofu/Terraform modules orchestrated via the terraform-exec Go library. +This direction is documented in [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md), which proposes formalizing the abstraction as out-of-tree gRPC plugins so that private and org-specific providers (e.g., OpenTeams' internal ASCOT DNS provider) have a supported integration path that isn't "fork NIC." **Rationale:** -- **Battle-tested modules**: Leverage existing, proven Terraform modules -- **Community ecosystem**: Access to thousands of maintained modules -- **Familiar to teams**: Most infrastructure engineers know Terraform/HCL -- **Standard tooling**: Works with terraform-docs, tfsec, Atlantis, etc. -- **Faster development**: Reuse modules instead of writing SDK code from scratch -- **Standard state format**: Terraform state is well-understood and tooling-rich +- Honest about reality: not every provider fits a Terraform module +- Each provider can use its strongest available tool +- The `Provider` interface, not Terraform, is the contract +- Future-proof for out-of-tree plugins (ADR-0004) -**How It Works:** +### 4.3 Decision: terraform-exec for the AWS Provider -``` -Every `nic deploy` run: -1. Parse config.yaml (desired state) -2. Convert config to Terraform variables -3. Run terraform-exec: init, plan, apply -4. OpenTofu provisions infrastructure via provider plugins -5. State file updated in remote backend -6. Go CLI waits for cluster readiness -7. Deploy foundational software via ArgoCD -``` - -**Trade-offs:** - -- **External dependency**: Requires OpenTofu/Terraform binary installed -- **State management**: Must configure and manage state backends -- **Debugging layers**: Errors pass through Go → terraform-exec → OpenTofu → Cloud API - -See [State Management](05-state-management.md) for state backend configuration. - -### 4.3 Decision: terraform-exec for Go Orchestration - -**Context:** How to invoke OpenTofu from the Go CLI? - -**Decision:** Use HashiCorp's terraform-exec library for programmatic OpenTofu control. - -**Rationale:** +**Context:** How to invoke OpenTofu from Go for the AWS provider. -- Official Go library for Terraform/OpenTofu execution -- Type-safe interface for init, plan, apply, destroy -- Structured output parsing (JSON plan output) -- Well-maintained and documented -- Supports both Terraform and OpenTofu binaries +**Decision:** Use HashiCorp's `terraform-exec` library wrapped by `pkg/tofu.TerraformExecutor`. -**Implementation Pattern:** +**Implementation Pattern (real shape from `pkg/tofu/tofu.go`):** ```go -// terraform-exec wrapper with OpenTelemetry instrumentation -func (e *Executor) Apply(ctx context.Context, varFiles []string) error { - ctx, span := tracer.Start(ctx, "Executor.Apply") - defer span.End() - - slog.InfoContext(ctx, "applying infrastructure changes") - - var opts []tfexec.ApplyOption - for _, vf := range varFiles { - opts = append(opts, tfexec.VarFile(vf)) - } - - if err := e.tf.Apply(ctx, opts...); err != nil { - span.RecordError(err) - return fmt.Errorf("terraform apply: %w", err) - } +type TerraformExecutor struct { + *tfexec.Terraform + workingDir string + appFs afero.Fs +} - return nil +func (te *TerraformExecutor) Apply(ctx context.Context, opts ...tfexec.ApplyOption) error { + ctx = signalSafeContext(ctx) + return te.streamThroughStatus(ctx, func(w io.Writer) error { + return te.ApplyJSON(ctx, w, opts...) + }) } ``` -See [Terraform-Exec Integration](../implementation/08-terraform-exec-integration.md) for complete implementation. +Two things to note: -### 4.4 Decision: Terraform State with Remote Backends +1. The wrapper calls `ApplyJSON`/`PlanJSON`/`InitJSON`/`DestroyJSON` and streams output through the **status channel** attached to `ctx` (see [System Overview §2.4](02-system-overview.md#24-the-status-channel-pkg--cmd-seam)). Library code does not call `slog` - that translation happens in `cmd/nic/status_handler.go`. +2. `Setup(ctx, templates fs.FS, tfvars any)` (also in `pkg/tofu/tofu.go`) handles binary acquisition via `tofudl` with caching at `~/.cache/nic/tofu/`, extraction of embedded templates, and `terraform.tfvars.json` writing. Callers do not look up tofu in `PATH`. -**Context:** Need to track infrastructure state across deployments. +See [Terraform-Exec Integration](../implementation/08-terraform-exec-integration.md). -**Decision:** Use standard Terraform state files stored in remote backends (S3, GCS, Azure Blob). +### 4.4 Decision: Terraform State in S3 with Native Lockfile-Based Locking (AWS) -**Rationale:** +**Context:** The AWS provider needs to track infrastructure state across deployments. -- **Standard format**: Terraform state is industry-standard -- **Team collaboration**: Remote backends support concurrent access with locking -- **State versioning**: Backends like S3 support versioning for recovery -- **Drift detection**: `terraform plan` compares state with actual infrastructure -- **Ecosystem integration**: Works with Atlantis, Terraform Cloud, etc. +**Decision:** Standard Terraform S3 backend with `use_lockfile = true`. No DynamoDB table is involved. -**State Backend Configuration:** +**State Backend Configuration (real `pkg/provider/aws/templates/backend.tf`):** ```hcl -# AWS Backend (S3 + DynamoDB for locking) terraform { backend "s3" { - bucket = "nebari-prod-terraform-state" - key = "nic/terraform.tfstate" - region = "us-west-2" - encrypt = true - dynamodb_table = "nebari-prod-terraform-locks" + encrypt = true + use_lockfile = true } } ``` -**Trade-offs:** +Bucket and key are populated dynamically at `tofu init` time. The bucket name is deterministic: `nic-tfstate---<8-hex-of-account-id-hash>`. NIC auto-creates the bucket (`pkg/provider/aws/state.go:ensureStateBucket`) with versioning and public-access-block enabled. + +**Non-AWS providers manage state in tool-specific ways:** -- **Setup required**: Must create and configure state backend resources -- **State drift risk**: State file can diverge from actual infrastructure -- **Sensitive data**: State files contain credentials and must be secured -- **Locking complexity**: Lock conflicts require manual resolution +- Hetzner: `hetzner-k3s` writes a cluster state file; the location is configured via the tool's own settings +- Local: Kind manages its own cluster lifecycle +- Existing: no NIC-owned state -See [State Management](05-state-management.md) for complete backend configuration. +See [State Management](05-state-management.md). ### 4.5 Decision: ArgoCD for Foundational Software -**Context:** How to deploy and manage foundational software (Keycloak, LGTM, etc.)? +**Context:** How to deploy and manage foundational software (cert-manager, Envoy Gateway, Keycloak, etc.). -**Decision:** Deploy ArgoCD first via Helm, then use ArgoCD applications for all other foundational software. +**Decision:** NIC installs ArgoCD first via the **embedded Helm Go SDK** (`helm.sh/helm/v3/pkg/action`, wrapped in `pkg/helm`), then renders ArgoCD `Application` manifests into a Git repository for ArgoCD to sync. **Rationale:** -- GitOps best practices (declarative, version-controlled) -- Automatic sync and health checks -- Dependency management (app-of-apps pattern) -- Self-healing (detects and fixes drift) -- Rollback capability -- Clear audit trail (Git history) +- GitOps best practices: declarative, version-controlled, self-healing +- Sync waves manage cross-app dependencies (cert-manager before things that need certs, etc.) +- ArgoCD itself can be installed without a separately-installed Helm CLI -**Deployment Order:** +**Deployment Order (real apps under `pkg/argocd/templates/apps/`):** ``` -1. ArgoCD (Helm chart via Terraform helm provider) +1. ArgoCD (installed by NIC via the Helm Go SDK) ↓ -2. ArgoCD Applications (via Terraform kubernetes provider) - ├── cert-manager (first, for TLS) - ├── Envoy Gateway (depends on cert-manager) - ├── OpenTelemetry Collector - ├── Mimir, Loki, Tempo (parallel) - ├── Grafana (depends on Mimir/Loki/Tempo) - ├── Keycloak (depends on Envoy for ingress) - └── Nebari Operator (last, depends on all) +2. App-of-apps root.yaml, then individual apps via sync waves: + ├── cert-manager + cluster-issuers + certificates + ├── Envoy Gateway + gateway-config + httproutes + ├── postgresql + Keycloak + ├── metallb + metallb-config (only when InfraSettings.NeedsMetalLB) + ├── opentelemetry-collector + ├── nebari-operator (Kustomized from nebari-dev/nebari-operator) + └── nebari-landingpage ``` -### 4.6 Decision: Nebari Kubernetes Operator +A full LGTM stack (Loki / Grafana / Tempo / Mimir) is not deployed today; that is roadmap work. -**Context:** Applications need to integrate with auth, o11y, and routing. +### 4.6 Decision: Nebari Operator Is Out-of-Tree -**Decision:** Build a Kubernetes operator that watches `nebari-application` CRDs and automates integration. +**Context:** Applications integrate with auth and routing via a `NebariApp` CRD. -**Rationale:** +**Decision:** The Nebari Operator is its own product, developed in [`nebari-dev/nebari-operator`](https://github.com/nebari-dev/nebari-operator). NIC deploys it as a foundational ArgoCD application via Kustomize (`pkg/argocd/templates/manifests/nebari-operator/kustomization.yaml`). -- Reduces manual configuration (no more copy-paste YAML) -- Consistent integration across all apps -- Self-service for developers -- Automatic updates when foundational software changes -- Native Kubernetes workflow - -**Example CRD Usage:** - -```yaml -apiVersion: nebari.dev/v1alpha1 -kind: NebariApplication -metadata: - name: jupyter-hub - namespace: jupyter -spec: - displayName: "JupyterHub" - routing: - domain: jupyter.example.com - enableTLS: true - paths: - - path: / - service: jupyterhub - port: 8000 - authentication: - enabled: true - allowedGroups: - - data-scientists - - admins - observability: - metrics: - enabled: true - port: 9090 - path: /metrics - logs: - enabled: true - traces: - enabled: true - dashboards: - - name: "JupyterHub Overview" - source: "https://..." -``` - -**Operator Actions:** - -1. Creates Keycloak OAuth2 client -2. Configures Envoy Gateway HTTPRoute -3. Provisions cert-manager Certificate -4. Creates Grafana Dashboard ConfigMap -5. Configures OpenTelemetry ServiceMonitor -6. Updates status with URLs and credentials +**Rationale:** -### 4.7 Decision: OpenTelemetry Throughout +- The operator has its own release cadence and CRD schema +- NIC is an infrastructure tool; the operator is an application-integration tool +- Keeps NIC's surface area focused on cluster provisioning and bootstrap -**Context:** Need comprehensive observability for NIC itself. +NIC passes `InfraSettings.KeycloakBasePath` and `InfraSettings.HTTPSPort` into the operator's Kustomize patch so it routes correctly per provider. NIC does not implement the reconciliation logic; that lives upstream. -**Decision:** Instrument all NIC code with OpenTelemetry (traces, metrics, logs). +### 4.7 Decision: OpenTelemetry in Library Code, slog in the CLI -**Rationale:** +**Context:** Need observability for NIC itself, without coupling library code to a specific logging backend. -- Debugging deployment issues -- Performance monitoring -- Vendor-neutral (can export to any backend) -- Unified observability story (NIC uses same stack it deploys) -- Compliance with industry standards +**Decision:** -**Implementation:** +- All new functions in `pkg/` are wrapped in OpenTelemetry trace spans, with the documented exemptions in [`CLAUDE.md`](../../../CLAUDE.md) (e.g., per-line writers in `pkg/status` and byte/line helpers in `pkg/tofu`). +- Library code never calls `slog`. User-visible progress goes through the status channel; `cmd/nic/status_handler.go` is the only translator into structured logs. +- Exporters are configurable via `OTEL_EXPORTER` (`console` default, `otlp`, `both`, `none`) and `OTEL_ENDPOINT`. -- Every Go function wrapped in trace span -- Structured logging via slog with trace context -- Custom metrics for resource counts, deployment time, errors -- Export to deployed LGTM stack +**Pattern:** -### 4.8 Decision: Go CLI with Embedded Modules +```go +func SomeFunction(ctx context.Context, ...) error { + tracer := otel.Tracer("nebari-infrastructure-core") + ctx, span := tracer.Start(ctx, "package.FunctionName") + defer span.End() -**Context:** How to package and distribute NIC? + span.SetAttributes(attribute.String("key", value)) -**Decision:** Single Go binary with OpenTofu modules embedded or cloned from git. + if err != nil { + span.RecordError(err) + return err + } + return nil +} +``` -**Rationale:** +### 4.8 Decision: Single Go Binary with Embedded Templates (Today); Out-of-Tree Plugins (Tomorrow) -- Easy installation (go install or download binary) -- Modules version-locked with NIC release -- Predictable behavior across environments -- No separate module download step +**Context:** How to package and distribute NIC. -**Module Delivery Options:** +**Decision (today):** Single Go binary. AWS templates are embedded via `go:embed` from `pkg/provider/aws/templates/`. OpenTofu itself is downloaded on first use into `~/.cache/nic/tofu/` and reused thereafter. -1. **Embedded** (default): Modules embedded in binary via Go embed -2. **Git clone**: Modules cloned from versioned git tag -3. **Local path**: Modules from local filesystem (development) +**Decision (planned, [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md)):** Move providers (cluster, DNS, cert, git, software) to out-of-tree gRPC plugins discovered at runtime. The current in-tree layout is the bootstrap target; the plugin architecture is the long-term direction. --- diff --git a/docs/design-doc/architecture/05-state-management.md b/docs/design-doc/architecture/05-state-management.md index 0e87f519..4e535c9b 100644 --- a/docs/design-doc/architecture/05-state-management.md +++ b/docs/design-doc/architecture/05-state-management.md @@ -1,345 +1,122 @@ -# State Management with Terraform State +# State Management ## 5.1 Overview -NIC uses **Terraform state files** to track infrastructure. The state file records what resources Terraform has created and their current configuration, enabling drift detection and safe updates. +State management in NIC is **provider-specific**. There is no single state mechanism that spans all cluster providers. -## 5.2 Terraform State Backends +| Provider | State Mechanism | +|----------|-----------------| +| AWS | OpenTofu state in S3, with native lockfile-based locking | +| Hetzner | `hetzner-k3s` writes a cluster state file managed by that tool | +| Local (Kind) | Kind manages its own cluster lifecycle; no NIC-owned state | +| Existing | No state; NIC adopts a cluster by `kubeconfig`/`context` | -NIC generates backend configuration based on the cloud provider and user preferences. +This document focuses on the **AWS provider**, which is the only provider that uses OpenTofu today. -### AWS (S3 + DynamoDB for Locking) +## 5.2 AWS State Backend + +The AWS provider uses the standard Terraform S3 backend with native lockfile-based locking (introduced in OpenTofu/Terraform 1.10): ```hcl +# pkg/provider/aws/templates/backend.tf terraform { backend "s3" { - bucket = "nebari-prod-terraform-state" - key = "nic/terraform.tfstate" - region = "us-west-2" - encrypt = true - dynamodb_table = "nebari-prod-terraform-locks" + encrypt = true + use_lockfile = true } } ``` -**Setup Requirements**: -1. Create S3 bucket for state storage -2. Enable versioning on S3 bucket -3. Create DynamoDB table for locking (primary key: `LockID` string) -4. Configure appropriate IAM permissions +Bucket and key are not hard-coded; they are populated via `-backend-config` flags at `tofu init` time from values computed in `pkg/provider/aws/state.go`. -### GCP (Cloud Storage) +### Bucket Naming -```hcl -terraform { - backend "gcs" { - bucket = "nebari-prod-terraform-state" - prefix = "nic" - } -} -``` - -**Setup Requirements**: -1. Create Cloud Storage bucket -2. Enable object versioning -3. Configure appropriate IAM permissions -4. Locking handled automatically by GCS +The bucket name is deterministic and not user-configurable today: -### Azure (Blob Storage) - -```hcl -terraform { - backend "azurerm" { - storage_account_name = "nebaristate" - container_name = "tfstate" - key = "nic/terraform.tfstate" - } -} ``` - -**Setup Requirements**: -1. Create Azure Storage Account -2. Create blob container -3. Configure appropriate RBAC permissions -4. Locking handled via blob lease mechanism - -### Local (Development Only) - -```hcl -terraform { - backend "local" { - path = "terraform.tfstate" - } -} +nic-tfstate---<8-hex-chars-of-sha256(account_id)> ``` -**Not Recommended for Production**: -- No team collaboration support -- No state locking across multiple users -- State file lost if local machine fails +For example, `nic-tfstate-my-nebari-us-west-2-1a2b3c4d`. The account ID is hashed rather than embedded directly. The total length is checked against the 63-character S3 bucket name limit; project names that would overflow it return an error. -## 5.3 State Locking +The state object key is `/terraform.tfstate`. -Terraform handles locking automatically to prevent concurrent modifications: +### Bucket Lifecycle -| Backend | Locking Mechanism | Configuration Required | -|---------|------------------|----------------------| -| **S3** | DynamoDB table | Create table with `LockID` primary key | -| **GCS** | Object generation metadata | None (automatic) | -| **Azure Blob** | Blob lease | None (automatic) | -| **Local** | File locking | None (automatic) | +NIC creates the bucket automatically on first deploy (`ensureStateBucket` in `pkg/provider/aws/state.go`) with: -### Lock Behavior +- Versioning enabled +- Public access fully blocked (`PutPublicAccessBlock`) +- SSE enabled at the backend level (`encrypt = true` in `backend.tf`) -When `nic deploy` runs: -1. Terraform acquires lock before `terraform plan` -2. Lock prevents concurrent modifications -3. Lock released after `terraform apply` completes -4. If NIC crashes, lock auto-expires (configurable timeout) +On `nic destroy`, the bucket and all object versions are deleted (`destroyStateBucket`). The bucket lifecycle is owned by NIC; there is no separate "setup" step the user runs first. -### Lock Conflicts +### Locking -```bash -$ nic deploy -f config.yaml +`use_lockfile = true` makes OpenTofu acquire the state lock by writing a `.tflock` object to S3 next to the state file. This replaces the older pattern of using a DynamoDB table for locks. NIC does **not** create or manage a DynamoDB table; if you see references to one anywhere, that is a documentation bug. -Error: Error acquiring the state lock +Lock conflicts surface as an error from `tofu apply` like: +``` +Error: Error acquiring the state lock Lock Info: - ID: a1b2c3d4-5678-90ef-ghij-klmnopqrstuv - Path: nebari-prod-terraform-state/nic/terraform.tfstate + ID: ... + Path: //terraform.tfstate Operation: OperationTypeApply - Who: user@hostname - Version: 1.6.0 - Created: 2025-01-14 15:30:00 UTC - Info: - -Another operation is currently holding the state lock. -If you're sure no other operation is running, you can force unlock: - nic unlock -f config.yaml -``` - -## 5.4 Drift Detection - -Drift detection compares state file with actual cloud infrastructure via `terraform plan`: - -```go -func (p *TofuProvider) DetectDrift(ctx context.Context) (*DriftReport, error) { - ctx, span := tracer.Start(ctx, "TofuProvider.DetectDrift") - defer span.End() - - slog.InfoContext(ctx, "detecting infrastructure drift") - - // terraform plan compares state file with actual cloud state - hasChanges, err := p.executor.Plan(ctx, []string{"terraform.tfvars"}) - if err != nil { - return nil, fmt.Errorf("running terraform plan: %w", err) - } - - if !hasChanges { - slog.InfoContext(ctx, "no drift detected") - return &DriftReport{DriftsDetected: 0}, nil - } - - // terraform-exec provides structured plan output - plan, err := p.executor.ShowPlanFile(ctx, "tfplan") - if err != nil { - return nil, fmt.Errorf("parsing plan: %w", err) - } - - // Parse changes into drift report - drifts := []Drift{} - for _, change := range plan.ResourceChanges { - if change.Change.Actions.Delete() || change.Change.Actions.Update() { - drifts = append(drifts, Drift{ - Resource: change.Address, - Type: change.Type, - Action: change.Change.Actions.String(), - }) - } - } - - slog.WarnContext(ctx, "infrastructure drift detected", "drift_count", len(drifts)) - - return &DriftReport{ - DriftsDetected: len(drifts), - Drifts: drifts, - }, nil -} -``` - -### Drift Scenarios - -**Scenario 1: Resource Deleted Outside Terraform** -``` -Plan: 1 to add, 0 to change, 0 to destroy - - # aws_eks_node_group.workers will be created - + resource "aws_eks_node_group" "workers" { - + arn = (known after apply) - + cluster_name = "nebari-prod" - ... - } -``` - -**Scenario 2: Resource Modified Outside Terraform** -``` -Plan: 0 to add, 1 to change, 0 to destroy - - # aws_eks_node_group.workers will be updated in-place - ~ resource "aws_eks_node_group" "workers" { - ~ desired_size = 3 -> 5 - ... - } -``` - -**Scenario 3: No Drift** ``` -No changes. Your infrastructure matches the configuration. -``` - -## 5.5 State Operations -NIC exposes Terraform state commands: +Today there is no `nic unlock` command; recovery from a stuck lock requires manual intervention via `tofu force-unlock` or by deleting the `.tflock` S3 object. Adding `nic unlock` is tracked in [issue #64](https://github.com/nebari-dev/nebari-infrastructure-core/issues/64); Ctrl-C-leaves-state-locked is tracked in [issue #63](https://github.com/nebari-dev/nebari-infrastructure-core/issues/63). -```bash -# List resources in state -nic state list +## 5.3 Drift Detection -# Show specific resource -nic state show aws_eks_cluster.main - -# Remove resource from state (doesn't destroy infrastructure) -nic state rm aws_eks_node_group.old_pool - -# Move resource to different address -nic state mv aws_eks_node_group.workers aws_eks_node_group.renamed -``` - -## 5.6 State Migration - -When changing backend configuration: +Drift detection is exposed via `--dry-run`: ```bash -# Old backend configuration -terraform { - backend "local" { - path = "terraform.tfstate" - } -} - -# New backend configuration -terraform { - backend "s3" { - bucket = "nebari-prod-terraform-state" - key = "nic/terraform.tfstate" - } -} - -# NIC handles migration -$ nic deploy -f config.yaml - -Detected backend configuration change. -Migrating state from local to s3... - -Terraform will perform the following actions: - - Copying state from "local" backend to "s3" backend. - -Do you want to copy existing state to the new backend? - Enter a value: yes - -Successfully migrated state to new backend. +nic deploy -f config.yaml --dry-run ``` -## 5.7 State File Security - -### Encryption at Rest +Under the hood, this calls `Provider.Deploy(ctx, ..., DeployOptions{DryRun: true})`. The AWS provider implementation runs `tofu plan` and streams structured plan output through the status channel; the CLI translates it into a human-readable summary. -- **S3**: Enable bucket encryption (SSE-S3 or SSE-KMS) -- **GCS**: Enable bucket encryption (Google-managed or customer-managed keys) -- **Azure Blob**: Enable storage account encryption +There is no separate `nic status`, `nic plan`, or `nic state` subcommand today. Drift information is communicated through `--dry-run`. -### Access Control +## 5.4 State File Security -- **S3**: IAM policies restricting bucket access -- **GCS**: IAM policies for Cloud Storage -- **Azure Blob**: RBAC for storage account access +Terraform state files contain sensitive material (cluster credentials, certificate authority data, etc.). NIC mitigates this via: -### Sensitive Data in State +- **Encryption at rest**: SSE enabled (`encrypt = true`) on the S3 backend +- **Public-access block**: NIC sets `BlockPublicAcls`, `BlockPublicPolicy`, `IgnorePublicAcls`, and `RestrictPublicBuckets` on the state bucket +- **IAM**: bucket access is controlled by the IAM identity NIC runs under; restrict it to the smallest set of principals that need to operate the cluster +- **Versioning**: enabled by default so accidental state corruption can be recovered -Terraform state files contain **sensitive data**: -- Kubernetes cluster credentials -- Database passwords -- API keys -- Certificate private keys +**Operator best practices:** -**Best Practices**: 1. Never commit state files to version control -2. Use encrypted remote backends -3. Restrict state file access via IAM/RBAC -4. Enable state file versioning for recovery -5. Regularly rotate credentials stored in state - -## 5.8 State Backend Setup - -NIC can automatically create state backend resources: - -```bash -# Initialize state backend (creates S3 bucket, DynamoDB table, etc.) -nic init-backend -f config.yaml +2. Restrict state-bucket access to a small operator group +3. Rotate credentials (e.g., Keycloak admin password) after they appear in plan output -Creating state backend resources... - - S3 Bucket: nebari-prod-terraform-state - - DynamoDB Table: nebari-prod-terraform-locks +## 5.5 Working Directory -State backend initialized successfully. -``` - -Or users can create resources manually and configure in config.yaml: - -```yaml -# config.yaml -project_name: nebari-prod -provider: aws - -state_backend: - type: s3 - bucket: my-existing-state-bucket - key: nebari/terraform.tfstate - region: us-west-2 - dynamodb_table: my-existing-lock-table -``` +`pkg/tofu.Setup` creates a fresh temporary working directory for each NIC invocation: -## 5.9 Working Directory Management - -OpenTofu requires a working directory with state and modules: - -``` -.nic/ -├── terraform/ # Working directory -│ ├── .terraform/ # Terraform plugins and modules -│ ├── terraform.tfstate # State file (if using local backend) -│ ├── vars.json # Generated from config.yaml -│ └── backend.tf # Generated backend configuration -``` +1. Allocates a temp directory via `afero.TempDir(appFs, "", "nic-tofu")` +2. Walks the embedded `templates/` filesystem and copies each file into the working dir +3. Downloads (or reuses, from `~/.cache/nic/tofu/`) the OpenTofu binary and writes it into the working dir +4. Sets `TF_PLUGIN_CACHE_DIR` to `~/.cache/nic/tofu/plugins` so provider plugins are reused across runs +5. Marshals provider-supplied tfvars to `terraform.tfvars.json` in the working dir +6. Returns a `TerraformExecutor` whose `Cleanup()` method removes the working dir -The Go CLI manages this working directory lifecycle: -1. Create working directory if not exists -2. Copy Terraform modules from embedded FS or git clone -3. Generate `vars.json` from `config.yaml` -4. Generate `backend.tf` from config -5. Run `tofu init`, `tofu plan`, `tofu apply` -6. Cleanup temporary files +There is no `.nic/` directory in the user's home or project root; everything is ephemeral except the binary and plugin caches. ---- +For dry-run scenarios where the remote state bucket might not yet exist, `WriteBackendOverride()` writes a `backend_override.tf.json` that overrides the backend with a local backend for that single run. -## Summary +## 5.6 Future Work -NIC uses standard Terraform state management with remote backends, providing: +The following are known gaps and tracked in GitHub issues: -- **Team collaboration**: State locking prevents concurrent modifications -- **Drift detection**: `terraform plan` compares state with actual infrastructure -- **State versioning**: Remote backends support versioning for recovery -- **Ecosystem compatibility**: Works with Atlantis, Terraform Cloud, and other tools +- **`nic unlock` command** ([#64](https://github.com/nebari-dev/nebari-infrastructure-core/issues/64)) - graceful recovery from stuck S3 lockfiles +- **Ctrl-C cleanup during destroy** ([#63](https://github.com/nebari-dev/nebari-infrastructure-core/issues/63)) - currently can leave state locked +- **Redundant tofu init / module downloads** ([#241](https://github.com/nebari-dev/nebari-infrastructure-core/issues/241)) - module downloads are repeated unnecessarily because the working dir is ephemeral +- **`nic state` subcommands** - currently the only way to manipulate state is via the bundled tofu binary directly; first-class state subcommands are not implemented +- **GCP / Azure backend support** - blocked on those providers being implemented at all -See [Terraform-Exec Integration](../implementation/08-terraform-exec-integration.md) for how the Go CLI manages state operations. +See also [Terraform-Exec Integration](../implementation/08-terraform-exec-integration.md). diff --git a/docs/design-doc/implementation/06-opentofu-module-architecture.md b/docs/design-doc/implementation/06-opentofu-module-architecture.md index 74b98f4b..67acdba8 100644 --- a/docs/design-doc/implementation/06-opentofu-module-architecture.md +++ b/docs/design-doc/implementation/06-opentofu-module-architecture.md @@ -1,686 +1,113 @@ # OpenTofu Module Architecture -## 6.1 Overview +## 6.1 Scope -NIC uses **OpenTofu/Terraform modules** for infrastructure provisioning. The Go CLI orchestrates OpenTofu execution via the terraform-exec library, while actual infrastructure provisioning is handled by HCL-based modules that leverage the Terraform provider ecosystem. +This document describes how OpenTofu is used inside NIC. **OpenTofu is not used by every provider.** Only the AWS cluster provider uses tofu today; Hetzner shells out to the `hetzner-k3s` binary, the local provider relies on Kind (driven from the Makefile), and the `existing` provider does not provision any infrastructure at all. -## 6.2 Module Structure +For the contract between CLI and provider implementations - which is what actually defines NIC's architecture - see the `Provider` interface in `pkg/provider/provider.go` and [System Overview](../architecture/02-system-overview.md). -### Repository Layout +## 6.2 Repository Layout (Real) -``` -nebari-infrastructure-core/ -├── cmd/nic/ # Go CLI (orchestration layer) -├── pkg/ -│ ├── tofu/ # terraform-exec wrapper with OpenTelemetry -│ ├── config/ # Parse config.yaml -│ ├── kubernetes/ # K8s health checks -│ └── telemetry/ # OpenTelemetry setup -├── terraform/ # OpenTofu/Terraform modules -│ ├── main.tf # Root module -│ ├── variables.tf # Input variables from config.yaml -│ ├── outputs.tf # Outputs (kubeconfig, URLs, etc.) -│ ├── backend.tf.tmpl # State backend configuration template -│ ├── providers.tf # Provider configurations -│ └── modules/ -│ ├── aws/ # AWS-specific modules -│ │ ├── vpc/ -│ │ ├── eks/ -│ │ └── efs/ -│ ├── gcp/ # GCP-specific modules -│ │ ├── vpc/ -│ │ ├── gke/ -│ │ └── filestore/ -│ ├── azure/ # Azure-specific modules -│ │ ├── vnet/ -│ │ ├── aks/ -│ │ └── azure-files/ -│ ├── local/ # Local K3s module -│ │ └── k3s/ -│ ├── kubernetes/ # K8s bootstrap (namespaces, RBAC, etc.) -│ ├── argocd/ # ArgoCD Helm deployment -│ └── foundational-apps/ # ArgoCD Applications -└── go.mod -``` - -### How It Works - -1. **User runs**: `nic deploy -f config.yaml` -2. **Go CLI (`cmd/nic`)**: - - Parses `config.yaml` into Go structs - - Converts config to Terraform variables JSON - - Invokes terraform-exec to run `tofu init`, `tofu plan`, `tofu apply` -3. **OpenTofu**: - - Reads variables JSON - - Executes `terraform/main.tf` root module - - Provisions cloud infrastructure via provider plugins - - Updates state file in configured backend -4. **Go CLI resumes**: - - Waits for Kubernetes cluster readiness - - Waits for ArgoCD and foundational software - - Reports deployment success - -## 6.3 Root Module Design - -The root module (`terraform/main.tf`) contains conditional logic to provision the correct cloud resources based on the `provider` variable: - -```hcl -terraform { - required_version = ">= 1.6" - - required_providers { - aws = { - source = "hashicorp/aws" - version = "~> 5.0" - } - google = { - source = "hashicorp/google" - version = "~> 5.0" - } - azurerm = { - source = "hashicorp/azurerm" - version = "~> 3.0" - } - kubernetes = { - source = "hashicorp/kubernetes" - version = "~> 2.25" - } - helm = { - source = "hashicorp/helm" - version = "~> 2.12" - } - } -} - -# Provider selection based on var.provider -locals { - is_aws = var.provider == "aws" - is_gcp = var.provider == "gcp" - is_azure = var.provider == "azure" - is_local = var.provider == "local" -} - -# AWS Infrastructure (only created if provider=aws) -module "aws_vpc" { - count = local.is_aws ? 1 : 0 - source = "./modules/aws/vpc" - - name = var.cluster_name - cidr = var.aws_vpc_cidr - availability_zones = var.aws_availability_zones - tags = var.tags -} - -module "aws_eks" { - count = local.is_aws ? 1 : 0 - source = "./modules/aws/eks" - - cluster_name = var.cluster_name - kubernetes_version = var.kubernetes_version - vpc_id = module.aws_vpc[0].vpc_id - subnet_ids = module.aws_vpc[0].private_subnet_ids - node_pools = var.node_pools - tags = var.tags -} - -# GCP Infrastructure (only created if provider=gcp) -module "gcp_vpc" { - count = local.is_gcp ? 1 : 0 - source = "./modules/gcp/vpc" - - name = var.cluster_name - region = var.region - project = var.gcp_project_id -} - -module "gcp_gke" { - count = local.is_gcp ? 1 : 0 - source = "./modules/gcp/gke" - - cluster_name = var.cluster_name - kubernetes_version = var.kubernetes_version - region = var.region - project = var.gcp_project_id - network = module.gcp_vpc[0].network_name - subnetwork = module.gcp_vpc[0].subnetwork_name - node_pools = var.node_pools -} - -# Azure, Local, Kubernetes bootstrap, ArgoCD modules... -# (Similar pattern for other providers) -``` - -## 6.4 AWS EKS Module Example - -Shows how infrastructure is defined in HCL: - -**terraform/modules/aws/eks/main.tf:** - -```hcl -terraform { - required_providers { - aws = { - source = "hashicorp/aws" - version = "~> 5.0" - } - } -} - -# EKS Cluster IAM Role -resource "aws_iam_role" "cluster" { - name = "${var.cluster_name}-cluster-role" - - assume_role_policy = jsonencode({ - Version = "2012-10-17" - Statement = [{ - Action = "sts:AssumeRole" - Effect = "Allow" - Principal = { - Service = "eks.amazonaws.com" - } - }] - }) - - tags = var.tags -} +There is **no root-level `terraform/` directory**. AWS-specific templates live inside the AWS provider package: -resource "aws_iam_role_policy_attachment" "cluster_AmazonEKSClusterPolicy" { - policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy" - role = aws_iam_role.cluster.name -} - -# EKS Cluster -resource "aws_eks_cluster" "main" { - name = var.cluster_name - version = var.kubernetes_version - role_arn = aws_iam_role.cluster.arn - - vpc_config { - subnet_ids = var.subnet_ids - endpoint_private_access = true - endpoint_public_access = true - } - - enabled_cluster_log_types = [ - "api", - "audit", - "authenticator", - "controllerManager", - "scheduler" - ] - - tags = var.tags - - depends_on = [ - aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy - ] -} - -# Node Groups -resource "aws_eks_node_group" "node_pools" { - for_each = { for np in var.node_pools : np.name => np } - - cluster_name = aws_eks_cluster.main.name - node_group_name = each.value.name - node_role_arn = aws_iam_role.node_group.arn - subnet_ids = var.subnet_ids - - instance_types = [each.value.instance_type] - - scaling_config { - desired_size = each.value.min_size - min_size = each.value.min_size - max_size = each.value.max_size - } - - labels = each.value.labels - - dynamic "taint" { - for_each = each.value.taints - content { - key = taint.value.key - value = taint.value.value - effect = taint.value.effect - } - } - - tags = merge(var.tags, { - "NodePool" = each.value.name - }) -} ``` - -**terraform/modules/aws/eks/outputs.tf:** - -```hcl -output "cluster_id" { - description = "EKS cluster ID" - value = aws_eks_cluster.main.id -} - -output "cluster_endpoint" { - description = "EKS cluster endpoint" - value = aws_eks_cluster.main.endpoint -} - -output "cluster_ca_certificate" { - description = "EKS cluster CA certificate" - value = base64decode(aws_eks_cluster.main.certificate_authority[0].data) - sensitive = true -} - -output "cluster_oidc_issuer_url" { - description = "OIDC issuer URL for IRSA" - value = aws_eks_cluster.main.identity[0].oidc[0].issuer -} +pkg/provider/aws/ +├── config.go # AWSConfig struct (yaml/json tags) +├── provider.go # Implements provider.Provider +├── state.go # S3 state-bucket lifecycle (ensure / destroy) +├── longhorn.go # Longhorn storage installation +├── lbc.go # AWS Load Balancer Controller +├── tofu.go # Builds tfvars and invokes pkg/tofu.Setup +└── templates/ # Embedded via go:embed + ├── main.tf # Calls upstream nebari-dev/eks-cluster module + ├── variables.tf # tfvars input schema + ├── outputs.tf # Cluster name, endpoint, OIDC issuer, etc. + ├── provider.tf # AWS provider config + └── backend.tf # S3 backend with use_lockfile = true ``` -## 6.5 Community Module Integration - -**Major Advantage**: NIC can leverage battle-tested community modules instead of writing everything from scratch. - -### Example: Using terraform-aws-modules/eks +Other cluster providers do not use OpenTofu and therefore have no `templates/` directory: -```hcl -# terraform/modules/aws/eks/main.tf (using community module) -module "eks" { - source = "terraform-aws-modules/eks/aws" - version = "~> 20.0" - - cluster_name = var.cluster_name - cluster_version = var.kubernetes_version - - vpc_id = var.vpc_id - subnet_ids = var.subnet_ids - - # EKS Managed Node Groups - eks_managed_node_groups = { - for np in var.node_pools : - np.name => { - min_size = np.min_size - max_size = np.max_size - desired_size = np.min_size - instance_types = [np.instance_type] - labels = np.labels - taints = np.taints - } - } - - # Enable IRSA - enable_irsa = true - - # Cluster addons - cluster_addons = { - coredns = { - most_recent = true - } - kube-proxy = { - most_recent = true - } - vpc-cni = { - most_recent = true - } - aws-ebs-csi-driver = { - most_recent = true - } - } - - tags = var.tags -} ``` - -### Benefits of Community Modules - -- **Less code**: Reuse proven modules instead of writing 100s of lines of HCL -- **Best practices**: Community modules encode AWS/GCP/Azure best practices -- **Faster development**: Don't reinvent the wheel for common patterns -- **Battle-tested**: Modules used by thousands of companies, bugs are found quickly -- **Maintained**: Active community maintenance and updates - -### Trade-offs of Community Modules - -- **Less control**: Module abstractions may hide details you want to configure -- **Dependency**: Reliant on module maintainer to fix bugs and add features -- **Version tracking**: Must monitor and update module versions -- **Learning curve**: Need to understand module's abstraction layer - -## 6.6 Kubernetes Bootstrap Module - -Handles post-cluster setup that's identical across providers: - -**terraform/modules/kubernetes/main.tf:** - -```hcl -terraform { - required_providers { - kubernetes = { - source = "hashicorp/kubernetes" - version = "~> 2.25" - } - } -} - -# Namespaces for foundational software -resource "kubernetes_namespace_v1" "namespaces" { - for_each = toset(var.namespaces) - - metadata { - name = each.value - labels = { - "managed-by" = "nic" - "nic.nebari.dev/namespace" = "true" - } - } -} - -# Storage Classes (example for AWS) -resource "kubernetes_storage_class_v1" "gp3" { - count = var.storage_class_gp3_enabled ? 1 : 0 - - metadata { - name = "gp3" - annotations = { - "storageclass.kubernetes.io/is-default-class" = "true" - } - } - - storage_provisioner = "ebs.csi.aws.com" - reclaim_policy = "Delete" - volume_binding_mode = "WaitForFirstConsumer" - - parameters = { - type = "gp3" - encrypted = "true" - } -} - -# Priority Classes -resource "kubernetes_priority_class_v1" "high_priority" { - metadata { - name = "high-priority" - } - - value = 1000 - global_default = false - description = "High priority for critical workloads" -} - -# Network Policies (deny all by default, allow within namespace) -resource "kubernetes_network_policy_v1" "deny_all" { - for_each = toset(var.namespaces) - - metadata { - name = "deny-all" - namespace = each.value - } - - spec { - pod_selector {} - policy_types = ["Ingress", "Egress"] - } - - depends_on = [kubernetes_namespace_v1.namespaces] -} - -resource "kubernetes_network_policy_v1" "allow_same_namespace" { - for_each = toset(var.namespaces) - - metadata { - name = "allow-same-namespace" - namespace = each.value - } - - spec { - pod_selector {} - policy_types = ["Ingress", "Egress"] - - ingress { - from { - pod_selector {} - } - } - - egress { - to { - pod_selector {} - } - } - } - - depends_on = [kubernetes_namespace_v1.namespaces] -} +pkg/provider/hetzner/ # Wraps the hetzner-k3s binary +pkg/provider/local/ # Kind stub (Makefile creates the cluster) +pkg/provider/existing/ # Adopts an existing kubeconfig +pkg/provider/gcp/ # Stub: returns "not yet implemented" +pkg/provider/azure/ # Stub: returns "not yet implemented" ``` -## 6.7 ArgoCD Deployment Module +## 6.3 AWS Root Module -ArgoCD is deployed via Helm, then used to deploy all other foundational software: - -**terraform/modules/argocd/main.tf:** +The AWS root module is intentionally thin. It is a single Terraform file that calls the upstream community module `nebari-dev/eks-cluster/aws`: ```hcl -terraform { - required_providers { - helm = { - source = "hashicorp/helm" - version = "~> 2.12" - } - kubernetes = { - source = "hashicorp/kubernetes" - version = "~> 2.25" - } - } -} - -resource "kubernetes_namespace_v1" "argocd" { - metadata { - name = "argocd" - labels = { - "managed-by" = "nic" - } - } -} - -resource "helm_release" "argocd" { - name = "argocd" - repository = "https://argoproj.github.io/argo-helm" - chart = "argo-cd" - version = var.argocd_version - namespace = kubernetes_namespace_v1.argocd.metadata[0].name - - values = [ - yamlencode({ - server = { - ingress = { - enabled = true - hosts = ["argocd.${var.domain}"] - } - } - configs = { - cm = { - "admin.enabled" = "false" - "oidc.config" = var.oidc_config - } - } - }) - ] - - depends_on = [kubernetes_namespace_v1.argocd] +# pkg/provider/aws/templates/main.tf +module "eks_cluster" { + source = "nebari-dev/eks-cluster/aws" + version = "0.4.0" + + project_name = var.project_name + tags = var.tags + availability_zones = var.availability_zones + create_vpc = var.create_vpc + vpc_cidr_block = var.vpc_cidr_block + kubernetes_version = var.kubernetes_version + endpoint_private_access = var.endpoint_private_access + endpoint_public_access = var.endpoint_public_access + node_groups = var.node_groups + efs_enabled = var.efs_enabled + efs_performance_mode = var.efs_performance_mode + efs_throughput_mode = var.efs_throughput_mode + efs_encrypted = var.efs_encrypted + # ... see real templates/main.tf for the full set } - -# ArgoCD Applications for foundational software -resource "kubernetes_manifest" "argocd_app_cert_manager" { - manifest = { - apiVersion = "argoproj.io/v1alpha1" - kind = "Application" - metadata = { - name = "cert-manager" - namespace = "argocd" - } - spec = { - project = "default" - source = { - repoURL = var.foundational_software_repo - targetRevision = var.foundational_software_version - path = "cert-manager" - } - destination = { - server = "https://kubernetes.default.svc" - namespace = "cert-manager" - } - syncPolicy = { - automated = { - prune = true - selfHeal = true - } - } - } - } - - depends_on = [helm_release.argocd] -} - -# Additional ArgoCD Applications for LGTM stack, Keycloak, Envoy Gateway, etc. -# (Similar pattern for each foundational software component) ``` -## 6.8 Variables and Configuration - -**terraform/variables.tf:** - -```hcl -variable "provider" { - description = "Cloud provider: aws, gcp, azure, or local" - type = string - - validation { - condition = contains(["aws", "gcp", "azure", "local"], var.provider) - error_message = "Provider must be one of: aws, gcp, azure, local" - } -} - -variable "cluster_name" { - description = "Name of the Kubernetes cluster" - type = string -} - -variable "region" { - description = "Cloud provider region" - type = string -} - -variable "kubernetes_version" { - description = "Kubernetes version" - type = string - default = "1.29" -} - -variable "node_pools" { - description = "Node pool configurations" - type = list(object({ - name = string - instance_type = string - min_size = number - max_size = number - labels = optional(map(string), {}) - taints = optional(list(object({ - key = string - value = string - effect = string - })), []) - })) -} +The variables file (`templates/variables.tf`) declares the input schema (project name, availability zones, VPC, EKS version, node groups, EFS settings, etc.) and the outputs file (`templates/outputs.tf`) exposes the cluster name, API endpoint, certificate authority data, OIDC issuer URL, OIDC provider ARN, VPC ID, private subnet IDs, and EFS file system ID. -variable "domain" { - description = "Base domain for the cluster" - type = string -} +The backend is configured for S3 with native lockfile-based locking (`use_lockfile = true`). Bucket name and key are supplied via `-backend-config` at `tofu init` time. See [State Management](../architecture/05-state-management.md) for bucket naming and lifecycle. -variable "tags" { - description = "Tags to apply to all resources" - type = map(string) - default = {} -} +## 6.4 Embedding and Extraction -# AWS-specific variables -variable "aws_vpc_cidr" { - description = "CIDR block for AWS VPC" - type = string - default = "10.0.0.0/16" -} - -variable "aws_availability_zones" { - description = "Availability zones for AWS" - type = list(string) - default = [] -} - -# GCP-specific variables -variable "gcp_project_id" { - description = "GCP project ID" - type = string - default = "" -} +Templates are embedded into the NIC binary via Go's `embed.FS` declared inside `pkg/provider/aws/`. At deploy time, the AWS provider: -# Azure-specific variables -variable "azure_resource_group" { - description = "Azure resource group name" - type = string - default = "" -} -``` +1. Constructs a `map[string]any` of tfvars from the parsed AWS config (region, project name, node groups, EFS settings, etc.). +2. Calls `pkg/tofu.Setup(ctx, templatesFS, tfvars)`, which extracts the embedded files into a fresh temp directory, downloads the OpenTofu binary if not cached, writes `terraform.tfvars.json`, and returns a `TerraformExecutor`. +3. Calls `Init`, `Plan` (or `Apply`/`Destroy`) and lets `pkg/tofu` stream JSON output through the status channel. -## 6.9 Outputs +There is no in-tree EKS HCL, no in-tree node group resources, and no in-tree IAM HCL beyond what the upstream module provides. The intent is to leverage a battle-tested community module instead of maintaining a parallel implementation. -**terraform/outputs.tf:** +## 6.5 Non-AWS Providers (Brief) -```hcl -output "kubeconfig" { - description = "Kubeconfig for cluster access" - value = coalesce( - try(module.aws_eks[0].kubeconfig, null), - try(module.gcp_gke[0].kubeconfig, null), - try(module.azure_aks[0].kubeconfig, null), - try(module.local_k3s[0].kubeconfig, null) - ) - sensitive = true -} +For completeness, the other providers do not have any `.tf` files: -output "cluster_endpoint" { - description = "Kubernetes API server endpoint" - value = coalesce( - try(module.aws_eks[0].cluster_endpoint, null), - try(module.gcp_gke[0].cluster_endpoint, null), - try(module.azure_aks[0].cluster_endpoint, null), - try(module.local_k3s[0].cluster_endpoint, null) - ) -} +| Provider | What it actually does | +|----------|----------------------| +| Hetzner | Generates a `hetzner-k3s` config file, invokes the binary, parses its output | +| Local | The provider itself is a thin adapter; the cluster is created by `make localkind-up` (Kind), and NIC's job is the bootstrap that follows | +| Existing | Reads `kubeconfig` and `context` from config; performs no provisioning | +| GCP, Azure | Registered, but every method currently returns "not yet implemented" | -output "argocd_url" { - description = "ArgoCD dashboard URL" - value = "https://argocd.${var.domain}" -} +If and when GCP/Azure are implemented, each provider package will decide independently whether to use OpenTofu (e.g., with the upstream `terraform-google-modules/kubernetes-engine` module for GKE) or another mechanism. The `Provider` interface is the boundary, not Terraform. -output "grafana_url" { - description = "Grafana dashboard URL" - value = "https://grafana.${var.domain}" -} +## 6.6 Adding a New Terraform-Backed Provider -output "keycloak_url" { - description = "Keycloak admin console URL" - value = "https://keycloak.${var.domain}" -} -``` +The pattern, if you choose tofu for a new provider: ---- +1. Create `pkg/provider//` with `config.go`, `provider.go`, and `tofu.go`. +2. Add a `templates/` directory inside the package with `main.tf`, `variables.tf`, `outputs.tf`, `provider.tf`, and (optionally) `backend.tf`. Embed it via `go:embed`. +3. Implement the `provider.Provider` interface. `Deploy` should build a tfvars map, call `pkg/tofu.Setup`, and invoke `Init`/`Plan`/`Apply` (or `Plan` only when `DeployOptions.DryRun` is true). +4. Implement `InfraSettings(cfg)` to return provider-shaped capabilities (`StorageClass`, `NeedsMetalLB`, `LoadBalancerAnnotations`, `KeycloakBasePath`, `HTTPSPort`, etc.). Do not add `switch` statements on provider name elsewhere in the codebase. +5. Register the provider in `cmd/nic/main.go` via `reg.ClusterProviders.Register(ctx, "", New())`. +6. Add an example config under `examples/` and validate against `pkg/config`. -## Summary +## 6.7 Anti-Patterns to Avoid -The OpenTofu module architecture provides: +These came up during the previous design-doc audit and are not how NIC actually works: -- **Modular design**: Separate modules for each cloud provider and component -- **Community leverage**: Ability to use battle-tested Terraform modules -- **Standard tooling**: Compatible with terraform-docs, tfsec, Atlantis, etc. -- **Familiar patterns**: HCL syntax known to most infrastructure engineers -- **Declarative infrastructure**: Terraform's plan/apply workflow for safe changes +- **Root `terraform/` directory with modules per provider.** Each provider owns its templates, embedded in the package. +- **A single root tofu module with `local.is_aws / is_gcp / is_azure` conditionals.** There is no single module; each provider has its own. +- **OpenTofu installing ArgoCD via `helm_release`.** NIC installs ArgoCD via the embedded Helm Go SDK (`pkg/helm`). +- **OpenTofu applying ArgoCD `Application` manifests via the Terraform kubernetes provider.** ArgoCD manifests are rendered into a Git repository by `pkg/argocd` and synced by ArgoCD. -See [Terraform-Exec Integration](08-terraform-exec-integration.md) for how the Go CLI orchestrates these modules. +See [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md) for the planned direction (out-of-tree gRPC plugins per provider kind), which makes the per-provider tool choice even more explicit. diff --git a/docs/design-doc/implementation/07-configuration-design.md b/docs/design-doc/implementation/07-configuration-design.md index d9dba47a..2c989962 100644 --- a/docs/design-doc/implementation/07-configuration-design.md +++ b/docs/design-doc/implementation/07-configuration-design.md @@ -1,167 +1,152 @@ # Configuration Design -### 7.1 Configuration Philosophy +## 7.1 Principles -**New Clean Configuration Format:** -- Not constrained by old config.yaml -- Optimized for new architecture -- Clear separation of concerns -- Validation at parse time +NIC's configuration philosophy: -### 7.2 Configuration Structure +1. **Single config file**: One `nebari-config.yaml` is the source of truth for a deployment. +2. **Discriminator pattern for providers**: `cluster.:` and `dns.:` use the provider name as the map key, with provider-specific config underneath. The `config` package never imports a provider package; per-provider decoding happens inside each provider. +3. **No secrets in config**: Credentials live in environment variables (typically loaded from `.env`). Config files are safe to check into a GitOps repo. +4. **Validate at parse time**: `NebariConfig.Validate(opts)` checks required fields and provider-name validity before any infrastructure call. +5. **Provider capabilities flow through `InfraSettings`**: Code outside `cmd/nic` and a provider's own package never branches on provider name; capabilities like `NeedsMetalLB` or `StorageClass` are read from `provider.InfraSettings(cfg)`. + +## 7.2 Top-Level Schema + +`NebariConfig` in `pkg/config/config.go`: + +```go +type NebariConfig struct { + ProjectName string `yaml:"project_name"` // required + Domain string `yaml:"domain,omitempty"` + Cluster *ClusterConfig `yaml:"cluster,omitempty"` // required + DNS *DNSConfig `yaml:"dns,omitempty"` // optional + GitRepository *git.Config `yaml:"git_repository,omitempty"` + Certificate *CertificateConfig `yaml:"certificate,omitempty"` +} +``` + +The corresponding minimal YAML: -**Example: `config.yaml`** ```yaml -version: "2025.1.0" -name: nebari-prod - -provider: - type: aws # aws | gcp | azure | local - region: us-west-2 - - # Provider-specific configuration - aws: - account_id: "123456789012" - vpc: - cidr: "10.0.0.0/16" - availability_zones: 3 - -kubernetes: - version: "1.29" - - node_pools: - - name: general - instance_type: m6i.2xlarge - min_size: 3 - max_size: 10 - labels: - workload: general - taints: [] - - - name: compute - instance_type: m6i.8xlarge - min_size: 0 - max_size: 20 - labels: - workload: compute - taints: - - key: compute - value: "true" - effect: NoSchedule - - - name: gpu - instance_type: g5.2xlarge - min_size: 0 - max_size: 5 - labels: - workload: gpu - nvidia.com/gpu: "true" - taints: - - key: nvidia.com/gpu - value: "true" - effect: NoSchedule - -domain: nebari.example.com - -tls: - enabled: true - letsencrypt: - enabled: true - email: admin@example.com - # Or bring your own cert: - # certificate_secret: custom-tls-cert - -foundational_software: - argocd: - enabled: true - version: "2.10.0" - repo_url: "https://github.com/nebari-dev/nebari-foundational-software" - - cert_manager: - enabled: true - version: "1.14.0" - - envoy_gateway: - enabled: true - version: "1.0.0" - - keycloak: - enabled: true - version: "23.0.0" - admin_username: admin - # admin_password generated and stored in secret - themes: - - nebari-theme - - observability: - enabled: true - - grafana: - version: "10.3.0" - admin_username: admin - - loki: - version: "2.9.0" - retention_days: 30 - storage_size: 100Gi - - mimir: - version: "2.11.0" - retention_days: 90 - storage_size: 500Gi - - tempo: - version: "2.3.0" - retention_days: 14 - storage_size: 100Gi - - opentelemetry: - version: "0.95.0" - # Endpoints exported by default to LGTM stack - - nebari_operator: - enabled: true - version: "1.0.0" - -# Optional: override default images -images: - registry: ghcr.io/nebari-dev - pull_policy: IfNotPresent - -# Optional: enable features -features: - auto_upgrade: false - backup: true - monitoring_alerts: true +project_name: my-nebari # required, [a-zA-Z0-9][a-zA-Z0-9_-]* +domain: nebari.example.com # optional, but needed for routable services + +cluster: # required, exactly one provider + aws: { ... } + +dns: # optional, exactly one provider + cloudflare: { ... } + +git_repository: { ... } # optional on local provider; required for cloud providers +certificate: { ... } # optional, defaults to selfsigned ``` -### 7.3 Configuration Validation +There is **no** top-level `provider:` field, **no** top-level `version:` field, **no** top-level `name:` field (use `project_name`), and **no** top-level `kubernetes:`, `node_pools:`, `tls:`, `foundational_software:`, `images:`, or `features:` blocks. If you find documentation that claims otherwise, it is out of date. -**Validation Stages:** -1. **Schema validation**: YAML structure matches schema -2. **Provider validation**: Provider-specific settings valid -3. **Version compatibility**: Kubernetes version supported by provider -4. **Resource limits**: Instance types valid for region -5. **Dependency checks**: e.g., TLS requires domain +## 7.3 Cluster Provider Block -**CLI Validation:** -```bash -$ nic validate -f config.yaml -✅ Configuration valid - -Summary: - Provider: AWS (us-west-2) - Kubernetes: 1.29 - Node pools: 3 (general, compute, gpu) - Domain: nebari.example.com - TLS: Let's Encrypt - Foundational software: 9 components enabled +```go +type ClusterConfig struct { + Providers map[string]any `yaml:",inline"` +} +``` + +Exactly one key under `cluster:`. Valid provider names (from `cmd/nic/main.go` registration): `aws`, `gcp`, `azure`, `local`, `hetzner`, `existing`. GCP and Azure are registered but their methods return "not yet implemented". + +The inline map captures the provider name as the key and an opaque `any` as the value. The provider implementation is responsible for decoding the `any` into its own typed config (e.g., `pkg/provider/aws/config.go:Config` for AWS). + +## 7.4 DNS Provider Block + +Same shape as `cluster`: + +```go +type DNSConfig struct { + Providers map[string]any `yaml:",inline"` +} +``` + +Valid provider names today: `cloudflare`. The DNS provider implementation owns the schema for its config. See [09-dns-provider-architecture.md](09-dns-provider-architecture.md). + +## 7.5 Certificate Block + +```go +type CertificateConfig struct { + Type string `yaml:"type,omitempty"` // "selfsigned" or "letsencrypt" + ACME *ACMEConfig `yaml:"acme,omitempty"` +} + +type ACMEConfig struct { + Email string `yaml:"email"` + Server string `yaml:"server,omitempty"` // staging URL for testing +} ``` -### 7.4 Multi-Environment Support +`selfsigned` is the default and is appropriate for local and internal deployments. `letsencrypt` requires `acme.email` (and a publicly-routable `domain` configured via the DNS provider). -**MVP Approach:** Use separate configuration files per environment (dev.yaml, staging.yaml, production.yaml). +## 7.6 Git Repository Block -**Future Enhancement:** Config overlays with base/override pattern (see docs/appendix/15-future-enhancements.md). +```go +// from pkg/git +type Config struct { + URL string `yaml:"url"` // git@..., https://..., or file://... + Branch string `yaml:"branch,omitempty"` // default: main + Path string `yaml:"path,omitempty"` // subdirectory for this cluster + Auth AuthConfig `yaml:"auth,omitempty"` + ArgocdAuth AuthConfig `yaml:"argocd_auth,omitempty"` // optional read-only +} + +type AuthConfig struct { + SSHKeyEnv string `yaml:"ssh_key_env,omitempty"` + TokenEnv string `yaml:"token_env,omitempty"` +} +``` + +The git repository is where NIC renders ArgoCD `Application` manifests during deploy. ArgoCD then syncs from it. + +- **Local file:// repos** are valid (and the default for local Kind clusters that have `InfraSettings.SupportsLocalGitOps = true`). The local provider's auto-bootstrap creates `/tmp/nebari-gitops-` if no `git_repository:` block is provided. +- **Cloud providers** require an explicit `git_repository:` block; cluster nodes cannot see the dev machine's filesystem, so a remote (SSH or HTTPS) repo is required. +- Credentials are referenced by env-var name, never inlined. The CLI scrubs the `auth:` and `argocd_auth:` blocks from any copy of the config it writes into the GitOps repo. + +## 7.7 Example Configs + +Authoritative examples live under [`examples/`](../../../examples/) in the repo. Highlights: + +- [`examples/aws-config.yaml`](../../../examples/aws-config.yaml) - EKS with EFS and remote GitOps repo +- [`examples/hetzner-config.yaml`](../../../examples/hetzner-config.yaml) - Hetzner k3s with `node_groups.master` and `node_groups.workers` +- [`examples/local-config.yaml`](../../../examples/local-config.yaml) - Kind cluster with optional MetalLB and `file://` GitOps repo +- [`examples/existing-config.yaml`](../../../examples/existing-config.yaml) - Adopt an existing kubeconfig +- [`examples/gcp-config.yaml`](../../../examples/gcp-config.yaml), [`examples/azure-config.yaml`](../../../examples/azure-config.yaml) - schema for the stub providers (not deployable today) + +The full per-provider field reference lives in [`16-configuration-reference.md`](../appendix/16-configuration-reference.md). + +## 7.8 Validation + +`NebariConfig.Validate(opts ValidateOptions)` runs at parse time. `ValidateOptions` carries the set of valid cluster and DNS provider names, supplied by the caller (typically `cmd/nic` looking up names from `pkg/registry`). The config package itself doesn't know which provider names are valid, which keeps it decoupled from provider implementations. + +Validation enforces: + +- `project_name` is set and matches `^[a-zA-Z0-9][a-zA-Z0-9_-]*$` +- `cluster:` is present with exactly one provider key matching `opts.ClusterProviders` +- `dns:`, if present, has exactly one provider key matching `opts.DNSProviders` +- `git_repository:`, if present, validates per `pkg/git.Config.Validate()` + +Provider-specific validation (e.g., that `cluster.aws.region` is set, that node groups are non-empty) lives inside the provider's own `Validate(ctx, projectName, clusterConfig)` method. + +## 7.9 Auto-Discovery + +If `nic deploy` is invoked without `-f`, the CLI auto-discovers a config file in the working directory. See `cmd/nic/config_discovery.go` for the search order. Explicit `-f path/to/config.yaml` always wins. + +## 7.10 Secrets + +Secrets are never written into the config file. The expected pattern: + +```bash +# .env (gitignored; loaded automatically by godotenv in main.go) +AWS_ACCESS_KEY_ID=... +AWS_SECRET_ACCESS_KEY=... +HCLOUD_TOKEN=... +CLOUDFLARE_API_TOKEN=... +GIT_SSH_PRIVATE_KEY=... +``` ---- +The `git_repository.auth.ssh_key_env` / `token_env` fields point at env-var names, not at the values. This keeps the config file safe to commit and lets the same file be used across operator machines with different credentials. diff --git a/docs/design-doc/implementation/08-terraform-exec-integration.md b/docs/design-doc/implementation/08-terraform-exec-integration.md index bdf750b0..953511d0 100644 --- a/docs/design-doc/implementation/08-terraform-exec-integration.md +++ b/docs/design-doc/implementation/08-terraform-exec-integration.md @@ -1,644 +1,156 @@ -# Terraform-Exec Integration +# terraform-exec Integration -## 8.1 Overview +## 8.1 Scope -NIC uses the `hashicorp/terraform-exec` library to orchestrate OpenTofu execution from Go. The Go CLI doesn't make direct cloud API calls; instead, it manages the OpenTofu lifecycle (init, plan, apply, destroy) and processes outputs. +NIC uses HashiCorp's `terraform-exec` library to orchestrate OpenTofu execution **from the AWS provider**. Other cluster providers (Hetzner, local, existing) do not use terraform-exec. The wrapper for this integration lives in `pkg/tofu/`. -## 8.2 Execution Flow +This document describes the wrapper, the Setup helper, and how AWS-provider code uses it. The AWS-side code that calls into `pkg/tofu` is in `pkg/provider/aws/tofu.go`. + +## 8.2 Package Layout ``` -User → NIC CLI → terraform-exec → OpenTofu Binary → Terraform Provider → Cloud API → Infrastructure +pkg/tofu/ +├── tofu.go # TerraformExecutor type, Setup, Init/Plan/Apply/Destroy/Output, downloader +├── log.go # JSON line mapper for status streaming +├── version.go # Pinned OpenTofu version +├── context_default.go # Non-Linux signal handling +└── context_linux.go # Linux-specific signal handling (PR_SET_PDEATHSIG) ``` -1. User runs `nic deploy -f config.yaml` -2. Go CLI parses config and generates Terraform variables -3. terraform-exec invokes OpenTofu init, plan, apply -4. OpenTofu uses provider plugins to call cloud APIs -5. State file updated in remote backend -6. Go CLI retrieves outputs and waits for readiness - -## 8.3 Wrapper Package Design - -The `pkg/tofu` package wraps terraform-exec with OpenTelemetry instrumentation: +There is no `executor.go`, `workspace.go`, or `outputs.go`. The entire wrapper is in `tofu.go`. -**pkg/tofu/executor.go:** +## 8.3 The Wrapper Type ```go -package tofu - -import ( - "context" - "fmt" - "os" - "path/filepath" - - "github.com/hashicorp/terraform-exec/tfexec" - "log/slog" - "go.opentelemetry.io/otel" - "go.opentelemetry.io/otel/attribute" -) - -var tracer = otel.Tracer("github.com/nebari-dev/nic/pkg/tofu") - -type Executor struct { +// pkg/tofu/tofu.go +type TerraformExecutor struct { + *tfexec.Terraform workingDir string - tofuPath string - tf *tfexec.Terraform -} - -// NewExecutor creates a new OpenTofu executor -func NewExecutor(workingDir string, tofuPath string) (*Executor, error) { - tf, err := tfexec.NewTerraform(workingDir, tofuPath) - if err != nil { - return nil, fmt.Errorf("creating terraform executor: %w", err) - } - - return &Executor{ - workingDir: workingDir, - tofuPath: tofuPath, - tf: tf, - }, nil -} - -// Init initializes the Terraform working directory -func (e *Executor) Init(ctx context.Context) error { - ctx, span := tracer.Start(ctx, "Executor.Init") - defer span.End() - - span.SetAttributes( - attribute.String("working_dir", e.workingDir), - ) - - slog.InfoContext(ctx, "initializing OpenTofu", "working_dir", e.workingDir) - - if err := e.tf.Init(ctx, tfexec.Upgrade(true)); err != nil { - span.RecordError(err) - return fmt.Errorf("terraform init: %w", err) - } - - slog.InfoContext(ctx, "OpenTofu initialized successfully") - return nil -} - -// Plan generates an execution plan -func (e *Executor) Plan(ctx context.Context, varFiles []string) (bool, error) { - ctx, span := tracer.Start(ctx, "Executor.Plan") - defer span.End() - - span.SetAttributes( - attribute.StringSlice("var_files", varFiles), - ) - - slog.InfoContext(ctx, "planning infrastructure changes") - - var opts []tfexec.PlanOption - for _, vf := range varFiles { - opts = append(opts, tfexec.VarFile(vf)) - } - - hasChanges, err := e.tf.Plan(ctx, opts...) - if err != nil { - span.RecordError(err) - return false, fmt.Errorf("terraform plan: %w", err) - } - - span.SetAttributes( - attribute.Bool("has_changes", hasChanges), - ) - - if hasChanges { - slog.InfoContext(ctx, "infrastructure changes detected") - } else { - slog.InfoContext(ctx, "no infrastructure changes needed") - } - - return hasChanges, nil -} - -// Apply applies the Terraform configuration -func (e *Executor) Apply(ctx context.Context, varFiles []string) error { - ctx, span := tracer.Start(ctx, "Executor.Apply") - defer span.End() - - span.SetAttributes( - attribute.StringSlice("var_files", varFiles), - ) - - slog.InfoContext(ctx, "applying infrastructure changes") - - var opts []tfexec.ApplyOption - for _, vf := range varFiles { - opts = append(opts, tfexec.VarFile(vf)) - } - - if err := e.tf.Apply(ctx, opts...); err != nil { - span.RecordError(err) - return fmt.Errorf("terraform apply: %w", err) - } - - slog.InfoContext(ctx, "infrastructure applied successfully") - return nil -} - -// Destroy destroys the Terraform-managed infrastructure -func (e *Executor) Destroy(ctx context.Context, varFiles []string) error { - ctx, span := tracer.Start(ctx, "Executor.Destroy") - defer span.End() - - slog.InfoContext(ctx, "destroying infrastructure") - - var opts []tfexec.DestroyOption - for _, vf := range varFiles { - opts = append(opts, tfexec.VarFile(vf)) - } - - if err := e.tf.Destroy(ctx, opts...); err != nil { - span.RecordError(err) - return fmt.Errorf("terraform destroy: %w", err) - } - - slog.InfoContext(ctx, "infrastructure destroyed successfully") - return nil -} - -// Output retrieves Terraform outputs -func (e *Executor) Output(ctx context.Context) (map[string]tfexec.OutputMeta, error) { - ctx, span := tracer.Start(ctx, "Executor.Output") - defer span.End() - - slog.InfoContext(ctx, "retrieving Terraform outputs") - - outputs, err := e.tf.Output(ctx) - if err != nil { - span.RecordError(err) - return nil, fmt.Errorf("terraform output: %w", err) - } - - span.SetAttributes( - attribute.Int("output_count", len(outputs)), - ) - - return outputs, nil -} - -// Show retrieves the current state -func (e *Executor) Show(ctx context.Context) (*tfjson.State, error) { - ctx, span := tracer.Start(ctx, "Executor.Show") - defer span.End() - - slog.InfoContext(ctx, "retrieving Terraform state") - - state, err := e.tf.Show(ctx) - if err != nil { - span.RecordError(err) - return nil, fmt.Errorf("terraform show: %w", err) - } - - return state, nil + appFs afero.Fs } ``` -## 8.4 Deploy Command Integration +`TerraformExecutor` embeds `*tfexec.Terraform` so callers get the full upstream API for free. The wrapper adds: -**cmd/nic/deploy.go:** +- The temp working directory it created +- An `afero.Fs` for testable filesystem access +- A `Cleanup()` method that removes the working dir + +The exported methods that NIC actually calls are wrapped to stream JSON output through the status channel attached to `ctx`: ```go -package main - -import ( - "context" - "encoding/json" - "fmt" - "os" - "os/exec" - "path/filepath" - - "github.com/spf13/cobra" - "go.opentelemetry.io/otel" - "go.opentelemetry.io/otel/attribute" - - "github.com/nebari-dev/nic/pkg/config" - "github.com/nebari-dev/nic/pkg/tofu" -) - -var tracer = otel.Tracer("github.com/nebari-dev/nic") - -var deployCmd = &cobra.Command{ - Use: "deploy", - Short: "Deploy Nebari infrastructure", - RunE: runDeploy, +func (te *TerraformExecutor) Init(ctx context.Context, opts ...tfexec.InitOption) error { + ctx = signalSafeContext(ctx) + return te.streamThroughStatus(ctx, func(w io.Writer) error { + return te.InitJSON(ctx, w, opts...) + }) } +``` -func runDeploy(cmd *cobra.Command, args []string) error { - ctx := cmd.Context() - ctx, span := tracer.Start(ctx, "deploy") - defer span.End() - - configFile, _ := cmd.Flags().GetString("config") - span.SetAttributes(attribute.String("config_file", configFile)) - - // Step 1: Parse configuration - cfg, err := config.ParseFile(configFile) - if err != nil { - span.RecordError(err) - return fmt.Errorf("parsing config: %w", err) - } - - span.SetAttributes( - attribute.String("provider", cfg.Provider), - attribute.String("project_name", cfg.ProjectName), - ) - - // Step 2: Convert config to Terraform variables - varsFile, err := generateTerraformVars(ctx, cfg) - if err != nil { - span.RecordError(err) - return fmt.Errorf("generating terraform vars: %w", err) - } - defer os.Remove(varsFile) - - // Step 3: Locate OpenTofu binary - tofuPath, err := findOpenTofuBinary() - if err != nil { - span.RecordError(err) - return fmt.Errorf("finding opentofu binary: %w", err) - } +`Plan`, `Apply`, and `Destroy` follow the same pattern, calling `PlanJSON`, `ApplyJSON`, and `DestroyJSON` respectively. `Output` does not stream because its caller wants the parsed `map[string]tfexec.OutputMeta` directly. - span.SetAttributes(attribute.String("tofu_path", tofuPath)) +`streamThroughStatus` creates a stdout writer that maps each JSON line to a `status.Update` (`jsonLineMapper`) and a stderr writer that maps each raw line to an error-level `Update`. Both writers are flushed after the operation completes to drain any partial trailing line. - // Step 4: Create tofu executor - workingDir := filepath.Join("terraform") - executor, err := tofu.NewExecutor(workingDir, tofuPath) - if err != nil { - span.RecordError(err) - return fmt.Errorf("creating tofu executor: %w", err) - } +### Why JSON streaming? - // Step 5: Initialize Terraform - if err := executor.Init(ctx); err != nil { - span.RecordError(err) - return fmt.Errorf("terraform init: %w", err) - } +OpenTofu's `-json` mode emits one structured event per line, with `@level`, `@message`, plus structured fields per event type (apply progress, plan summary, diagnostics, etc.). Streaming those through the status channel lets the CLI render live progress without parsing OpenTofu's human-readable output. The full event payload is attached to each `status.Update` via `Update.Metadata[status.MetadataKeyPayload]` so downstream handlers can pick out any field they want. - // Step 6: Plan infrastructure changes - hasChanges, err := executor.Plan(ctx, []string{varsFile}) - if err != nil { - span.RecordError(err) - return fmt.Errorf("terraform plan: %w", err) - } +### Logging policy - if !hasChanges { - fmt.Println("No infrastructure changes needed") - return nil - } +`pkg/tofu` does not call `slog`. That's intentional and required: per [`CLAUDE.md`](../../../CLAUDE.md), library code never logs. Translation into log records happens in `cmd/nic/status_handler.go`. - // Step 7: Apply infrastructure changes - if err := executor.Apply(ctx, []string{varsFile}); err != nil { - span.RecordError(err) - return fmt.Errorf("terraform apply: %w", err) - } +## 8.4 Setup - // Step 8: Retrieve outputs (kubeconfig, URLs, etc.) - outputs, err := executor.Output(ctx) - if err != nil { - span.RecordError(err) - return fmt.Errorf("terraform output: %w", err) - } +`Setup` is the entry point that providers actually call: - // Step 9: Wait for Kubernetes cluster readiness - kubeconfig := outputs["kubeconfig"].Value.(string) - if err := waitForClusterReady(ctx, kubeconfig); err != nil { - span.RecordError(err) - return fmt.Errorf("waiting for cluster: %w", err) - } +```go +func Setup(ctx context.Context, templates fs.FS, tfvars any) (*TerraformExecutor, error) +``` - // Step 10: Wait for ArgoCD and foundational software - if err := waitForFoundationalSoftware(ctx, kubeconfig); err != nil { - span.RecordError(err) - return fmt.Errorf("waiting for foundational software: %w", err) - } +It does the following: - fmt.Println("Nebari deployed successfully") - fmt.Printf(" Domain: %s\n", cfg.Domain) - fmt.Printf(" ArgoCD: https://argocd.%s\n", cfg.Domain) - fmt.Printf(" Grafana: https://grafana.%s\n", cfg.Domain) - fmt.Printf(" Keycloak: https://keycloak.%s\n", cfg.Domain) +1. Allocates a fresh temp working directory via `afero.TempDir`. +2. Walks the `templates` filesystem (an `embed.FS` from the calling provider) and copies each file into the working dir. +3. Ensures `~/.cache/nic/tofu/` exists and uses it as the OpenTofu download cache. +4. Downloads the OpenTofu binary (version pinned in `pkg/tofu/version.go`) via the `tofudl` library, with `MirrorConfig` that caches both API responses and artifacts indefinitely. Writes the executable into the working dir to avoid version-mismatch races between concurrent NIC invocations. +5. Sets `TF_PLUGIN_CACHE_DIR` to `~/.cache/nic/tofu/plugins` so provider plugins are reused across runs. +6. Marshals `tfvars` to `terraform.tfvars.json` in the working dir. +7. Constructs `tfexec.NewTerraform(workingDir, execPath)` and returns the wrapped `TerraformExecutor`. - return nil -} +If any step fails, the temp dir and the (empty) cache directories are cleaned up. The caller is responsible for `defer executor.Cleanup()` once Setup succeeds. -// generateTerraformVars converts NebariConfig to Terraform variables JSON -func generateTerraformVars(ctx context.Context, cfg *config.NebariConfig) (string, error) { - ctx, span := tracer.Start(ctx, "generateTerraformVars") - defer span.End() +There is **no** `findOpenTofuBinary()` in `PATH`. The binary is always the version NIC pinned and downloaded. - // Convert Go config struct to Terraform variables map - vars := map[string]any{ - "provider": cfg.Provider, - "cluster_name": cfg.ProjectName, - "domain": cfg.Domain, - "region": getRegion(cfg), - "kubernetes_version": cfg.KubernetesVersion, - "node_pools": convertNodePools(cfg), - "tags": getTags(cfg), - } +## 8.5 AWS Provider Usage - // Add provider-specific variables (extracted from cfg.ProviderConfig map) - switch cfg.Provider { - case "aws": - if awsCfg := cfg.ProviderConfig["amazon_web_services"]; awsCfg != nil { - vars["aws_vpc_cidr"] = awsCfg.(map[string]any)["vpc_cidr"] - vars["aws_availability_zones"] = awsCfg.(map[string]any)["availability_zones"] - } - case "gcp": - if gcpCfg := cfg.ProviderConfig["google_cloud_platform"]; gcpCfg != nil { - vars["gcp_project_id"] = gcpCfg.(map[string]any)["project_id"] - } - case "azure": - if azureCfg := cfg.ProviderConfig["azure"]; azureCfg != nil { - vars["azure_resource_group"] = azureCfg.(map[string]any)["resource_group"] - } - } +The AWS provider's `Deploy` and `Destroy` methods are the primary callers. The shape (simplified, with telemetry omitted): - // Write to temporary file - tmpFile, err := os.CreateTemp("", "nic-vars-*.json") - if err != nil { - return "", fmt.Errorf("creating temp file: %w", err) - } - defer tmpFile.Close() +```go +// pkg/provider/aws/tofu.go (illustrative) +func (p *Provider) Deploy(ctx context.Context, projectName string, cluster *config.ClusterConfig, opts provider.DeployOptions) error { + awsCfg, err := decodeConfig(cluster) + if err != nil { return err } - encoder := json.NewEncoder(tmpFile) - encoder.SetIndent("", " ") - if err := encoder.Encode(vars); err != nil { - return "", fmt.Errorf("encoding vars: %w", err) + if err := ensureStateBucket(ctx, s3Client, awsCfg.Region, bucketName); err != nil { + return err } - return tmpFile.Name(), nil -} + tfvars := buildTfvars(projectName, awsCfg) + te, err := tofu.Setup(ctx, templatesFS, tfvars) + if err != nil { return err } + defer te.Cleanup() -// findOpenTofuBinary locates the tofu or terraform binary -func findOpenTofuBinary() (string, error) { - // Try tofu first (OpenTofu) - if path, err := exec.LookPath("tofu"); err == nil { - return path, nil - } + if err := te.Init(ctx, + tfexec.BackendConfig(fmt.Sprintf("bucket=%s", bucketName)), + tfexec.BackendConfig(fmt.Sprintf("key=%s", stateKey(projectName))), + tfexec.BackendConfig(fmt.Sprintf("region=%s", awsCfg.Region)), + ); err != nil { return err } - // Fall back to terraform - if path, err := exec.LookPath("terraform"); err == nil { - return path, nil + if opts.DryRun { + _, err := te.Plan(ctx) + return err } - - return "", fmt.Errorf("neither tofu nor terraform binary found in PATH") + return te.Apply(ctx) } ``` -## 8.5 Error Handling +Key points the previous version of this doc got wrong: -**Terraform errors are wrapped with context:** +- The CLI does **not** call a function like `generateTerraformVars(cfg)` itself; each provider owns its own tfvars construction. +- There is no `cfg.Provider` or `cfg.ProviderConfig` field on `NebariConfig`. The provider name is `cfg.Cluster.ProviderName()`; the typed config comes from decoding `cfg.Cluster.ProviderConfig()` inside the provider package. +- There is no `findOpenTofuBinary()`; see Setup above. -```go -// Direct error from terraform-exec -err := executor.Apply(ctx, varFiles) -// Error: terraform apply: creating EKS Cluster: ValidationException: Invalid instance type - -// The Go CLI can provide additional context -if err != nil { - slog.ErrorContext(ctx, "deployment failed", - "error", err, - "provider", cfg.Provider, - "cluster", cfg.ProjectName, - ) - return fmt.Errorf("deploying %s: %w", cfg.ProjectName, err) -} -``` +## 8.6 Backend Override (Dry-Run) -**Error types and handling:** +For `--dry-run` runs against a fresh AWS account where the state bucket might not yet exist, `pkg/tofu` exposes: ```go -// Check for specific Terraform error types -import "github.com/hashicorp/terraform-exec/tfexec" - -if exitErr, ok := err.(*tfexec.ErrTerraformNotFound); ok { - return fmt.Errorf("OpenTofu/Terraform not installed: %w", exitErr) -} - -if lockErr, ok := err.(*tfexec.ErrStateLocked); ok { - return fmt.Errorf("state locked by another process: %w", lockErr) -} +func (te *TerraformExecutor) WriteBackendOverride() error ``` -## 8.6 Working Directory Management - -The Go CLI manages the Terraform working directory: - -```go -// pkg/tofu/workspace.go -package tofu +This writes `backend_override.tf.json` into the working dir with a `terraform.backend.local` block, which OpenTofu uses to override the configured S3 backend for this single run. The AWS provider only triggers this in dry-run mode. -import ( - "embed" - "io/fs" - "os" - "path/filepath" -) +## 8.7 Signal Handling -//go:embed modules/* -var embeddedModules embed.FS - -type Workspace struct { - baseDir string -} - -// NewWorkspace creates or opens a workspace -func NewWorkspace(baseDir string) (*Workspace, error) { - ws := &Workspace{baseDir: baseDir} - - // Create .nic/terraform directory - tfDir := filepath.Join(baseDir, ".nic", "terraform") - if err := os.MkdirAll(tfDir, 0755); err != nil { - return nil, fmt.Errorf("creating workspace: %w", err) - } - - // Extract embedded modules if not present - if err := ws.extractModules(tfDir); err != nil { - return nil, fmt.Errorf("extracting modules: %w", err) - } - - return ws, nil -} +Long-running tofu operations need to survive Ctrl-C in a controlled way. `signalSafeContext(ctx)` returns a derived context whose cancellation is propagated to the tofu child process via SIGTERM, then SIGKILL after a grace period. On Linux, `pkg/tofu/context_linux.go` also sets `PR_SET_PDEATHSIG` so a crashed NIC process doesn't orphan its tofu child. -// extractModules copies embedded Terraform modules to workspace -func (ws *Workspace) extractModules(destDir string) error { - return fs.WalkDir(embeddedModules, "modules", func(path string, d fs.DirEntry, err error) error { - if err != nil { - return err - } +There is a known cleanup gap during destroy ([#63](https://github.com/nebari-dev/nebari-infrastructure-core/issues/63)): Ctrl-C while `tofu destroy` is mid-flight can leave the S3 state lockfile in place. - destPath := filepath.Join(destDir, path) +## 8.8 OpenTelemetry Instrumentation Status - if d.IsDir() { - return os.MkdirAll(destPath, 0755) - } - - content, err := embeddedModules.ReadFile(path) - if err != nil { - return err - } - - return os.WriteFile(destPath, content, 0644) - }) -} - -// GenerateBackendConfig creates backend.tf from config -func (ws *Workspace) GenerateBackendConfig(cfg *config.NebariConfig) error { - tmpl := ` -terraform { - backend "{{.Type}}" { - {{- if eq .Type "s3" }} - bucket = "{{.Bucket}}" - key = "{{.Key}}" - region = "{{.Region}}" - encrypt = true - dynamodb_table = "{{.DynamoDBTable}}" - {{- end }} - {{- if eq .Type "gcs" }} - bucket = "{{.Bucket}}" - prefix = "{{.Prefix}}" - {{- end }} - {{- if eq .Type "azurerm" }} - storage_account_name = "{{.StorageAccount}}" - container_name = "{{.Container}}" - key = "{{.Key}}" - {{- end }} - } -} -` - // ... template execution - return nil -} -``` - -## 8.7 Output Processing - -**Retrieving and using Terraform outputs:** +`TerraformExecutor`'s operation-granularity methods (`Init`, `Plan`, `Apply`, `Destroy`, `Output`) are **not yet** wrapped in their own spans. This is acknowledged as outstanding work in `CLAUDE.md`. When that lands, each method will look like: ```go -// pkg/tofu/outputs.go -package tofu - -import ( - "encoding/json" - "fmt" -) - -type DeploymentOutputs struct { - Kubeconfig string `json:"kubeconfig"` - ClusterEndpoint string `json:"cluster_endpoint"` - ArgocdURL string `json:"argocd_url"` - GrafanaURL string `json:"grafana_url"` - KeycloakURL string `json:"keycloak_url"` -} - -func (e *Executor) GetDeploymentOutputs(ctx context.Context) (*DeploymentOutputs, error) { - ctx, span := tracer.Start(ctx, "GetDeploymentOutputs") +func (te *TerraformExecutor) Apply(ctx context.Context, opts ...tfexec.ApplyOption) error { + tracer := otel.Tracer("nebari-infrastructure-core") + ctx, span := tracer.Start(ctx, "tofu.Apply") defer span.End() - - rawOutputs, err := e.Output(ctx) - if err != nil { - return nil, fmt.Errorf("getting outputs: %w", err) - } - - outputs := &DeploymentOutputs{} - - if kc, ok := rawOutputs["kubeconfig"]; ok { - outputs.Kubeconfig = string(kc.Value) - } - - if ep, ok := rawOutputs["cluster_endpoint"]; ok { - outputs.ClusterEndpoint = string(ep.Value) - } - - if argocd, ok := rawOutputs["argocd_url"]; ok { - outputs.ArgocdURL = string(argocd.Value) - } - - if grafana, ok := rawOutputs["grafana_url"]; ok { - outputs.GrafanaURL = string(grafana.Value) - } - - if keycloak, ok := rawOutputs["keycloak_url"]; ok { - outputs.KeycloakURL = string(keycloak.Value) - } - - return outputs, nil + // ... existing body ... } ``` -## 8.8 State Operations - -**Exposing state commands via CLI:** - -```go -// cmd/nic/state.go -var stateCmd = &cobra.Command{ - Use: "state", - Short: "Terraform state operations", -} - -var stateListCmd = &cobra.Command{ - Use: "list", - Short: "List resources in state", - RunE: func(cmd *cobra.Command, args []string) error { - executor, err := getExecutor(cmd) - if err != nil { - return err - } - - state, err := executor.Show(cmd.Context()) - if err != nil { - return err - } - - for _, resource := range state.Values.RootModule.Resources { - fmt.Printf("%s.%s\n", resource.Type, resource.Name) - } - - return nil - }, -} - -var stateShowCmd = &cobra.Command{ - Use: "show [address]", - Short: "Show a resource in state", - Args: cobra.ExactArgs(1), - RunE: func(cmd *cobra.Command, args []string) error { - executor, err := getExecutor(cmd) - if err != nil { - return err - } - - // Use terraform state show command - output, err := executor.StateShow(cmd.Context(), args[0]) - if err != nil { - return err - } - - fmt.Println(output) - return nil - }, -} -``` - ---- - -## Summary - -The terraform-exec integration provides: +The byte/line helpers (`streamThroughStatus`, `jsonLineMapper`, `mapStatusLevel`) and the `pkg/status` writers themselves are intentionally exempt: spans at that granularity would dwarf the operations they describe. -- **Programmatic control**: Go CLI orchestrates OpenTofu without shell scripts -- **OpenTelemetry instrumentation**: Full tracing of Terraform operations -- **Error handling**: Structured error types for better debugging -- **Output processing**: Type-safe access to Terraform outputs -- **State management**: CLI commands for state operations +## 8.9 Not Implemented -See [State Management](../architecture/05-state-management.md) for backend configuration and [OpenTofu Module Architecture](06-opentofu-module-architecture.md) for module design. +There is no `nic state` subcommand, no `nic plan` subcommand, no `nic unlock`, no `nic init-backend`, and no `nic status` subcommand. Several of those have open issues ([#64](https://github.com/nebari-dev/nebari-infrastructure-core/issues/64) for unlock). Users who need direct state manipulation today must invoke the tofu binary themselves; the bundled cache makes the same version available at `~/.cache/nic/tofu/`. diff --git a/docs/design-doc/implementation/09-dns-provider-architecture.md b/docs/design-doc/implementation/09-dns-provider-architecture.md index 175b2cba..53e8ed0f 100644 --- a/docs/design-doc/implementation/09-dns-provider-architecture.md +++ b/docs/design-doc/implementation/09-dns-provider-architecture.md @@ -83,24 +83,29 @@ The real implementation (`sdkClient`) wraps the `cloudflare-go/v4` SDK. Tests in ### Registry Pattern -DNS providers use a separate registry from cloud providers: +Cluster and DNS providers share a single `registry.Registry`, which holds two `ProviderList` instances (one per provider category). Registration is explicit in `cmd/nic/main.go`: ```go // cmd/nic/main.go -var ( - registry *provider.Registry // Cloud providers - dnsRegistry *dnsprovider.Registry // DNS providers (separate) -) +var reg *registry.Registry + +func init() { + reg = registry.NewRegistry() -func main() { + // Cluster providers + _ = reg.ClusterProviders.Register(ctx, "aws", aws.NewProvider()) + _ = reg.ClusterProviders.Register(ctx, "hetzner", hetzner.NewProvider()) // ... - dnsRegistry = dnsprovider.NewRegistry() - if err := dnsRegistry.Register(ctx, "cloudflare", cloudflare.NewProvider()); err != nil { + + // DNS providers + if err := reg.DNSProviders.Register(ctx, "cloudflare", cloudflare.NewProvider()); err != nil { log.Fatalf("Failed to register Cloudflare DNS provider: %v", err) } } ``` +`registry.Registry`, defined in `pkg/registry/registry.go`, is the single point of registration for all provider categories. The two `ProviderList` fields are typed (`ProviderList[provider.Provider]` and `ProviderList[dnsprovider.DNSProvider]`) so misuse is caught at compile time. + ## Configuration ### YAML Configuration @@ -108,13 +113,14 @@ func main() { ```yaml # nebari-config.yaml project_name: my-nebari -provider: aws domain: nebari.example.com -# Cloud provider config... -amazon_web_services: - region: us-west-2 - # ... +# Cluster provider config (single discriminator key) +cluster: + aws: + region: us-west-2 + kubernetes_version: "1.34" + # ... # DNS configuration (optional) dns: @@ -313,7 +319,7 @@ require ( ) ``` -Note: Future providers (AWS Route53, Azure DNS, Google Cloud DNS) may be managed via OpenTofu modules rather than direct SDK calls. +Note: Future providers (AWS Route53, Azure DNS, Google Cloud DNS) will likely be implemented via their native Go SDKs to keep behavior consistent with Cloudflare (idempotent, stateless, instrumented). See [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md) for the planned out-of-tree plugin path that will let private DNS integrations (e.g., OpenTeams' ASCOT) live outside this repo. ## Related Documentation diff --git a/docs/design-doc/implementation/10-foundational-software.md b/docs/design-doc/implementation/10-foundational-software.md index f9a9d5f6..f1e84189 100644 --- a/docs/design-doc/implementation/10-foundational-software.md +++ b/docs/design-doc/implementation/10-foundational-software.md @@ -1,387 +1,115 @@ # Foundational Software Stack -### 9.1 Stack Overview +## 10.1 Overview -**Complete LGTM + Platform Stack:** +NIC deploys an opinionated set of foundational platform services on every cluster. After the cluster provider finishes provisioning Kubernetes, NIC: -| Component | Purpose | Why This Tool | -| --------------------------- | ------------------------------ | --------------------------------------------------------------- | -| **ArgoCD** | GitOps continuous deployment | Industry standard, dependency management, self-healing | -| **cert-manager** | TLS certificate automation | Let's Encrypt integration, automatic renewal, cloud DNS solvers | -| **Envoy Gateway** | Ingress & API gateway | Kubernetes Gateway API, future-proof, advanced routing | -| **Keycloak** | Authentication & authorization | Open source, OIDC/SAML, user federation, battle-tested | -| **OpenTelemetry Collector** | Telemetry aggregation | Vendor-neutral, metrics/logs/traces, industry standard | -| **Mimir** | Metrics storage | Prometheus-compatible, horizontally scalable, cost-effective | -| **Loki** | Log aggregation | LogQL, integrates with Grafana, low-cost storage | -| **Tempo** | Distributed tracing | OpenTelemetry native, Grafana integration, scalable | -| **Grafana** | Visualization | Unified dashboards, alerting, LGTM native support | +1. Installs ArgoCD into the `argocd` namespace via the embedded Helm Go SDK (`pkg/helm`). +2. Renders ArgoCD `Application` manifests for the rest of the stack into a Git repository (`pkg/argocd`). +3. Lets ArgoCD sync the stack via the `root.yaml` app-of-apps and sync waves. -### 9.2 Deployment Architecture +The stack is intentionally small. A full LGTM observability backend (Loki / Grafana / Tempo / Mimir) is **not** deployed today; only an OpenTelemetry Collector is shipped. Adding the rest is roadmap work. -**ArgoCD App-of-Apps Pattern:** +## 10.2 Components (Actual) -``` -ArgoCD (Deployed by NIC via Helm) - ├── App: cert-manager (Priority: 1) - ├── App: envoy-gateway (Priority: 2, depends: cert-manager) - ├── App: opentelemetry-collector (Priority: 3) - ├── App: mimir (Priority: 4, depends: opentelemetry-collector) - ├── App: loki (Priority: 4, depends: opentelemetry-collector) - ├── App: tempo (Priority: 4, depends: opentelemetry-collector) - ├── App: grafana (Priority: 5, depends: mimir, loki, tempo) - ├── App: keycloak (Priority: 6, depends: envoy-gateway, grafana) - └── App: nebari-operator (Priority: 7, depends: keycloak, grafana, envoy-gateway) -``` - -**Repository Structure:** - -``` -nebari-foundational-software/ -├── argocd-apps/ -│ ├── cert-manager.yaml -│ ├── envoy-gateway.yaml -│ ├── opentelemetry-collector.yaml -│ ├── mimir.yaml -│ ├── loki.yaml -│ ├── tempo.yaml -│ ├── grafana.yaml -│ ├── keycloak.yaml -│ └── nebari-operator.yaml -├── cert-manager/ -│ ├── kustomization.yaml -│ ├── cluster-issuer-letsencrypt.yaml -│ └── ... -├── envoy-gateway/ -│ ├── kustomization.yaml -│ ├── gateway-class.yaml -│ └── ... -├── keycloak/ -│ ├── kustomization.yaml -│ ├── deployment.yaml -│ ├── service.yaml -│ ├── ingress.yaml -│ └── ... -├── observability/ -│ ├── opentelemetry/ -│ │ ├── collector-config.yaml -│ │ └── ... -│ ├── mimir/ -│ │ ├── values.yaml -│ │ └── ... -│ ├── loki/ -│ ├── tempo/ -│ └── grafana/ -└── operator/ - ├── crd.yaml - ├── deployment.yaml - └── ... -``` - -### 9.3 Component Details - -#### 9.3.1 ArgoCD - -**Installation Method:** Helm chart via NIC -**Namespace:** `nebari-system` - -```go -func (d *Deployer) installArgoCD(ctx context.Context) error { - ctx, span := tracer.Start(ctx, "installArgoCD") - defer span.End() - - // Install ArgoCD via Helm - helmChart := HelmChart{ - Name: "argo-cd", - Repo: "https://argoproj.github.io/argo-helm", - Chart: "argo-cd", - Version: "5.51.0", - Namespace: "nebari-system", - Values: map[string]interface{}{ - "server": map[string]interface{}{ - "ingress": map[string]interface{}{ - "enabled": true, - "hosts": []string{"argocd." + d.config.Domain}, - "tls": true, - }, - }, - "configs": map[string]interface{}{ - "repositories": map[string]interface{}{ - "nebari-foundational": map[string]interface{}{ - "url": d.config.FoundationalSoftware.ArgoCD.RepoURL, - "type": "git", - }, - }, - }, - }, - } - - if err := d.helm.Install(ctx, helmChart); err != nil { - return fmt.Errorf("installing ArgoCD: %w", err) - } - - // Wait for ArgoCD to be ready - if err := d.waitForArgoCD(ctx); err != nil { - return fmt.Errorf("waiting for ArgoCD: %w", err) - } - - slog.InfoContext(ctx, "ArgoCD installed successfully") - return nil -} -``` - -**Post-Installation:** - -- Create ArgoCD Applications for foundational software -- Configure SSO with Keycloak (after Keycloak deploys) -- Set up RBAC (admin group from Keycloak) - -#### 9.3.2 cert-manager - -**Purpose:** Automated TLS certificate management - -**Features:** - -- Let's Encrypt integration (HTTP-01 and DNS-01 challenges) -- Automatic certificate renewal -- Wildcard certificate support -- Cloud DNS solver support (Route53, Cloud DNS, Azure DNS, Cloudflare) +The authoritative app set is the YAML under `pkg/argocd/templates/apps/`: -**Example ClusterIssuer:** +| Component | App manifest | Purpose | +|-----------|--------------|---------| +| **cert-manager** | `cert-manager.yaml` | TLS certificate automation | +| **cluster-issuers** | `cluster-issuers.yaml` | `ClusterIssuer` resources (selfsigned and/or Let's Encrypt) | +| **certificates** | `certificates.yaml` | Initial `Certificate` resources for foundational hostnames | +| **Envoy Gateway** | `envoy-gateway.yaml` | Kubernetes Gateway API implementation | +| **gateway-config** | `gateway-config.yaml` | `Gateway` and listener configuration | +| **httproutes** | `httproutes.yaml` | Initial `HTTPRoute` resources for foundational services | +| **postgresql** | `postgresql.yaml` | Backing database for Keycloak | +| **Keycloak** | `keycloak.yaml` | OIDC identity provider (Codecentric keycloakx chart - context path `/auth`) | +| **MetalLB** | `metallb.yaml` | Bare-metal `LoadBalancer` implementation (only when `InfraSettings.NeedsMetalLB` is true) | +| **metallb-config** | `metallb-config.yaml` | `IPAddressPool` and `L2Advertisement` for MetalLB | +| **OpenTelemetry Collector** | `opentelemetry-collector.yaml` | Telemetry pipeline (no backend deployed yet) | +| **Nebari Operator** | `nebari-operator.yaml` | Reconciles `NebariApp` CRs; source lives in `nebari-dev/nebari-operator` | +| **Nebari Landing Page** | `nebari-landingpage.yaml` | React/Go service catalog UI | +| **root** | `root.yaml` | App-of-apps entry point that owns all of the above | -```yaml -apiVersion: cert-manager.io/v1 -kind: ClusterIssuer -metadata: - name: letsencrypt-prod -spec: - acme: - server: https://acme-v02.api.letsencrypt.org/directory - email: admin@example.com - privateKeySecretRef: - name: letsencrypt-prod-key - solvers: - - dns01: - route53: # For AWS - region: us-west-2 -``` - -#### 9.3.3 Envoy Gateway +Apps not yet shipped (referenced in older docs as if shipped): Grafana, Loki, Mimir, Tempo, Promtail. These are roadmap items. -**Purpose:** Modern ingress controller using Kubernetes Gateway API +## 10.3 GitOps Layout -**Features:** +NIC does not pull these manifests from a separate `nebari-foundational-software` repo. The templates live inside this repo (`pkg/argocd/templates/`) and are rendered at deploy time into the user-configured GitOps repository (`git_repository.url`). ArgoCD's source-of-truth for the deployed stack is therefore the user's own repo, which makes everything inspectable, auditable, and overridable. -- Gateway API (v1 stable) -- Advanced routing (header-based, weight-based) -- TLS termination (via cert-manager) -- Rate limiting, JWT validation -- OpenTelemetry tracing +Sketch of what `pkg/argocd` writes into the GitOps repo at the `git_repository.path` subdirectory: -**Example Gateway:** - -```yaml -apiVersion: gateway.networking.k8s.io/v1 -kind: Gateway -metadata: - name: nebari-gateway - namespace: envoy-gateway-system -spec: - gatewayClassName: envoy - listeners: - - name: https - protocol: HTTPS - port: 443 - tls: - mode: Terminate - certificateRefs: - - name: wildcard-tls - namespace: envoy-gateway-system - hostname: "*.nebari.example.com" ``` - -#### 9.3.4 Keycloak - -**Purpose:** Centralized authentication and authorization - -**Features:** - -- OAuth2 / OIDC provider -- User federation (LDAP, Active Directory) -- Social login (Google, GitHub, etc.) -- Multi-factor authentication -- User self-service (password reset, profile management) - -**Deployment:** - -- High-availability mode (2+ replicas) -- PostgreSQL database for persistence -- Ingress: `https://auth.nebari.example.com` - -**Integration:** - -- ArgoCD SSO -- Grafana SSO -- Nebari Operator (OAuth client creation for apps) - -#### 9.3.5 OpenTelemetry Collector - -**Purpose:** Centralized telemetry collection - -**Features:** - -- Receives metrics, logs, traces -- Protocol support: OTLP, Prometheus, Jaeger, Zipkin -- Processing pipelines (filtering, sampling, batching) -- Export to LGTM stack - -**Example Configuration:** - -```yaml -receivers: - otlp: - protocols: - grpc: - endpoint: 0.0.0.0:4317 - http: - endpoint: 0.0.0.0:4318 - prometheus: - config: - scrape_configs: - - job_name: "kubernetes-pods" - # Scrape pods with prometheus.io/scrape annotation - -processors: - batch: - timeout: 10s - send_batch_size: 1024 - -exporters: - prometheusremotewrite: - endpoint: http://mimir.monitoring.svc:9009/api/v1/push - loki: - endpoint: http://loki.monitoring.svc:3100/loki/api/v1/push - otlp/tempo: - endpoint: tempo.monitoring.svc:4317 - -service: - pipelines: - metrics: - receivers: [otlp, prometheus] - processors: [batch] - exporters: [prometheusremotewrite] - logs: - receivers: [otlp] - processors: [batch] - exporters: [loki] - traces: - receivers: [otlp] - processors: [batch] - exporters: [otlp/tempo] +// +├── root.yaml # App-of-apps root +├── nic-config.yaml # Scrubbed copy of nebari-config.yaml +├── .nic-bootstrapped # Marker file +└── manifests/ + ├── cert-manager/ # Application + (optional) values + ├── cluster-issuers/ + ├── certificates/ + ├── envoy-gateway/ + ├── gateway-config/ + ├── httproutes/ + ├── postgresql/ + ├── keycloak/ + ├── metallb/ # Skipped when not needed + ├── metallb-config/ # Skipped when not needed + ├── opentelemetry-collector/ + ├── nebari-operator/ # Kustomize patch over upstream operator + └── nebari-landingpage/ ``` -#### 9.3.6 Mimir (Metrics) - -**Purpose:** Scalable Prometheus-compatible metrics storage +The exact file layout depends on each template; the templates are owned by `pkg/argocd/templates/`. -**Features:** +## 10.4 ArgoCD Bootstrap -- Horizontally scalable -- Long-term storage (object storage: S3/GCS/Azure Blob) -- Prometheus-compatible query API -- Multi-tenancy support -- Compaction and downsampling +ArgoCD is installed in the `argocd` namespace by `pkg/argocd/install.go` via the embedded Helm Go SDK. It is configured with: -**Storage:** +- Keycloak OIDC for SSO (client secret generated by `cmd/nic/deploy.go` and passed into both the ArgoCD Helm values and the Keycloak realm-setup job) +- Read credentials for the GitOps repo (from `git_repository.argocd_auth`, falling back to `git_repository.auth`) +- `repoURL` and `path` from `cfg.GitRepository` -- Short-term: In-cluster (PersistentVolumes) -- Long-term: Cloud object storage (90 days retention) +After ArgoCD comes up, `pkg/argocd/bootstrap.go:ApplyRootAppOfApps` applies the root `Application` directly to the cluster via client-go. Everything else syncs from there. -#### 9.3.7 Loki (Logs) +## 10.5 InfraSettings Drives Conditional Deployment -**Purpose:** Scalable log aggregation +The Provider interface returns `InfraSettings` (see `pkg/provider/provider.go`), and the foundational layer reads from it instead of branching on provider name: -**Features:** +- **`NeedsMetalLB`** - if false, the MetalLB apps are skipped entirely +- **`MetalLBAddressPool`** - feeds `metallb-config`'s `IPAddressPool` +- **`StorageClass`** - default `StorageClass` name for foundational PVCs (postgresql, etc.) +- **`KeycloakBasePath`** - `/auth` for the Codecentric keycloakx chart; empty for upstream/Bitnami Keycloak +- **`HTTPSPort`** - Gateway HTTPS listener port (`443` normalized from `0`; can be overridden e.g. for local-dev on `8443`) +- **`LoadBalancerAnnotations`** - applied to the Gateway's provisioned `LoadBalancer` Service +- **`EFSStorageClass`** - name of the EFS-backed `StorageClass` if available (AWS-only) +- **`SupportsLocalGitOps`** - whether `file://` GitOps repos are acceptable (`local` only) -- LogQL query language (similar to PromQL) -- Label-based indexing (cost-effective) -- Cloud object storage for logs -- Grafana native integration +Adding a new provider-shaped capability means adding a field to `InfraSettings` and populating it in each provider's `InfraSettings(cfg)`. There must be no `switch cfg.Cluster.ProviderName()` in `pkg/argocd` or `cmd/nic`. -**Collection:** +## 10.6 Sync Waves -- Promtail DaemonSet (node logs) -- OpenTelemetry Collector (application logs) +Cross-app dependencies are expressed via ArgoCD sync waves on each `Application`. The general ordering: -#### 9.3.8 Tempo (Traces) +1. cert-manager (CRDs and webhooks need to be available before anything else issues certs) +2. cluster-issuers + certificates (initial issuers and the cert-manager `Certificate` resources foundational services depend on) +3. MetalLB + metallb-config (only when needed; before `LoadBalancer` Services) +4. Envoy Gateway + gateway-config + httproutes +5. postgresql + Keycloak +6. opentelemetry-collector +7. nebari-operator +8. nebari-landingpage -**Purpose:** Distributed tracing backend +Exact wave numbers live in the individual template files under `pkg/argocd/templates/apps/`. -**Features:** +## 10.7 Health and Readiness -- OpenTelemetry native -- TraceQL query language -- Object storage for traces -- Integration with Grafana and Loki +Foundational software health is observed via ArgoCD's own sync/health status, not a hardcoded list in NIC. NIC's `deploy` command does not block waiting for every component; it prints follow-up instructions (how to reach ArgoCD, how to reach Keycloak) and exits. Users who want to wait for full health can watch ArgoCD's UI or run `kubectl wait` against the relevant Applications. -**Use Cases:** +A first-class `nic status` / health-check subcommand does not exist today; that work is tracked but not started. -- Request tracing across microservices -- Performance debugging -- Dependency visualization +## 10.8 Versions -#### 9.3.9 Grafana - -**Purpose:** Unified visualization and alerting - -**Features:** - -- Dashboards for metrics, logs, traces -- Alerting with multiple notification channels -- Data source management (Mimir, Loki, Tempo) -- SSO via Keycloak -- Dashboard provisioning via ConfigMaps - -**Pre-configured Dashboards:** - -- Kubernetes cluster overview -- Node resources -- Pod resources -- Foundational software health -- NIC deployment metrics - -### 9.4 Health Checks and Readiness - -**NIC Health Check Loop:** - -```go -func (d *Deployer) waitForFoundationalSoftware(ctx context.Context) error { - ctx, span := tracer.Start(ctx, "waitForFoundationalSoftware") - defer span.End() - - components := []Component{ - {Name: "cert-manager", Namespace: "cert-manager"}, - {Name: "envoy-gateway", Namespace: "envoy-gateway-system"}, - {Name: "opentelemetry-collector", Namespace: "monitoring"}, - {Name: "mimir", Namespace: "monitoring"}, - {Name: "loki", Namespace: "monitoring"}, - {Name: "tempo", Namespace: "monitoring"}, - {Name: "grafana", Namespace: "monitoring"}, - {Name: "keycloak", Namespace: "nebari-system"}, - {Name: "nebari-operator", Namespace: "nebari-system"}, - } - - for _, component := range components { - slog.InfoContext(ctx, "waiting for component", "name", component.Name) - - if err := d.waitForDeployment(ctx, component.Name, component.Namespace, 10*time.Minute); err != nil { - return fmt.Errorf("waiting for %s: %w", component.Name, err) - } - - slog.InfoContext(ctx, "component ready", "name", component.Name) - } - - return nil -} -``` +Component versions are pinned in the individual template YAML files under `pkg/argocd/templates/apps/`. Search those files for `targetRevision:` and `version:` fields. The nebari-operator version is pinned in `pkg/argocd/templates/manifests/nebari-operator/kustomization.yaml`. ---- +Bumping a foundational version is a config change inside the template file plus an `argocd app sync` on the deployed cluster. diff --git a/docs/design-doc/implementation/11-nebari-operator.md b/docs/design-doc/implementation/11-nebari-operator.md index 7de9445d..e6aeb822 100644 --- a/docs/design-doc/implementation/11-nebari-operator.md +++ b/docs/design-doc/implementation/11-nebari-operator.md @@ -1,485 +1,99 @@ -# Nebari Kubernetes Operator +# Nebari Operator -### 10.1 Operator Purpose +## 11.1 Scope (and What This Document Is Not) -**Problem:** Applications need to integrate with auth, o11y, and routing - currently manual and error-prone. +The Nebari Operator is **not implemented in this repository**. It lives in its own project at [`github.com/nebari-dev/nebari-operator`](https://github.com/nebari-dev/nebari-operator) with its own release cadence, CRD schema, and reconciliation logic. -**Solution:** Kubernetes operator that watches `NebariApplication` CRDs and automates: -- OAuth2 client creation in Keycloak -- HTTPRoute configuration in Envoy Gateway -- TLS certificate provisioning via cert-manager -- Grafana dashboard provisioning -- OpenTelemetry ServiceMonitor creation +NIC's only role with respect to the operator is to **deploy it as a foundational ArgoCD application** so that user-installed software packs can rely on its CRDs being present. -### 10.2 NebariApplication CRD +This document describes: -**CRD Definition:** -```yaml -apiVersion: apiextensions.k8s.io/v1 -kind: CustomResourceDefinition -metadata: - name: nebariapplications.nebari.dev -spec: - group: nebari.dev - versions: - - name: v1alpha1 - served: true - storage: true - schema: - openAPIV3Schema: - type: object - properties: - spec: - type: object - required: [displayName, routing] - properties: - displayName: - type: string - description: "Human-readable application name" +1. How NIC deploys the operator +2. The contract NIC depends on the operator providing (the `NebariApp` CRD) +3. The provider-shaped capabilities NIC passes into the operator - routing: - type: object - required: [domain, paths] - properties: - domain: - type: string - description: "Application domain (e.g., jupyter.example.com)" - enableTLS: - type: boolean - default: true - paths: - type: array - items: - type: object - required: [path, service, port] - properties: - path: - type: string - service: - type: string - port: - type: integer +For the operator's CRD schema, reconciliation rules, controller code, and release notes, see the upstream repository. - authentication: - type: object - properties: - enabled: - type: boolean - default: true - allowedGroups: - type: array - items: - type: string - allowedUsers: - type: array - items: - type: string - publicPaths: - type: array - description: "Paths that don't require auth" - items: - type: string +## 11.2 How NIC Deploys the Operator - observability: - type: object - properties: - metrics: - type: object - properties: - enabled: - type: boolean - default: true - port: - type: integer - path: - type: string - default: "/metrics" - logs: - type: object - properties: - enabled: - type: boolean - default: true - traces: - type: object - properties: - enabled: - type: boolean - default: true - dashboards: - type: array - items: - type: object - properties: - name: - type: string - source: - type: string - description: "URL to dashboard JSON or ConfigMap reference" +The operator is deployed as a foundational ArgoCD application from `pkg/argocd/templates/apps/nebari-operator.yaml`. The actual manifests are pulled from the upstream `nebari-operator` repository via Kustomize, with NIC-specific patches layered on top: - status: - type: object - properties: - phase: - type: string - enum: [Pending, Provisioning, Ready, Error] - url: - type: string - description: "Public URL of the application" - keycloakClientID: - type: string - description: "OAuth2 client ID in Keycloak" - conditions: - type: array - items: - type: object - properties: - type: - type: string - status: - type: string - lastTransitionTime: - type: string - format: date-time - reason: - type: string - message: - type: string ``` - -### 10.3 Example Usage - -**Deploy JupyterHub with Full Integration:** -```yaml -apiVersion: nebari.dev/v1alpha1 -kind: NebariApplication -metadata: - name: jupyterhub - namespace: jupyter -spec: - displayName: "JupyterHub" - - routing: - domain: jupyter.nebari.example.com - enableTLS: true - paths: - - path: / - service: jupyterhub - port: 8000 - - authentication: - enabled: true - allowedGroups: - - data-scientists - - admins - publicPaths: - - /hub/health # Health check endpoint - - observability: - metrics: - enabled: true - port: 9090 - path: /metrics - logs: - enabled: true - traces: - enabled: true - dashboards: - - name: "JupyterHub Overview" - source: "https://raw.githubusercontent.com/jupyterhub/grafana-dashboards/main/jupyterhub.json" - - name: "JupyterHub User Activity" - source: "configmap://jupyter/jupyterhub-dashboard" +pkg/argocd/templates/manifests/nebari-operator/ +└── kustomization.yaml # Points at github.com/nebari-dev/nebari-operator + # at a pinned ref (e.g. v0.1.0-alpha.19), with + # patches for ingress hostname, Keycloak base path, + # and HTTPS port ``` -**Operator Creates:** +The operator runs in its own namespace and watches for `NebariApp` CRs across the cluster. -1. **Keycloak OAuth2 Client:** -```json -{ - "clientId": "jupyterhub-jupyter", - "name": "JupyterHub", - "redirectUris": [ - "https://jupyter.nebari.example.com/hub/oauth_callback" - ], - "webOrigins": [ - "https://jupyter.nebari.example.com" - ], - "protocol": "openid-connect", - "publicClient": false, - "directAccessGrantsEnabled": false, - "serviceAccountsEnabled": false, - "authorizationServicesEnabled": false -} -``` - -2. **Envoy Gateway HTTPRoute:** -```yaml -apiVersion: gateway.networking.k8s.io/v1 -kind: HTTPRoute -metadata: - name: jupyterhub - namespace: jupyter -spec: - parentRefs: - - name: nebari-gateway - namespace: envoy-gateway-system - hostnames: - - jupyter.nebari.example.com - rules: - - matches: - - path: - type: PathPrefix - value: / - backendRefs: - - name: jupyterhub - port: 8000 - filters: - - type: ExtensionRef - extensionRef: - group: gateway.envoyproxy.io - kind: SecurityPolicy - name: jupyterhub-oauth -``` +## 11.3 The `NebariApp` CRD -3. **cert-manager Certificate:** -```yaml -apiVersion: cert-manager.io/v1 -kind: Certificate -metadata: - name: jupyterhub-tls - namespace: jupyter -spec: - secretName: jupyterhub-tls - issuerRef: - name: letsencrypt-prod - kind: ClusterIssuer - dnsNames: - - jupyter.nebari.example.com -``` +The CRD shape is owned by the upstream operator. The relevant fields, at a high level (consult the upstream repo for the authoritative schema): -4. **OpenTelemetry ServiceMonitor:** ```yaml -apiVersion: v1 -kind: Service +apiVersion: nebari.dev/v1 +kind: NebariApp metadata: - name: jupyterhub-metrics + name: jupyter-hub namespace: jupyter - labels: - app: jupyterhub - annotations: - prometheus.io/scrape: "true" - prometheus.io/port: "9090" - prometheus.io/path: "/metrics" spec: - selector: - app: jupyterhub - ports: - - name: metrics - port: 9090 - targetPort: 9090 -``` - -5. **Grafana Dashboard ConfigMap:** -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: jupyterhub-dashboard - namespace: monitoring - labels: - grafana_dashboard: "1" -data: - jupyterhub.json: | - { - "dashboard": { - "title": "JupyterHub Overview", - "panels": [ ... ] - } - } -``` - -6. **Status Update:** -```yaml -status: - phase: Ready - url: https://jupyter.nebari.example.com - keycloakClientID: jupyterhub-jupyter - conditions: - - type: RoutingConfigured - status: "True" - lastTransitionTime: "2025-01-30T12:00:00Z" - - type: AuthenticationConfigured - status: "True" - lastTransitionTime: "2025-01-30T12:01:00Z" - - type: ObservabilityConfigured - status: "True" - lastTransitionTime: "2025-01-30T12:02:00Z" - - type: Ready - status: "True" - lastTransitionTime: "2025-01-30T12:02:00Z" - reason: AllComponentsReady - message: "Application is fully configured and accessible" + hostname: jupyter.example.com + routing: + routes: + - path: / + backend: + name: jupyterhub + port: 8000 + publicRoutes: [] # Paths that should bypass OIDC + tls: { ... } + auth: + enforceAtGateway: true # If true, operator creates a SecurityPolicy + landingPage: + displayName: "JupyterHub" + icon: "..." ``` -### 10.4 Operator Implementation - -**Controller Logic:** -```go -package operator - -import ( - "context" - "fmt" - - nebaridevv1alpha1 "github.com/nebari-dev/nic/api/v1alpha1" - ctrl "sigs.k8s.io/controller-runtime" - "sigs.k8s.io/controller-runtime/pkg/client" - "sigs.k8s.io/controller-runtime/pkg/log" -) - -type NebariApplicationReconciler struct { - client.Client - KeycloakClient *keycloak.Client - EnvoyClient *envoy.Client - GrafanaClient *grafana.Client -} - -func (r *NebariApplicationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { - ctx, span := tracer.Start(ctx, "Reconcile") - defer span.End() +Critically: - log := log.FromContext(ctx) +- **`spec.routing.routes`** drives the main `HTTPRoute` that the operator creates. The operator's `SecurityPolicy` targets this main route when `auth.enforceAtGateway` is true. +- **`spec.routing.publicRoutes`** drives a *second*, separate `HTTPRoute` that is intentionally not protected by the SecurityPolicy. +- **`auth.enforceAtGateway`** is orthogonal to `publicRoutes`. The operator creates the SecurityPolicy if and only if `enforceAtGateway` is true (or unset, since it defaults to true). +- **Cert and landing page** depend on `spec.hostname` (for the cert) and `spec.landingPage` + `spec.hostname` (for the landing page entry), independent of any `routes` block. - // Fetch NebariApplication - var app nebaridevv1alpha1.NebariApplication - if err := r.Get(ctx, req.NamespacedName, &app); err != nil { - return ctrl.Result{}, client.IgnoreNotFound(err) - } +Operators of Nebari clusters and software-pack authors should treat the upstream operator's docs as authoritative. - // Update status to Provisioning - app.Status.Phase = "Provisioning" - if err := r.Status().Update(ctx, &app); err != nil { - return ctrl.Result{}, err - } +## 11.4 Provider-Shaped Inputs from NIC - // 1. Configure routing (Envoy HTTPRoute + cert-manager Certificate) - if err := r.configureRouting(ctx, &app); err != nil { - log.Error(err, "failed to configure routing") - return ctrl.Result{}, err - } - r.updateCondition(&app, "RoutingConfigured", "True", "RoutingReady", "Routing configured successfully") +The operator's manifests need a small number of cluster-shaped values to route correctly. NIC supplies these via Kustomize patches sourced from `provider.InfraSettings(cfg)`: - // 2. Configure authentication (Keycloak OAuth client) - if app.Spec.Authentication.Enabled { - clientID, err := r.configureAuthentication(ctx, &app) - if err != nil { - log.Error(err, "failed to configure authentication") - return ctrl.Result{}, err - } - app.Status.KeycloakClientID = clientID - r.updateCondition(&app, "AuthenticationConfigured", "True", "AuthReady", "OAuth client created") - } +| `InfraSettings` field | Operator use | +|------------------------|--------------| +| `KeycloakBasePath` | Path prefix the operator uses when constructing OIDC issuer URLs (`/auth` for the keycloakx chart used today; empty for upstream/Bitnami) | +| `HTTPSPort` | Port to use when constructing user-facing URLs in the operator's status output and landing-page registration | - // 3. Configure observability (metrics, dashboards) - if err := r.configureObservability(ctx, &app); err != nil { - log.Error(err, "failed to configure observability") - return ctrl.Result{}, err - } - r.updateCondition(&app, "ObservabilityConfigured", "True", "ObservabilityReady", "Observability configured") +The operator does not see any other parts of `NebariConfig`. In particular, it does not know which cluster provider is in use. - // 4. Update final status - app.Status.Phase = "Ready" - app.Status.URL = fmt.Sprintf("https://%s", app.Spec.Routing.Domain) - r.updateCondition(&app, "Ready", "True", "AllComponentsReady", "Application fully configured") +## 11.5 NIC's Responsibilities (Summary) - if err := r.Status().Update(ctx, &app); err != nil { - return ctrl.Result{}, err - } +- Pin a known-good operator release in `pkg/argocd/templates/manifests/nebari-operator/kustomization.yaml` +- Render the operator's ArgoCD Application into the GitOps repo with the correct sync wave (after Keycloak, cert-manager, and Envoy Gateway are ready) +- Pass `InfraSettings.KeycloakBasePath` and `InfraSettings.HTTPSPort` into the operator manifests via Kustomize patches - log.Info("reconciliation complete", "app", app.Name, "url", app.Status.URL) - return ctrl.Result{}, nil -} +That's it. NIC does not reconcile `NebariApp` CRs, does not implement the operator's controller, and does not ship any `api/v1alpha1/` package. If you find documentation that says otherwise, it is out of date. -func (r *NebariApplicationReconciler) configureRouting(ctx context.Context, app *nebaridevv1alpha1.NebariApplication) error { - ctx, span := tracer.Start(ctx, "configureRouting") - defer span.End() - - // Create cert-manager Certificate - if app.Spec.Routing.EnableTLS { - if err := r.createCertificate(ctx, app); err != nil { - return fmt.Errorf("creating certificate: %w", err) - } - } - - // Create Envoy HTTPRoute - if err := r.createHTTPRoute(ctx, app); err != nil { - return fmt.Errorf("creating HTTPRoute: %w", err) - } - - return nil -} - -func (r *NebariApplicationReconciler) configureAuthentication(ctx context.Context, app *nebaridevv1alpha1.NebariApplication) (string, error) { - ctx, span := tracer.Start(ctx, "configureAuthentication") - defer span.End() - - redirectURI := fmt.Sprintf("https://%s/oauth_callback", app.Spec.Routing.Domain) - - clientID, clientSecret, err := r.KeycloakClient.CreateOAuthClient(ctx, keycloak.OAuthClientRequest{ - Name: app.Spec.DisplayName, - RedirectURIs: []string{redirectURI}, - AllowedGroups: app.Spec.Authentication.AllowedGroups, - }) - - if err != nil { - return "", fmt.Errorf("creating Keycloak client: %w", err) - } - - // Store client secret in Kubernetes Secret - if err := r.createOAuthSecret(ctx, app, clientID, clientSecret); err != nil { - return "", fmt.Errorf("creating OAuth secret: %w", err) - } - - return clientID, nil -} - -func (r *NebariApplicationReconciler) configureObservability(ctx context.Context, app *nebaridevv1alpha1.NebariApplication) error { - ctx, span := tracer.Start(ctx, "configureObservability") - defer span.End() - - // Create ServiceMonitor for metrics - if app.Spec.Observability.Metrics.Enabled { - if err := r.createServiceMonitor(ctx, app); err != nil { - return fmt.Errorf("creating ServiceMonitor: %w", err) - } - } - - // Provision Grafana dashboards - for _, dashboard := range app.Spec.Observability.Dashboards { - if err := r.provisionDashboard(ctx, app, dashboard); err != nil { - return fmt.Errorf("provisioning dashboard %s: %w", dashboard.Name, err) - } - } - - return nil -} -``` +## 11.6 Operator Upgrade Path -### 10.5 Operator Benefits +Bumping the operator version: -**For Users:** -- ✅ One manifest to deploy + integrate application -- ✅ No manual OAuth client creation -- ✅ No manual HTTPRoute configuration -- ✅ No manual dashboard import -- ✅ Automatic TLS certificate provisioning -- ✅ Status updates show integration progress +1. Update the `ref:` in `pkg/argocd/templates/manifests/nebari-operator/kustomization.yaml` to the new upstream tag. +2. Verify the operator's CRD schema hasn't broken NIC's Kustomize patches. +3. Land the change; on next `nic deploy` or `argocd app sync`, the new operator version rolls out. -**For Platform Team:** -- ✅ Consistent integration patterns -- ✅ Centralized configuration management -- ✅ Easier to update (change operator, all apps benefit) -- ✅ Self-documenting (CRD schema is API contract) -- ✅ Audit trail (Git history of CRDs) +## 11.7 References ---- +- Upstream operator repo: +- ArgoCD app manifest: `pkg/argocd/templates/apps/nebari-operator.yaml` +- Kustomize patches: `pkg/argocd/templates/manifests/nebari-operator/` +- Related discussion of `publicRoutes` + `enforceAtGateway` interaction: [`nebari-operator#118`](https://github.com/nebari-dev/nebari-operator/issues/118) diff --git a/docs/design-doc/nic-summary.md b/docs/design-doc/nic-summary.md index 08f86810..385aa048 100644 --- a/docs/design-doc/nic-summary.md +++ b/docs/design-doc/nic-summary.md @@ -2,12 +2,12 @@ ## What is Nebari Infrastructure Core? -Nebari Infrastructure Core (NIC) is a next-generation platform that rethinks how organizations deploy and manage data science infrastructure. Unlike Nebari Classic, which delivers a monolithic, opinionated data science stack (JupyterHub, Dask, conda-store, etc.) as a single deployable unit, NIC takes a fundamentally different approach: it provides a **stable, composable foundation** upon which various workloads can be built and deployed independently. NIC uses a Go CLI powered by OpenTofu/Terraform modules to provision Kubernetes clusters across AWS, GCP, Azure, or bare metal, ensuring consistent infrastructure regardless of hosting environment. The result is a platform that separates infrastructure concerns from application concerns, enabling teams to evolve each layer independently. +Nebari Infrastructure Core (NIC) is a next-generation platform that rethinks how organizations deploy and manage data science infrastructure. Unlike Nebari Classic, which delivers a monolithic, opinionated data science stack (JupyterHub, Dask, conda-store, etc.) as a single deployable unit, NIC takes a fundamentally different approach: it provides a **stable, composable foundation** upon which various workloads can be built and deployed independently. NIC is a Go CLI that provisions Kubernetes clusters and bootstraps a foundational software stack via GitOps. Each cluster provider chooses the right backing tool for its environment - OpenTofu for AWS (EKS), the `hetzner-k3s` binary for Hetzner, Kind for local development, and an `existing` adapter for pre-provisioned clusters - while GCP and Azure are stubbed and not yet implemented. The result is a platform that separates infrastructure concerns from application concerns, enabling teams to evolve each layer independently. ## Advantages Over Nebari Classic -The key advantage of NIC is its **composable architecture through Software Packs**. Where Nebari Classic bundles everything together—meaning you get the full data science stack whether you need it all or not—NIC lets you choose exactly what you need. Software Packs are curated collections of open-source tools packaged as ArgoCD applications with a `NicApp` Custom Resource that enables automatic registration with the platform. Want just JupyterHub and conda-store? Install the Data Science Pack. Need model serving capabilities? Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway). This modular approach means faster deployments, smaller attack surfaces, easier upgrades, and the flexibility to mix-and-match capabilities. Additionally, all services automatically integrate with centralized authentication (Keycloak), routing (Envoy Gateway), and TLS certificates (cert-manager) through the Nebari Operator. +The key advantage of NIC is its **composable architecture through Software Packs**. Where Nebari Classic bundles everything together (meaning you get the full data science stack whether you need it all or not), NIC lets you choose exactly what you need. Software Packs are curated collections of open-source tools packaged as ArgoCD applications with a `NebariApp` Custom Resource that enables automatic registration with the platform. Want just JupyterHub and conda-store? Install the Data Science Pack. Need model serving capabilities? Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway). This modular approach means faster deployments, smaller attack surfaces, easier upgrades, and the flexibility to mix-and-match capabilities. Additionally, all services automatically integrate with centralized authentication (Keycloak), routing (Envoy Gateway), and TLS certificates (cert-manager) through the Nebari Operator. ## Architecture Philosophy -NIC embraces a **layered, GitOps-native architecture** where each layer has clear responsibilities and can evolve independently. At the foundation, NIC deploys an opinionated Kubernetes cluster with consistent networking, storage, and security policies. The foundational software layer provides essential platform services. The Nebari Operator watches for `NicApp` resources and automatically handles routing, authentication, and service registration. A dynamic React landing page (backed by a Go API) provides users with a single entry point to discover and access all deployed services. Software Packs sit at the top, registering themselves dynamically when installed—no manual configuration required. This separation of concerns means platform teams can upgrade infrastructure without affecting applications, and application teams can deploy new services without understanding infrastructure details. +NIC embraces a **layered, GitOps-native architecture** where each layer has clear responsibilities and can evolve independently. At the foundation, NIC provisions an opinionated Kubernetes cluster with consistent networking, storage, and security policies. The foundational software layer provides essential platform services (cert-manager, Envoy Gateway, Keycloak, an OpenTelemetry collector, the Nebari Operator, and a landing page) installed via ArgoCD from a GitOps repository that NIC generates and seeds. The Nebari Operator (developed in the separate [`nebari-operator`](https://github.com/nebari-dev/nebari-operator) repository and deployed by NIC as a foundational app) watches for `NebariApp` resources and automatically handles routing, authentication, and service registration. A dynamic React landing page (backed by a Go API) provides users with a single entry point to discover and access all deployed services. Software Packs sit at the top, registering themselves dynamically when installed - no manual configuration required. This separation of concerns means platform teams can upgrade infrastructure without affecting applications, and application teams can deploy new services without understanding infrastructure details. diff --git a/docs/design-doc/operations/12-testing-strategy.md b/docs/design-doc/operations/12-testing-strategy.md index 092a8c5f..422bad33 100644 --- a/docs/design-doc/operations/12-testing-strategy.md +++ b/docs/design-doc/operations/12-testing-strategy.md @@ -1,601 +1,135 @@ # Testing Strategy -### 11.1 Testing Levels - -**1. Unit Tests:** - -- Provider implementations -- Configuration parsing -- State management (read/write/lock/unlock) -- Reconciliation logic -- Drift detection -- **Run Frequency:** Every commit (pre-commit hook + CI) - -**2. Integration Tests:** - -- Provider operations against mock cloud APIs -- State backend operations (local, mock S3/GCS/Azure) -- Kubernetes operations against kind clusters -- ArgoCD application deployment -- **Target:** All critical paths covered -- **Run Frequency:** Every PR (CI) - -**3. Provider Tests (Expensive):** - -- Deploy real infrastructure to AWS/GCP/Azure -- Verify Kubernetes cluster functional -- Verify foundational software deployed -- Verify operator functional -- Tear down infrastructure -- **Target:** Nightly or on-demand -- **Run Frequency:** Nightly, release candidates - -**4. Black Box Health Tests:** - -- Verify health of any deployed cluster (production, staging, dev) -- Test foundational software availability and functionality -- Validate authentication, observability, routing, TLS -- Provider-agnostic: works against AWS, GCP, Azure, Local deployments -- **Target:** Verify cluster health after any deployment -- **Run Frequency:** Release candidates, after deployments, scheduled daily, on-demand - -### 11.2 Critical Test Cases - -**Test Case 1: Fresh Deployment (AWS)** - -```gherkin -Given a valid config.yaml for AWS -When I run `nic deploy -f config.yaml` -Then: - - VPC is created with 3 AZs - - EKS cluster is created (version 32) - - 3 node pools are created (general, compute, gpu) - - ArgoCD is deployed - - All 9 foundational components are deployed - - Nebari operator is deployed - - Kubeconfig is saved - - All URLs are accessible (argocd, grafana, keycloak) -``` +## 12.1 Testing Levels -**Test Case 2: Idempotency** +NIC has three testing levels today, plus one (health) that is planned but not yet implemented: -```gherkin -Given a deployed Nebari platform -When I run `nic deploy -f config.yaml` again -Then: - - No infrastructure changes are made - - Command completes in <2 minutes (only queries, no creates) -``` +### Unit tests -**Test Case 3: Add Node Pool** +- **Scope**: Pure Go packages under `pkg/` and `cmd/nic/`. +- **Runner**: `go test ./...` (or `make test` / `make test-unit`). +- **Conventions**: Table-driven tests (per [`CLAUDE.md`](../../../CLAUDE.md)). Interfaces are injected so concrete dependencies (AWS SDK, Helm, k8s client) can be mocked. +- **Where they run**: Every push and PR via `.github/workflows/ci.yml`, with `-race` and coverage. -```gherkin -Given a deployed Nebari platform -When I add a new node pool to config.yaml -And I run `nic deploy -f config.yaml` -Then: - - New node pool is created - - Existing node pools are unchanged - - Kubernetes cluster detects new nodes -``` +### Integration tests (LocalStack) -**Test Case 4: NebariApplication Integration** - -```gherkin -Given a deployed Nebari platform -When I create a NebariApplication CRD for JupyterHub -Then: - - OAuth client is created in Keycloak - - HTTPRoute is created in Envoy Gateway - - Certificate is provisioned by cert-manager - - ServiceMonitor is created - - Grafana dashboards are provisioned - - Status.URL is set to https://jupyter.example.com - - Status.Phase is "Ready" -``` +- **Scope**: AWS provider's state-bucket lifecycle and tofu invocation, against [LocalStack](https://localstack.cloud/). +- **Runner**: `make test-integration` (testcontainers-managed LocalStack) or `make test-integration-local` (uses `docker-compose.test.yml`). +- **Build tag**: `integration`. Unit-only runs (the default and what CI runs) exclude these via the absence of `-tags=integration`. +- **Where they run**: Locally, on demand. Not currently wired into CI. -**Test Case 5: Drift Detection** - -```gherkin -Given a deployed Nebari platform -When I manually delete a node pool via AWS console -And I run `nic status --check-drift` -Then: - - Drift is detected for node pool - - Report shows expected vs actual state -When I run `nic deploy` -Then: - - Node pool is recreated - - Drift is resolved -``` +### Provider tests (real cloud) -**Test Case 6: Destroy** - -```gherkin -Given a deployed Nebari platform -When I run `nic destroy -f config.yaml` -Then: - - All ArgoCD applications are deleted - - Kubernetes cluster is deleted - - Node pools are deleted - - VPC is deleted - - No cloud resources remain (verified via cloud APIs) -``` +- **Status**: Not yet wired up. The intent is a small set of expensive tests that deploy real infrastructure on AWS (and eventually Hetzner) to validate end-to-end provider behavior. These will live behind a separate build tag and run only when explicitly invoked (e.g., for release candidates). -### 11.3 Test Infrastructure +### Health tests (planned) -**Mock Services:** +- **Status**: Not implemented. A future `nic health check` subcommand and a corresponding test harness are planned but no code exists today (no `cmd/nic/health.go`, no `tests/health/`, no scheduled workflow). When referenced elsewhere, treat as roadmap. -- `moto` for AWS API mocking -- `fake-gcs-server` for GCS mocking -- `azurite` for Azure Blob mocking -- `kind` for Kubernetes testing +## 12.2 Test Coverage Targets -**CI/CD Pipeline:** +There are no enforced coverage thresholds in CI today. The Codecov upload in `.github/workflows/ci.yml` is informational only and is `continue-on-error: true`. -```yaml -name: CI +Coverage hygiene is enforced through review: -on: [push, pull_request] +- New code added under `pkg/` should have unit tests, ideally table-driven. +- The interface-driven design (Go functions take interfaces, return concrete types - see [`CLAUDE.md`](../../../CLAUDE.md)) is what makes coverage feasible. -jobs: - unit-tests: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: actions/setup-go@v5 - with: - go-version: "1.22" - - run: go test ./... -v -cover +## 12.3 Test Infrastructure - integration-tests: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: actions/setup-go@v5 - - run: | - kind create cluster - go test ./... -tags=integration -v +| Need | Tool | +|------|------| +| AWS API mocking | LocalStack via `docker-compose.test.yml` | +| Kubernetes object mocking | `k8s.io/client-go/kubernetes/fake` | +| Helm SDK mocking | The `Helm` interface in `pkg/helm` with fake implementations | +| Filesystem mocking | `github.com/spf13/afero` (used in `pkg/tofu` and elsewhere) | +| Local cluster for manual testing | Kind via `make localkind-up` | + +GCS and Azure Blob mocking are not in scope while the GCP and Azure providers remain stubs. + +## 12.4 CI Pipeline + +The actual workflow at `.github/workflows/ci.yml`: - provider-tests-aws: +```yaml +jobs: + test: runs-on: ubuntu-latest - if: github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' steps: - uses: actions/checkout@v4 - uses: actions/setup-go@v5 - - uses: aws-actions/configure-aws-credentials@v4 with: - aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} - aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - aws-region: us-west-2 - - run: go test ./pkg/provider/aws -tags=provider -v -``` - -### 11.4 Black Box Health Tests - -Black box health tests verify the health and functionality of any deployed Nebari cluster without knowledge of the underlying infrastructure provider or deployment method. These tests can be run against production, staging, or development environments to validate cluster health. - -**Design Principles:** - -- **Provider-agnostic**: Works against AWS, GCP, Azure, and local deployments -- **Deployment-agnostic**: Works regardless of how cluster was deployed (NIC, manual, other tools) -- **Non-destructive**: Read-only operations, safe to run against production -- **Fast**: Complete suite runs in <5 minutes -- **Actionable**: Clear pass/fail criteria with diagnostic information - -#### Test Suite Structure - -``` -tests/health/ -├── cluster/ # Kubernetes cluster health tests -├── foundational/ # Foundational software health tests -├── integration/ # Cross-component integration tests -├── performance/ # Basic performance/latency tests -└── security/ # Security posture validation -``` - -#### Test Case Categories - -**Category 1: Kubernetes Cluster Health** - -```gherkin -Scenario: Kubernetes API is accessible - Given I have kubeconfig for the cluster - When I query the Kubernetes API server - Then the API responds successfully - And the cluster version is as expected - And all control plane components are healthy - -Scenario: All nodes are ready - When I list all nodes in the cluster - Then all nodes have status "Ready" - And no nodes have memory pressure - And no nodes have disk pressure - And no nodes have PID pressure - -Scenario: Critical system pods are running - When I check pods in kube-system namespace - Then all kube-proxy pods are Running - And all coredns pods are Running - And all CNI pods are Running (if applicable) - -Scenario: Node pools match configuration - When I list all nodes by labels - Then I see nodes for each expected node group/pool - And node counts are within min/max ranges - And nodes have correct taints and labels -``` - -**Category 2: ArgoCD Health** - -```gherkin -Scenario: ArgoCD is accessible - Given the cluster domain is configured - When I access https://argocd. - Then I receive a valid HTTPS response - And the TLS certificate is valid - And the ArgoCD login page is displayed - -Scenario: ArgoCD applications are healthy - When I query ArgoCD API for all applications - Then all applications have sync status "Synced" - And all applications have health status "Healthy" - And no applications are in "Degraded" state - And no applications are in "Unknown" state - -Scenario: ArgoCD can access Git repository - When I trigger a manual sync of any application - Then ArgoCD successfully fetches from Git repository - And the sync completes without errors -``` - -**Category 3: Keycloak (Authentication) Health** - -```gherkin -Scenario: Keycloak is accessible - When I access https://keycloak. - Then I receive a valid HTTPS response - And the TLS certificate is valid - And the Keycloak login page is displayed - -Scenario: Keycloak master realm is accessible - When I query Keycloak API for realm information - Then the master realm is available - And the Nebari realm is available (if configured) - -Scenario: OAuth2 endpoints are functional - When I query /.well-known/openid-configuration endpoint - Then I receive valid OpenID Connect metadata - And authorization_endpoint is accessible - And token_endpoint is accessible - And userinfo_endpoint is accessible - -Scenario: Keycloak can issue tokens - Given valid Keycloak credentials - When I request an access token via client credentials flow - Then I receive a valid JWT token - And the token can be verified with JWKS endpoint -``` - -**Category 4: Envoy Gateway (Ingress) Health** - -```gherkin -Scenario: Envoy Gateway is running - When I check the envoy-gateway-system namespace - Then envoy-gateway controller pod is Running - And envoy proxy pods are Running - And all pods are Ready - -Scenario: HTTPRoutes are configured - When I list all HTTPRoute resources - Then all expected routes are present - And all routes have "Accepted" status - And all routes are attached to Gateway - -Scenario: TLS certificates are valid - When I access each foundational software endpoint via HTTPS - Then each endpoint has a valid TLS certificate - And certificates are issued by expected CA (Let's Encrypt) - And certificates are not expired - And certificates are not expiring within 30 days - -Scenario: HTTP to HTTPS redirect works - When I access http://. - Then I receive a 301 or 302 redirect - And the redirect location is https://. -``` - -**Category 5: cert-manager Health** - -```gherkin -Scenario: cert-manager is running - When I check the cert-manager namespace - Then cert-manager controller pod is Running - And cert-manager webhook pod is Running - And cert-manager cainjector pod is Running - And all pods are Ready - -Scenario: ClusterIssuers are ready - When I list all ClusterIssuers - Then letsencrypt-prod ClusterIssuer exists - And letsencrypt-staging ClusterIssuer exists (if configured) - And all ClusterIssuers have status "Ready" - -Scenario: Certificates are valid - When I list all Certificate resources - Then all certificates have status "Ready" - And no certificates have status "Failed" - And all certificates are not expired - And all certificates have valid secrets -``` - -**Category 6: Observability Stack (LGTM) Health** - -```gherkin -Scenario: Grafana is accessible - When I access https://grafana. - Then I receive a valid HTTPS response - And the Grafana login page is displayed - And I can authenticate via OAuth (Keycloak) - -Scenario: Grafana has data sources configured - Given I am authenticated to Grafana API - When I query /api/datasources - Then Loki data source is configured and healthy - And Mimir (Prometheus) data source is configured and healthy - And Tempo data source is configured and healthy - -Scenario: Loki is ingesting logs - When I query Loki for recent logs - Then I receive log entries from the last 5 minutes - And logs include entries from multiple namespaces - And log ingestion rate is > 0 - -Scenario: Mimir is scraping metrics - When I query Mimir for up metric - Then I receive data points from the last 1 minute - And multiple targets are reporting up=1 - And scrape success rate is > 95% - -Scenario: Tempo is ingesting traces - When I query Tempo for recent traces - Then I receive traces from the last 5 minutes - And traces include spans from multiple services - -Scenario: OpenTelemetry Collector is running - When I check the opentelemetry-collector pods - Then all collector pods are Running - And collector is receiving telemetry data - And collector is exporting to Loki, Mimir, and Tempo -``` - -**Category 7: Nebari Operator Health** - -```gherkin -Scenario: Nebari Operator is running - When I check the nebari-operator-system namespace - Then operator controller pod is Running - And operator webhook pod is Running (if applicable) - And all pods are Ready - -Scenario: CRDs are installed - When I list CustomResourceDefinitions - Then NebariApplication CRD exists - And CRD has expected version and schema - And CRD is established and accepted - -Scenario: Operator can reconcile resources - When I create a test NebariApplication - Then the operator reconciles the resource - And status is updated with progress - And the application becomes Ready - When I delete the test NebariApplication - Then the operator cleans up all created resources -``` - -**Category 8: Cross-Component Integration** - -```gherkin -Scenario: OAuth integration works end-to-end - When I access Grafana without authentication - Then I am redirected to Keycloak login - When I authenticate with valid credentials - Then I am redirected back to Grafana - And I am successfully logged in - And my user identity is from Keycloak - -Scenario: Monitoring stack observes all components - When I query Mimir for component metrics - Then I see metrics for ArgoCD - And I see metrics for Keycloak - And I see metrics for Envoy Gateway - And I see metrics for cert-manager - And I see metrics for Nebari Operator - -Scenario: Logs are aggregated from all components - When I query Loki with no namespace filter - Then I see logs from argocd namespace - And I see logs from keycloak namespace - And I see logs from envoy-gateway-system namespace - And I see logs from cert-manager namespace - And I see logs from observability stack namespaces - -Scenario: Distributed tracing captures cross-service requests - When I make a request that spans multiple services - Then I can see the full trace in Tempo - And trace includes spans from ingress (Envoy) - And trace includes spans from application - And trace shows service dependencies -``` + go-version: '1.25.1' + - run: go mod download + - run: go mod verify + - uses: golangci/golangci-lint-action@v9 + with: + version: latest + - run: go test -v -race -coverprofile=coverage.out -covermode=atomic ./... + - uses: codecov/codecov-action@v4 # continue-on-error: true -**Category 9: Performance and Latency** - -```gherkin -Scenario: API response times are acceptable - When I query Kubernetes API - Then response time is < 500ms - When I query Grafana API - Then response time is < 1000ms - When I query Keycloak API - Then response time is < 1000ms - -Scenario: DNS resolution works - When I resolve service..svc.cluster.local - Then DNS resolution succeeds in < 100ms - When I resolve . - Then DNS resolution succeeds in < 200ms - -Scenario: Prometheus query performance - When I run a simple PromQL query - Then query executes in < 2 seconds - When I run a complex PromQL query (aggregation) - Then query executes in < 10 seconds + build: + needs: test + steps: + - run: make build + - run: ./nic --help || true ``` -**Category 10: Security Posture** - -```gherkin -Scenario: Network policies are enforced - When I check NetworkPolicy resources - Then network policies exist for sensitive namespaces - And policies restrict inter-namespace traffic appropriately - -Scenario: RBAC is configured - When I check ClusterRoles and Roles - Then service accounts have minimal required permissions - And no service account has cluster-admin unnecessarily - And user access is role-based - -Scenario: Secrets are encrypted - When I check secret encryption configuration - Then secrets are encrypted at rest (cloud provider KMS) - And secrets are not stored in plain text in etcd - -Scenario: Pod security standards are enforced - When I check PodSecurityPolicy or PodSecurity admission - Then restricted or baseline standards are enforced - And privileged containers are only in system namespaces -``` +Highlights: -#### Implementation - -**Test Execution Tool:** - -```bash -# Run all health tests -nic health check --kubeconfig=~/.kube/config - -# Run specific category -nic health check --category=foundational - -# Run against specific domain -nic health check --domain=nebari.example.com - -# Output formats -nic health check --format=json -nic health check --format=junit # For CI/CD integration - -# Example output: -# ✅ Cluster Health (5/5 passed) -# ✅ ArgoCD (3/3 passed) -# ✅ Keycloak (4/4 passed) -# ✅ Envoy Gateway (4/4 passed) -# ✅ cert-manager (3/3 passed) -# ✅ Observability (7/7 passed) -# ✅ Nebari Operator (3/3 passed) -# ✅ Integration (4/4 passed) -# ⚠️ Performance (2/3 passed, 1 warning) -# ✅ Security (4/4 passed) -# -# Overall: 39/40 tests passed (97.5%) -# Duration: 3m 42s -``` +- Go 1.25.1. +- Unit tests run with `-race` and coverage. +- Lint via the latest `golangci-lint`. +- No integration-test job, no nightly schedule, no kind-based cluster spin-up. -**Test Configuration:** +Other workflows in `.github/workflows/`: -```yaml -# health-test-config.yaml -cluster: - kubeconfig: ~/.kube/config - context: nebari-prod # Optional, uses current context if not specified +- `release.yml` - cuts releases via goreleaser +- `opentofu-lockfile-pr.yml` - keeps tofu lockfiles fresh +- `add-to-project.yaml` - GitHub Projects auto-add -domain: nebari.example.com # Used for HTTPS endpoint checks +## 12.5 Local Development Loop -thresholds: - api_latency_ms: 500 - query_latency_ms: 2000 - certificate_expiry_days: 30 +- `make build` - compile the binary +- `make test` - run unit tests +- `make test-race` - unit tests with `-race` +- `make test-coverage` - unit tests with coverage report +- `make test-integration` / `make test-integration-local` - integration tests against LocalStack +- `make lint` - `golangci-lint run` +- `make check` - `fmt`, `vet`, `lint`, `test` +- `make localkind-up` - end-to-end deploy onto a local Kind cluster (uses `examples/local-config.yaml` by default; pass `LOCAL_CONFIG=...` to override) +- `make localkind-rebuild` - tear down and rebuild the local cluster -skip_tests: - - performance.complex-query # Skip specific tests if needed +The Kind workflow mounts the `file://` GitOps directory into the cluster so the in-cluster ArgoCD can sync from a local filesystem. See `pkg/provider/local` and the relevant Makefile target. -authentication: - keycloak: - client_id: health-check-client - client_secret_env: HEALTH_CHECK_CLIENT_SECRET -``` +## 12.6 What "Test Cases" Look Like -**CI/CD Integration:** +A few representative cases: -```yaml -# .github/workflows/health-check.yaml -name: Daily Health Check +**Fresh AWS deploy (manual integration):** -on: - schedule: - - cron: "0 8 * * *" # Daily at 8 AM UTC - workflow_dispatch: +- `nic deploy -f examples/aws-config.yaml` +- Expect: state bucket created, EKS cluster up with `kubernetes_version: "1.34"` and the configured `node_groups`, EFS volume mounted, ArgoCD running in `argocd` namespace, foundational apps syncing. +- Verify: `kubectl get nodes`, `kubectl get applications -n argocd`, the printed Argo CD and Keycloak access instructions. -jobs: - health-check-production: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 +**Local Kind deploy (manual):** - - name: Configure kubectl - run: | - echo "${{ secrets.PROD_KUBECONFIG }}" > /tmp/kubeconfig - export KUBECONFIG=/tmp/kubeconfig +- `make localkind-up` +- Expect: Kind cluster `nebari-local` up, MetalLB syncing, gateway with an IP from the configured pool, foundational apps green. - - name: Run health checks - run: | - nic health check --domain=nebari.company.com --format=junit > health-results.xml +**Dry-run (any provider):** - - name: Upload results - uses: actions/upload-artifact@v4 - with: - name: health-check-results - path: health-results.xml +- `nic deploy -f config.yaml --dry-run` +- Expect: no state mutation, plan output streamed. - - name: Notify on failure - if: failure() - uses: slackapi/slack-github-action@v1 - with: - webhook-url: ${{ secrets.SLACK_WEBHOOK }} - payload: | - { - "text": "🚨 Production health check failed", - "blocks": [ - { - "type": "section", - "text": { - "type": "mrkdwn", - "text": "Production cluster health check failed. Review results in Actions." - } - } - ] - } -``` +**Adoption of an existing cluster:** -**Benefits of Black Box Health Testing:** +- `nic deploy -f examples/existing-config.yaml` +- Expect: no infrastructure provisioning, just the GitOps bootstrap + foundational app rollout against the kubeconfig in the config. -1. **Post-Deployment Verification**: Validate cluster is healthy after any deployment -2. **Continuous Monitoring**: Run daily/hourly to catch drift or degradation -3. **Incident Response**: Run on-demand to quickly assess cluster health during incidents -4. **Provider-Agnostic**: Same tests work on AWS, GCP, Azure, and local clusters -5. **Regression Detection**: Catch issues introduced by configuration changes or upgrades -6. **Compliance**: Document cluster health for audits and SLAs -7. **Onboarding**: New team members can validate their dev environment setup -8. **Pre-Production Gates**: Require health checks to pass before promoting to production +## 12.7 Future Work ---- +- Wire integration tests into CI (likely as a separate, slower workflow with a manual trigger). +- Add a `provider-tests` job on a schedule (nightly or weekly) that hits real cloud APIs. +- Implement the `nic health check` subcommand and a paired test harness. +- Add Hetzner-specific integration tests (LocalStack analogue does not exist; may require recorded HTTP fixtures against the Hetzner Cloud API). diff --git a/docs/design-doc/operations/13-milestones.md b/docs/design-doc/operations/13-milestones.md index 5b092243..a954652f 100644 --- a/docs/design-doc/operations/13-milestones.md +++ b/docs/design-doc/operations/13-milestones.md @@ -1,120 +1,111 @@ # Timeline and Milestones -### 12.1 Phase 1: Foundation - -**Goals:** - -- Core NIC CLI with provider abstraction -- AWS provider implementation -- Basic testing infrastructure - -**Deliverables:** - -- ✅ NIC CLI (`deploy`, `destroy`, `status`, `validate`) -- ✅ Provider interface and registry -- ✅ AWS provider (EKS, VPC, EFS, node pools) -- ✅ Configuration parsing (config.yaml) -- ✅ Integration tests (kind-based) - -**Milestone:** Deploy Kubernetes cluster on AWS via NIC - -### 12.2 Phase 2: Foundational Software - -**Goals:** - -- ArgoCD deployment via Helm -- Foundational software repository -- LGTM stack deployment -- Keycloak deployment - -**Deliverables:** - -- ✅ ArgoCD installation in NIC -- ✅ Foundational software repo structure -- ✅ ArgoCD applications for all 9 components -- ✅ Health checks and readiness gates -- ✅ cert-manager + Let's Encrypt integration -- ✅ Envoy Gateway + HTTPRoute examples - -**Milestone:** Full platform deployed on AWS with all foundational software - -### 12.3 Phase 3: Nebari Operator - -**Goals:** - -- Kubernetes operator implementation -- NebariApplication CRD -- Integration with Keycloak, Envoy, Grafana - -**Deliverables:** - -- ✅ Operator scaffolding (controller-runtime) -- ✅ NebariApplication CRD v1alpha1 -- ✅ Keycloak OAuth client automation -- ✅ Envoy HTTPRoute automation -- ✅ cert-manager Certificate automation -- ✅ Grafana dashboard provisioning -- ✅ OpenTelemetry ServiceMonitor creation - -**Milestone:** Deploy sample app (JupyterHub) via NebariApplication CRD with full integration - -### 12.4 Phase 4: Multi-Cloud - -**Goals:** - -- GCP, Azure, Local providers -- Provider parity testing -- Cross-provider consistency - -**Deliverables:** - -- ✅ GCP provider (GKE, VPC, Filestore) -- ✅ Azure provider (AKS, VNet, Azure Files) -- ✅ Local provider (K3s) -- ✅ Provider parity tests -- ✅ Multi-cloud CI/CD pipelines - -**Milestone:** Deploy platform on all 4 providers (AWS, GCP, Azure, Local) - -### 12.5 Phase 5: Observability & Polish - -**Goals:** - -- OpenTelemetry instrumentation throughout NIC -- Pre-built Grafana dashboards -- Comprehensive documentation - -**Deliverables:** - -- ✅ OpenTelemetry tracing in all NIC functions -- ✅ Custom metrics (deployment time, resource counts) -- ✅ Structured logging via slog -- ✅ Export to deployed LGTM stack -- ✅ Grafana dashboards for NIC operations -- ✅ User documentation (deployment guides, CRD reference) -- ✅ Architecture documentation (this doc!) - -**Milestone:** NIC self-monitoring and production-ready observability - -### 12.6 Phase 6: Hardening & Release - -**Goals:** - -- Security hardening -- Performance optimization -- Comprehensive testing -- v1.0 release - -**Deliverables:** - -- ✅ Security audit (RBAC, secrets management) -- ✅ Performance benchmarks (deployment time targets) -- ✅ End-to-end tests on all providers -- ✅ Disaster recovery testing -- ✅ Documentation review -- ✅ Release notes and migration guides -- ✅ v1.0.0 release - -**Milestone:** NIC v1.0 released to production - ---- +Status legend: + +- ✅ shipped +- 🟡 partially shipped +- ⏳ planned, not started or in progress + +The repo's current release line is `v0.1.0-alpha.*` (see recent tags and `pkg/argocd/templates/manifests/nebari-operator/kustomization.yaml`). v1.0.0 has not shipped. + +## 13.1 Phase 1: Foundation + +**Goals**: A working `Provider` abstraction, a real cluster provisioner, and the cluster-level bits of state and config. + +| Deliverable | Status | +|-------------|--------| +| `Provider` interface and `InfraSettings` capability struct (`pkg/provider/provider.go`) | ✅ | +| Unified provider registry (`pkg/registry.Registry` with `ClusterProviders` + `DNSProviders`) | ✅ | +| AWS cluster provider (EKS via upstream `nebari-dev/eks-cluster` module, EFS, node groups) | ✅ | +| Hetzner cluster provider (via `hetzner-k3s` binary) | ✅ | +| Local cluster provider (Kind stub, driven by `make localkind-up`) | ✅ | +| `existing` cluster provider (adopt a kubeconfig) | ✅ | +| GCP cluster provider | ⏳ (registered as stub) | +| Azure cluster provider | ⏳ (registered as stub) | +| `pkg/tofu` wrapper with streaming JSON output through the status channel | ✅ | +| AWS S3 state backend with `use_lockfile = true` and auto-managed bucket lifecycle | ✅ | +| NIC CLI (`deploy`, `destroy`, `validate`, `kubeconfig`, `version`) | ✅ | +| `status` / `plan` / `state` / `unlock` subcommands | ⏳ | +| Integration tests against LocalStack via `make test-integration-local` | ✅ | +| CI: unit tests + lint + race + coverage upload | ✅ | + +## 13.2 Phase 2: Foundational Software + +**Goals**: GitOps bootstrap and the opinionated platform stack. + +| Deliverable | Status | +|-------------|--------| +| ArgoCD install via embedded Helm Go SDK (`pkg/helm`) | ✅ | +| GitOps repo bootstrap (`pkg/argocd`, `pkg/git`) | ✅ | +| `file://` GitOps repos for local development | ✅ | +| Cert-manager + cluster-issuers + initial Certificates | ✅ | +| Envoy Gateway + gateway-config + httproutes | ✅ | +| PostgreSQL + Keycloak (Codecentric keycloakx chart) | ✅ | +| MetalLB + metallb-config (conditional on `InfraSettings.NeedsMetalLB`) | ✅ | +| OpenTelemetry Collector | ✅ | +| Nebari Landing Page | ✅ | +| Nebari Operator (Kustomized from `nebari-dev/nebari-operator`) | ✅ | +| Full LGTM backend (Loki, Grafana, Tempo, Mimir, Promtail) | ⏳ | + +## 13.3 Phase 3: Operator Integration + +**Goals**: Apps integrate via the `NebariApp` CRD. + +The Nebari Operator is developed out-of-tree at [`nebari-dev/nebari-operator`](https://github.com/nebari-dev/nebari-operator). This phase's status from NIC's perspective: + +| Deliverable | Status | +|-------------|--------| +| Operator deployed by NIC via ArgoCD + Kustomize | ✅ | +| Operator version pinned in the Kustomize manifest | ✅ | +| `InfraSettings.KeycloakBasePath` and `HTTPSPort` propagated via Kustomize patches | ✅ | +| Operator-side reconciliation of `NebariApp` CRs | upstream-owned | +| Operator-side `SecurityPolicy` and OIDC plumbing | upstream-owned | +| Grafana dashboard provisioning | ⏳ (depends on LGTM stack) | + +## 13.4 Phase 4: Multi-Cloud Parity + +**Goals**: Make the secondary cluster providers real. + +| Deliverable | Status | +|-------------|--------| +| GCP provider (GKE, VPC, Filestore) | ⏳ | +| Azure provider (AKS, VNet, Azure Files) | ⏳ | +| Provider parity tests | ⏳ | +| Multi-cloud CI workflows | ⏳ | + +Note: [ADR-0004](../../adr/0004-out-of-tree-provider-plugins.md) (Proposed, 2026-04-15) re-frames the multi-cloud roadmap. Out-of-tree provider plugins would let GCP, Azure, and third-party providers ship independently. The in-tree path above is the original plan; the plugin path is the proposed direction. + +## 13.5 Phase 5: Observability + +**Goals**: NIC observes itself, and clusters get a real telemetry backend. + +| Deliverable | Status | +|-------------|--------| +| OpenTelemetry instrumentation in library code | 🟡 (per `CLAUDE.md` exemptions for `pkg/status` and byte/line helpers in `pkg/tofu`; operation-granularity `TerraformExecutor` wrappers tracked as outstanding work) | +| Status-channel seam between `pkg/` and `cmd/` (`pkg/status`, `cmd/nic/status_handler.go`) | ✅ | +| OTLP exporter wiring (`OTEL_EXPORTER=otlp`, `OTEL_ENDPOINT=...`) | ✅ | +| LGTM backend deployed on cluster | ⏳ | +| Grafana dashboards for NIC operations | ⏳ | + +## 13.6 Phase 6: Production Hardening + +**Goals**: GA-readiness items. + +| Deliverable | Status | +|-------------|--------| +| Documented upgrade paths between alpha releases | ⏳ | +| Comprehensive end-to-end testing across providers | ⏳ | +| Backup and restore for foundational software | ⏳ | +| Compliance profiles (HIPAA, SOC2, PCI-DSS) | ⏳ | +| v1.0.0 release | ⏳ | + +## 13.7 Known Issues Tracked + +A few of the open issues that affect the picture above: + +- [#63](https://github.com/nebari-dev/nebari-infrastructure-core/issues/63) Ctrl-C during destroy leaves OpenTofu state locked (bug) +- [#64](https://github.com/nebari-dev/nebari-infrastructure-core/issues/64) Add `nic unlock` command for stuck state locks (enhancement) +- [#65](https://github.com/nebari-dev/nebari-infrastructure-core/issues/65) MetalLB deployed on AWS (bug; `InfraSettings.NeedsMetalLB` fix) +- [#66](https://github.com/nebari-dev/nebari-infrastructure-core/issues/66) Pipe OpenTofu output through slog + pretty-print option (enhancement) +- [#241](https://github.com/nebari-dev/nebari-infrastructure-core/issues/241) Avoid redundant `tofu init` / module downloads during deploy (perf) +- [#300](https://github.com/nebari-dev/nebari-infrastructure-core/issues/300) Audit and rewrite design docs against current code (this audit) diff --git a/docs/design-doc/operations/longhorn-node-maintenance.md b/docs/design-doc/operations/longhorn-node-maintenance.md index 90a0b3be..13690d79 100644 --- a/docs/design-doc/operations/longhorn-node-maintenance.md +++ b/docs/design-doc/operations/longhorn-node-maintenance.md @@ -1,6 +1,6 @@ # Longhorn Node Maintenance -This document describes how to gracefully drain an EKS node when Longhorn is the storage backend, and what to expect during abrupt node failures. +This document describes how to gracefully drain an EKS node when Longhorn is the storage backend, and what to expect during abrupt node failures. Longhorn is currently only wired into the AWS provider (see `pkg/provider/aws/longhorn.go`, chart version 1.8.1); these procedures apply to NIC-managed EKS clusters. ## Background From fce61f79229b24b96366db12a57305b5fa1546e5 Mon Sep 17 00:00:00 2001 From: Chuck McAndrew <6248903+dcmcand@users.noreply.github.com> Date: Wed, 13 May 2026 13:11:48 +0200 Subject: [PATCH 2/3] docs(design-doc): fix factual inaccuracies flagged in PR #301 review Reverse-direction review surfaced six places where the rewritten docs still misrepresented the code: - nebari-operator kustomization: doc claimed patches for ingress hostname / KeycloakBasePath / HTTPSPort. The actual deployment patch only sets Keycloak integration env vars and the TLS cluster-issuer name; HTTPSPort is consumed by gateway templates, not by the operator. Rewrote 11.2, expanded 11.4 to the seven values actually rendered (with a Source column distinguishing InfraSettings fields from NIC-internal defaults), and updated 11.5 + 04-key-decisions 4.6 to match. - pkg/git.Config snippet: ArgocdAuth AuthConfig -> ArgoCDAuth *AuthConfig (pointer, capitalized CD); auth tag is not omitempty in real code. - pkg/provider/aws layout: AWSConfig -> aws.Config; lbc.go -> aws_load_balancer_controller.go; expanded the file list to include efs.go, kubeconfig.go, cleanup.go, k8s.go, version.go for fidelity. - GCP / Azure stubs: doc claimed methods return "not yet implemented"; they actually emit a "(stub)" status update and return nil. Fixed in both 02-system-overview and 06-opentofu-module-architecture. - 08-terraform-exec-integration 8.5: AWS Deploy sketch header pointed at tofu.go; the real Deploy lives in provider.go. Updated header and called out the conditionals the snippet omits. --- .../architecture/02-system-overview.md | 2 +- .../architecture/04-key-decisions.md | 2 +- .../06-opentofu-module-architecture.md | 33 +++++++++++-------- .../implementation/07-configuration-design.md | 10 +++--- .../08-terraform-exec-integration.md | 4 +-- .../implementation/11-nebari-operator.md | 30 ++++++++++------- 6 files changed, 47 insertions(+), 34 deletions(-) diff --git a/docs/design-doc/architecture/02-system-overview.md b/docs/design-doc/architecture/02-system-overview.md index 5eaf824c..84ae8070 100644 --- a/docs/design-doc/architecture/02-system-overview.md +++ b/docs/design-doc/architecture/02-system-overview.md @@ -100,7 +100,7 @@ The actual repository layout is captured in [`AGENTS.md`](../../../AGENTS.md). K **`pkg/provider/` (Cluster providers)** - `pkg/provider/provider.go` defines the `Provider` interface (`Name`, `Validate`, `Deploy`, `Destroy`, `GetKubeconfig`, `Summary`, `InfraSettings`) and the `InfraSettings` capability struct (`StorageClass`, `NeedsMetalLB`, `LoadBalancerAnnotations`, `MetalLBAddressPool`, `KeycloakBasePath`, `HTTPSPort`, `EFSStorageClass`, `SupportsLocalGitOps`). -- One sub-package per cluster provider: `aws/`, `hetzner/`, `local/`, `existing/`, plus `gcp/` and `azure/` stubs (registered but their methods return "not yet implemented"). +- One sub-package per cluster provider: `aws/`, `hetzner/`, `local/`, `existing/`, plus `gcp/` and `azure/` stubs (registered, but their `Validate`/`Deploy`/`Destroy` methods emit a "(stub)" status message and return `nil` rather than provisioning anything). - AWS-specific Terraform templates live under `pkg/provider/aws/templates/` and are embedded into the binary via `go:embed`. **`pkg/dnsprovider/` (DNS providers)** diff --git a/docs/design-doc/architecture/04-key-decisions.md b/docs/design-doc/architecture/04-key-decisions.md index be0c7b2f..a642e658 100644 --- a/docs/design-doc/architecture/04-key-decisions.md +++ b/docs/design-doc/architecture/04-key-decisions.md @@ -132,7 +132,7 @@ A full LGTM stack (Loki / Grafana / Tempo / Mimir) is not deployed today; that i - NIC is an infrastructure tool; the operator is an application-integration tool - Keeps NIC's surface area focused on cluster provisioning and bootstrap -NIC passes `InfraSettings.KeycloakBasePath` and `InfraSettings.HTTPSPort` into the operator's Kustomize patch so it routes correctly per provider. NIC does not implement the reconciliation logic; that lives upstream. +NIC renders Keycloak integration env vars (URL, realm, admin secret, issuer context path, external URL) and the TLS cluster-issuer name into the operator's Kustomize patch. See [Nebari Operator §11.4](../implementation/11-nebari-operator.md) for the full list. NIC does not implement the reconciliation logic; that lives upstream. ### 4.7 Decision: OpenTelemetry in Library Code, slog in the CLI diff --git a/docs/design-doc/implementation/06-opentofu-module-architecture.md b/docs/design-doc/implementation/06-opentofu-module-architecture.md index 67acdba8..11054750 100644 --- a/docs/design-doc/implementation/06-opentofu-module-architecture.md +++ b/docs/design-doc/implementation/06-opentofu-module-architecture.md @@ -12,18 +12,23 @@ There is **no root-level `terraform/` directory**. AWS-specific templates live i ``` pkg/provider/aws/ -├── config.go # AWSConfig struct (yaml/json tags) -├── provider.go # Implements provider.Provider -├── state.go # S3 state-bucket lifecycle (ensure / destroy) -├── longhorn.go # Longhorn storage installation -├── lbc.go # AWS Load Balancer Controller -├── tofu.go # Builds tfvars and invokes pkg/tofu.Setup -└── templates/ # Embedded via go:embed - ├── main.tf # Calls upstream nebari-dev/eks-cluster module - ├── variables.tf # tfvars input schema - ├── outputs.tf # Cluster name, endpoint, OIDC issuer, etc. - ├── provider.tf # AWS provider config - └── backend.tf # S3 backend with use_lockfile = true +├── config.go # aws.Config struct (yaml/json tags) +├── provider.go # Implements provider.Provider +├── state.go # S3 state-bucket lifecycle (ensure / destroy) +├── tofu.go # Builds tfvars and invokes pkg/tofu.Setup +├── k8s.go # Shared kube-client construction +├── kubeconfig.go # GetKubeconfig implementation +├── efs.go # EFS storage-class wiring +├── longhorn.go # Longhorn storage installation +├── aws_load_balancer_controller.go # AWS Load Balancer Controller install +├── cleanup.go / cleanup_k8s.go # Pre-destroy resource cleanup +├── version.go # Provider-version probe +└── templates/ # Embedded via go:embed + ├── main.tf # Calls upstream nebari-dev/eks-cluster module + ├── variables.tf # tfvars input schema + ├── outputs.tf # Cluster name, endpoint, OIDC issuer, etc. + ├── provider.tf # AWS provider config + └── backend.tf # S3 backend with use_lockfile = true ``` Other cluster providers do not use OpenTofu and therefore have no `templates/` directory: @@ -32,8 +37,8 @@ Other cluster providers do not use OpenTofu and therefore have no `templates/` d pkg/provider/hetzner/ # Wraps the hetzner-k3s binary pkg/provider/local/ # Kind stub (Makefile creates the cluster) pkg/provider/existing/ # Adopts an existing kubeconfig -pkg/provider/gcp/ # Stub: returns "not yet implemented" -pkg/provider/azure/ # Stub: returns "not yet implemented" +pkg/provider/gcp/ # Stub: emits a "(stub)" status message and returns nil +pkg/provider/azure/ # Stub: emits a "(stub)" status message and returns nil ``` ## 6.3 AWS Root Module diff --git a/docs/design-doc/implementation/07-configuration-design.md b/docs/design-doc/implementation/07-configuration-design.md index 2c989962..b6d4c7c1 100644 --- a/docs/design-doc/implementation/07-configuration-design.md +++ b/docs/design-doc/implementation/07-configuration-design.md @@ -88,11 +88,11 @@ type ACMEConfig struct { ```go // from pkg/git type Config struct { - URL string `yaml:"url"` // git@..., https://..., or file://... - Branch string `yaml:"branch,omitempty"` // default: main - Path string `yaml:"path,omitempty"` // subdirectory for this cluster - Auth AuthConfig `yaml:"auth,omitempty"` - ArgocdAuth AuthConfig `yaml:"argocd_auth,omitempty"` // optional read-only + URL string `yaml:"url"` // git@..., https://..., or file://... + Branch string `yaml:"branch,omitempty"` // default: main + Path string `yaml:"path,omitempty"` // subdirectory for this cluster + Auth AuthConfig `yaml:"auth"` + ArgoCDAuth *AuthConfig `yaml:"argocd_auth,omitempty"` // optional read-only; falls back to Auth } type AuthConfig struct { diff --git a/docs/design-doc/implementation/08-terraform-exec-integration.md b/docs/design-doc/implementation/08-terraform-exec-integration.md index 953511d0..140b007d 100644 --- a/docs/design-doc/implementation/08-terraform-exec-integration.md +++ b/docs/design-doc/implementation/08-terraform-exec-integration.md @@ -83,10 +83,10 @@ There is **no** `findOpenTofuBinary()` in `PATH`. The binary is always the versi ## 8.5 AWS Provider Usage -The AWS provider's `Deploy` and `Destroy` methods are the primary callers. The shape (simplified, with telemetry omitted): +The AWS provider's `Deploy` and `Destroy` methods are the primary callers. The shape (simplified, with telemetry, dry-run/backend-override handling, and bucket-existence branching omitted - see `pkg/provider/aws/provider.go` for the authoritative version): ```go -// pkg/provider/aws/tofu.go (illustrative) +// pkg/provider/aws/provider.go (illustrative) func (p *Provider) Deploy(ctx context.Context, projectName string, cluster *config.ClusterConfig, opts provider.DeployOptions) error { awsCfg, err := decodeConfig(cluster) if err != nil { return err } diff --git a/docs/design-doc/implementation/11-nebari-operator.md b/docs/design-doc/implementation/11-nebari-operator.md index e6aeb822..cfd26398 100644 --- a/docs/design-doc/implementation/11-nebari-operator.md +++ b/docs/design-doc/implementation/11-nebari-operator.md @@ -20,10 +20,13 @@ The operator is deployed as a foundational ArgoCD application from `pkg/argocd/t ``` pkg/argocd/templates/manifests/nebari-operator/ -└── kustomization.yaml # Points at github.com/nebari-dev/nebari-operator - # at a pinned ref (e.g. v0.1.0-alpha.19), with - # patches for ingress hostname, Keycloak base path, - # and HTTPS port +├── kustomization.yaml # Points at github.com/nebari-dev/nebari-operator +│ # at a pinned ref (e.g. v0.1.0-alpha.19) and applies +│ # the deployment patch below +└── deployment-patch.yaml # Sets environment variables on the controller-manager + # container: Keycloak integration (URL, realm, admin + # secret name/namespace, issuer context path, external + # URL) and the TLS cluster-issuer name ``` The operator runs in its own namespace and watches for `NebariApp` CRs across the cluster. @@ -64,14 +67,19 @@ Critically: Operators of Nebari clusters and software-pack authors should treat the upstream operator's docs as authoritative. -## 11.4 Provider-Shaped Inputs from NIC +## 11.4 Values Rendered Into the Operator Patch -The operator's manifests need a small number of cluster-shaped values to route correctly. NIC supplies these via Kustomize patches sourced from `provider.InfraSettings(cfg)`: +The deployment patch is a Go template rendered by `pkg/argocd` with values that come from a mix of `provider.InfraSettings(cfg)`, `cfg.Domain`, and NIC-internal Keycloak/cert-manager defaults. The fields below correspond to env vars set on the `nebari-operator-controller-manager` container. -| `InfraSettings` field | Operator use | -|------------------------|--------------| -| `KeycloakBasePath` | Path prefix the operator uses when constructing OIDC issuer URLs (`/auth` for the keycloakx chart used today; empty for upstream/Bitnami) | -| `HTTPSPort` | Port to use when constructing user-facing URLs in the operator's status output and landing-page registration | +| Template field | Source | Operator use | +|----------------|--------|--------------| +| `KeycloakBasePath` | `InfraSettings.KeycloakBasePath` | Path prefix appended to the in-cluster Keycloak URL (`/auth` for the keycloakx chart used today; empty for upstream/Bitnami). Surfaces as `KEYCLOAK_ISSUER_CONTEXT_PATH`. | +| `Domain` | `cfg.Domain` | Used to compute `KEYCLOAK_EXTERNAL_URL` (`https://keycloak.`). | +| `KeycloakServiceURL` | NIC default (`http://keycloak-keycloakx-http.keycloak.svc.cluster.local:8080`) | In-cluster URL the operator uses to reach Keycloak. Surfaces as `KEYCLOAK_URL`. | +| `KeycloakRealm` | NIC default (`nebari`) | Realm the operator talks to. Surfaces as `KEYCLOAK_REALM`. | +| `KeycloakAdminSecretName` | NIC default | Name of the K8s secret the operator reads for Keycloak admin credentials. Surfaces as `KEYCLOAK_ADMIN_SECRET_NAME`. | +| `KeycloakNamespace` | NIC default (`keycloak`) | Namespace containing the admin secret. Surfaces as `KEYCLOAK_ADMIN_SECRET_NAMESPACE`. | +| `CertificateIssuer` | NIC choice (`selfsigned-issuer` or `letsencrypt-issuer`, based on whether `dns.` is set) | cert-manager `ClusterIssuer` name the operator should reference when creating Certificate resources. Surfaces as `TLS_CLUSTER_ISSUER_NAME`. | The operator does not see any other parts of `NebariConfig`. In particular, it does not know which cluster provider is in use. @@ -79,7 +87,7 @@ The operator does not see any other parts of `NebariConfig`. In particular, it d - Pin a known-good operator release in `pkg/argocd/templates/manifests/nebari-operator/kustomization.yaml` - Render the operator's ArgoCD Application into the GitOps repo with the correct sync wave (after Keycloak, cert-manager, and Envoy Gateway are ready) -- Pass `InfraSettings.KeycloakBasePath` and `InfraSettings.HTTPSPort` into the operator manifests via Kustomize patches +- Render `deployment-patch.yaml` with the Keycloak integration env vars and TLS issuer name listed in §11.4 That's it. NIC does not reconcile `NebariApp` CRs, does not implement the operator's controller, and does not ship any `api/v1alpha1/` package. If you find documentation that says otherwise, it is out of date. From 1cb355e008e298b71f3461fd9e3f806df2eadcd7 Mon Sep 17 00:00:00 2001 From: Chuck McAndrew <6248903+dcmcand@users.noreply.github.com> Date: Mon, 25 May 2026 13:08:19 +0200 Subject: [PATCH 3/3] docs(design-doc): address PR #301 review feedback - OTEL_EXPORTER default: console -> none (pkg/telemetry/telemetry.go:26) - Bootstrap marker: .nic-bootstrapped -> .bootstrapped (pkg/git/client_impl.go:22) - NebariApp YAML: apiVersion reconcilers.nebari.dev/v1, service block at spec top, routes use pathPrefix (per upstream nebariapp_types.go) - Drop "next-generation" qualifier in nic-summary - Model Serving Pack lists llm-d, not MLflow/KServe/Envoy AI - Remove aws.NodeGroup negative-space note (docs are the source of truth) - Drop "under 20 minutes" deploy goal from success criteria --- docs/design-doc/appendix/16-configuration-reference.md | 4 +--- docs/design-doc/appendix/17-appendix.md | 1 - docs/design-doc/architecture/02-system-overview.md | 2 +- docs/design-doc/architecture/04-key-decisions.md | 2 +- .../implementation/10-foundational-software.md | 2 +- docs/design-doc/implementation/11-nebari-operator.md | 10 +++++----- docs/design-doc/nic-summary.md | 4 ++-- 7 files changed, 11 insertions(+), 14 deletions(-) diff --git a/docs/design-doc/appendix/16-configuration-reference.md b/docs/design-doc/appendix/16-configuration-reference.md index 6c3da86b..1e543b45 100644 --- a/docs/design-doc/appendix/16-configuration-reference.md +++ b/docs/design-doc/appendix/16-configuration-reference.md @@ -134,8 +134,6 @@ cluster: # node_selector: { workload: storage } ``` -Fields not in `aws.NodeGroup`: `single_subnet`, per-node-group `permissions_boundary`. If you see them in older docs, they are not real. - State backend: S3 with `use_lockfile = true`, bucket auto-created per [§5.2 of State Management](../architecture/05-state-management.md). No DynamoDB. ### 2.2 `cluster.hetzner` (Hetzner Cloud k3s) @@ -328,7 +326,7 @@ Loaded by `godotenv` from `.env` (gitignored) at startup. Used for credentials a | `GIT_SSH_PRIVATE_KEY` (or whatever you point `git_repository.auth.ssh_key_env` at) | `pkg/git` | SSH private key in PEM form | | `GIT_TOKEN` (or whatever you point `git_repository.auth.token_env` at) | `pkg/git` | Personal access token for HTTPS git URLs | | `KUBECONFIG` | `existing` provider, `nic kubeconfig` | Kubeconfig path (used when `cluster.existing.kubeconfig` is empty) | -| `OTEL_EXPORTER` | `pkg/telemetry` | `console` (default), `otlp`, `both`, `none` | +| `OTEL_EXPORTER` | `pkg/telemetry` | `none` (default), `console`, `otlp`, `both` | | `OTEL_ENDPOINT` | `pkg/telemetry` | OTLP endpoint (default: `localhost:4317`) | `.env.example` in the repo root lists the variables NIC looks at; copy to `.env` and fill in the values you need. diff --git a/docs/design-doc/appendix/17-appendix.md b/docs/design-doc/appendix/17-appendix.md index 79d91e36..0549cff8 100644 --- a/docs/design-doc/appendix/17-appendix.md +++ b/docs/design-doc/appendix/17-appendix.md @@ -49,7 +49,6 @@ The specific commit dates for the 2026 entries can be reconstructed from git his - ⏳ LGTM observability backend deployed by NIC - ⏳ Documented upgrade paths between releases - ⏳ End-to-end test coverage across providers -- ⏳ AWS cluster deploy under 20 minutes from a fresh account **User success criteria:** diff --git a/docs/design-doc/architecture/02-system-overview.md b/docs/design-doc/architecture/02-system-overview.md index 84ae8070..452e05c7 100644 --- a/docs/design-doc/architecture/02-system-overview.md +++ b/docs/design-doc/architecture/02-system-overview.md @@ -136,7 +136,7 @@ The actual repository layout is captured in [`AGENTS.md`](../../../AGENTS.md). K - `pkg/git` clones, commits, and pushes the GitOps repo (including `file://` local paths). - `pkg/helm` is a thin wrapper around `helm.sh/helm/v3/pkg/action` used by `pkg/argocd`. - `pkg/status` is the in-process status channel used to surface user-visible progress from library code without violating the "no `slog` in `pkg/`" rule. -- `pkg/telemetry` wires up the OpenTelemetry tracer provider, with exporters selected via `OTEL_EXPORTER` (`console` default, `otlp`, `both`, `none`). +- `pkg/telemetry` wires up the OpenTelemetry tracer provider, with exporters selected via `OTEL_EXPORTER` (`none` default, `console`, `otlp`, `both`). ### 2.3 Why This Architecture? diff --git a/docs/design-doc/architecture/04-key-decisions.md b/docs/design-doc/architecture/04-key-decisions.md index a642e658..2985ee39 100644 --- a/docs/design-doc/architecture/04-key-decisions.md +++ b/docs/design-doc/architecture/04-key-decisions.md @@ -142,7 +142,7 @@ NIC renders Keycloak integration env vars (URL, realm, admin secret, issuer cont - All new functions in `pkg/` are wrapped in OpenTelemetry trace spans, with the documented exemptions in [`CLAUDE.md`](../../../CLAUDE.md) (e.g., per-line writers in `pkg/status` and byte/line helpers in `pkg/tofu`). - Library code never calls `slog`. User-visible progress goes through the status channel; `cmd/nic/status_handler.go` is the only translator into structured logs. -- Exporters are configurable via `OTEL_EXPORTER` (`console` default, `otlp`, `both`, `none`) and `OTEL_ENDPOINT`. +- Exporters are configurable via `OTEL_EXPORTER` (`none` default, `console`, `otlp`, `both`) and `OTEL_ENDPOINT`. **Pattern:** diff --git a/docs/design-doc/implementation/10-foundational-software.md b/docs/design-doc/implementation/10-foundational-software.md index f1e84189..ae47c8c4 100644 --- a/docs/design-doc/implementation/10-foundational-software.md +++ b/docs/design-doc/implementation/10-foundational-software.md @@ -43,7 +43,7 @@ Sketch of what `pkg/argocd` writes into the GitOps repo at the `git_repository.p // ├── root.yaml # App-of-apps root ├── nic-config.yaml # Scrubbed copy of nebari-config.yaml -├── .nic-bootstrapped # Marker file +├── .bootstrapped # Marker file └── manifests/ ├── cert-manager/ # Application + (optional) values ├── cluster-issuers/ diff --git a/docs/design-doc/implementation/11-nebari-operator.md b/docs/design-doc/implementation/11-nebari-operator.md index cfd26398..2b8dc57d 100644 --- a/docs/design-doc/implementation/11-nebari-operator.md +++ b/docs/design-doc/implementation/11-nebari-operator.md @@ -36,19 +36,19 @@ The operator runs in its own namespace and watches for `NebariApp` CRs across th The CRD shape is owned by the upstream operator. The relevant fields, at a high level (consult the upstream repo for the authoritative schema): ```yaml -apiVersion: nebari.dev/v1 +apiVersion: reconcilers.nebari.dev/v1 kind: NebariApp metadata: name: jupyter-hub namespace: jupyter spec: hostname: jupyter.example.com + service: + name: jupyterhub + port: 8000 routing: routes: - - path: / - backend: - name: jupyterhub - port: 8000 + - pathPrefix: / publicRoutes: [] # Paths that should bypass OIDC tls: { ... } auth: diff --git a/docs/design-doc/nic-summary.md b/docs/design-doc/nic-summary.md index 385aa048..6d6a96ad 100644 --- a/docs/design-doc/nic-summary.md +++ b/docs/design-doc/nic-summary.md @@ -2,11 +2,11 @@ ## What is Nebari Infrastructure Core? -Nebari Infrastructure Core (NIC) is a next-generation platform that rethinks how organizations deploy and manage data science infrastructure. Unlike Nebari Classic, which delivers a monolithic, opinionated data science stack (JupyterHub, Dask, conda-store, etc.) as a single deployable unit, NIC takes a fundamentally different approach: it provides a **stable, composable foundation** upon which various workloads can be built and deployed independently. NIC is a Go CLI that provisions Kubernetes clusters and bootstraps a foundational software stack via GitOps. Each cluster provider chooses the right backing tool for its environment - OpenTofu for AWS (EKS), the `hetzner-k3s` binary for Hetzner, Kind for local development, and an `existing` adapter for pre-provisioned clusters - while GCP and Azure are stubbed and not yet implemented. The result is a platform that separates infrastructure concerns from application concerns, enabling teams to evolve each layer independently. +Nebari Infrastructure Core (NIC) is a platform that rethinks how organizations deploy and manage data science infrastructure. Unlike Nebari Classic, which delivers a monolithic, opinionated data science stack (JupyterHub, Dask, conda-store, etc.) as a single deployable unit, NIC takes a fundamentally different approach: it provides a **stable, composable foundation** upon which various workloads can be built and deployed independently. NIC is a Go CLI that provisions Kubernetes clusters and bootstraps a foundational software stack via GitOps. Each cluster provider chooses the right backing tool for its environment - OpenTofu for AWS (EKS), the `hetzner-k3s` binary for Hetzner, Kind for local development, and an `existing` adapter for pre-provisioned clusters - while GCP and Azure are stubbed and not yet implemented. The result is a platform that separates infrastructure concerns from application concerns, enabling teams to evolve each layer independently. ## Advantages Over Nebari Classic -The key advantage of NIC is its **composable architecture through Software Packs**. Where Nebari Classic bundles everything together (meaning you get the full data science stack whether you need it all or not), NIC lets you choose exactly what you need. Software Packs are curated collections of open-source tools packaged as ArgoCD applications with a `NebariApp` Custom Resource that enables automatic registration with the platform. Want just JupyterHub and conda-store? Install the Data Science Pack. Need model serving capabilities? Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway). This modular approach means faster deployments, smaller attack surfaces, easier upgrades, and the flexibility to mix-and-match capabilities. Additionally, all services automatically integrate with centralized authentication (Keycloak), routing (Envoy Gateway), and TLS certificates (cert-manager) through the Nebari Operator. +The key advantage of NIC is its **composable architecture through Software Packs**. Where Nebari Classic bundles everything together (meaning you get the full data science stack whether you need it all or not), NIC lets you choose exactly what you need. Software Packs are curated collections of open-source tools packaged as ArgoCD applications with a `NebariApp` Custom Resource that enables automatic registration with the platform. Want just JupyterHub and conda-store? Install the Data Science Pack. Need model serving capabilities? Add the Model Serving Pack (llm-d). This modular approach means faster deployments, smaller attack surfaces, easier upgrades, and the flexibility to mix-and-match capabilities. Additionally, all services automatically integrate with centralized authentication (Keycloak), routing (Envoy Gateway), and TLS certificates (cert-manager) through the Nebari Operator. ## Architecture Philosophy