docs(design-doc): audit and rewrite design docs against current code#301
docs(design-doc): audit and rewrite design docs against current code#301dcmcand wants to merge 3 commits into
Conversation
Most files under docs/design-doc/ had drifted substantially from the codebase: invented CLI commands, wrong package layouts, fictional code samples, wrong YAML config schemas, and a foundational software stack (LGTM) presented as deployed when only the OpenTelemetry Collector ships today. This commit rewrites the heavily-drifted docs from scratch against current code (verified against pkg/, cmd/nic/, examples/, Makefile, .github/workflows/) and applies surgical fixes elsewhere. Highlights: - Acknowledge per-provider tool choice: AWS uses OpenTofu, Hetzner uses the hetzner-k3s binary, local uses Kind, existing is a no-op. The Provider interface is the contract. - Cross-reference ADR-0004 (out-of-tree provider plugins) where relevant. - Fix the config schema reference to match the real cluster.<provider>: / dns.<provider>: discriminator pattern. Remove fictional top-level provider:, version:, kubernetes:, tls:, foundational_software:, images:, features: blocks. - Document Hetzner and existing providers (previously missing). - Mark GCP/Azure providers as stubs (not deployable today). - Replace fictional CLI commands (nic plan / status / state / unlock / init / stack / marketplace / health) with the real surface (deploy, destroy, validate, kubeconfig, version). - Replace DynamoDB-locked S3 backend with the real native lockfile configuration (use_lockfile = true). - Reframe nebari-operator as out-of-tree (lives at github.com/nebari-dev/nebari-operator); NIC just deploys it. Correct CRD name throughout (NebariApp, not NebariApplication / NicApp). - Realign the testing strategy and milestones with what CI actually runs and what is actually shipped vs planned. Closes #300
|
@khuyentran1401 I updated the docs here, that should hopefully help you out |
Reverse-direction review surfaced six places where the rewritten docs still misrepresented the code: - nebari-operator kustomization: doc claimed patches for ingress hostname / KeycloakBasePath / HTTPSPort. The actual deployment patch only sets Keycloak integration env vars and the TLS cluster-issuer name; HTTPSPort is consumed by gateway templates, not by the operator. Rewrote 11.2, expanded 11.4 to the seven values actually rendered (with a Source column distinguishing InfraSettings fields from NIC-internal defaults), and updated 11.5 + 04-key-decisions 4.6 to match. - pkg/git.Config snippet: ArgocdAuth AuthConfig -> ArgoCDAuth *AuthConfig (pointer, capitalized CD); auth tag is not omitempty in real code. - pkg/provider/aws layout: AWSConfig -> aws.Config; lbc.go -> aws_load_balancer_controller.go; expanded the file list to include efs.go, kubeconfig.go, cleanup.go, k8s.go, version.go for fidelity. - GCP / Azure stubs: doc claimed methods return "not yet implemented"; they actually emit a "(stub)" status update and return nil. Fixed in both 02-system-overview and 06-opentofu-module-architecture. - 08-terraform-exec-integration 8.5: AWS Deploy sketch header pointed at tofu.go; the real Deploy lives in provider.go. Updated header and called out the conditionals the snippet omits.
| apiVersion: nebari.dev/v1 | ||
| kind: NebariApp | ||
| metadata: | ||
| name: jupyterhub-metrics | ||
| name: jupyter-hub | ||
| namespace: jupyter | ||
| labels: | ||
| app: jupyterhub | ||
| annotations: | ||
| prometheus.io/scrape: "true" | ||
| prometheus.io/port: "9090" | ||
| prometheus.io/path: "/metrics" | ||
| spec: | ||
| selector: | ||
| app: jupyterhub | ||
| ports: | ||
| - name: metrics | ||
| port: 9090 | ||
| targetPort: 9090 | ||
| ``` | ||
|
|
||
| 5. **Grafana Dashboard ConfigMap:** | ||
| ```yaml | ||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: jupyterhub-dashboard | ||
| namespace: monitoring | ||
| labels: | ||
| grafana_dashboard: "1" | ||
| data: | ||
| jupyterhub.json: | | ||
| { | ||
| "dashboard": { | ||
| "title": "JupyterHub Overview", | ||
| "panels": [ ... ] | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| 6. **Status Update:** | ||
| ```yaml | ||
| status: | ||
| phase: Ready | ||
| url: https://jupyter.nebari.example.com | ||
| keycloakClientID: jupyterhub-jupyter | ||
| conditions: | ||
| - type: RoutingConfigured | ||
| status: "True" | ||
| lastTransitionTime: "2025-01-30T12:00:00Z" | ||
| - type: AuthenticationConfigured | ||
| status: "True" | ||
| lastTransitionTime: "2025-01-30T12:01:00Z" | ||
| - type: ObservabilityConfigured | ||
| status: "True" | ||
| lastTransitionTime: "2025-01-30T12:02:00Z" | ||
| - type: Ready | ||
| status: "True" | ||
| lastTransitionTime: "2025-01-30T12:02:00Z" | ||
| reason: AllComponentsReady | ||
| message: "Application is fully configured and accessible" | ||
| hostname: jupyter.example.com | ||
| routing: | ||
| routes: | ||
| - path: / | ||
| backend: | ||
| name: jupyterhub | ||
| port: 8000 |
There was a problem hiding this comment.
A few fields in this example didn't match the upstream types when I cross-checked:
- apiVersion is
reconcilers.nebari.dev/v1(fromgroupversion_info.go), notnebari.dev/v1. routing.routes[].pathispathPrefix(seenebariapp_types.goRouteMatch).RouteMatchhas nobackend;servicesits at the top of the spec.
The rest of the example matches.
marcelovilla
left a comment
There was a problem hiding this comment.
Thanks for this PR @dcmcand! I left some comments
| - `pkg/git` clones, commits, and pushes the GitOps repo (including `file://` local paths). | ||
| - `pkg/helm` is a thin wrapper around `helm.sh/helm/v3/pkg/action` used by `pkg/argocd`. | ||
| - `pkg/status` is the in-process status channel used to surface user-visible progress from library code without violating the "no `slog` in `pkg/`" rule. | ||
| - `pkg/telemetry` wires up the OpenTelemetry tracer provider, with exporters selected via `OTEL_EXPORTER` (`console` default, `otlp`, `both`, `none`). |
There was a problem hiding this comment.
The exporter type is none by default, not console:
nebari-infrastructure-core/pkg/telemetry/telemetry.go
Lines 26 to 29 in fce61f7
| **Implementation:** | ||
| - All new functions in `pkg/` are wrapped in OpenTelemetry trace spans, with the documented exemptions in [`CLAUDE.md`](../../../CLAUDE.md) (e.g., per-line writers in `pkg/status` and byte/line helpers in `pkg/tofu`). | ||
| - Library code never calls `slog`. User-visible progress goes through the status channel; `cmd/nic/status_handler.go` is the only translator into structured logs. | ||
| - Exporters are configurable via `OTEL_EXPORTER` (`console` default, `otlp`, `both`, `none`) and `OTEL_ENDPOINT`. |
There was a problem hiding this comment.
The exporter type is none by default, not console:
nebari-infrastructure-core/pkg/telemetry/telemetry.go
Lines 26 to 29 in fce61f7
| | `GIT_SSH_PRIVATE_KEY` (or whatever you point `git_repository.auth.ssh_key_env` at) | `pkg/git` | SSH private key in PEM form | | ||
| | `GIT_TOKEN` (or whatever you point `git_repository.auth.token_env` at) | `pkg/git` | Personal access token for HTTPS git URLs | | ||
| | `KUBECONFIG` | `existing` provider, `nic kubeconfig` | Kubeconfig path (used when `cluster.existing.kubeconfig` is empty) | | ||
| | `OTEL_EXPORTER` | `pkg/telemetry` | `console` (default), `otlp`, `both`, `none` | |
There was a problem hiding this comment.
The exporter type is none by default, not console:
nebari-infrastructure-core/pkg/telemetry/telemetry.go
Lines 26 to 29 in fce61f7
| <repo>/<path>/ | ||
| ├── root.yaml # App-of-apps root | ||
| ├── nic-config.yaml # Scrubbed copy of nebari-config.yaml | ||
| ├── .nic-bootstrapped # Marker file |
There was a problem hiding this comment.
This file is .bootstrapped now:
| ## What is Nebari Infrastructure Core? | ||
|
|
||
| Nebari Infrastructure Core (NIC) is a next-generation platform that rethinks how organizations deploy and manage data science infrastructure. Unlike Nebari Classic, which delivers a monolithic, opinionated data science stack (JupyterHub, Dask, conda-store, etc.) as a single deployable unit, NIC takes a fundamentally different approach: it provides a **stable, composable foundation** upon which various workloads can be built and deployed independently. NIC uses a Go CLI powered by OpenTofu/Terraform modules to provision Kubernetes clusters across AWS, GCP, Azure, or bare metal, ensuring consistent infrastructure regardless of hosting environment. The result is a platform that separates infrastructure concerns from application concerns, enabling teams to evolve each layer independently. | ||
| Nebari Infrastructure Core (NIC) is a next-generation platform that rethinks how organizations deploy and manage data science infrastructure. Unlike Nebari Classic, which delivers a monolithic, opinionated data science stack (JupyterHub, Dask, conda-store, etc.) as a single deployable unit, NIC takes a fundamentally different approach: it provides a **stable, composable foundation** upon which various workloads can be built and deployed independently. NIC is a Go CLI that provisions Kubernetes clusters and bootstraps a foundational software stack via GitOps. Each cluster provider chooses the right backing tool for its environment - OpenTofu for AWS (EKS), the `hetzner-k3s` binary for Hetzner, Kind for local development, and an `existing` adapter for pre-provisioned clusters - while GCP and Azure are stubbed and not yet implemented. The result is a platform that separates infrastructure concerns from application concerns, enabling teams to evolve each layer independently. |
There was a problem hiding this comment.
Do we want to use the next-generation term here? I think we discussed it's not the most descriptive one.
| ## Advantages Over Nebari Classic | ||
|
|
||
| The key advantage of NIC is its **composable architecture through Software Packs**. Where Nebari Classic bundles everything together—meaning you get the full data science stack whether you need it all or not—NIC lets you choose exactly what you need. Software Packs are curated collections of open-source tools packaged as ArgoCD applications with a `NicApp` Custom Resource that enables automatic registration with the platform. Want just JupyterHub and conda-store? Install the Data Science Pack. Need model serving capabilities? Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway). This modular approach means faster deployments, smaller attack surfaces, easier upgrades, and the flexibility to mix-and-match capabilities. Additionally, all services automatically integrate with centralized authentication (Keycloak), routing (Envoy Gateway), and TLS certificates (cert-manager) through the Nebari Operator. | ||
| The key advantage of NIC is its **composable architecture through Software Packs**. Where Nebari Classic bundles everything together (meaning you get the full data science stack whether you need it all or not), NIC lets you choose exactly what you need. Software Packs are curated collections of open-source tools packaged as ArgoCD applications with a `NebariApp` Custom Resource that enables automatic registration with the platform. Want just JupyterHub and conda-store? Install the Data Science Pack. Need model serving capabilities? Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway). This modular approach means faster deployments, smaller attack surfaces, easier upgrades, and the flexibility to mix-and-match capabilities. Additionally, all services automatically integrate with centralized authentication (Keycloak), routing (Envoy Gateway), and TLS certificates (cert-manager) through the Nebari Operator. |
There was a problem hiding this comment.
Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway).
Should we update this to reflect the fact that the model serving pack uses llm-d instead? As a user it'd feel misleading reading this and seeing that there's no pack providing kserve.
| # node_selector: { workload: storage } | ||
| ``` | ||
|
|
||
| Fields not in `aws.NodeGroup`: `single_subnet`, per-node-group `permissions_boundary`. If you see them in older docs, they are not real. |
There was a problem hiding this comment.
Do we need notes like these? I feel if our documentation is the source of truth, these are not really needed.
| - ⏳ LGTM observability backend deployed by NIC | ||
| - ⏳ Documented upgrade paths between releases | ||
| - ⏳ End-to-end test coverage across providers | ||
| - ⏳ AWS cluster deploy under 20 minutes from a fresh account |
There was a problem hiding this comment.
Is this really a goal? I feel there's not a lot we can do to reduce the deployment time to under 20 minutes
- OTEL_EXPORTER default: console -> none (pkg/telemetry/telemetry.go:26) - Bootstrap marker: .nic-bootstrapped -> .bootstrapped (pkg/git/client_impl.go:22) - NebariApp YAML: apiVersion reconcilers.nebari.dev/v1, service block at spec top, routes use pathPrefix (per upstream nebariapp_types.go) - Drop "next-generation" qualifier in nic-summary - Model Serving Pack lists llm-d, not MLflow/KServe/Envoy AI - Remove aws.NodeGroup negative-space note (docs are the source of truth) - Drop "under 20 minutes" deploy goal from success criteria
Summary
The design docs under
docs/design-doc/had drifted substantially from the codebase. This PR audits every file against current code and rewrites the heavily-drifted ones from scratch; the rest get surgical fixes.Closes #300.
What changed
Heavily rewritten (substantially wrong before):
architecture/02-system-overview.md,04-key-decisions.md,05-state-management.mdimplementation/06-opentofu-module-architecture.md,07-configuration-design.md,08-terraform-exec-integration.md,10-foundational-software.md,11-nebari-operator.mdappendix/16-configuration-reference.mdoperations/12-testing-strategy.md,13-milestones.mdSurgical edits:
nic-summary.md,architecture/01-introduction.md,architecture/03-goals-and-non-goals.md,implementation/09-dns-provider-architecture.md,appendix/14-open-questions.md,appendix/15-future-enhancements.md,appendix/17-appendix.md,operations/longhorn-node-maintenance.mdNet diff: +1415 / -4952 (mostly removing fictional content).
Highlights of what was wrong
hetzner-k3sbinary.localis a Kind stub viamake localkind-up.existingis a no-op. GCP and Azure are stubs. Hetzner andexistingweren't documented at all. ADR-0004 is now cross-referenced where relevant.terraform/modules/{aws,gcp,azure,local,kubernetes,argocd,foundational-apps}/tree that doesn't exist. AWS templates actually live inpkg/provider/aws/templates/. References topkg/operator,pkg/tofu/executor.go,pkg/tofu/workspace.go,pkg/tofu/outputs.go,pkg/kubernetes/,api/v1alpha1/were removed (none of those exist).nic plan,nic status,nic state list/show/rm/mv,nic unlock,nic init-backend,nic health check,nic stack ...,nic init,nic marketplaceremoved. Real verbs documented:deploy,destroy,validate,kubeconfig,version.pkg/argocd/templates/apps/is now documented correctly.provider:field with siblingamazon_web_services:/google_cloud_platform:/azure:/hetzner_cloud:/local:keys. The real schema iscluster.<provider-name>:with no top-levelprovider:field. Only the Hetzner section had been correct. The reference now matchespkg/config/config.goand the per-provider config packages, and adds the missingcertificate:,git_repository:, andexistingprovider sections.NicApp/NebariApplicationcorrected toNebariAppthroughout. The operator is reframed as an out-of-tree project atgithub.com/nebari-dev/nebari-operator; NIC just deploys it.05-state-management.md): DynamoDB-based locking replaced with the real S3 native lockfile (use_lockfile = true). Fakestate_backend:config block removed. FakeDetectDriftcode sample removed.nic health checksubsystem removed. CI YAML corrected to match.github/workflows/ci.yml(Go 1.25.1, real test command, no fictional integration / scheduled jobs). Mocking libraries corrected (moto/fake-gcs-server/azurite-> LocalStack viadocker-compose.test.yml). Milestones no longer mark GCP, Azure, Grafana dashboards, multi-cloud CI, or v1.0.0 as shipped; ADR-0004 referenced.Test plan
go build ./...cleanmake lintcleanmake testclean (pre-commit hooks ran on commit)Notes
appendix/14-open-questions.mdandappendix/15-future-enhancements.mdhave shipped (e.g., git repo consumption viagit_repository:,.env-based secrets). Those are flagged inline rather than removed wholesale, so the future-work appendix stays useful.