Skip to content

docs(design-doc): audit and rewrite design docs against current code#301

Open
dcmcand wants to merge 3 commits into
mainfrom
worktree-design-doc-audit
Open

docs(design-doc): audit and rewrite design docs against current code#301
dcmcand wants to merge 3 commits into
mainfrom
worktree-design-doc-audit

Conversation

@dcmcand

@dcmcand dcmcand commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

The design docs under docs/design-doc/ had drifted substantially from the codebase. This PR audits every file against current code and rewrites the heavily-drifted ones from scratch; the rest get surgical fixes.

Closes #300.

What changed

Heavily rewritten (substantially wrong before):

  • architecture/02-system-overview.md, 04-key-decisions.md, 05-state-management.md
  • implementation/06-opentofu-module-architecture.md, 07-configuration-design.md, 08-terraform-exec-integration.md, 10-foundational-software.md, 11-nebari-operator.md
  • appendix/16-configuration-reference.md
  • operations/12-testing-strategy.md, 13-milestones.md

Surgical edits:

  • nic-summary.md, architecture/01-introduction.md, architecture/03-goals-and-non-goals.md, implementation/09-dns-provider-architecture.md, appendix/14-open-questions.md, appendix/15-future-enhancements.md, appendix/17-appendix.md, operations/longhorn-node-maintenance.md

Net diff: +1415 / -4952 (mostly removing fictional content).

Highlights of what was wrong

  • Provider story: docs assumed a uniform OpenTofu-driven multi-cloud implementation. Reality: only AWS uses tofu. Hetzner uses the hetzner-k3s binary. local is a Kind stub via make localkind-up. existing is a no-op. GCP and Azure are stubs. Hetzner and existing weren't documented at all. ADR-0004 is now cross-referenced where relevant.
  • Package layout: docs referenced a root-level terraform/modules/{aws,gcp,azure,local,kubernetes,argocd,foundational-apps}/ tree that doesn't exist. AWS templates actually live in pkg/provider/aws/templates/. References to pkg/operator, pkg/tofu/executor.go, pkg/tofu/workspace.go, pkg/tofu/outputs.go, pkg/kubernetes/, api/v1alpha1/ were removed (none of those exist).
  • CLI commands: invented nic plan, nic status, nic state list/show/rm/mv, nic unlock, nic init-backend, nic health check, nic stack ..., nic init, nic marketplace removed. Real verbs documented: deploy, destroy, validate, kubeconfig, version.
  • Foundational software: the LGTM stack (Loki / Grafana / Tempo / Mimir) was presented as deployed. It isn't; only opentelemetry-collector ships today. The full app set under pkg/argocd/templates/apps/ is now documented correctly.
  • Config schema: the configuration reference documented a top-level provider: field with sibling amazon_web_services: / google_cloud_platform: / azure: / hetzner_cloud: / local: keys. The real schema is cluster.<provider-name>: with no top-level provider: field. Only the Hetzner section had been correct. The reference now matches pkg/config/config.go and the per-provider config packages, and adds the missing certificate:, git_repository:, and existing provider sections.
  • CRD name: NicApp / NebariApplication corrected to NebariApp throughout. The operator is reframed as an out-of-tree project at github.com/nebari-dev/nebari-operator; NIC just deploys it.
  • State management (05-state-management.md): DynamoDB-based locking replaced with the real S3 native lockfile (use_lockfile = true). Fake state_backend: config block removed. Fake DetectDrift code sample removed.
  • Testing strategy / milestones: invented nic health check subsystem removed. CI YAML corrected to match .github/workflows/ci.yml (Go 1.25.1, real test command, no fictional integration / scheduled jobs). Mocking libraries corrected (moto / fake-gcs-server / azurite -> LocalStack via docker-compose.test.yml). Milestones no longer mark GCP, Azure, Grafana dashboards, multi-cloud CI, or v1.0.0 as shipped; ADR-0004 referenced.

Test plan

  • go build ./... clean
  • make lint clean
  • make test clean (pre-commit hooks ran on commit)
  • Maintainer review: spot-check claims in each rewritten doc against current code

Notes

  • The full audit reports the rewrites are based on are in the conversation that produced this PR; happy to drop them in if useful.
  • Some items in appendix/14-open-questions.md and appendix/15-future-enhancements.md have shipped (e.g., git repo consumption via git_repository:, .env-based secrets). Those are flagged inline rather than removed wholesale, so the future-work appendix stays useful.

Most files under docs/design-doc/ had drifted substantially from the
codebase: invented CLI commands, wrong package layouts, fictional code
samples, wrong YAML config schemas, and a foundational software stack
(LGTM) presented as deployed when only the OpenTelemetry Collector
ships today.

This commit rewrites the heavily-drifted docs from scratch against
current code (verified against pkg/, cmd/nic/, examples/, Makefile,
.github/workflows/) and applies surgical fixes elsewhere.

Highlights:
- Acknowledge per-provider tool choice: AWS uses OpenTofu, Hetzner uses
  the hetzner-k3s binary, local uses Kind, existing is a no-op. The
  Provider interface is the contract.
- Cross-reference ADR-0004 (out-of-tree provider plugins) where
  relevant.
- Fix the config schema reference to match the real
  cluster.<provider>: / dns.<provider>: discriminator pattern. Remove
  fictional top-level provider:, version:, kubernetes:, tls:,
  foundational_software:, images:, features: blocks.
- Document Hetzner and existing providers (previously missing).
- Mark GCP/Azure providers as stubs (not deployable today).
- Replace fictional CLI commands (nic plan / status / state / unlock /
  init / stack / marketplace / health) with the real surface (deploy,
  destroy, validate, kubeconfig, version).
- Replace DynamoDB-locked S3 backend with the real native lockfile
  configuration (use_lockfile = true).
- Reframe nebari-operator as out-of-tree (lives at
  github.com/nebari-dev/nebari-operator); NIC just deploys it. Correct
  CRD name throughout (NebariApp, not NebariApplication / NicApp).
- Realign the testing strategy and milestones with what CI actually
  runs and what is actually shipped vs planned.

Closes #300
@dcmcand

dcmcand commented May 13, 2026

Copy link
Copy Markdown
Contributor Author

@khuyentran1401 I updated the docs here, that should hopefully help you out

Reverse-direction review surfaced six places where the rewritten docs
still misrepresented the code:

- nebari-operator kustomization: doc claimed patches for ingress
  hostname / KeycloakBasePath / HTTPSPort. The actual deployment patch
  only sets Keycloak integration env vars and the TLS cluster-issuer
  name; HTTPSPort is consumed by gateway templates, not by the
  operator. Rewrote 11.2, expanded 11.4 to the seven values actually
  rendered (with a Source column distinguishing InfraSettings fields
  from NIC-internal defaults), and updated 11.5 + 04-key-decisions
  4.6 to match.
- pkg/git.Config snippet: ArgocdAuth AuthConfig -> ArgoCDAuth
  *AuthConfig (pointer, capitalized CD); auth tag is not omitempty in
  real code.
- pkg/provider/aws layout: AWSConfig -> aws.Config; lbc.go ->
  aws_load_balancer_controller.go; expanded the file list to include
  efs.go, kubeconfig.go, cleanup.go, k8s.go, version.go for fidelity.
- GCP / Azure stubs: doc claimed methods return "not yet implemented";
  they actually emit a "(stub)" status update and return nil. Fixed in
  both 02-system-overview and 06-opentofu-module-architecture.
- 08-terraform-exec-integration 8.5: AWS Deploy sketch header pointed
  at tofu.go; the real Deploy lives in provider.go. Updated header and
  called out the conditionals the snippet omits.
Comment on lines +39 to +51
apiVersion: nebari.dev/v1
kind: NebariApp
metadata:
name: jupyterhub-metrics
name: jupyter-hub
namespace: jupyter
labels:
app: jupyterhub
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
selector:
app: jupyterhub
ports:
- name: metrics
port: 9090
targetPort: 9090
```

5. **Grafana Dashboard ConfigMap:**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: jupyterhub-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
jupyterhub.json: |
{
"dashboard": {
"title": "JupyterHub Overview",
"panels": [ ... ]
}
}
```

6. **Status Update:**
```yaml
status:
phase: Ready
url: https://jupyter.nebari.example.com
keycloakClientID: jupyterhub-jupyter
conditions:
- type: RoutingConfigured
status: "True"
lastTransitionTime: "2025-01-30T12:00:00Z"
- type: AuthenticationConfigured
status: "True"
lastTransitionTime: "2025-01-30T12:01:00Z"
- type: ObservabilityConfigured
status: "True"
lastTransitionTime: "2025-01-30T12:02:00Z"
- type: Ready
status: "True"
lastTransitionTime: "2025-01-30T12:02:00Z"
reason: AllComponentsReady
message: "Application is fully configured and accessible"
hostname: jupyter.example.com
routing:
routes:
- path: /
backend:
name: jupyterhub
port: 8000

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few fields in this example didn't match the upstream types when I cross-checked:

  • apiVersion is reconcilers.nebari.dev/v1 (from groupversion_info.go), not nebari.dev/v1.
  • routing.routes[].path is pathPrefix (see nebariapp_types.go RouteMatch).
  • RouteMatch has no backend; service sits at the top of the spec.

The rest of the example matches.

@marcelovilla marcelovilla left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @dcmcand! I left some comments

- `pkg/git` clones, commits, and pushes the GitOps repo (including `file://` local paths).
- `pkg/helm` is a thin wrapper around `helm.sh/helm/v3/pkg/action` used by `pkg/argocd`.
- `pkg/status` is the in-process status channel used to surface user-visible progress from library code without violating the "no `slog` in `pkg/`" rule.
- `pkg/telemetry` wires up the OpenTelemetry tracer provider, with exporters selected via `OTEL_EXPORTER` (`console` default, `otlp`, `both`, `none`).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exporter type is none by default, not console:

exporterType := os.Getenv("OTEL_EXPORTER")
if exporterType == "" {
exporterType = "none"
}

**Implementation:**
- All new functions in `pkg/` are wrapped in OpenTelemetry trace spans, with the documented exemptions in [`CLAUDE.md`](../../../CLAUDE.md) (e.g., per-line writers in `pkg/status` and byte/line helpers in `pkg/tofu`).
- Library code never calls `slog`. User-visible progress goes through the status channel; `cmd/nic/status_handler.go` is the only translator into structured logs.
- Exporters are configurable via `OTEL_EXPORTER` (`console` default, `otlp`, `both`, `none`) and `OTEL_ENDPOINT`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exporter type is none by default, not console:

exporterType := os.Getenv("OTEL_EXPORTER")
if exporterType == "" {
exporterType = "none"
}

| `GIT_SSH_PRIVATE_KEY` (or whatever you point `git_repository.auth.ssh_key_env` at) | `pkg/git` | SSH private key in PEM form |
| `GIT_TOKEN` (or whatever you point `git_repository.auth.token_env` at) | `pkg/git` | Personal access token for HTTPS git URLs |
| `KUBECONFIG` | `existing` provider, `nic kubeconfig` | Kubeconfig path (used when `cluster.existing.kubeconfig` is empty) |
| `OTEL_EXPORTER` | `pkg/telemetry` | `console` (default), `otlp`, `both`, `none` |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exporter type is none by default, not console:

exporterType := os.Getenv("OTEL_EXPORTER")
if exporterType == "" {
exporterType = "none"
}

<repo>/<path>/
├── root.yaml # App-of-apps root
├── nic-config.yaml # Scrubbed copy of nebari-config.yaml
├── .nic-bootstrapped # Marker file

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is .bootstrapped now:

bootstrapMarkerFile = ".bootstrapped"

Comment thread docs/design-doc/nic-summary.md Outdated
## What is Nebari Infrastructure Core?

Nebari Infrastructure Core (NIC) is a next-generation platform that rethinks how organizations deploy and manage data science infrastructure. Unlike Nebari Classic, which delivers a monolithic, opinionated data science stack (JupyterHub, Dask, conda-store, etc.) as a single deployable unit, NIC takes a fundamentally different approach: it provides a **stable, composable foundation** upon which various workloads can be built and deployed independently. NIC uses a Go CLI powered by OpenTofu/Terraform modules to provision Kubernetes clusters across AWS, GCP, Azure, or bare metal, ensuring consistent infrastructure regardless of hosting environment. The result is a platform that separates infrastructure concerns from application concerns, enabling teams to evolve each layer independently.
Nebari Infrastructure Core (NIC) is a next-generation platform that rethinks how organizations deploy and manage data science infrastructure. Unlike Nebari Classic, which delivers a monolithic, opinionated data science stack (JupyterHub, Dask, conda-store, etc.) as a single deployable unit, NIC takes a fundamentally different approach: it provides a **stable, composable foundation** upon which various workloads can be built and deployed independently. NIC is a Go CLI that provisions Kubernetes clusters and bootstraps a foundational software stack via GitOps. Each cluster provider chooses the right backing tool for its environment - OpenTofu for AWS (EKS), the `hetzner-k3s` binary for Hetzner, Kind for local development, and an `existing` adapter for pre-provisioned clusters - while GCP and Azure are stubbed and not yet implemented. The result is a platform that separates infrastructure concerns from application concerns, enabling teams to evolve each layer independently.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use the next-generation term here? I think we discussed it's not the most descriptive one.

Comment thread docs/design-doc/nic-summary.md Outdated
## Advantages Over Nebari Classic

The key advantage of NIC is its **composable architecture through Software Packs**. Where Nebari Classic bundles everything togethermeaning you get the full data science stack whether you need it all or notNIC lets you choose exactly what you need. Software Packs are curated collections of open-source tools packaged as ArgoCD applications with a `NicApp` Custom Resource that enables automatic registration with the platform. Want just JupyterHub and conda-store? Install the Data Science Pack. Need model serving capabilities? Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway). This modular approach means faster deployments, smaller attack surfaces, easier upgrades, and the flexibility to mix-and-match capabilities. Additionally, all services automatically integrate with centralized authentication (Keycloak), routing (Envoy Gateway), and TLS certificates (cert-manager) through the Nebari Operator.
The key advantage of NIC is its **composable architecture through Software Packs**. Where Nebari Classic bundles everything together (meaning you get the full data science stack whether you need it all or not), NIC lets you choose exactly what you need. Software Packs are curated collections of open-source tools packaged as ArgoCD applications with a `NebariApp` Custom Resource that enables automatic registration with the platform. Want just JupyterHub and conda-store? Install the Data Science Pack. Need model serving capabilities? Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway). This modular approach means faster deployments, smaller attack surfaces, easier upgrades, and the flexibility to mix-and-match capabilities. Additionally, all services automatically integrate with centralized authentication (Keycloak), routing (Envoy Gateway), and TLS certificates (cert-manager) through the Nebari Operator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the Model Serving Pack (MLflow, KServe, Envoy AI Gateway).

Should we update this to reflect the fact that the model serving pack uses llm-d instead? As a user it'd feel misleading reading this and seeing that there's no pack providing kserve.

# node_selector: { workload: storage }
```

Fields not in `aws.NodeGroup`: `single_subnet`, per-node-group `permissions_boundary`. If you see them in older docs, they are not real.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need notes like these? I feel if our documentation is the source of truth, these are not really needed.

Comment thread docs/design-doc/appendix/17-appendix.md Outdated
- ⏳ LGTM observability backend deployed by NIC
- ⏳ Documented upgrade paths between releases
- ⏳ End-to-end test coverage across providers
- ⏳ AWS cluster deploy under 20 minutes from a fresh account

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a goal? I feel there's not a lot we can do to reduce the deployment time to under 20 minutes

- OTEL_EXPORTER default: console -> none (pkg/telemetry/telemetry.go:26)
- Bootstrap marker: .nic-bootstrapped -> .bootstrapped (pkg/git/client_impl.go:22)
- NebariApp YAML: apiVersion reconcilers.nebari.dev/v1, service block at
  spec top, routes use pathPrefix (per upstream nebariapp_types.go)
- Drop "next-generation" qualifier in nic-summary
- Model Serving Pack lists llm-d, not MLflow/KServe/Envoy AI
- Remove aws.NodeGroup negative-space note (docs are the source of truth)
- Drop "under 20 minutes" deploy goal from success criteria
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audit and rewrite design docs to match current code

3 participants