Skip to content

Update local provider to deploy a kind cluster#400

Open
marcelovilla wants to merge 11 commits into
mainfrom
feature/deploy-kind-cluster
Open

Update local provider to deploy a kind cluster#400
marcelovilla wants to merge 11 commits into
mainfrom
feature/deploy-kind-cluster

Conversation

@marcelovilla

@marcelovilla marcelovilla commented Jun 17, 2026

Copy link
Copy Markdown
Member

Closes

Closes #399
Also closes #374 indirectly

Description

This PR updates the local provider to deploy a kind cluster on the user's behalf. Previously, the Deploy and Destroy methods were stubs and users had to deploy a cluster on their local machines and pass a kube context in the config file to deploy the foundational software and MetalLB on top of it.

The local provider does not accept a kubecontext in the config anymore, and MetalLB is not optional. Also, the storage class is not exposed anymore either. Users that want to bring their own cluster using other tools like k3d can still use the existing provider for that.

This PR also removes the localkind-up and localkind-down makefile targets as they're not relevant anymore.

How to Test

Deploy the following config file and verify everything works as expected.

# kind-local.yaml
project_name: nic-kind-test
domain: nebari.local
certificate:
  type: selfsigned

cluster:
  local:
    kind: {}
    node_selectors:
      general:
        kubernetes.io/os: linux
      user:
        kubernetes.io/os: linux
      worker:
        kubernetes.io/os: linux
./nic destroy -f kind-local.yaml

Then, destroy it:

./nic destroy -f kind-local.yaml

If you want to test a custom path for the local GitOps repo, you can also test this config:

project_name: nic-kind-custom
domain: nebari.local
certificate:
  type: selfsigned

cluster:
  local:
    kind:
      # A custom gitops path is not mounted automatically (only the default
      # /tmp/nebari-gitops-<project_name> is), so we must mount it ourselves.
      # host_path/container_path must match git_repository.url below exactly.
      extra_mounts:
        - host_path: /tmp/my-gitops-custom
          container_path: /tmp/my-gitops-custom
          read_only: true
    node_selectors:
      general:
        kubernetes.io/os: linux
      user:
        kubernetes.io/os: linux
      worker:
        kubernetes.io/os: linux

# bootstraps the repo into it, and ArgoCD reads it over the mount.
git_repository:
  url: "file:///tmp/my-gitops-custom"
  branch: main

@marcelovilla marcelovilla added status: blocked ⛔️ Another task is blocking this item and removed status: blocked ⛔️ Another task is blocking this item labels Jun 17, 2026
@marcelovilla marcelovilla force-pushed the feature/deploy-kind-cluster branch from a52c49b to 4e3bf93 Compare June 22, 2026 12:35

@dcmcand dcmcand left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work - this turns the local provider from a stub into something that actually stands up a cluster, and it closes out the macOS mount mismatch (#374) by making that whole class of bug impossible rather than just realigning the two paths. I reviewed it in a local worktree: it builds, go vet is clean, and the package tests pass. The existing provider and examples/existing-config.yaml that the docs now point people to both exist, so that redirect is solid.

Leaving this as a comment rather than an approval - there's one open question (the /16 network assumption) I'd like your read on before signing off. Everything else here is either a compliment or a non-blocking nit.

Issue resolution

  • #399: fully addressed - Deploy/Destroy create and delete a real kind cluster, kubeconfig comes straight from kind, and kube_context is gone from the config surface.
  • #374: addressed structurally. Since NIC now both creates and mounts the gitops directory off the same git.DefaultLocalPath(), the two-sources-of-truth disagreement that caused the macOS failure can't come back.

One thing I'd flag that isn't tied to a single line: none of the runtime code added here (Deploy/Destroy/createKindCluster/kindNodeAddressPool/GetKubeconfig) is exercised by any test after this merges - CI runs go test -race ./... only, and this PR also removes the make localkind-up/localkind-down targets that were the one manual way to drive it end to end. The pure logic (deriveAddressPool) is well covered, but the parts that touch the container runtime now have no automated or scripted exercise path. Not a blocker, but worth a conscious decision about whether something should fill that gap.

Comment thread pkg/git/config.go
// function so the deploy orchestrator and providers that mount the directory
// (e.g. the local kind provider) can derive the same path independently
// without threading it between them.
func DefaultLocalPath(projectName string) string {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this a pure shared function is the right call - it's exactly what fixes #374. The deploy orchestrator and the kind provider now derive the same path independently instead of one trusting the other to pass it, so the path can't drift between them. The doc comment explaining why it's pure is a nice touch.

Comment thread pkg/git/config.go
}

// For local file paths, validate directory exists but skip auth validation
// For local file paths, skip auth validation. A missing directory is valid because

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the lifecycle here: validation runs before the directory exists, and destroy runs after it's gone, so requiring it to exist would have been wrong in both directions. Allowing a missing dir (while still rejecting a non-dir) is the correct relaxation.

// No locking: Deploy (writer) and InfraSettings (reader) run sequentially
// on the same goroutine in the deploy flow, and the local provider is not
// used to deploy multiple clusters concurrently.
metalLBPool string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is excellent - it justifies the no-lock decision, the no-keying decision, and the empty-string fallback all in one place. This is the kind of context that survives a future refactor.

One thing it doesn't state, and arguably should: InfraSettings only sees a derived pool because Deploy runs first and writes it on the same instance. That ordering holds today (the registry hands back one shared provider, and the orchestrator calls Deploy before InfraSettings), but if anyone ever reorders those or constructs the provider per-call, this silently reverts to the static default with no error. A one-liner at InfraSettings noting "relies on Deploy having populated this" would harden it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in bb7a948

"testing"
)

func TestDeriveAddressPool(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good table here - the host-bits-masking case and the sub-/24 / IPv6 / non-CIDR error paths are all covered, which is most of where this kind of math goes wrong.

// kind creates its Docker network from Docker's default address pools,
// which are /16. ParseCIDR (in deriveAddressPool) masks the host bits, so
// "<nodeIP>/16" yields the network the pool should live in.
return deriveAddressPool(ipv4 + "/16")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: this hard-assumes the kind Docker network is a /16. If someone has Docker configured with smaller default-address-pools (e.g. /24, which shows up in IP-constrained corporate setups), the node IP gets masked to a /16 that's wider than the real subnet, the derived pool lands outside it, and MetalLB hands out IPs that aren't routable - silently, with no error, so the symptom is just "the gateway never gets a reachable address."

I don't have a clean fix that keeps your runtime-agnostic design: kind's nodes.Node.IP() only exposes the address, not the mask, and reading the real CIDR means inspecting the Docker/podman network directly, which is the coupling you deliberately avoided. So this is a genuine question rather than a suggestion - is the /16 assumption acceptable for the audience here, or is the silent-unroutability case worth guarding against? Happy to defer to your read on how likely a non-/16 pool is in practice.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone has Docker configured with smaller default-address-pools

I did not take this into consideration and I'm not sure how common it would be, but we already provide a way of overriding the address pool for MetalLB. Users (and by that I mean mostly developers as I don't expect a lot of people deploying locally) with a different network configuration in Docker can pass a specific range in their config.

})

for _, m := range kindCfg.ExtraMounts {
if err := os.MkdirAll(m.HostPath, 0o750); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: extra_mounts host paths get auto-created. That's reasonable since kind requires the path to exist, but it means a typo'd host_path silently produces an empty dir instead of erroring - a small surprise if someone expected to be mounting existing data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and maybe it should not be nic's responsibility to create the directories if they don't exist but rather fail loudly. But we're relying on it creating the directory to mount the default or custom local gitops path for now. I would leave as is for the time being, until we find a better way of dealing with the local gitops path.


// newKindProvider builds a kind cluster provider backed by the detected container runtime
func newKindProvider() (*cluster.Provider, error) {
opt, err := cluster.DetectNodeProvider()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking, project-convention note: the helpers in this file don't take ctx or open trace spans, which the repo's instrumentation convention asks for on new functions. kind's API can't consume a ctx, so this won't change behavior - but the functions could still accept ctx and start spans for trace continuity. Worth a deliberate call rather than an accident.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Addressed in 47ab591

Comment thread pkg/providers/cluster/local/provider.go Outdated
return err
}
if exists {
status.Send(ctx, status.NewUpdate(status.LevelInfo, fmt.Sprintf("Kind cluster %s already exists, reusing it (extra_mounts changes require a recreate)", projectName)).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: the reuse message warns that extra_mounts changes need a recreate, but a changed node_image on an existing cluster is also silently ignored. Might be worth widening the wording so it doesn't read as "only mounts are sticky."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f77c559

Comment thread pkg/providers/cluster/local/config.go Outdated
AdditionalFields map[string]any `yaml:",inline"`
}

// KindConfig holds optional confg for the deployed kind cluster. It may be

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny typo: "confg" -> "config".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 57889e9

Comment thread examples/local-config.yaml Outdated
# MetalLB configuration (default: enabled with 192.168.1.100-192.168.1.110 pool)
# Disable for k3s which ships with ServiceLB
# MetalLB provides the gateway's LoadBalancer IP. NIC derives the address pool
# from the kind Docker network automatically so the gateway IP is routable. override

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: "override" starts mid-sentence in lowercase here - reads a little oddly.

@marcelovilla marcelovilla Jun 22, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 33b10fe

@marcelovilla marcelovilla requested a review from dcmcand June 22, 2026 21:08

@dcmcand dcmcand left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through all the feedback so thoroughly, Marcelo. This is a great piece of work. It turns the local provider from a stub into something that actually stands up a cluster, and it closes #374 by making that whole class of macOS mount bug impossible instead of just realigning the two paths.

I re-reviewed at HEAD 4b07707 in a local worktree: it builds, go vet and golangci-lint are clean, and the package tests pass. Every item from my first pass is addressed or consciously deferred with a clear reason:

  • InfraSettings now documents its dependency on Deploy running first
  • the kind helpers are instrumented with spans
  • the reuse message covers node_image as well as mounts
  • the 90s timeout has a comment explaining why it is separate from DeployOptions.Timeout
  • the confg typo and the example yaml wording are fixed
  • the /16 question is fairly answered by the existing address_pool override

Issue resolution

  • #399: fully resolved. Real cluster create/delete, kubeconfig comes straight from kind, and kube_context is gone from the config surface.
  • #374: resolved structurally. NIC now both creates and mounts the gitops dir off the same git.DefaultLocalPath(), so the two-sources-of-truth disagreement that broke macOS can't come back.

Kudos on the Provider.metalLBPool doc comment in particular. It justifies the no-lock, no-keying, and empty-fallback decisions in one place that will survive a refactor. deriveAddressPool is also nicely covered across the host-bit masking, sub-/24, IPv6, and non-CIDR paths.

One thing still worth a conscious call, not a blocker: the runtime path (Deploy/Destroy/createKindCluster/kindNodeAddressPool/GetKubeconfig) has no automated exercise after merge, and this PR removes the make localkind-* targets that were the one scripted way to drive it end to end. A build-tagged smoke test or a documented manual checklist would fill that gap down the line.

Approving. Thanks again for the careful iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants