feat(longhorn): expose UI through Envoy Gateway with Keycloak OIDC#328
feat(longhorn): expose UI through Envoy Gateway with Keycloak OIDC#328tylerpotts wants to merge 37 commits into
Conversation
…Gateway Adds the design doc for exposing the Longhorn UI at longhorn.<domain> through the existing nebari-gateway, gated by Keycloak OIDC enforced via an Envoy Gateway SecurityPolicy. Mirrors the existing ArgoCD SSO design layer-for-layer; route ships only when both Longhorn and Keycloak are enabled.
Step-by-step TDD plan covering provider InfraSettings changes, foundational secret provisioning, three new manifest templates (HTTPRoute, SecurityPolicy, realm-setup additions), and the gateway certificate dnsName update.
Wire each provider to set InfraSettings.LonghornEnabled correctly: AWS and Hetzner default to true and track their Config.LonghornEnabled() method; Azure, GCP, and local are set to false (not yet wired). Add table-driven TDD tests for AWS and Hetzner covering default-enabled and explicit-disabled paths.
Keycloak is mandatory infrastructure (always provisioned during foundational install), so the spec's Yes/No Keycloak branches were unreachable by design. The implementation has always gated on LonghornEnabled alone; this brings the spec and source comments into alignment with that reality.
Extends the existing OIDC-gated Longhorn UI exposure (PR #328) to require membership in the `longhorn-admins` Keycloak group. Adds `oidc.forwardAccessToken`, a `jwt` provider, and an `authorization` block with Deny-by-default + a single Allow rule on the `groups` claim to the existing longhorn-securitypolicy template. No new resources, no Go changes.
Single-task TDD plan: extend longhorn-securitypolicy.yaml with forwardAccessToken + jwt provider + authorization rule, and extend the existing positive sub-test to pin all new fields. Plus a PR-#328 description update task to drop the group-authz follow-up and add manual acceptance steps.
CI golangci-lint v2.12.2 flagged three goconst violations: - "app.kubernetes.io/part-of" (4 occurrences in foundational.go) - "app.kubernetes.io/managed-by" (4 occurrences in foundational.go) - "us-east-1" (counted across test+non-test in aws/) Extract three constants in foundational.go (LabelKeyPartOf, LabelKeyManagedBy, LabelValueManagedBy) and one in state.go (awsRegionUSEast1). No behavior change.
…0, not 8080
The longhorn-frontend Service exposes port 80 (mapping to container port 8000
via the named target port "http"). The HTTPRoute's backendRefs[].port refers
to the Service port. Live cluster reproduction:
status:
conditions:
- message: 'Failed to process route rule 0 backendRef 0:
TCP Port 8080 not found on Service longhorn-system/longhorn-frontend.'
reason: PortNotFound
type: ResolvedRefs
The realm-setup-job previously did `kcadm create ... || echo "may already exist"` for the group-membership mapper. When Keycloak's default groups client scope already exists (with full.path=true), the create fails and the error is swallowed -- so the live mapper retains full.path=true, which makes the JWT 'groups' claim a list of "/longhorn-admins" rather than "longhorn-admins". The longhorn SecurityPolicy then denies access because its rule looks for the bare name. Fix: look up the existing mapper ID and update its config in place, falling back to create when no mapper exists yet. Same fix applies to the ArgoCD group-claim path (the bug was previously masked because no SecurityPolicy was reading the claim). Live cluster reproduction: - Adding a user to longhorn-admins, logging in, hitting longhorn URL yields "rbac: access denied" from the Envoy SecurityPolicy filter. - Decoding the access token shows groups: ["/longhorn-admins", ...].
Longhorn has no read-only UI mode, so the viewers group was always a no-op. Keeping it around is just confusing -- it suggests a role tier that doesn't exist. Removed from: - realm-setup-job: drop the `kcadm create groups ... name=longhorn-viewers` line. - writer_test: drop the positive assertion; add a negative assertion that the rendered job does NOT reference longhorn-viewers, so a future accidental re-add fails CI. - spec + plan docs: drop references and the "no-op forward compatibility" language. Existing live clusters will keep the group until it's deleted manually; the realm-setup script is purely additive (creates if missing) and never deletes groups.
…tic config The previous approach (kcadm update -s 'config."dotted.key"=value') was non-deterministic on fresh deploys -- live clusters ended up with the default full.path=true config, which made the JWT 'groups' claim look like "/longhorn-admins" and caused Envoy SecurityPolicy rules matching "longhorn-admins" (bare) to deny all requests. Switch to DELETE-then-CREATE: delete Keycloak's auto-created default mapper if present, then create our own with full.path=false. This sets the entire config JSON blob in one atomic shot, removing the kcadm dotted-key escaping ambiguity. Adds a post-create verification that logs ERROR to job logs if the mapper still has full.path != false, so future regressions are visible in `kubectl logs job/keycloak-realm-setup`.
…wns the mapper
Root cause uncovered after several failed iterations:
The nebari-operator's data-science-pack RBAC bootstrap Job
(keycloak_rbac_bootstrap.py, runs in the keycloak namespace as a PostSync
hook of the data-science-pack ArgoCD app) actively reconciles the
realm-shared 'groups' client-scope mapper to full.path=true on every
sync. The operator's script explicitly comments:
"An operator-managed scope may pre-create the mapper with
full.path: false ... [but the operator] reconciles the mapper config
to the chart's desired state on every run."
So no value our realm-setup-job writes to the mapper's full.path field
sticks -- the operator overwrites it within a minute.
Fix: stop fighting the operator. Match the groups claim by its path
form ("/longhorn-admins") in the Envoy SecurityPolicy authorization
rule. This is a one-line change in the manifest and is consistent with
how the JWT actually looks on the wire.
Also reverts the realm-setup-job mapper logic to a best-effort
fire-and-forget create (does nothing if the mapper already exists),
since trying to manage its config is futile.
Test asserts the new path form ("- /longhorn-admins") explicitly so the
intent is preserved across edits.
|
I am looking at this @tylerpotts, thanks |
The merge of main (PR #266, which extracted CLI orchestration into pkg/nic) left cmd/nic/deploy.go with a dangling copy of the old fat orchestration that referenced symbols no longer in the package, breaking the build. It also dropped the Longhorn OIDC client-secret logic this branch adds. Remove the stale block from cmd/nic/deploy.go and re-apply the Longhorn secret generation + foundationalCfg wiring in pkg/nic/deploy.go.
…teway # Conflicts: # pkg/provider/azure/provider.go
dcmcand
left a comment
There was a problem hiding this comment.
Thanks for this - the architecture is sound and the enabled path is solid, well-tested work. Requesting changes on one Blocker around the disabled/teardown path.
Highlights
- Provider consistency is excellent. AWS/Hetzner track
Config.LonghornEnabled()(default-on); Azure/GCP/local hardcodefalsewith clearnot yet wiredcomments. The default-then-refine pattern matches the existingStorageClasshandling exactly, andLonghornEnabledflows throughprovider.InfraSettings->TemplateDatawith no provider-name switches anywhere. Textbook use of the abstraction boundary. - Dual-namespace secret handling is clean (
createLonghornSecrets): writes the OIDC client secret to bothkeycloak(for realm-setup-job) andlonghorn-system(for the SecurityPolicy), with a comment explaining why, and no-ops on empty. generateSecurePasswordis reused, not reinvented - cryptorand.Reader, generated only whenLonghornEnabled.- Strong test coverage: both enabled and disabled branches asserted for the HTTPRoute, SecurityPolicy, realm-setup-job, and gateway cert.
- The group-claim path-form comment documents a real cross-repo contract with nebari-operator instead of silently working around it - good defensive documentation.
- No shell-injection risk in realm-setup-job:
$LONGHORN_CLIENT_SECRETcomes from asecretKeyRefand is quoted.
Blocker
The securitypolicies Application is created unconditionally, but its only manifest is conditionally empty - this diverges from the MetalLB precedent and breaks the Longhorn-disable path.
apps/securitypolicies.yaml has no LonghornEnabled gate and uses allowEmpty: false, while manifests/networking/policies/ contains only longhorn-securitypolicy.yaml, which renders empty when Longhorn is disabled.
The established pattern for "a provider doesn't need this" is MetalLB: writer.go (isMetalLBPath) skips writing both apps/metallb*.yaml and manifests/metallb*, so no Application is ever created. This PR instead leaves the securitypolicies Application pointing at a directory that yields zero resources whenever Longhorn is off (Azure, GCP, local - the primary dev/test path - and AWS/Hetzner with longhorn_enabled: false).
Concrete failure mode: with allowEmpty: false, when a cluster goes from Longhorn-enabled to disabled (config toggle + redeploy), ArgoCD refuses to prune the SecurityPolicy to zero, leaving an orphaned policy and an OutOfSync app that selfHeal cannot reconcile.
Suggested fix: follow the MetalLB precedent - generalize the skip mechanism (e.g. an isLonghornPath check or a small skip-list) so that when !settings.LonghornEnabled the writer skips both apps/securitypolicies.yaml and manifests/networking/policies/longhorn-securitypolicy.yaml. That removes the empty committed file and the orphan-on-disable problem in one move.
One thing I could not verify without a live ArgoCD: the exact sync status of a fresh non-Longhorn deploy (target=0, live=0) - it may show Synced/Healthy harmlessly rather than erroring. Can you confirm what you observed on a local/Azure deploy? Even if the fresh case is benign, the toggle-off orphan case and the divergence from the deliberate MetalLB skip pattern still stand.
Question
deploy.go gating vs Keycloak: longhornClientSecret is generated on infraSettings.LonghornEnabled alone, but createLonghornSecrets only runs inside if foundationalCfg.Keycloak.Enabled. The writer comment says Keycloak is mandatory infrastructure, so LonghornEnabled && !Keycloak.Enabled shouldn't occur - is that invariant guaranteed, or relying on deploy.go always setting Keycloak.Enabled: true? If the latter, a one-line comment at the generation site would prevent a future dangling secretKeyRef.
…ghorn disabled The securitypolicies Application targets manifests/networking/policies, whose only manifest is the Longhorn SecurityPolicy. When Longhorn was disabled, the manifest rendered empty but the Application was still written, producing an app with zero resources (rejected by allowEmpty: false) and leaving the SecurityPolicy un-prunable on an enabled->disabled toggle. Mirror the MetalLB skip pattern: isLonghornOnlyPath skips writing both apps/securitypolicies.yaml and the policies manifests when LonghornEnabled is false, and the now-redundant template guard is removed from longhorn-securitypolicy.yaml.
…teway # Conflicts: # pkg/argocd/writer_test.go
…teway # Conflicts: # pkg/argocd/foundational.go # pkg/argocd/writer.go # pkg/providers/cluster/azure/provider.go # pkg/providers/cluster/gcp/provider.go # pkg/providers/cluster/hetzner/provider.go # pkg/providers/cluster/local/provider.go
Replace the package-level awsRegionUSEast1 constant with an inline literal plus explanatory comment at its single usage site, addressing review feedback that a constant adds indirection for a one-off value.
Summary
Exposes the Longhorn UI at
https://longhorn.<domain>through the existing Envoy Gateway, gated by Keycloak OIDC via an Envoy GatewaySecurityPolicy, and restricted to members of thelonghorn-adminsKeycloak group. Routes / policies / cert SAN / Keycloak client are only provisioned when the provider has Longhorn enabled (AWS / Hetzner default-on; Azure / GCP / Local off).provider.InfraSettings.LonghornEnabledpopulated from existing per-providerLonghornEnabled()config methods. Threaded throughTemplateDatato gate manifest rendering.longhorn-oidc-client-secretinto bothkeycloak(consumed by realm-setup-job to register the OIDC client) andlonghorn-system(consumed by the SecurityPolicy). NoReferenceGrantneeded.LonghornEnabled):manifests/networking/routes/longhorn-httproute.yaml— HTTPRoutelonghorninlonghorn-system→longhorn-frontend:8080, attaches tonebari-gateway/httpsmanifests/networking/policies/longhorn-securitypolicy.yaml—SecurityPolicy longhorn-oidcenforcing Keycloak OIDC (clientIDlonghorn, redirecthttps://longhorn.<domain>/oauth2/callback), plus a JWT provider and anauthorizationrule that Allows only when thegroupsclaim containslonghorn-admins(everything else 403)Application securitypoliciesso the newpolicies/directory is reconciledgateway-certificate.yamladdslonghorn.<domain>todnsNameswhen enabled.realm-setup-job.yamlregisters thelonghornOIDC client, createslonghorn-admins/longhorn-viewersgroups, attachesgroupsscope, and adds the realm admin tolonghorn-admins.Authorization gates the UI to members of the
longhorn-adminsKeycloak group. Non-members get HTTP 403 from the gateway. Thelonghorn-viewersgroup is created but currently a no-op (Longhorn has no read-only mode).Test Plan
Unit tests added (all passing under
make check+go test -race ./...):TestProvider_InfraSettings_LonghornEnabledon AWS and Hetzner (default + explicit-disabled)TestCreateLonghornSecrets— dual-namespace Secret creation; no-op whenClientSecretemptyTestCreateKeycloakAndLonghornSecrets_Together— coexistence with the existing ArgoCD secretTestWriteAllToGit_LonghornHTTPRoute— positive + negative renderTestWriteAllToGit_LonghornSecurityPolicy— positive + negative render; positive asserts forwardAccessToken, jwt provider, authorization Deny-by-default + Allow-longhorn-admins ruleTestWriteAllToGit_GatewayCertIncludesLonghorn— SAN added/omittedTestWriteAllToGit_RealmSetupRegistersLonghornClient— kcadm block + env var renders/omitsOperator-driven acceptance (NOT automated; run after deploy):
domain: <real-domain>+ Longhorn enabled (default for AWS):./nic deploy --config ...kubectl -n envoy-gateway-system get certificate nebari-gateway-cert -o jsonpath='{.spec.dnsNames}'includeslonghorn.<domain>kubectl -n longhorn-system get httproute longhornshowsAccepted=Truekubectl -n longhorn-system get securitypolicy longhorn-oidcshowsAccepted=Truekubectl -n keycloak logs job/keycloak-realm-setup | grep -i longhornshows the new client + groups createdlonghorn-admins) → land on the Longhorn UIlonghorn-admins, sign in → expect HTTP 403 from the gatewaylonghorn-adminsvia the Keycloak UI, sign out and back in → land on the Longhorn UIaws.longhorn.enabled: false, redeploy → HTTPRoute / SecurityPolicy / Keycloak client / cert SAN all absentFollow-ups (out of scope)