Skip to content

feat(longhorn): expose UI through Envoy Gateway with Keycloak OIDC#328

Open
tylerpotts wants to merge 37 commits into
mainfrom
tpotts/longhorn-ui-gateway
Open

feat(longhorn): expose UI through Envoy Gateway with Keycloak OIDC#328
tylerpotts wants to merge 37 commits into
mainfrom
tpotts/longhorn-ui-gateway

Conversation

@tylerpotts

@tylerpotts tylerpotts commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Exposes the Longhorn UI at https://longhorn.<domain> through the existing Envoy Gateway, gated by Keycloak OIDC via an Envoy Gateway SecurityPolicy, and restricted to members of the longhorn-admins Keycloak group. Routes / policies / cert SAN / Keycloak client are only provisioned when the provider has Longhorn enabled (AWS / Hetzner default-on; Azure / GCP / Local off).

  • Per-provider gate: new provider.InfraSettings.LonghornEnabled populated from existing per-provider LonghornEnabled() config methods. Threaded through TemplateData to gate manifest rendering.
  • Dual-namespace OIDC secret: foundational install writes longhorn-oidc-client-secret into both keycloak (consumed by realm-setup-job to register the OIDC client) and longhorn-system (consumed by the SecurityPolicy). No ReferenceGrant needed.
  • New manifests (all conditionally rendered on LonghornEnabled):
    • manifests/networking/routes/longhorn-httproute.yaml — HTTPRoute longhorn in longhorn-systemlonghorn-frontend:8080, attaches to nebari-gateway/https
    • manifests/networking/policies/longhorn-securitypolicy.yamlSecurityPolicy longhorn-oidc enforcing Keycloak OIDC (clientID longhorn, redirect https://longhorn.<domain>/oauth2/callback), plus a JWT provider and an authorization rule that Allows only when the groups claim contains longhorn-admins (everything else 403)
    • New ArgoCD Application securitypolicies so the new policies/ directory is reconciled
  • Cert SAN: gateway-certificate.yaml adds longhorn.<domain> to dnsNames when enabled.
  • Keycloak realm setup: realm-setup-job.yaml registers the longhorn OIDC client, creates longhorn-admins / longhorn-viewers groups, attaches groups scope, and adds the realm admin to longhorn-admins.

Authorization gates the UI to members of the longhorn-admins Keycloak group. Non-members get HTTP 403 from the gateway. The longhorn-viewers group is created but currently a no-op (Longhorn has no read-only mode).

Test Plan

Unit tests added (all passing under make check + go test -race ./...):

  • TestProvider_InfraSettings_LonghornEnabled on AWS and Hetzner (default + explicit-disabled)
  • TestCreateLonghornSecrets — dual-namespace Secret creation; no-op when ClientSecret empty
  • TestCreateKeycloakAndLonghornSecrets_Together — coexistence with the existing ArgoCD secret
  • TestWriteAllToGit_LonghornHTTPRoute — positive + negative render
  • TestWriteAllToGit_LonghornSecurityPolicy — positive + negative render; positive asserts forwardAccessToken, jwt provider, authorization Deny-by-default + Allow-longhorn-admins rule
  • TestWriteAllToGit_GatewayCertIncludesLonghorn — SAN added/omitted
  • TestWriteAllToGit_RealmSetupRegistersLonghornClient — kcadm block + env var renders/omits

Operator-driven acceptance (NOT automated; run after deploy):

  • Deploy with domain: <real-domain> + Longhorn enabled (default for AWS): ./nic deploy --config ...
  • kubectl -n envoy-gateway-system get certificate nebari-gateway-cert -o jsonpath='{.spec.dnsNames}' includes longhorn.<domain>
  • kubectl -n longhorn-system get httproute longhorn shows Accepted=True
  • kubectl -n longhorn-system get securitypolicy longhorn-oidc shows Accepted=True
  • kubectl -n keycloak logs job/keycloak-realm-setup | grep -i longhorn shows the new client + groups created
  • Sign in as the realm admin user (already in longhorn-admins) → land on the Longhorn UI
  • Create a fresh Keycloak user, do NOT add to longhorn-admins, sign in → expect HTTP 403 from the gateway
  • Add that user to longhorn-admins via the Keycloak UI, sign out and back in → land on the Longhorn UI
  • Set aws.longhorn.enabled: false, redeploy → HTTPRoute / SecurityPolicy / Keycloak client / cert SAN all absent

Follow-ups (out of scope)

  • Client-secret rotation on redeploy (same pre-existing divergence risk as today's ArgoCD client)

tylerpotts added 19 commits May 22, 2026 12:17
…Gateway

Adds the design doc for exposing the Longhorn UI at longhorn.<domain>
through the existing nebari-gateway, gated by Keycloak OIDC enforced
via an Envoy Gateway SecurityPolicy. Mirrors the existing ArgoCD SSO
design layer-for-layer; route ships only when both Longhorn and
Keycloak are enabled.
Step-by-step TDD plan covering provider InfraSettings changes,
foundational secret provisioning, three new manifest templates
(HTTPRoute, SecurityPolicy, realm-setup additions), and the
gateway certificate dnsName update.
Wire each provider to set InfraSettings.LonghornEnabled correctly:
AWS and Hetzner default to true and track their Config.LonghornEnabled()
method; Azure, GCP, and local are set to false (not yet wired).
Add table-driven TDD tests for AWS and Hetzner covering default-enabled
and explicit-disabled paths.
Keycloak is mandatory infrastructure (always provisioned during foundational
install), so the spec's Yes/No Keycloak branches were unreachable by design.
The implementation has always gated on LonghornEnabled alone; this brings
the spec and source comments into alignment with that reality.
Extends the existing OIDC-gated Longhorn UI exposure (PR #328) to require
membership in the `longhorn-admins` Keycloak group. Adds `oidc.forwardAccessToken`,
a `jwt` provider, and an `authorization` block with Deny-by-default + a single
Allow rule on the `groups` claim to the existing longhorn-securitypolicy template.
No new resources, no Go changes.
Single-task TDD plan: extend longhorn-securitypolicy.yaml with
forwardAccessToken + jwt provider + authorization rule, and extend
the existing positive sub-test to pin all new fields. Plus a PR-#328
description update task to drop the group-authz follow-up and add
manual acceptance steps.
CI golangci-lint v2.12.2 flagged three goconst violations:
- "app.kubernetes.io/part-of" (4 occurrences in foundational.go)
- "app.kubernetes.io/managed-by" (4 occurrences in foundational.go)
- "us-east-1" (counted across test+non-test in aws/)

Extract three constants in foundational.go (LabelKeyPartOf,
LabelKeyManagedBy, LabelValueManagedBy) and one in state.go
(awsRegionUSEast1). No behavior change.
…0, not 8080

The longhorn-frontend Service exposes port 80 (mapping to container port 8000
via the named target port "http"). The HTTPRoute's backendRefs[].port refers
to the Service port. Live cluster reproduction:

  status:
    conditions:
    - message: 'Failed to process route rule 0 backendRef 0:
        TCP Port 8080 not found on Service longhorn-system/longhorn-frontend.'
      reason: PortNotFound
      type: ResolvedRefs
The realm-setup-job previously did `kcadm create ... || echo "may already
exist"` for the group-membership mapper. When Keycloak's default groups
client scope already exists (with full.path=true), the create fails and
the error is swallowed -- so the live mapper retains full.path=true, which
makes the JWT 'groups' claim a list of "/longhorn-admins" rather than
"longhorn-admins". The longhorn SecurityPolicy then denies access because
its rule looks for the bare name.

Fix: look up the existing mapper ID and update its config in place,
falling back to create when no mapper exists yet. Same fix applies to the
ArgoCD group-claim path (the bug was previously masked because no
SecurityPolicy was reading the claim).

Live cluster reproduction:
- Adding a user to longhorn-admins, logging in, hitting longhorn URL
  yields "rbac: access denied" from the Envoy SecurityPolicy filter.
- Decoding the access token shows groups: ["/longhorn-admins", ...].
Longhorn has no read-only UI mode, so the viewers group was always a
no-op. Keeping it around is just confusing -- it suggests a role tier
that doesn't exist. Removed from:

- realm-setup-job: drop the `kcadm create groups ... name=longhorn-viewers` line.
- writer_test: drop the positive assertion; add a negative assertion that
  the rendered job does NOT reference longhorn-viewers, so a future
  accidental re-add fails CI.
- spec + plan docs: drop references and the "no-op forward compatibility"
  language.

Existing live clusters will keep the group until it's deleted manually;
the realm-setup script is purely additive (creates if missing) and never
deletes groups.
…tic config

The previous approach (kcadm update -s 'config."dotted.key"=value') was
non-deterministic on fresh deploys -- live clusters ended up with the
default full.path=true config, which made the JWT 'groups' claim look
like "/longhorn-admins" and caused Envoy SecurityPolicy rules matching
"longhorn-admins" (bare) to deny all requests.

Switch to DELETE-then-CREATE: delete Keycloak's auto-created default
mapper if present, then create our own with full.path=false. This sets
the entire config JSON blob in one atomic shot, removing the kcadm
dotted-key escaping ambiguity.

Adds a post-create verification that logs ERROR to job logs if the
mapper still has full.path != false, so future regressions are visible
in `kubectl logs job/keycloak-realm-setup`.
…wns the mapper

Root cause uncovered after several failed iterations:

The nebari-operator's data-science-pack RBAC bootstrap Job
(keycloak_rbac_bootstrap.py, runs in the keycloak namespace as a PostSync
hook of the data-science-pack ArgoCD app) actively reconciles the
realm-shared 'groups' client-scope mapper to full.path=true on every
sync. The operator's script explicitly comments:

  "An operator-managed scope may pre-create the mapper with
   full.path: false ... [but the operator] reconciles the mapper config
   to the chart's desired state on every run."

So no value our realm-setup-job writes to the mapper's full.path field
sticks -- the operator overwrites it within a minute.

Fix: stop fighting the operator. Match the groups claim by its path
form ("/longhorn-admins") in the Envoy SecurityPolicy authorization
rule. This is a one-line change in the manifest and is consistent with
how the JWT actually looks on the wire.

Also reverts the realm-setup-job mapper logic to a best-effort
fire-and-forget create (does nothing if the mapper already exists),
since trying to manage its config is futile.

Test asserts the new path form ("- /longhorn-admins") explicitly so the
intent is preserved across edits.
@viniciusdc

Copy link
Copy Markdown
Contributor

I am looking at this @tylerpotts, thanks

The merge of main (PR #266, which extracted CLI orchestration into pkg/nic)
left cmd/nic/deploy.go with a dangling copy of the old fat orchestration that
referenced symbols no longer in the package, breaking the build. It also
dropped the Longhorn OIDC client-secret logic this branch adds.

Remove the stale block from cmd/nic/deploy.go and re-apply the Longhorn secret
generation + foundationalCfg wiring in pkg/nic/deploy.go.
…teway

# Conflicts:
#	pkg/provider/azure/provider.go
@viniciusdc viniciusdc added the status: in review 👀 This PR is currently being reviewed by the team label Jun 1, 2026

@dcmcand dcmcand left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this - the architecture is sound and the enabled path is solid, well-tested work. Requesting changes on one Blocker around the disabled/teardown path.

Highlights

  • Provider consistency is excellent. AWS/Hetzner track Config.LonghornEnabled() (default-on); Azure/GCP/local hardcode false with clear not yet wired comments. The default-then-refine pattern matches the existing StorageClass handling exactly, and LonghornEnabled flows through provider.InfraSettings -> TemplateData with no provider-name switches anywhere. Textbook use of the abstraction boundary.
  • Dual-namespace secret handling is clean (createLonghornSecrets): writes the OIDC client secret to both keycloak (for realm-setup-job) and longhorn-system (for the SecurityPolicy), with a comment explaining why, and no-ops on empty.
  • generateSecurePassword is reused, not reinvented - crypto rand.Reader, generated only when LonghornEnabled.
  • Strong test coverage: both enabled and disabled branches asserted for the HTTPRoute, SecurityPolicy, realm-setup-job, and gateway cert.
  • The group-claim path-form comment documents a real cross-repo contract with nebari-operator instead of silently working around it - good defensive documentation.
  • No shell-injection risk in realm-setup-job: $LONGHORN_CLIENT_SECRET comes from a secretKeyRef and is quoted.

Blocker

The securitypolicies Application is created unconditionally, but its only manifest is conditionally empty - this diverges from the MetalLB precedent and breaks the Longhorn-disable path.

apps/securitypolicies.yaml has no LonghornEnabled gate and uses allowEmpty: false, while manifests/networking/policies/ contains only longhorn-securitypolicy.yaml, which renders empty when Longhorn is disabled.

The established pattern for "a provider doesn't need this" is MetalLB: writer.go (isMetalLBPath) skips writing both apps/metallb*.yaml and manifests/metallb*, so no Application is ever created. This PR instead leaves the securitypolicies Application pointing at a directory that yields zero resources whenever Longhorn is off (Azure, GCP, local - the primary dev/test path - and AWS/Hetzner with longhorn_enabled: false).

Concrete failure mode: with allowEmpty: false, when a cluster goes from Longhorn-enabled to disabled (config toggle + redeploy), ArgoCD refuses to prune the SecurityPolicy to zero, leaving an orphaned policy and an OutOfSync app that selfHeal cannot reconcile.

Suggested fix: follow the MetalLB precedent - generalize the skip mechanism (e.g. an isLonghornPath check or a small skip-list) so that when !settings.LonghornEnabled the writer skips both apps/securitypolicies.yaml and manifests/networking/policies/longhorn-securitypolicy.yaml. That removes the empty committed file and the orphan-on-disable problem in one move.

One thing I could not verify without a live ArgoCD: the exact sync status of a fresh non-Longhorn deploy (target=0, live=0) - it may show Synced/Healthy harmlessly rather than erroring. Can you confirm what you observed on a local/Azure deploy? Even if the fresh case is benign, the toggle-off orphan case and the divergence from the deliberate MetalLB skip pattern still stand.

Question

deploy.go gating vs Keycloak: longhornClientSecret is generated on infraSettings.LonghornEnabled alone, but createLonghornSecrets only runs inside if foundationalCfg.Keycloak.Enabled. The writer comment says Keycloak is mandatory infrastructure, so LonghornEnabled && !Keycloak.Enabled shouldn't occur - is that invariant guaranteed, or relying on deploy.go always setting Keycloak.Enabled: true? If the latter, a one-line comment at the generation site would prevent a future dangling secretKeyRef.

Comment thread pkg/argocd/templates/apps/securitypolicies.yaml
…ghorn disabled

The securitypolicies Application targets manifests/networking/policies,
whose only manifest is the Longhorn SecurityPolicy. When Longhorn was
disabled, the manifest rendered empty but the Application was still
written, producing an app with zero resources (rejected by
allowEmpty: false) and leaving the SecurityPolicy un-prunable on an
enabled->disabled toggle.

Mirror the MetalLB skip pattern: isLonghornOnlyPath skips writing both
apps/securitypolicies.yaml and the policies manifests when
LonghornEnabled is false, and the now-redundant template guard is
removed from longhorn-securitypolicy.yaml.
@tylerpotts tylerpotts requested a review from dcmcand June 9, 2026 22:04
…teway

# Conflicts:
#	pkg/argocd/writer_test.go
…teway

# Conflicts:
#	pkg/argocd/foundational.go
#	pkg/argocd/writer.go
#	pkg/providers/cluster/azure/provider.go
#	pkg/providers/cluster/gcp/provider.go
#	pkg/providers/cluster/hetzner/provider.go
#	pkg/providers/cluster/local/provider.go
Comment thread pkg/providers/cluster/aws/state.go Outdated
Replace the package-level awsRegionUSEast1 constant with an inline
literal plus explanatory comment at its single usage site, addressing
review feedback that a constant adds indirection for a one-off value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: in review 👀 This PR is currently being reviewed by the team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SPIKE] - Backup Solution for Nebari Components

3 participants