Skip to content

[serve] Recover ingress-router pin-misses via the fallback proxy instead of 503#64218

Draft
eicherseiji wants to merge 3 commits into
ray-project:masterfrom
eicherseiji:llm-dp-direct-streaming-warmup
Draft

[serve] Recover ingress-router pin-misses via the fallback proxy instead of 503#64218
eicherseiji wants to merge 3 commits into
ray-project:masterfrom
eicherseiji:llm-dp-direct-streaming-warmup

Conversation

@eicherseiji

@eicherseiji eicherseiji commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Why are these changes needed?

When direct streaming is enabled, HAProxy asks the ingress request router to pin a replica, then routes to it by name through its statically reloaded server map. Right after an application becomes RUNNING there is a brief window where the router's in-process replica view runs ahead of HAProxy's config reload, so the router can name a replica HAProxy has not loaded yet. HAProxy rejected that with a 503 unknown_replica_id -- surfacing as flaky test_dp_direct_streaming failures in postmerge and as a real client-facing error during the post-deploy gap.

This routes the unknown_replica_id reason to the fallback Serve proxy instead of returning 503. The fallback proxy is always running in HAProxy mode and re-pins via the same consistent-hash router (same ring, same replica) over the Serve handle path, so it serves the request correctly and preserves session affinity.

Scope: only unknown_replica_id is recovered. router_unreachable, router_non_200, and unparseable_replica_id mean the router itself is broken and still fail loud (503). The Lua keeps arming txn.ingress_request_router_failed, so serve_haproxy_ingress_router_failures{reason} still counts every pin-miss by reason.

Adds test_pin_miss_falls_back_to_fallback_server (runs against real HAProxy): with the router pinning a replica absent from the server map, the request is served by the fallback (200) and the affinity-breaking primary backend is never selected.

The first commit (a test-side warmup) is reverted in this PR; the fallback fix replaces it.

Related issue number

Surfaced by postmerge https://buildkite.com/ray-project/postmerge/builds/18131

Checks

  • I've signed off every commit (DCO).
  • test_haproxy_api.py (incl. the new test) and test_haproxy_metrics.py pass locally against HAProxy 2.8; lint clean.

… tests

The direct-streaming session-affinity tests fired requests as soon as the
application reached RUNNING. RUNNING only means the controller sees every
replica running; HAProxy reloads its data-plane server map asynchronously
afterward, and the ingress request router learns the running replicas through
a separate long poll that runs ahead of that reload. In that window the router
pins a replica HAProxy has not loaded, and HAProxy returns 503
unknown_replica_id.

test_different_sessions_spread fans requests across all replicas, so it hit
the gap and failed in postmerge; test_session_affinity pins a single replica
and rarely did.

run_app_through_haproxy now warms up until a request has reached every
data-plane replica before returning the URL, closing the race for all three
direct-streaming test files (DP, PD, and router).

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the reliability of direct-streaming session-affinity tests by ensuring HAProxy is fully warmed up and routing to all data-plane replicas before proceeding. Feedback suggests wrapping the warmup HTTP requests in a try-except block to handle transient connection or timeout errors during HAProxy startup or reload, preventing premature failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +80 to +82
resp = _chat_request(base_url, f"warmup-session-{next(sessions)}")
if resp.status_code == 200:
reached.add(resp.headers["x-replica-id"])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

During HAProxy startup or reload, _chat_request might raise transient connection or timeout errors (e.g., httpx.ConnectError or httpx.HTTPError). If an exception is raised, it will abort the loop of probes immediately, which can cause the wait_for_condition check to fail or run much slower. Wrapping the request in a try-except block to catch httpx.HTTPError and safely checking for the presence of the x-replica-id header makes the warmup process significantly more robust.

Suggested change
resp = _chat_request(base_url, f"warmup-session-{next(sessions)}")
if resp.status_code == 200:
reached.add(resp.headers["x-replica-id"])
try:
resp = _chat_request(base_url, f"warmup-session-{next(sessions)}")
if resp.status_code == 200 and "x-replica-id" in resp.headers:
reached.add(resp.headers["x-replica-id"])
except httpx.HTTPError:
pass

@eicherseiji eicherseiji self-assigned this Jun 18, 2026
@eicherseiji eicherseiji added the go add ONLY when ready to merge, run all tests label Jun 18, 2026
…affinity tests"

This reverts commit 284f335.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ead of 503

When direct streaming is enabled, HAProxy asks the ingress request router to
pin a replica, then routes to that replica by name through its statically
reloaded server map. Right after an application becomes RUNNING there is a
brief window where the router's in-process replica view runs ahead of
HAProxy's config reload, so the router can name a replica HAProxy has not
loaded yet. HAProxy rejected that with a 503 `unknown_replica_id`, which
surfaced as flaky failures in the direct-streaming session-affinity tests
(test_dp_direct_streaming, etc.) and as a real client-facing error during the
post-deploy gap.

Recover the request instead of failing it. The fallback Serve proxy is always
running in HAProxy mode and re-pins via the same consistent-hash router (same
ring, same replica) over the Serve handle path, so it serves the request
correctly and preserves session affinity. The frontend now routes the
`unknown_replica_id` reason to the fallback proxy rather than returning 503.

Scope: only `unknown_replica_id` is recovered. `router_unreachable`,
`router_non_200`, and `unparseable_replica_id` mean the router itself is
broken and still fail loud. The Lua keeps arming
`txn.ingress_request_router_failed`, so the
`serve_haproxy_ingress_router_failures{reason}` metric still counts every
pin-miss by reason.

Adds test_pin_miss_falls_back_to_fallback_server: with the router pinning a
replica absent from the server map, the request is served by the fallback
(200) and the affinity-breaking primary backend is never selected.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji changed the title [llm] Warm up HAProxy data plane in direct-streaming session-affinity tests [serve] Recover ingress-router pin-misses via the fallback proxy instead of 503 Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant