[serve] Recover ingress-router pin-misses via the fallback proxy instead of 503#64218
[serve] Recover ingress-router pin-misses via the fallback proxy instead of 503#64218eicherseiji wants to merge 3 commits into
Conversation
… tests The direct-streaming session-affinity tests fired requests as soon as the application reached RUNNING. RUNNING only means the controller sees every replica running; HAProxy reloads its data-plane server map asynchronously afterward, and the ingress request router learns the running replicas through a separate long poll that runs ahead of that reload. In that window the router pins a replica HAProxy has not loaded, and HAProxy returns 503 unknown_replica_id. test_different_sessions_spread fans requests across all replicas, so it hit the gap and failed in postmerge; test_session_affinity pins a single replica and rarely did. run_app_through_haproxy now warms up until a request has reached every data-plane replica before returning the URL, closing the race for all three direct-streaming test files (DP, PD, and router). Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request improves the reliability of direct-streaming session-affinity tests by ensuring HAProxy is fully warmed up and routing to all data-plane replicas before proceeding. Feedback suggests wrapping the warmup HTTP requests in a try-except block to handle transient connection or timeout errors during HAProxy startup or reload, preventing premature failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| resp = _chat_request(base_url, f"warmup-session-{next(sessions)}") | ||
| if resp.status_code == 200: | ||
| reached.add(resp.headers["x-replica-id"]) |
There was a problem hiding this comment.
During HAProxy startup or reload, _chat_request might raise transient connection or timeout errors (e.g., httpx.ConnectError or httpx.HTTPError). If an exception is raised, it will abort the loop of probes immediately, which can cause the wait_for_condition check to fail or run much slower. Wrapping the request in a try-except block to catch httpx.HTTPError and safely checking for the presence of the x-replica-id header makes the warmup process significantly more robust.
| resp = _chat_request(base_url, f"warmup-session-{next(sessions)}") | |
| if resp.status_code == 200: | |
| reached.add(resp.headers["x-replica-id"]) | |
| try: | |
| resp = _chat_request(base_url, f"warmup-session-{next(sessions)}") | |
| if resp.status_code == 200 and "x-replica-id" in resp.headers: | |
| reached.add(resp.headers["x-replica-id"]) | |
| except httpx.HTTPError: | |
| pass |
…affinity tests" This reverts commit 284f335. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ead of 503
When direct streaming is enabled, HAProxy asks the ingress request router to
pin a replica, then routes to that replica by name through its statically
reloaded server map. Right after an application becomes RUNNING there is a
brief window where the router's in-process replica view runs ahead of
HAProxy's config reload, so the router can name a replica HAProxy has not
loaded yet. HAProxy rejected that with a 503 `unknown_replica_id`, which
surfaced as flaky failures in the direct-streaming session-affinity tests
(test_dp_direct_streaming, etc.) and as a real client-facing error during the
post-deploy gap.
Recover the request instead of failing it. The fallback Serve proxy is always
running in HAProxy mode and re-pins via the same consistent-hash router (same
ring, same replica) over the Serve handle path, so it serves the request
correctly and preserves session affinity. The frontend now routes the
`unknown_replica_id` reason to the fallback proxy rather than returning 503.
Scope: only `unknown_replica_id` is recovered. `router_unreachable`,
`router_non_200`, and `unparseable_replica_id` mean the router itself is
broken and still fail loud. The Lua keeps arming
`txn.ingress_request_router_failed`, so the
`serve_haproxy_ingress_router_failures{reason}` metric still counts every
pin-miss by reason.
Adds test_pin_miss_falls_back_to_fallback_server: with the router pinning a
replica absent from the server map, the request is served by the fallback
(200) and the affinity-breaking primary backend is never selected.
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Why are these changes needed?
When direct streaming is enabled, HAProxy asks the ingress request router to pin a replica, then routes to it by name through its statically reloaded server map. Right after an application becomes RUNNING there is a brief window where the router's in-process replica view runs ahead of HAProxy's config reload, so the router can name a replica HAProxy has not loaded yet. HAProxy rejected that with a 503
unknown_replica_id-- surfacing as flakytest_dp_direct_streamingfailures in postmerge and as a real client-facing error during the post-deploy gap.This routes the
unknown_replica_idreason to the fallback Serve proxy instead of returning 503. The fallback proxy is always running in HAProxy mode and re-pins via the same consistent-hash router (same ring, same replica) over the Serve handle path, so it serves the request correctly and preserves session affinity.Scope: only
unknown_replica_idis recovered.router_unreachable,router_non_200, andunparseable_replica_idmean the router itself is broken and still fail loud (503). The Lua keeps armingtxn.ingress_request_router_failed, soserve_haproxy_ingress_router_failures{reason}still counts every pin-miss by reason.Adds
test_pin_miss_falls_back_to_fallback_server(runs against real HAProxy): with the router pinning a replica absent from the server map, the request is served by the fallback (200) and the affinity-breaking primary backend is never selected.The first commit (a test-side warmup) is reverted in this PR; the fallback fix replaces it.
Related issue number
Surfaced by postmerge https://buildkite.com/ray-project/postmerge/builds/18131
Checks
test_haproxy_api.py(incl. the new test) andtest_haproxy_metrics.pypass locally against HAProxy 2.8; lint clean.