Skip to content

[core] Avoid GCS crash on Redis connection loss in RedisResponseFn#64204

Open
nadongjun wants to merge 2 commits into
ray-project:masterfrom
nadongjun:fix/gcs-redis-reconnect-on-failover
Open

[core] Avoid GCS crash on Redis connection loss in RedisResponseFn#64204
nadongjun wants to merge 2 commits into
ray-project:masterfrom
nadongjun:fix/gcs-redis-reconnect-on-failover

Conversation

@nadongjun

@nadongjun nadongjun commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

While testing, gcs_server crashed with a SIGSEGV during a Redis connection error (failover). Looking for a fix I found #48781, which solves it by reconnecting to Redis, but it was closed as stale without being merged. Its main blocker at the time (a dependency cycle in the GCS Redis client) has since been resolved by #49000 and #55655, so this PR ports that reconnect approach onto the current version.

When the Redis connection is lost, an in-flight request retries against the released connection and dereferences a null pointer, crashing the GCS. This PR reconnects to Redis when the connection is dead so requests retry on a healthy connection instead of crashing.

Related issues

Additional information

Reproduction:

  1. Run a cluster with GCS fault tolerance backed by an HA Redis.
  2. While the cluster is up, trigger a Redis primary failover e.g. kill/replace the Redis primary so a replica is promoted.
  3. Before this PR, gcs_server exits with signal 11 and the head pod restarts; after this PR it reconnects to the new primary and continues.

The exact crash we observed:

2026-06-18 04:54:00,110 ERROR scripts.py:1219 -- Some Ray subprocesses exited unexpectedly:
  gcs_server [exit code=-11 (signal 11)]

[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474: *** SIGSEGV received at time=1781726136 on cpu 0 ***
[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474: PC: @     0x55719802efb8  (unknown)  ray::gcs::RedisRequestContext::RedisResponseFn()
[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474:     @     0x7f6c34fb6520  (unknown)  (unknown)
[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474:     @     0x55719802f1f9       1168  ray::gcs::RedisRequestContext::Run()
[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474:     @     0x55719801a748        224  ray::gcs::RedisStoreClient::AsyncCheckHealth()

Update:

I first took the reconnect approach above, but a robust in-place reconnect needs Connect() to be non-fatal (it RAY_CHECKs today, so a transient failover still aborts gcs_server) - the larger refactor that stalled #48781. So this PR instead takes the minimal fix: null-guard the dereference so the request fails gracefully instead of crashing, and GCS fault tolerance recovers as today. Full reconnect is left as a follow-up (#48781).

@nadongjun nadongjun requested a review from a team as a code owner June 18, 2026 06:33
Comment thread src/ray/gcs/store_client/redis_context.h Outdated
Comment thread src/ray/gcs/store_client/redis_context.cc Outdated
Comment thread src/ray/gcs/store_client/redis_context.h Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a reconnection mechanism (Reconnect()) to RedisContext to recover from lost Redis connections (e.g., during failovers) instead of crashing. It stores connection parameters on initial connection and updates RedisRequestContext to reference RedisContext directly. The reviewer identified several critical issues with this implementation, including thread-safety and lifetime issues in async_context() that could lead to data races or use-after-free bugs, a bug where bootstrap addresses are overwritten in Sentinel/Cluster setups, and a potential crash in RunArgvAsync during active reconnections.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/ray/gcs/store_client/redis_context.h
Comment thread src/ray/gcs/store_client/redis_context.h
Comment thread src/ray/gcs/store_client/redis_context.cc
Comment thread src/ray/gcs/store_client/redis_context.cc
Comment thread src/ray/gcs/store_client/redis_context.cc

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 28f4384. Configure here.

Comment thread src/ray/gcs/store_client/redis_context.h Outdated
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun nadongjun force-pushed the fix/gcs-redis-reconnect-on-failover branch from 28f4384 to bafecfc Compare June 18, 2026 07:24
@nadongjun nadongjun changed the title [core] Reconnect to Redis after connection loss instead of crashing GCS [core] Avoid GCS crash on Redis connection loss in RedisResponseFn Jun 18, 2026
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jun 18, 2026
@edoakes

edoakes commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

@rueian PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ray Serve] GCS Segmentation Fault on failed Redis requests

3 participants