[core] Avoid GCS crash on Redis connection loss in RedisResponseFn by nadongjun · Pull Request #64204 · ray-project/ray

nadongjun · 2026-06-18T06:33:51Z

Description

While testing, gcs_server crashed with a SIGSEGV during a Redis connection error (failover). Looking for a fix I found #48781, which solves it by reconnecting to Redis, but it was closed as stale without being merged. Its main blocker at the time (a dependency cycle in the GCS Redis client) has since been resolved by #49000 and #55655, so this PR ports that reconnect approach onto the current version.

When the Redis connection is lost, an in-flight request retries against the released connection and dereferences a null pointer, crashing the GCS. This PR reconnects to Redis when the connection is dead so requests retry on a healthy connection instead of crashing.

Related issues

Fixes [Ray Serve] GCS Segmentation Fault on failed Redis requests #53475
Based on [Fix][GCS] Implement reconnection for RedisContext #48781 by @MortalHappiness (closed without merge); credit for the original design.

Additional information

Reproduction:

Run a cluster with GCS fault tolerance backed by an HA Redis.
While the cluster is up, trigger a Redis primary failover e.g. kill/replace the Redis primary so a replica is promoted.
Before this PR, gcs_server exits with signal 11 and the head pod restarts; after this PR it reconnects to the new primary and continues.

The exact crash we observed:

2026-06-18 04:54:00,110 ERROR scripts.py:1219 -- Some Ray subprocesses exited unexpectedly:
  gcs_server [exit code=-11 (signal 11)]

[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474: *** SIGSEGV received at time=1781726136 on cpu 0 ***
[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474: PC: @     0x55719802efb8  (unknown)  ray::gcs::RedisRequestContext::RedisResponseFn()
[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474:     @     0x7f6c34fb6520  (unknown)  (unknown)
[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474:     @     0x55719802f1f9       1168  ray::gcs::RedisRequestContext::Run()
[2026-06-18 04:55:36,551 E 42 42] (gcs_server) logging.cc:474:     @     0x55719801a748        224  ray::gcs::RedisStoreClient::AsyncCheckHealth()

Update:

I first took the reconnect approach above, but a robust in-place reconnect needs Connect() to be non-fatal (it RAY_CHECKs today, so a transient failover still aborts gcs_server) - the larger refactor that stalled #48781. So this PR instead takes the minimal fix: null-guard the dereference so the request fails gracefully instead of crashing, and GCS fault tolerance recovers as today. Full reconnect is left as a follow-up (#48781).

gemini-code-assist

Code Review

This pull request introduces a reconnection mechanism (Reconnect()) to RedisContext to recover from lost Redis connections (e.g., during failovers) instead of crashing. It stores connection parameters on initial connection and updates RedisRequestContext to reference RedisContext directly. The reviewer identified several critical issues with this implementation, including thread-safety and lifetime issues in async_context() that could lead to data races or use-after-free bugs, a bug where bootstrap addresses are overwritten in Sentinel/Cluster setups, and a potential crash in RunArgvAsync during active reconnections.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Reviewed by Cursor Bugbot for commit 28f4384. Configure here.}

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

edoakes · 2026-06-18T19:31:47Z

@rueian PTAL

nadongjun requested a review from a team as a code owner June 18, 2026 06:33

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/ray/gcs/store_client/redis_context.h Outdated

Comment thread src/ray/gcs/store_client/redis_context.cc Outdated

Comment thread src/ray/gcs/store_client/redis_context.h Outdated

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/ray/gcs/store_client/redis_context.h

Comment thread src/ray/gcs/store_client/redis_context.h

Comment thread src/ray/gcs/store_client/redis_context.cc

Comment thread src/ray/gcs/store_client/redis_context.cc

Comment thread src/ray/gcs/store_client/redis_context.cc

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/ray/gcs/store_client/redis_context.h Outdated

[core] Avoid GCS crash on Redis connection loss in RedisResponseFn

bafecfc

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

nadongjun force-pushed the fix/gcs-redis-reconnect-on-failover branch from 28f4384 to bafecfc Compare June 18, 2026 07:24

nadongjun changed the title ~~[core] Reconnect to Redis after connection loss instead of crashing GCS~~ [core] Avoid GCS crash on Redis connection loss in RedisResponseFn Jun 18, 2026

Merge branch 'master' into fix/gcs-redis-reconnect-on-failover

4b6b8a8

ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jun 18, 2026

edoakes assigned rueian Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Avoid GCS crash on Redis connection loss in RedisResponseFn#64204

[core] Avoid GCS crash on Redis connection loss in RedisResponseFn#64204
nadongjun wants to merge 2 commits into
ray-project:masterfrom
nadongjun:fix/gcs-redis-reconnect-on-failover

nadongjun commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

edoakes commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nadongjun commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Update:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edoakes commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nadongjun commented Jun 18, 2026 •

edited

Loading