[core] Avoid GCS crash on Redis connection loss in RedisResponseFn#64204
[core] Avoid GCS crash on Redis connection loss in RedisResponseFn#64204nadongjun wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a reconnection mechanism (Reconnect()) to RedisContext to recover from lost Redis connections (e.g., during failovers) instead of crashing. It stores connection parameters on initial connection and updates RedisRequestContext to reference RedisContext directly. The reviewer identified several critical issues with this implementation, including thread-safety and lifetime issues in async_context() that could lead to data races or use-after-free bugs, a bug where bootstrap addresses are overwritten in Sentinel/Cluster setups, and a potential crash in RunArgvAsync during active reconnections.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Reviewed by Cursor Bugbot for commit 28f4384. Configure here.
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
28f4384 to
bafecfc
Compare
|
@rueian PTAL |

Description
While testing,
gcs_servercrashed with a SIGSEGV during a Redis connection error (failover). Looking for a fix I found #48781, which solves it by reconnecting to Redis, but it was closed as stale without being merged. Its main blocker at the time (a dependency cycle in the GCS Redis client) has since been resolved by #49000 and #55655, so this PR ports that reconnect approach onto the current version.When the Redis connection is lost, an in-flight request retries against the released connection and dereferences a null pointer, crashing the GCS. This PR reconnects to Redis when the connection is dead so requests retry on a healthy connection instead of crashing.
Related issues
Additional information
Reproduction:
gcs_serverexits with signal 11 and the head pod restarts; after this PR it reconnects to the new primary and continues.The exact crash we observed:
Update:
I first took the reconnect approach above, but a robust in-place reconnect needs
Connect()to be non-fatal (itRAY_CHECKs today, so a transient failover still abortsgcs_server) - the larger refactor that stalled #48781. So this PR instead takes the minimal fix: null-guard the dereference so the request fails gracefully instead of crashing, and GCS fault tolerance recovers as today. Full reconnect is left as a follow-up (#48781).