HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up. by ArafatKhan2198 · Pull Request #10617 · apache/ozone

ArafatKhan2198 · 2026-06-26T08:26:19Z

What changes were proposed in this pull request?

When an SCM follower restarts in an HA cluster, it used to start talking to datanodes right away, even while it was still catching up on the Ratis log.

That caused problems:

Datanodes report containers the follower doesn’t know about yet → CONTAINER_NOT_FOUND
Or the follower tries to update container state and fails → NotLeaderException
In both cases, replica info gets dropped
If that SCM later becomes leader, containers can show missing or wrong replicas

The fix:

Don’t start the datanode server in HA mode during normal SCM startup.
Wait until catch-up is done, then start it from SCMStateMachine.
Don’t let followers write container state changes during report handling — only the leader should.

Why: Replica locations are rebuilt from datanode reports. Those reports must only be processed after the SCM has replayed all committed Ratis entries.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14989

How was this patch tested?

Integration tests

TestSCMFollowerCatchupWithContainerReport -

testFollowerCatchupAfterContainerClose — close-while-down (HDDS-14989 scenario)
testFollowerCatchupAfterContainerCreate — create-while-down (CONTAINER_NOT_FOUND scenario)
testFollowerCatchupOnIdleCluster — idle cluster edge case

Manual test (docker-compose `ozone-ha`)

Environment: hadoop-ozone/dist/target/ozone-2.3.0-SNAPSHOT/compose/ozone-ha

Config: RF=3, 3 datanodes, hdds.container.report.interval=1h, ozone.scm.container.size=1GB

Procedure (same for with/without fix):

Start cluster: OZONE_REPLICATION_FACTOR=3 docker compose up -d --scale datanode=3
Write 50 × 1MB keys to vol1/buck1 (containers 1–3)
Stop follower scm3
Close containers 1, 2, 3
Write 50 × 1MB keys to vol1/buck2 (creates containers 4, 5, 6 while scm3 is down)
Restart scm3
Transfer SCM leadership to scm3
Inspect scm3 logs and ozone admin container info for containers 4–6

Without the fix:

06:57:58.622  ScmDatanodeProtocol RPC server ... listening at /0.0.0.0:9861
06:57:58.837  CONTAINER_NOT_FOUND for Container #4
06:57:58.837  CONTAINER_NOT_FOUND for Container #5
06:57:58.837  CONTAINER_NOT_FOUND for Container #6
(6 errors total — 2 datanodes × 3 containers)

After leadership transfer, containers 4–6 had 1 replica each (expected 3).

With the fix:

07:24:28.377  Follower caught up with leader: lastAppliedIndex=49, leaderCommit=49
07:24:28.378  ScmDatanodeProtocol RPC server ... listening at /0.0.0.0:9861

CONTAINER_NOT_FOUND on scm3: 0
After leadership transfer, containers 4–6 each had 3 replicas from all datanodes

@afterall

…ring cluster Reduce test execution time by ~60% (2m28s → ~1m) via three targeted optimizations: 1. Share one cluster across all three test methods (@BeforeAll/@afterall instead of @BeforeEach/@AfterEach). This eliminates 2 of 3 cluster bring-up cycles, which account for ~80-90% of the test duration. Each test uses its own volume/bucket name to avoid collisions on the shared cluster. 2. Lower the datanode heartbeat interval to 1 second in the config. This speeds safe-mode exit at startup and replica re-reporting after the deferred DN-server start, both happening in ~1s instead of default timers. 3. Tighten the waitFor poll interval from 1000ms to 250ms. This allows the test to notice when async conditions (safe-mode exit, leadership transfer) are met sooner, without changing the timeout ceilings. The test essence is fully preserved: 3 SCMs, 3 DNs, 5-minute report interval, down→mutate→restart→catch-up→promote sequence all unchanged. Only the test harness (cluster setup and polling) was optimized. The regression still catches empty replicas on the unfixed code and confirms full replicas on the fix. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

ivandika3 · 2026-06-26T13:05:32Z

If some of the implementation is taken from #10059 , don't forget to add @xichen01 as the co-author.

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up

df1f0f3

ArafatKhan2198 requested a review from sumitagrawl June 26, 2026 08:26

ivandika3 requested a review from xichen01 June 26, 2026 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up.#10617

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up.#10617
ArafatKhan2198 wants to merge 2 commits into
apache:masterfrom
ArafatKhan2198:follower_slow

ArafatKhan2198 commented Jun 26, 2026

Uh oh!

ivandika3 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ArafatKhan2198 commented Jun 26, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Integration tests

Manual test (docker-compose ozone-ha)

Uh oh!

ivandika3 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Manual test (docker-compose `ozone-ha`)