Skip to content

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up.#10617

Draft
ArafatKhan2198 wants to merge 2 commits into
apache:masterfrom
ArafatKhan2198:follower_slow
Draft

HDDS-14989. Delay follower SCM DN server start until Ratis log catch-up.#10617
ArafatKhan2198 wants to merge 2 commits into
apache:masterfrom
ArafatKhan2198:follower_slow

Conversation

@ArafatKhan2198

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

When an SCM follower restarts in an HA cluster, it used to start talking to datanodes right away, even while it was still catching up on the Ratis log.

That caused problems:

  • Datanodes report containers the follower doesn’t know about yet → CONTAINER_NOT_FOUND
  • Or the follower tries to update container state and fails → NotLeaderException
  • In both cases, replica info gets dropped
  • If that SCM later becomes leader, containers can show missing or wrong replicas

The fix:

  1. Don’t start the datanode server in HA mode during normal SCM startup.
  2. Wait until catch-up is done, then start it from SCMStateMachine.
  3. Don’t let followers write container state changes during report handling — only the leader should.

Why: Replica locations are rebuilt from datanode reports. Those reports must only be processed after the SCM has replayed all committed Ratis entries.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14989

How was this patch tested?

Integration tests

TestSCMFollowerCatchupWithContainerReport -

  • testFollowerCatchupAfterContainerClose — close-while-down (HDDS-14989 scenario)
  • testFollowerCatchupAfterContainerCreate — create-while-down (CONTAINER_NOT_FOUND scenario)
  • testFollowerCatchupOnIdleCluster — idle cluster edge case

Manual test (docker-compose ozone-ha)

Environment: hadoop-ozone/dist/target/ozone-2.3.0-SNAPSHOT/compose/ozone-ha

Config: RF=3, 3 datanodes, hdds.container.report.interval=1hozone.scm.container.size=1GB

Procedure (same for with/without fix):

  1. Start cluster: OZONE_REPLICATION_FACTOR=3 docker compose up -d --scale datanode=3
  2. Write 50 × 1MB keys to vol1/buck1 (containers 1–3)
  3. Stop follower scm3
  4. Close containers 1, 2, 3
  5. Write 50 × 1MB keys to vol1/buck2 (creates containers 4, 5, 6 while scm3 is down)
  6. Restart scm3
  7. Transfer SCM leadership to scm3
  8. Inspect scm3 logs and ozone admin container info for containers 4–6

Without the fix:

06:57:58.622  ScmDatanodeProtocol RPC server ... listening at /0.0.0.0:9861
06:57:58.837  CONTAINER_NOT_FOUND for Container #4
06:57:58.837  CONTAINER_NOT_FOUND for Container #5
06:57:58.837  CONTAINER_NOT_FOUND for Container #6
(6 errors total — 2 datanodes × 3 containers)

After leadership transfer, containers 4–6 had 1 replica each (expected 3).

With the fix:

07:24:28.377  Follower caught up with leader: lastAppliedIndex=49, leaderCommit=49
07:24:28.378  ScmDatanodeProtocol RPC server ... listening at /0.0.0.0:9861
  • CONTAINER_NOT_FOUND on scm3: 0
  • After leadership transfer, containers 4–6 each had 3 replicas from all datanodes

…ring cluster

Reduce test execution time by ~60% (2m28s → ~1m) via three targeted optimizations:

1. Share one cluster across all three test methods (@BeforeAll/@afterall instead
   of @BeforeEach/@AfterEach). This eliminates 2 of 3 cluster bring-up cycles,
   which account for ~80-90% of the test duration. Each test uses its own
   volume/bucket name to avoid collisions on the shared cluster.

2. Lower the datanode heartbeat interval to 1 second in the config. This speeds
   safe-mode exit at startup and replica re-reporting after the deferred
   DN-server start, both happening in ~1s instead of default timers.

3. Tighten the waitFor poll interval from 1000ms to 250ms. This allows the
   test to notice when async conditions (safe-mode exit, leadership transfer)
   are met sooner, without changing the timeout ceilings.

The test essence is fully preserved: 3 SCMs, 3 DNs, 5-minute report interval,
down→mutate→restart→catch-up→promote sequence all unchanged. Only the test
harness (cluster setup and polling) was optimized. The regression still catches
empty replicas on the unfixed code and confirms full replicas on the fix.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@ivandika3 ivandika3 requested a review from xichen01 June 26, 2026 13:03
@ivandika3

Copy link
Copy Markdown
Contributor

If some of the implementation is taken from #10059 , don't forget to add @xichen01 as the co-author.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants