HDDS-15651. Test case for DiskBalancer when markContainerForDelete fails by arunsarin85 · Pull Request #10593 · apache/ozone

arunsarin85 · 2026-06-23T20:43:54Z

What changes were proposed in this pull request?

Test-only PR for HDDS-15651. Adds two unit tests in TestDiskBalancerTask to document the intended DiskBalancer move/cleanup behavior when markContainerForDelete() fails or when lazy deletion fails.

Please describe your PR in detail:
DiskBalancer treats container move and source cleanup as separate phases. Once import and ContainerSet update succeed, the move is reported as success even if marking the old source replica fails. The old replica is queued in pendingDeletionContainers and removed after replica.deletion.delay.

This PR adds tests to lock in that behavior and document a known gap when lazy deletion fails.

Test 1: moveSucceedsWhenMarkContainerForDeleteFails

Simulates markContainerForDelete() failure on the source replica after a successful move.
Verifies the move is still reported as success (success metrics updated, no rollback).
Verifies ContainerSet points to the destination replica.
Verifies the source replica stays on disk temporarily and is queued for lazy deletion.
After the delay, verifies the source replica is removed via cleanupPendingDeletionContainers().

Test 2: lazyDeletionFailureDoesNotRetry

Runs a successful move and advances the clock past the deletion delay.
Mocks KeyValueContainerUtil.removeContainer() to fail during lazy deletion.
Verifies the source replica remains on disk, the pending queue entry is dropped, and deletion is not retried on a second cleanup attempt.
Documents current behavior when lazy deletion fails (recovery depends on other paths such as DN restart for Ratis).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15651

How was this patch tested?

mvn test -pl hadoop-hdds/container-service -am
-Dtest=TestDiskBalancerTask#moveSucceedsWhenMarkContainerForDeleteFails,TestDiskBalancerTask#lazyDeletionFailureDoesNotRetry
-DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false

Gargi-jais11

Thanks @arunsarin85 for raisin the concern. I have left comments below to discuss on this.

Gargi-jais11 · 2026-06-25T04:19:44Z

        readLockReleased = true;
        try {
          container.markContainerForDelete();
+          moveSucceeded = true;


By the time markContainerForDelete() runs, the expensive part is already done:

container copied to destination

import completed

ContainerSet updated to the new replica

destination used space incremented

If mark fails and we roll back, we would:

r- restore ContainerSet to the source replica

delete the destination directory

revert volume accounting

report the move as failed

That means we throw away a valid destination copy and redo the whole move later. For large containers, that is a lot of wasted I/O.

Why the current behavior is acceptable?

The move and cleanup are intentionally separate:

Move phase — copy + import + ContainerSet update
Cleanup phase — mark/delete source replica (with lazy deletion for in-flight reads)
If phase 1 succeeds, the container is effectively moved. Source cleanup is a follow-up step.

Also, even when mark fails:

the old replica is still queued in pendingDeletionContainers

deleteContainer() → removeContainer() does not require DELETED state

for Ratis, HDDS-9322 cleans up duplicates on DN restart

So this is mostly an operational/accounting issue, not a data-loss issue for Ratis.

cc: @ChenSammi

@arunsarin85
I don’t think we should fail and fully roll back a completed move just because source cleanup failed. That adds heavy work and can make DiskBalancer less effective. Existing lazy deletion + dn restart already cover most of the cleanup path for Ratis. For EC we issue can be there as the Pr is not merged yet.

Gargi-jais11 · 2026-06-25T04:25:16Z

+          if (diskBalancerDestDir != null) {
+            try {
+              FileUtils.deleteDirectory(diskBalancerDestDir.toFile());
+            } catch (IOException ex) {
+              LOG.warn("Failed to delete destination replica during rollback for container {}",


By the time markContainerForDelete() runs, copy/import/ContainerSet update are already done. Just deleting the new destination container is not a roll back. As ContainerSet is already pointing to new container on destination disk.

…ailure paths.

arunsarin85 · 2026-06-25T08:12:02Z

Thanks @Gargi-jais11 for the comments !
As per the design and feature flow explained in the jira https://issues.apache.org/jira/browse/HDDS-15651
I have modified this PR to be a test only [added 2 additional tests.] . Will modify the description accordingly.

arunsarin85 marked this pull request as draft June 24, 2026 15:40

Gargi-jais11 requested review from ChenSammi and Gargi-jais11 June 25, 2026 04:04

Gargi-jais11 reviewed Jun 25, 2026

View reviewed changes

HDDS-15651. Add DiskBalancer tests for mark failure and lazy-delete f…

30d50f6

…ailure paths.

arunsarin85 force-pushed the HDDS-15651 branch from ff380e7 to 30d50f6 Compare June 25, 2026 08:09

arunsarin85 marked this pull request as ready for review June 25, 2026 08:10

adoroszlai reviewed Jun 25, 2026

View reviewed changes

Comment thread ...rvice/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerTask.java Outdated

Comment thread ...rvice/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerTask.java Outdated

adoroszlai changed the title ~~HDDS-15651. Roll back DiskBalancer move when markContainerForDelete fails~~ HDDS-15651. Test case for DiskBalancer when markContainerForDelete fails Jun 25, 2026

adoroszlai added the test label Jun 25, 2026

HDDS-15651. Addresses review comment

b8e24d9

adoroszlai requested a review from Gargi-jais11 June 25, 2026 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-15651. Test case for DiskBalancer when markContainerForDelete fails#10593

HDDS-15651. Test case for DiskBalancer when markContainerForDelete fails#10593
arunsarin85 wants to merge 2 commits into
apache:masterfrom
arunsarin85:HDDS-15651

arunsarin85 commented Jun 23, 2026 •

edited

Loading

Uh oh!

Gargi-jais11 left a comment

Uh oh!

Gargi-jais11 Jun 25, 2026

Uh oh!

Gargi-jais11 Jun 25, 2026

Uh oh!

Gargi-jais11 Jun 25, 2026

Uh oh!

arunsarin85 commented Jun 25, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

arunsarin85 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Gargi-jais11 left a comment

Choose a reason for hiding this comment

Uh oh!

Gargi-jais11 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Gargi-jais11 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Gargi-jais11 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

arunsarin85 commented Jun 25, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arunsarin85 commented Jun 23, 2026 •

edited

Loading