Skip to content

Allow merge integrity checks to be aborted sooner#16086

Closed
tlrx wants to merge 3 commits into
apache:mainfrom
tlrx:merge-state-abort-checker
Closed

Allow merge integrity checks to be aborted sooner#16086
tlrx wants to merge 3 commits into
apache:mainfrom
tlrx:merge-state-abort-checker

Conversation

@tlrx

@tlrx tlrx commented May 19, 2026

Copy link
Copy Markdown
Contributor

This change introduces a merge AbortChecker that is threaded through MergeState into StoredFieldsReader.checkIntegrity and CodecUtil.checksumEntireFile, allowing long-running integrity checks like described in #13354 to be aborted sooner after a merge is aborted.

It is enabled via IndexWriterConfig.setMergeAbortCheckIntervalBytes, which indicates a number of bytes to read during checksumming before checking if the corresponding merge has been aborted.

Only stored fields are wired up but other files like postings and doc values could benefit from the same treatment in a follow-up. Having MergeState expose the AbortChecker.checkAborted method could also be useful to other merging logic like HNSW graphs maybe.

This change introduces a merge AbortChecker that is threaded through MergeState into StoredFieldsReader.checkIntegrity and CodecUtil.checksumEntireFile, allowing long-running integrity checks like in apache#13354 to be aborted sooner after a merge is aborted. It is enabled via IndexWriterConfig.setMergeAbortCheckIntervalBytes, which indicate a number of bytes to read during checksumming before checking if the corresponding merge has been aborted.

Only stored fields are wired up but other files like postings and doc values could benefit from the same treatment in a follow-up. Having MergeState expose the AbortChecker.checkAborted method can also be useful to other merging logic I think.

@romseygeek romseygeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tlrx, this looks great. I left a couple of comments.

}
}

final MergePolicy.AbortChecker mergeAbortChecker;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a factory method that takes a OneMerge and a check interval and returns NO_OP if the interval is 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I added MergePolicy.AbortChecker.create() in 227c021.


/** Checks if the merge should be aborted, throwing MergeAbortedException if so. */
public void checkAborted() throws MergeAbortedException {
oneMerge.checkAborted();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw an NPE if it's called on NO_OP which I think is trappy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oups, I pushed a68f87c

@romseygeek

Copy link
Copy Markdown
Contributor

Related: #13354

@tlrx tlrx requested a review from romseygeek May 19, 2026 16:20
private final OneMerge oneMerge;
private final int abortCheckIntervalBytes;

public AbortChecker(OneMerge oneMerge, int abortCheckIntervalBytes) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this private and do all construction via the factory method.

@mikemccand mikemccand left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we need a whole new class/interface, touching so many existing classes, for this. If it's just a "check every N bytes" why not implement that more directly -- store that int/long somewhere and fix checkIntegrity.

I'm also not a fan of only doing stored fields -- other Lucene index files can be ginormous (vectors) -- can we impl for those as well? Or if it really must only be stored fields, let's confess so in the docs for this?

Finally, I'm not convinced we understand the root cause here, so I think we may be fixing a "not actually a problem". At Amazon we've also seen some evidence that merge abort was insanely slow, but somehow never followed up on it / got to root cause. Can we first spend some time back on the issue doing that? For example, is there maybe a bug that's turned off the entire abort checking best effort system? (It is similar to Thread.interrupt best effort checking, but hopefully better!).

It's hard for me to believe checkIntegrity can take minutes on any Lucene index file unless the storage system holding the index is insanely slow. Or, the index is insanely massive. Both are possible :)

*
* @param intervalBytes the interval in bytes, must be positive
*/
public IndexWriterConfig setMergeAbortCheckIntervalBytes(int intervalBytes) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this long? In general if something is measuring bytes let's try to use long -- we do math on such things that may lead to overflow?

@rmuir rmuir May 22, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not need a configurable value at all, just a reasonable one (e.g. 1MB). If someone is doing s3, even setting this to 1 could take "forever" due to network latency. We shouldn't overengineer here.

If we can simplify how we do it (see idea on CodecUtil.java), i'd suggest overloading the implementing function with a parameter, to allow unit tests to use a small value, but it doesn't need to be public.

* Returns the interval in bytes between abort checks during merge integrity verification. See
* {@link IndexWriterConfig#setMergeAbortCheckIntervalBytes(int)}.
*/
public int getMergeAbortCheckIntervalBytes() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long instead?

* <p>The check is performed in {@link #readBytes} since that is what {@link
* ChecksumIndexInput#seek} calls in a loop to compute the checksum.
*/
private static final class AbortableBufferedChecksumIndexInput

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're adding another level of buffering / copying, when someone turns on this new setting? I think? Can we subclass FilterIndexInput instead, so it's simply/only the added accounting (tracking total bytes written) and not more buffering / copying?

protected IndexWriterEventListener eventListener;

/** Interval in bytes between abort checks during merge integrity verification. */
protected volatile int mergeAbortCheckIntervalBytes;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only checkIntegrity that we are newly instrumenting here?

abstract void checkIntegrity() throws IOException;

/** Check the integrity of the index, with periodic merge abort checking. */
public abstract void checkIntegrity(MergePolicy.AbortChecker abortChecker) throws IOException;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are only adding "check every N bytes", can we rename AbortChecker to something more specific? Is this a public API, or will Lucene users only interact with the setter/getter on IWC (taking just int or long)?

@mikemccand

Copy link
Copy Markdown
Member

Also, since things can vary drastically depending on the env, let's use time/seconds as the "check every", not bytes? We can hardwire this to something reasonable that's surely in the noise of overall merge cost.

For merges, IndexWriter already wraps the provided Directory to track which files are created for Codec file-tracking purposes, I think? If so, we could in theory wrap that to also instrument every IndexInput with this "check every N seconds", maybe.

We should be able to make this fix (once we really understand root cause -- do any of our unit tests check for prompt merge aborts?) without having to expose another knob for Lucene users to tune ...

@tlrx

tlrx commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @romseygeek and @mikemccand for your feedback!

Before addressing code review comments, I'll try to reply to the higher-level questions:

I'm not convinced we understand the root cause here, so I think we may be fixing a "not actually a problem". At Amazon we've also seen some evidence that merge abort was insanely slow, but somehow never followed up on it / got to root cause. Can we first spend some time back on the issue doing that?

It's hard for me to believe checkIntegrity can take minutes on any Lucene index file unless the storage system holding the index is insanely slow. Or, the index is insanely massive. Both are possible :)

Yes, the storage systems were indeed insanely slow when it happened. I've seen this in two situations: disks saturated with IO (caused I think by many concurrent force merges running on non top notch SSDs) and storage backed by object storage like S3 with high read latency hiccups, as David describes in #13354 (comment). In those situations, the Lucene index needed to be closed for system maintenance purpose or to be reopened on a more performant machine.

For example, is there maybe a bug that's turned off the entire abort checking best effort system?

I don't think there is a bug in the existing abort checking system itself. The existing merge.checkAborted() calls
work correctly at the points where they are checked (there are 2 or 3 of such calls before and in IndexWriter.mergeMiddle()), but once the merge thread starts the integrity checks on the segment readers to merge, then it is committed to finishing reading all the files associated to all reader to merge before it can observe the abort flag again.

Note that this change adds the checkAborted() method to MergeState, so we could maybe add additional checks on the aborted flag before checking the integrity of a reader. That would be less intrusive and already an improvement in my opinion, since the merge will be dropped later anyway.

do any of our unit tests check for prompt merge aborts?

I found some tests that abort merges but they don't check that the abort happens promptly. Specifically, there is no
test that verifies how "quickly" a merge reacts to the abort signal after it has entered checkIntegrity / checksumEntireFile.

I'm also not a fan of only doing stored fields -- other Lucene index files can be ginormous (vectors) -- can we impl for those as well?

Yes I mentioned in the PR description that other files could benefit from this but wanted to keep the radius of changes low at first.

For vectors, I wonder if more checkAborted() calls could be added too when building large graphs during merges.

For merges, IndexWriter already wraps the provided Directory to track which files are created for Codec file-tracking purposes, I think? If so, we could in theory wrap that to also instrument every IndexInput with this "check every N seconds", maybe.

I explored the MergeScheduler#wrapForMerge path but noticed that many readers were not reopening the files/inputs for integrity checks but instead use clones of already opened inputs, making it hard to wrap the index inputs. The typical pattern seems to be reader -> reader merge instance > clone inputs -> seek to zero (then skip by reading to compute the checksum).

I also explored wrapping the readers (StoredFieldsReader, DocValuesProducer etc) using MergePolicy.OneMerge#wrapForMerge(CodecReader) to override the checkIntegry methods but noticed that it would disable some bulk merges optimizations down the road.

Can we subclass FilterIndexInput instead [...]
since things can vary drastically depending on the env, let's use time/seconds as the "check every", not bytes? We can hardwire this to something reasonable that's surely in the noise of overall merge cost.

If we're only accounting for time spent reading bytes for checksumming purposes without tying it to the OneMerge aborted flag, then I agree we can subclass FilterIndexInput directly in checksumEntireFile.

So I think this leaves us with different options:

  1. Implement a FilterIndexInput in CodecUtil.checksumEntireFile() that periodically checks elapsed time
  2. Use the checkAborted() method added to MergeState in this pull request before each reader.checkIntegrity() call in the various writer/consumer merge methods.
  3. Explore wrapping the merge Directory via MergeScheduler.wrapForMerge to return wrapped IndexInputs that periodically checks checkAborted (may require to reopen files instead of relying on clones)
  4. Continue with the current proposal as-is.

I think we could start with option 2 as a simple first improvement in this PR (doesn't interrupt in-progress checksums but avoids starting new ones on an already-aborted merge) and follow up with option 1 or 3 to interrupt long-running checksums.

Does that sounds reasonable? If so I can rework the PR.

+ footerLength(),
input);
}
in.seek(in.length() - footerLength());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the call that will cause your i/o. I wonder if, instead of wrapping in a subclass, if we could just change it to a small loop, that only seek()s say, 1MB at once, and calls checkabort() after each iteration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rmuir, checking every ~1MB seek makes sense and the public configurable value on IndexWriterConfig would become unnecessary.

We'd still need to wire the OneMerge::checkAborted call down to CodecUtil.checksumEntireFile though, so the checkIntegrity() signature on the abstract reader classes would need to change (or be overloaded) to accept the abort callback, something that @mikemccand was concerned about.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand his point too. this was just a brainstorm to try to minimize it practically. I feel like wrapping all the reads might be overkill: if we could contain the solution here, I think it would make the merge abort a lot better.

Otherwise, after we checksum, for the most part merge is no longer just "reading" but also doing writes at the same time, so the existing checks should work there.

@mikemccand

Copy link
Copy Markdown
Member

For vectors, I wonder if more checkAborted() calls could be added too when building large graphs during merges.

Actually does vectors merging today ever check for aborted, e.g. during the concurrent HNSW graph building, except at the very start/end?

I'm also not a fan of only doing stored fields -- other Lucene index files can be ginormous (vectors) -- can we impl for those as well?

Yes I mentioned in the PR description that other files could benefit from this but wanted to keep the radius of changes low at first.

Got it, +1 for PnP ("progress not perfection") for sure.

I think a lower level approach (wrapping all IndexOutput created during merge, since we already wrap Directory to track which files exist, and checking every N seconds) might apply broadly (vectors, stored fields, massive postings files or doc values) with smaller blast radius on API surface.

For merges, IndexWriter already wraps the provided Directory to track which files are created for Codec file-tracking purposes, I think? If so, we could in theory wrap that to also instrument every IndexInput with this "check every N seconds", maybe.

I explored the MergeScheduler#wrapForMerge path but noticed that many readers were not reopening the files/inputs for integrity checks but instead use clones of already opened inputs, making it hard to wrap the index inputs. The typical pattern seems to be reader -> reader merge instance > clone inputs -> seek to zero (then skip by reading to compute the checksum).

Hmm I see, that is a wrinkle. I was thinking we could wrap the output path, but you're right, checkIntegrity is doing tons of reading (from existing open thingy) and no writing.

I also explored wrapping the readers (StoredFieldsReader, DocValuesProducer etc) using MergePolicy.OneMerge#wrapForMerge(CodecReader) to override the checkIntegry methods but noticed that it would disable some bulk merges optimizations down the road.

That is so tricky -- how did you catch that bulk optos broke? I hope we have tests that get angry?

I think we could start with option 2 as a simple first improvement in this PR (doesn't interrupt in-progress checksums but avoids starting new ones on an already-aborted merge) and follow up with option 1 or 3 to interrupt long-running checksums.

+1 to at least start with #2 -- that's an easy win!

I also ... dare say ... is there any chance Thread.interrupt is usable again? The wildly unexpected "I close your file handle because you interrupted me while reading some bytes from it" side effect of Thread.interrupt is maybe less of a big deal now (memory-mapped segments don't have this problem I think)? Thread.interrupt should work for threads stuck deep in IO for checkIntegrity, except, now you need to know which thread to interrupt.

@mikemccand

Copy link
Copy Markdown
Member

I think during merging we wrap all IndexOutput in order to possible rate-limit the IO (off by default), and to check for aborting every X bytes (on). Maybe the problem here is we only instrumented that write path, not the read path, and checkIntegrity obviously is all about reading. Could we just instrument the read path for abort checking, symmetric to the write path, and hopefully as delegator so we don't add extra buffering/copying on read?

@tlrx

tlrx commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Could we just instrument the read path for abort checking, symmetric to the write path, and hopefully as delegator so we don't add extra buffering/copying on read?

The symmetry with the read path is appealing! But I think it would require wrapping (or reopening) the IndexInputs used by the merge instances through the wrapped merge directory. And those merge instances are obtained via getMergeInstance(), so we'd need a way to pass the merge directory to them... I suspect the required changes would be as invasive as what this PR proposes. Unless you were thinking of a different solution?

Overall I think the simplest path is a checkIntegrity(OneMerge) overload that propagates the abort check down to checksumEntireFile, where it checks every ~1MB seek as Robert suggested.

@mikemccand

Copy link
Copy Markdown
Member

Overall I think the simplest path is a checkIntegrity(OneMerge) overload that propagates the abort check down to checksumEntireFile, where it checks every ~1MB seek as Robert suggested.

+1, that's a delightfully simple solution.

Could we just instrument the read path for abort checking, symmetric to the write path, and hopefully as delegator so we don't add extra buffering/copying on read?

The symmetry with the read path is appealing! But I think it would require wrapping (or reopening) the IndexInputs used by the merge instances through the wrapped merge directory. And those merge instances are obtained via getMergeInstance(), so we'd need a way to pass the merge directory to them... I suspect the required changes would be as invasive as what this PR proposes. Unless you were thinking of a different solution?

Ahh, hrpmph, yes. And these merges will often re-use an already opened pooled reader for each segment, and those were opened well before any merging plans (so no chance to wrap the IndexInputs). I agree, not simple...

@tlrx

tlrx commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @mikemccand for the feedback.

I opened #16264 to add additional merge abort checks before executing file integrity checksums (the point 2 that I suggested in #16086 (comment)). Hopefully it's a less controversial change. It also threads OneMerge down to MergeState so that once it is merged I can follow up with the other proposed change:

Overall I think the simplest path is a checkIntegrity(OneMerge) overload that propagates the abort check down to checksumEntireFile, where it checks every ~1MB seek as Robert suggested.

+1, that's a delightfully simple solution.

@tlrx

tlrx commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Closed in favor of #16281

@tlrx tlrx closed this Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants