Check if merge is aborted during file integrity checksums#16281
Conversation
This pull request follows the discussion on apache#16086. It adds a new `CodecUtil.checksumEntireFile(IndexInput, OneMerge)` that periodically checks whether the merge has been aborted while reading through the file. This avoids spending time checksumming large files when the merge is already cancelled. The check runs every 1 MB. This idea was suggested by Robert (thanks!). It also adds a `checkIntegrity(OneMerge)` to all codec reader/producer base classes, with a default implementation that delegates to the existing `checkIntegrity()`. I think I updated all current codec implementations to propagate the merge to CodecUtil.checksumEntireFile(IndexInput, OneMerge). An alternative change would be to make `checkIntegrity(OneMerge)` abstract and `checkIntegrity()` final, delegating to `checkIntegrity(null)`. That would prevent subclasses from silently ignoring the merge parameter but it would require updating all (backwards and third party) codec implementations. Or just replace `checkIntegrity()` by `checkIntegrity(OneMerge)`. Happy to hear our thoughts on this.
|
@mikemccand @rmuir If you have the time I'd be happy to have your feedback on this, thanks! |
Maybe we can avoid the delegator and just change the signature of the existing functions? Delegators can sometimes cause problems, maybe it is best to avoid them with this one |
| } | ||
|
|
||
| static long checksumEntireFile( | ||
| IndexInput input, MergePolicy.OneMerge merge, long abortCheckInterval) throws IOException { |
There was a problem hiding this comment.
great. maybe at some point this could be generalized into a built-in functional interface/predicate? I do see it as just "passing in an optional progress function".
I feel like the API would be cleaner, but when I looked more, it doesn't seem easy.
for now it seems passing OneMerge directly, like you have it, is probably the reasonable choice. The MergeAbortedException is not "normal" and gets handled in a special way by IndexWriter.
There was a problem hiding this comment.
Thanks for your feedback. Having tried various approach previously, I also think passing OneMerge is the simpler for now, but happy to revisit this if you change your mind.
I pushed 73aa8b9 to change |
|
looks great. I'm just waiting for release activity to settle before merging anything big into main. |
This pull request follows the discussion on #16086.
It adds a new
CodecUtil.checksumEntireFile(IndexInput, OneMerge)that periodically checks whether the merge has been aborted while reading through the file. This avoids spending time checksumming large files when the merge is already cancelled. The check runs every 1 MB. This idea was suggested by Robert (thanks!).It also adds a
checkIntegrity(OneMerge)to all codec reader/producer base classes, with a default implementation that delegates to the existingcheckIntegrity(). I think I updated all current codec implementations to propagate the merge to CodecUtil.checksumEntireFile(IndexInput, OneMerge).An alternative change would be to make
checkIntegrity(OneMerge)abstract andcheckIntegrity()final, delegating tocheckIntegrity(null). That would prevent subclasses from silently ignoring the merge parameter but it would require updating all (backwards and third party) codec implementations. Or just replacecheckIntegrity()bycheckIntegrity(OneMerge). Happy to hear our thoughts on this.Close #13354