perf(util): add FixedBitSet.copyOf() fast paths for SparseLiveDocs and DenseLiveDocs#16282
Conversation
FixedBitSet.copyOf(Bits) already has fast paths for FixedBitSet and FixedBits, but SparseLiveDocs and DenseLiveDocs (introduced in apache#15413) fell through to the O(maxDoc) generic loop. Each type now exposes a package-private toFixedBitSet() method that FixedBitSet.copyOf() delegates to: - DenseLiveDocs stores live docs in a FixedBitSet: clone it directly, O(maxDoc/64). - SparseLiveDocs stores deleted docs in a SparseFixedBitSet: pre-fill the backing long[] with -1L and clear only deleted positions using nextSetBit, O(deletedDocs + maxDoc/64). The hot caller is PendingDeletes.getMutableBits(), which invokes copyOf(liveDocs) on the first delete after a snapshot.
| } else if (bits instanceof DenseLiveDocs denseLiveDocs) { | ||
| return denseLiveDocs.toFixedBitSet(); | ||
| } else if (bits instanceof SparseLiveDocs sparseLiveDocs) { | ||
| return sparseLiveDocs.toFixedBitSet(); |
There was a problem hiding this comment.
Can we have an interface (Have FixedBitSet, DenseLiveDocs, and SparseLiveDocs all implement it) which could be used here instead of multiple if/else if?
There was a problem hiding this comment.
Thanks for the suggestion. The interface would make copyOf() cleaner, but the tricky part is that FixedBitSet itself would also need to implement it (to handle the case after the FixedBits unwrap at the top of the method). That means adding a toFixedBitSet() method to FixedBitSet whose only implementation is return clone(), which feels redundant and a bit odd semantically. Happy to go that route if the consensus is that the cleaner dispatch is worth it, but leaning toward keeping the instanceof chain since it mirrors the existing pattern already in the method for FixedBits/FixedBitSet.
There was a problem hiding this comment.
That said, if the interface only covers DenseLiveDocs and SparseLiveDocs (not FixedBitSet), the semantic oddity goes away. Is that what you had in mind?
There was a problem hiding this comment.
I think either way is fine - if it leads to a reduction of instanceof, not a huge fan of it (if it keeps on branching)
Replace the raw long[] pre-fill approach with FixedBitSet.set(0, maxDoc) followed by result.clear(doc) in the deletion loop. The two approaches are semantically identical: set(0, maxDoc) fills the backing array with -1L and masks off the ghost bits in the last word in one call.
|
This PR will prevent the function from being inlined anymore (I do not know if it is important). Previously it would work with bimorphic inlining. |
I think you're referring to The As a result, the trade-off I see is: an |
|
Please do not answer me with an LLM. Your AI is wrong about this. |
rmuir
left a comment
There was a problem hiding this comment.
looks like a de-optimization with a benchmark that hides it
I wrote the benchmark to compare against the generic loop, which seems to me like a fair baseline. What scenario do you think it is hiding? Also, the previous comment was mine. |
TL;DR
Make
FixedBitSet.copyOf()135x to 193x faster forDenseLiveDocsand 11x to 48x faster forSparseLiveDocs.Summary
FixedBitSet.copyOf(Bits)has fast paths forFixedBitSetandFixedBitsbut not forSparseLiveDocsandDenseLiveDocs, the twoLiveDocstypes introduced in #15413. Both fall through to the generic O(maxDoc) per-bit loop. The hot caller isPendingDeletes.getMutableBits(), which invokesFixedBitSet.copyOf(liveDocs)on the first delete after a reader snapshot. Under write-heavy workloads this cost accumulates across open reader generations.Each type now exposes a package-private
toFixedBitSet()method thatFixedBitSet.copyOf()delegates to, keeping the copy logic next to the data it knows about.DenseLiveDocsstores live docs in aFixedBitSetwith identical semantics, sotoFixedBitSet()clones it at O(maxDoc/64).SparseLiveDocsstores deleted positions in aSparseFixedBitSet, sotoFixedBitSet()allocates aFixedBitSet, callsset(0, maxDoc)to mark all docs live in O(maxDoc/64), then iterates only the deleted positions vianextSetBitclearing each one, for a total of O(maxDoc/64 + deletedDocs).Benchmarks
LiveDocsCopyOfBenchmark,FixedBitSet.copyOf()average time (us/op),-wi 5 -i 7 -f 3. Baseline =mainHEAD; contender = this PR.DenseLiveDocs
SparseLiveDocs
The
SparseLiveDocsspeedup shrinks as the deletion rate grows: the contender always pays O(maxDoc/64) to fill the backing array viaset(0, maxDoc), and on top of that clears one position per deleted document. At low deletion rates the fill dominates and the gap with the O(maxDoc) baseline is large; at higher rates the clearing loop contributes more and the advantage narrows.