Skip to content

perf(util): add FixedBitSet.copyOf() fast paths for SparseLiveDocs and DenseLiveDocs#16282

Open
salvatorecampagna wants to merge 5 commits into
apache:mainfrom
salvatorecampagna:perf/fixedbitset-copyof-livedocs-fast-paths
Open

perf(util): add FixedBitSet.copyOf() fast paths for SparseLiveDocs and DenseLiveDocs#16282
salvatorecampagna wants to merge 5 commits into
apache:mainfrom
salvatorecampagna:perf/fixedbitset-copyof-livedocs-fast-paths

Conversation

@salvatorecampagna

@salvatorecampagna salvatorecampagna commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

TL;DR

Make FixedBitSet.copyOf() 135x to 193x faster for DenseLiveDocs and 11x to 48x faster for SparseLiveDocs.

Summary

FixedBitSet.copyOf(Bits) has fast paths for FixedBitSet and FixedBits but not for SparseLiveDocs and DenseLiveDocs, the two LiveDocs types introduced in #15413. Both fall through to the generic O(maxDoc) per-bit loop. The hot caller is PendingDeletes.getMutableBits(), which invokes FixedBitSet.copyOf(liveDocs) on the first delete after a reader snapshot. Under write-heavy workloads this cost accumulates across open reader generations.

Each type now exposes a package-private toFixedBitSet() method that FixedBitSet.copyOf() delegates to, keeping the copy logic next to the data it knows about. DenseLiveDocs stores live docs in a FixedBitSet with identical semantics, so toFixedBitSet() clones it at O(maxDoc/64). SparseLiveDocs stores deleted positions in a SparseFixedBitSet, so toFixedBitSet() allocates a FixedBitSet, calls set(0, maxDoc) to mark all docs live in O(maxDoc/64), then iterates only the deleted positions via nextSetBit clearing each one, for a total of O(maxDoc/64 + deletedDocs).

Benchmarks

LiveDocsCopyOfBenchmark, FixedBitSet.copyOf() average time (us/op), -wi 5 -i 7 -f 3. Baseline = main HEAD; contender = this PR.

DenseLiveDocs

maxDoc del rate baseline (us) err% contender (us) err% speedup
1M 0.1% 317.838 +/- 5.686 1.8% 2.362 +/- 0.039 1.7% 135x
10M 0.1% 3166.801 +/- 91.964 2.9% 20.797 +/- 1.160 5.6% 152x
100M 0.1% 31434.474 +/- 234.936 0.7% 198.186 +/- 11.443 5.8% 159x
1M 1% 371.669 +/- 6.345 1.7% 2.302 +/- 0.054 2.3% 162x
10M 1% 3766.590 +/- 35.200 0.9% 19.476 +/- 0.684 3.5% 193x
100M 1% 37788.094 +/- 191.936 0.5% 203.131 +/- 10.855 5.3% 186x

SparseLiveDocs

maxDoc del rate baseline (us) err% contender (us) err% speedup
1M 0.1% 518.084 +/- 6.739 1.3% 10.900 +/- 0.096 0.9% 48x
10M 0.1% 4976.429 +/- 55.761 1.1% 111.454 +/- 1.495 1.3% 45x
100M 0.1% 49412.733 +/- 424.527 0.9% 1152.935 +/- 12.952 1.1% 43x
1M 1% 1201.822 +/- 43.661 3.6% 97.187 +/- 1.349 1.4% 12x
10M 1% 12062.162 +/- 146.840 1.2% 986.402 +/- 14.975 1.5% 12x
100M 1% 119511.553 +/- 688.099 0.6% 10884.173 +/- 159.084 1.5% 11x

The SparseLiveDocs speedup shrinks as the deletion rate grows: the contender always pays O(maxDoc/64) to fill the backing array via set(0, maxDoc), and on top of that clears one position per deleted document. At low deletion rates the fill dominates and the gap with the O(maxDoc) baseline is large; at higher rates the clearing loop contributes more and the advantage narrows.

FixedBitSet.copyOf(Bits) already has fast paths for FixedBitSet and
FixedBits, but SparseLiveDocs and DenseLiveDocs (introduced in apache#15413)
fell through to the O(maxDoc) generic loop.

Each type now exposes a package-private toFixedBitSet() method that
FixedBitSet.copyOf() delegates to:

- DenseLiveDocs stores live docs in a FixedBitSet: clone it directly,
  O(maxDoc/64).
- SparseLiveDocs stores deleted docs in a SparseFixedBitSet: pre-fill
  the backing long[] with -1L and clear only deleted positions using
  nextSetBit, O(deletedDocs + maxDoc/64).

The hot caller is PendingDeletes.getMutableBits(), which invokes
copyOf(liveDocs) on the first delete after a snapshot.
@github-actions github-actions Bot added this to the 10.6.0 milestone Jun 22, 2026
@salvatorecampagna salvatorecampagna marked this pull request as ready for review June 22, 2026 18:08
} else if (bits instanceof DenseLiveDocs denseLiveDocs) {
return denseLiveDocs.toFixedBitSet();
} else if (bits instanceof SparseLiveDocs sparseLiveDocs) {
return sparseLiveDocs.toFixedBitSet();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an interface (Have FixedBitSet, DenseLiveDocs, and SparseLiveDocs all implement it) which could be used here instead of multiple if/else if?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. The interface would make copyOf() cleaner, but the tricky part is that FixedBitSet itself would also need to implement it (to handle the case after the FixedBits unwrap at the top of the method). That means adding a toFixedBitSet() method to FixedBitSet whose only implementation is return clone(), which feels redundant and a bit odd semantically. Happy to go that route if the consensus is that the cleaner dispatch is worth it, but leaning toward keeping the instanceof chain since it mirrors the existing pattern already in the method for FixedBits/FixedBitSet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if the interface only covers DenseLiveDocs and SparseLiveDocs (not FixedBitSet), the semantic oddity goes away. Is that what you had in mind?

@shubhamsrkdev shubhamsrkdev Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either way is fine - if it leads to a reduction of instanceof, not a huge fan of it (if it keeps on branching)

Comment thread lucene/core/src/java/org/apache/lucene/util/SparseLiveDocs.java Outdated
Replace the raw long[] pre-fill approach with FixedBitSet.set(0, maxDoc)
followed by result.clear(doc) in the deletion loop. The two approaches
are semantically identical: set(0, maxDoc) fills the backing array with
-1L and masks off the ghost bits in the last word in one call.
@rmuir

rmuir commented Jun 24, 2026

Copy link
Copy Markdown
Member

This PR will prevent the function from being inlined anymore (I do not know if it is important). Previously it would work with bimorphic inlining.

@salvatorecampagna

salvatorecampagna commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

This PR will prevent the function from being inlined anymore (I do not know if it is important). Previously it would work with bimorphic inlining.

I think you're referring to bits.get(i) in the generic fallback loop (that's the only virtual dispatch I see, am I missing anything else?). Before this PR, SparseLiveDocs and DenseLiveDocs both fell through to that loop, so the JIT saw exactly 2 receiver types at that call site and could apply bimorphic inlining for both get() implementations. After this PR those two types take the fast paths and never reach the loop, so that call site loses its 2-type profile, right?

The instanceof chain itself does not introduce megamorphic dispatch since after each type guard the type is statically known at that point.

As a result, the trade-off I see is: an O(maxDoc) loop with bimorphic virtual get() calls is replaced by O(maxDoc/64) direct word operations with no virtual dispatch. Is that the concern you had in mind?

@rmuir

rmuir commented Jun 24, 2026

Copy link
Copy Markdown
Member

Please do not answer me with an LLM. Your AI is wrong about this.

@rmuir rmuir left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a de-optimization with a benchmark that hides it

@salvatorecampagna

Copy link
Copy Markdown
Contributor Author

looks like a de-optimization with a benchmark that hides it

I wrote the benchmark to compare against the generic loop, which seems to me like a fair baseline.
The fast path didn't exist before and that is what the new benchmark is measuring: main without fast path versus pr with fast path.

What scenario do you think it is hiding?

Also, the previous comment was mine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants