Use Panama Vector API to SIMD-evaluate fixed-cardinality sorted numeric range queries in rangeIntoBitSet()#16283
Use Panama Vector API to SIMD-evaluate fixed-cardinality sorted numeric range queries in rangeIntoBitSet()#16283costin wants to merge 2 commits into
Conversation
Dense fixed-cardinality sorted numeric values can evaluate range blocks with the vectorization provider when the flattened value layout is raw and contiguous. Keep the optimization gated to layouts that benchmark well and retain scalar fallback behavior for other encodings.
| * | ||
| * @lucene.internal | ||
| */ | ||
| public interface SortedNumericDocValuesRangeSupport { |
There was a problem hiding this comment.
I don't think creating a separate interface for this specific use case looks right.
It differs from the existing DocValuesRangeSupport only by a single parameter ie cardinality. So doesn't justify creating a new abstraction layer just based on that, probably we can create or add just another method to existing DocValuesRangeSupport? Something like:
default void rangeIntoBitSet(LongValues values,
int fromDoc,
int toDoc,
int cardinality,
long minValue,
long maxValue,
FixedBitSet bitSet,
int offset) {
// default to scalar approach.
}
Let me know what you think.
There was a problem hiding this comment.
Make sense. I've removed the interface in favor of a new method which has the nice benefit of reducing the PR size.
27af116 to
015f370
Compare
sgup432
left a comment
There was a problem hiding this comment.
Have few minor comments. Also I think we should unit tests which cover scenarios with different cardinality values where it is > 1, vectorLen % cardinality != 0 and other cases? Assuming this is not already covered via existing tests.
See my other comments. I've parameterized testSortedNumericRangeIntoBitSetVaryingCardinality to exercise the other cardinalities {2, 3, 4, 5, 7, 8} to check both the SIMD and fallback scalar path. |
When the stored values have fixed cardinality and no encoding transforms (no gcd, delta, table, or block compression), the vectorization provider loads N values into a SIMD vector, performs a broadcast range check (
>= minAND<= max), collapses per-lane results into a per-doc mask, and OR-writes matching docs into the bitset in one operation.Falls back to scalar when
vectorLen % cardinality != 0(e.g. vpd=8 on AVX2 with 4-lane vectors).Benchmark
SortedNumericDocValuesRangeQueryBenchmark, 1M docs,
cardinality=fixed,density=dense,queryShape=plain. Branch vsmain, JDK 25.0.3.AMD EPYC 7R32 (c5a.2xlarge) — AVX2, 256-bit (4 longs)
SIMD: vpd=2 (2 docs/vec), vpd=4 (1 doc/vec). vpd=8 falls back to scalar.
Intel Xeon 8375C (c6i.2xlarge) — AVX-512, 512-bit (8 longs)
SIMD: vpd=2 (4 docs/vec), vpd=4 (2 docs/vec), vpd=8 (1 doc/vec).
Clustered data shows no change since sequential access is already at L1/L2 cache speed; comparison cost is negligible. Wins appear on random data where per-doc cache misses dominate and SIMD batching amortizes comparison overhead.
Gains scale with
docsPerVector: vpd=2 on AVX-512 processes 4 docs per vector (best), vpd=8 on AVX2 falls back to scalar (no gain).