Add bulk BinaryDocValues#binaryValues API#16286
Conversation
Mirrors NumericDocValues#longValues for binary fields. Default impl: per-doc advanceExact + deepCopyOf fallback. Null for missing docs. Includes CheckIndex validation, AssertingLeafReader assertions, dense+sparse tests.
Dense fixed-length and variable-length binary doc values bypass virtual dispatch by reading directly from the data slice. Includes JMH benchmark comparing bulk vs per-doc.
| * @param valuesOffset first position in {@code values} to write | ||
| */ | ||
| public void binaryValues( | ||
| int size, int[] docs, int docsOffset, BytesRef[] values, int valuesOffset) |
There was a problem hiding this comment.
I wonder whether we can use something more compact than BytesRef[] here. Maybe BytesRefArray?
Or maybe let the consumer side chose how to collect the values? Maybe we can replace values with ObjIntConsumer or custom functional interface.
There was a problem hiding this comment.
I know what you mean however I've looked at BytesRefArray, FixedLengthBytesRefArray, BytesRefBlockPool, and PagedBytes but none seem to fit since they're all append-only with no null support (needed for docs without values).
A consumer interface re-introduces per-value virtual dispatch, the opposite of what the bulk API eliminates and prevents the codec from doing contiguous bulk reads.
For dense fields with contiguous doc IDs, the codec now does a single readBytes() into one shared byte[] and all returned BytesRef entries are views into it (one allocation instead of N). This works for both fixed and variable length. Non-contiguous docs fall back to per-value reads.
| import org.apache.lucene.util.BytesRef; | ||
|
|
||
| /** A per-document numeric value. */ | ||
| /** A per-document binary value. */ |
Allocate memory only if the backing BytesRef is not continous
Add dedicated method for bulk processing. Mirrors NumericDocValues#longValues for binary fields.
Provides a default implementation with default per-doc fallback plus Lucene90 codec override that reads directly from
the data slice (fixed-length: readBytes(doc * length), variable-length: batch address lookup).
Benchmark
AMD EPYC 7R32 (c5a.2xlarge, AVX2), JDK 25 Temurin, 1M docs, batchSize=1024
Codec override benefit
binaryValuesBulkon default per-doc impl vsbinaryValuesBulkwith Lucene90 override:End-to-end improvement
Per-doc access on main vs bulk API with codec override: