Add bulk BinaryDocValues#binaryValues API by costin · Pull Request #16286 · apache/lucene

costin · 2026-06-23T19:31:52Z

Add dedicated method for bulk processing. Mirrors NumericDocValues#longValues for binary fields.
Provides a default implementation with default per-doc fallback plus Lucene90 codec override that reads directly from
the data slice (fixed-length: readBytes(doc * length), variable-length: batch address lookup).

Benchmark

AMD EPYC 7R32 (c5a.2xlarge, AVX2), JDK 25 Temurin, 1M docs, batchSize=1024

Codec override benefit

binaryValuesBulk on default per-doc impl vs binaryValuesBulk with Lucene90 override:

encoding	valueLength	default (ops/s)	codec override (ops/s)	ratio
fixed	8	39,821	65,940	1.66x
fixed	32	38,643	59,785	1.55x
fixed	128	31,571	46,174	1.46x
variable	8	19,930	26,154	1.31x
variable	32	15,222	17,754	1.17x
variable	128	12,676	15,243	1.20x

End-to-end improvement

Per-doc access on main vs bulk API with codec override:

encoding	valueLength	per-doc (ops/s)	bulk+codec (ops/s)	ratio
fixed	8	48,048	65,940	1.37x
fixed	32	48,226	59,785	1.24x
fixed	128	39,443	46,174	1.17x
variable	8	20,641	26,154	1.27x
variable	32	15,968	17,754	1.11x
variable	128	14,060	15,243	1.08x

Mirrors NumericDocValues#longValues for binary fields. Default impl: per-doc advanceExact + deepCopyOf fallback. Null for missing docs. Includes CheckIndex validation, AssertingLeafReader assertions, dense+sparse tests.

Dense fixed-length and variable-length binary doc values bypass virtual dispatch by reading directly from the data slice. Includes JMH benchmark comparing bulk vs per-doc.

martijnvg · 2026-06-26T12:46:04Z

+   * @param valuesOffset first position in {@code values} to write
+   */
+  public void binaryValues(
+      int size, int[] docs, int docsOffset, BytesRef[] values, int valuesOffset)


I wonder whether we can use something more compact than BytesRef[] here. Maybe BytesRefArray?
Or maybe let the consumer side chose how to collect the values? Maybe we can replace values with ObjIntConsumer or custom functional interface.

I know what you mean however I've looked at BytesRefArray, FixedLengthBytesRefArray, BytesRefBlockPool, and PagedBytes but none seem to fit since they're all append-only with no null support (needed for docs without values).

A consumer interface re-introduces per-value virtual dispatch, the opposite of what the bulk API eliminates and prevents the codec from doing contiguous bulk reads.

For dense fields with contiguous doc IDs, the codec now does a single readBytes() into one shared byte[] and all returned BytesRef entries are views into it (one allocation instead of N). This works for both fixed and variable length. Non-contiguous docs fall back to per-value reads.

martijnvg · 2026-06-26T12:46:13Z

 import org.apache.lucene.util.BytesRef;

-/** A per-document numeric value. */
+/** A per-document binary value. */


Allocate memory only if the backing BytesRef is not continous

costin added 2 commits June 23, 2026 20:08

Add bulk BinaryDocValues#binaryValues API

e16dc67

Mirrors NumericDocValues#longValues for binary fields. Default impl: per-doc advanceExact + deepCopyOf fallback. Null for missing docs. Includes CheckIndex validation, AssertingLeafReader assertions, dense+sparse tests.

Add Lucene90 codec overrides for binaryValues

5d5ad9b

Dense fixed-length and variable-length binary doc values bypass virtual dispatch by reading directly from the data slice. Includes JMH benchmark comparing bulk vs per-doc.

github-actions Bot added module:core/index module:core/codecs module:test-framework labels Jun 23, 2026

github-actions Bot added this to the 10.6.0 milestone Jun 23, 2026

Update CHANGES.txt

4cde579

martijnvg reviewed Jun 26, 2026

View reviewed changes

Improve SIMD implementation by reducing allocations

d26fd36

Allocate memory only if the backing BytesRef is not continous

costin requested a review from martijnvg June 26, 2026 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add bulk BinaryDocValues#binaryValues API#16286

Add bulk BinaryDocValues#binaryValues API#16286
costin wants to merge 4 commits into
apache:mainfrom
costin:lucene/binary-values-bulk-api

costin commented Jun 23, 2026

Uh oh!

martijnvg Jun 26, 2026

Uh oh!

costin Jun 26, 2026

Uh oh!

martijnvg Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

costin commented Jun 23, 2026

Benchmark

Codec override benefit

End-to-end improvement

Uh oh!

martijnvg Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

costin Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

martijnvg Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants