Skip to content

Add bulk BinaryDocValues#binaryValues API#16286

Open
costin wants to merge 4 commits into
apache:mainfrom
costin:lucene/binary-values-bulk-api
Open

Add bulk BinaryDocValues#binaryValues API#16286
costin wants to merge 4 commits into
apache:mainfrom
costin:lucene/binary-values-bulk-api

Conversation

@costin

@costin costin commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Add dedicated method for bulk processing. Mirrors NumericDocValues#longValues for binary fields.
Provides a default implementation with default per-doc fallback plus Lucene90 codec override that reads directly from
the data slice (fixed-length: readBytes(doc * length), variable-length: batch address lookup).

Benchmark

AMD EPYC 7R32 (c5a.2xlarge, AVX2), JDK 25 Temurin, 1M docs, batchSize=1024

Codec override benefit

binaryValuesBulk on default per-doc impl vs binaryValuesBulk with Lucene90 override:

encoding valueLength default (ops/s) codec override (ops/s) ratio
fixed 8 39,821 65,940 1.66x
fixed 32 38,643 59,785 1.55x
fixed 128 31,571 46,174 1.46x
variable 8 19,930 26,154 1.31x
variable 32 15,222 17,754 1.17x
variable 128 12,676 15,243 1.20x

End-to-end improvement

Per-doc access on main vs bulk API with codec override:

encoding valueLength per-doc (ops/s) bulk+codec (ops/s) ratio
fixed 8 48,048 65,940 1.37x
fixed 32 48,226 59,785 1.24x
fixed 128 39,443 46,174 1.17x
variable 8 20,641 26,154 1.27x
variable 32 15,968 17,754 1.11x
variable 128 14,060 15,243 1.08x

costin added 2 commits June 23, 2026 20:08
Mirrors NumericDocValues#longValues for binary fields.
Default impl: per-doc advanceExact + deepCopyOf fallback.
Null for missing docs. Includes CheckIndex validation,
AssertingLeafReader assertions, dense+sparse tests.
Dense fixed-length and variable-length binary doc values
bypass virtual dispatch by reading directly from the data
slice. Includes JMH benchmark comparing bulk vs per-doc.
* @param valuesOffset first position in {@code values} to write
*/
public void binaryValues(
int size, int[] docs, int docsOffset, BytesRef[] values, int valuesOffset)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we can use something more compact than BytesRef[] here. Maybe BytesRefArray?
Or maybe let the consumer side chose how to collect the values? Maybe we can replace values with ObjIntConsumer or custom functional interface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean however I've looked at BytesRefArray, FixedLengthBytesRefArray, BytesRefBlockPool, and PagedBytes but none seem to fit since they're all append-only with no null support (needed for docs without values).

A consumer interface re-introduces per-value virtual dispatch, the opposite of what the bulk API eliminates and prevents the codec from doing contiguous bulk reads.

For dense fields with contiguous doc IDs, the codec now does a single readBytes() into one shared byte[] and all returned BytesRef entries are views into it (one allocation instead of N). This works for both fixed and variable length. Non-contiguous docs fall back to per-value reads.

import org.apache.lucene.util.BytesRef;

/** A per-document numeric value. */
/** A per-document binary value. */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Allocate memory only if the backing BytesRef is not continous
@costin costin requested a review from martijnvg June 26, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants