feat(inspect): add ArrowRowBuilder for materializing Arrow batches by WZhuo · Pull Request #780 · apache/iceberg-cpp

WZhuo · 2026-06-25T03:19:03Z

Summary

Adds ArrowRowBuilder (inspect/row_builder_internal), a schema-driven helper that materializes in-memory rows into an Arrow ArrowArray (a batch) for an arbitrary Iceberg schema. It wraps the nanoarrow boilerplate and exposes per-column access plus typed append helpers, so metadata tables (snapshots, history, manifests, …) can emit rows without re-implementing it.

This is the first of a series splitting metadata-table support into focused PRs; the InMemoryBatchReader and the SnapshotsTable::Scan integration are intended to follow in separate PRs that build on this.

What's included

ArrowRowBuilder::Make/column/FinishRow/Finish and typed helpers AppendNull/AppendBoolean/AppendInt/AppendString/AppendStringMap.
The implementation lives in the core iceberg library — it only needs nanoarrow + ToArrowSchema (no Apache Arrow), matching peers like manifest_adapter and arrow_c_data_util.
Unit tests in row_builder_test.cc covering typed appends (int32/string/int64/boolean/map), null handling for optional columns, multi-entry/empty string maps, zero-row batches, and column-index bounds. Compiled into the metadata_table_test target.

Testing

CMake (Ninja): cmake --build build --target metadata_table_test then ran it — 9/9 tests pass (4 existing MetadataTableTest + 5 new ArrowRowBuilderTest). ctest green.
The test verifies output by importing the produced C-data into Apache Arrow (arrow::ImportRecordBatch), so its target is USE_BUNDLE.

Notes

The test is registered under CMake's bundle build only. The meson build (which has no Apache Arrow/bundle layer) is left unchanged; the core-only metadata_table_test.cc continues to build there.
Developed with AI-assisted tooling, reviewed by the author.

Add ArrowRowBuilder (inspect/row_builder_internal) to materialize in-memory rows into an ArrowArray for an arbitrary Iceberg schema, with typed append helpers (AppendNull/Boolean/Int/String/StringMap) reused by later metadata tables.

wgtmac · 2026-06-25T15:06:49Z

    file_writer.cc
    inspect/history_table.cc
    inspect/metadata_table.cc
+    inspect/row_builder_internal.cc


We can just name it to row_builder.cc because _internal suffix is used for marking header files that do not need to be installed.

wgtmac · 2026-06-25T15:07:21Z

+
+#pragma once
+
+/// \file iceberg/inspect/row_builder_internal.h


Why putting it here instead of treating it as a general utility for building arrow arrays?

wgtmac · 2026-06-25T15:24:46Z

+  ArrowError error;
+  ICEBERG_NANOARROW_RETURN_UNEXPECTED_WITH_ERROR(
+      ArrowArrayInitFromSchema(array.get(), &arrow_schema, &error), error);
+  ICEBERG_NANOARROW_RETURN_UNEXPECTED(ArrowArrayStartAppending(array.get()));


If this call fails, the ArrowArray initialized above is destroyed without ArrowArrayRelease, so nanoarrow-owned state leaks. Please guard it right after ArrowArrayInitFromSchema and only disarm the guard once ownership has moved into ArrowRowBuilder.

wgtmac · 2026-06-25T15:29:16Z

I think this is a good point to factor out a small internal nanoarrow builder/helper layer instead of adding another local copy of the same lifecycle code.

We already have the same pattern in a few production paths: initialize from schema, start appending, append child values, finish each row/list/map element, then finish building. For example, position_delete_writer.cc and arrow_c_data_util.cc both guard the initialized ArrowArray immediately after ArrowArrayInitFromSchema, while this new builder currently misses that and can leak if ArrowArrayStartAppending fails. manifest_adapter.cc also has local primitive/list/map append helpers, including map-shape validation, and this PR adds another set of primitive/string-map helpers.

A concrete way to keep this scoped would be:

Add an internal move-only nanoarrow array builder/owner that handles init/start/finish/release/ownership transfer.
Move the simple append helpers (null, signed/unsigned int, double, string, bytes, and maybe map entry validation) behind that internal helper.
Keep higher-level code like ManifestAdapter, batch projection, and metadata-table row materialization separate, since those have different domain semantics.

That should fix the leak path here and avoid repeating this fragile nanoarrow lifecycle in the next metadata-table PRs without turning this into a broad refactor.

WZhuo force-pushed the arrow-row-builder branch from b0a8615 to b42f0da Compare June 25, 2026 03:38

wgtmac reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780

feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780
WZhuo wants to merge 1 commit into
apache:mainfrom
WZhuo:arrow-row-builder

WZhuo commented Jun 25, 2026

Uh oh!

wgtmac Jun 25, 2026

Uh oh!

wgtmac Jun 25, 2026

Uh oh!

wgtmac Jun 25, 2026

Uh oh!

wgtmac commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		#pragma once

		/// \file iceberg/inspect/row_builder_internal.h

Uh oh!

Conversation

WZhuo commented Jun 25, 2026

Summary

What's included

Testing

Notes

Uh oh!

wgtmac Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

wgtmac Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

wgtmac Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants