Skip to content

feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780

Open
WZhuo wants to merge 1 commit into
apache:mainfrom
WZhuo:arrow-row-builder
Open

feat(inspect): add ArrowRowBuilder for materializing Arrow batches#780
WZhuo wants to merge 1 commit into
apache:mainfrom
WZhuo:arrow-row-builder

Conversation

@WZhuo

@WZhuo WZhuo commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds ArrowRowBuilder (inspect/row_builder_internal), a schema-driven helper that materializes in-memory rows into an Arrow ArrowArray (a batch) for an arbitrary Iceberg schema. It wraps the nanoarrow boilerplate and exposes per-column access plus typed append helpers, so metadata tables (snapshots, history, manifests, …) can emit rows without re-implementing it.

This is the first of a series splitting metadata-table support into focused PRs; the InMemoryBatchReader and the SnapshotsTable::Scan integration are intended to follow in separate PRs that build on this.

What's included

  • ArrowRowBuilder::Make/column/FinishRow/Finish and typed helpers AppendNull/AppendBoolean/AppendInt/AppendString/AppendStringMap.
  • The implementation lives in the core iceberg library — it only needs nanoarrow + ToArrowSchema (no Apache Arrow), matching peers like manifest_adapter and arrow_c_data_util.
  • Unit tests in row_builder_test.cc covering typed appends (int32/string/int64/boolean/map), null handling for optional columns, multi-entry/empty string maps, zero-row batches, and column-index bounds. Compiled into the metadata_table_test target.

Testing

  • CMake (Ninja): cmake --build build --target metadata_table_test then ran it — 9/9 tests pass (4 existing MetadataTableTest + 5 new ArrowRowBuilderTest). ctest green.
  • The test verifies output by importing the produced C-data into Apache Arrow (arrow::ImportRecordBatch), so its target is USE_BUNDLE.

Notes

  • The test is registered under CMake's bundle build only. The meson build (which has no Apache Arrow/bundle layer) is left unchanged; the core-only metadata_table_test.cc continues to build there.
  • Developed with AI-assisted tooling, reviewed by the author.

Add ArrowRowBuilder (inspect/row_builder_internal) to materialize
in-memory rows into an ArrowArray for an arbitrary Iceberg schema,
with typed append helpers (AppendNull/Boolean/Int/String/StringMap)
reused by later metadata tables.
@WZhuo WZhuo force-pushed the arrow-row-builder branch from b0a8615 to b42f0da Compare June 25, 2026 03:38
file_writer.cc
inspect/history_table.cc
inspect/metadata_table.cc
inspect/row_builder_internal.cc

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just name it to row_builder.cc because _internal suffix is used for marking header files that do not need to be installed.


#pragma once

/// \file iceberg/inspect/row_builder_internal.h

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why putting it here instead of treating it as a general utility for building arrow arrays?

ArrowError error;
ICEBERG_NANOARROW_RETURN_UNEXPECTED_WITH_ERROR(
ArrowArrayInitFromSchema(array.get(), &arrow_schema, &error), error);
ICEBERG_NANOARROW_RETURN_UNEXPECTED(ArrowArrayStartAppending(array.get()));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this call fails, the ArrowArray initialized above is destroyed without ArrowArrayRelease, so nanoarrow-owned state leaks. Please guard it right after ArrowArrayInitFromSchema and only disarm the guard once ownership has moved into ArrowRowBuilder.

@wgtmac

wgtmac commented Jun 25, 2026

Copy link
Copy Markdown
Member

I think this is a good point to factor out a small internal nanoarrow builder/helper layer instead of adding another local copy of the same lifecycle code.

We already have the same pattern in a few production paths: initialize from schema, start appending, append child values, finish each row/list/map element, then finish building. For example, position_delete_writer.cc and arrow_c_data_util.cc both guard the initialized ArrowArray immediately after ArrowArrayInitFromSchema, while this new builder currently misses that and can leak if ArrowArrayStartAppending fails. manifest_adapter.cc also has local primitive/list/map append helpers, including map-shape validation, and this PR adds another set of primitive/string-map helpers.

A concrete way to keep this scoped would be:

  1. Add an internal move-only nanoarrow array builder/owner that handles init/start/finish/release/ownership transfer.
  2. Move the simple append helpers (null, signed/unsigned int, double, string, bytes, and maybe map entry validation) behind that internal helper.
  3. Keep higher-level code like ManifestAdapter, batch projection, and metadata-table row materialization separate, since those have different domain semantics.

That should fix the leak path here and avoid repeating this fragile nanoarrow lifecycle in the next metadata-table PRs without turning this into a broad refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants