[spark] Support V2 UPDATE for data evolution tables by kerwin-zk · Pull Request #8214 · apache/paimon

kerwin-zk · 2026-06-12T04:24:00Z

Purpose

Support V2 UPDATE for data evolution tables.

Tests

CI

JingsongLi · 2026-06-12T06:51:05Z

How to support delete row ids? cc @leaves12138

kerwin-zk · 2026-06-12T07:26:17Z

How to support delete row ids? cc @leaves12138

@JingsongLi Deleted row ids are simply retired — they become holes in the row-id space. Surviving rows always keep their original row ids, and no physical _ROW_ID column is ever written, so row ids stay derived from firstRowId + position, exactly like the files produced by DataEvolutionDeleteRewriter in #8182

JingsongLi

+1 from my side, no blocking issues found

JingsongLi · 2026-06-17T08:38:58Z

How to support delete row ids? cc @leaves12138

@JingsongLi Deleted row ids are simply retired — they become holes in the row-id space. Surviving rows always keep their original row ids, and no physical _ROW_ID column is ever written, so row ids stay derived from firstRowId + position, exactly like the files produced by DataEvolutionDeleteRewriter in #8182

@kerwin-zk It's too troublesome to maintain the index after deletion. Maybe the correct solution is deletion-vector. Can you just support update in this PR?

JingsongLi · 2026-06-20T02:17:26Z

+      .withIgnorePreviousFiles(true)
+      .getWrite
+      .asInstanceOf[AbstractFileStoreWrite[PaimonInternalRow]]
+      .createWriter(partition, 0)


This path skips the IOManager setup that the normal V2 writer does before creating append writers. If a data-evolution table has write-buffer-for-append=true (with the default spillable buffer), AppendOnlyWriter builds an ExternalBuffer with a null IOManager; once the buffer spills, it will fail at ioManager.createChannel(). Please pass an IOManager into this TableWriteImpl (and close it with the writer), or otherwise disable the buffered append path here.

@JingsongLi done

JingsongLi · 2026-06-23T07:07:39Z

+        // consistent as long as its range is fully covered by ranges deleted in this commit
+        // (concurrent rewrites of those files are caught by the regular deleted-file conflict
+        // checks).
+        Map<Pair<BinaryRow, Integer>, List<Range>> deletedRanges = new HashMap<>();


Is this needed by UPDATE?

@JingsongLi Yes, this is needed by UPDATE.

This logic is not for supporting SQL DELETE. It exists for the copy-on-write replacement that UPDATE performs in one commit: old physical data files are recorded as manifest DELETE entries, and rewritten files are added back with the original row ids.

The rewritten ADD files may be sub-ranges of a deleted file because of file rolling, so they do not necessarily have the exact same row-id range as any file in the current snapshot.

For example, an existing file covers row ids [0, 100). The UPDATE deletes that physical file and may add rewritten files [0, 40) and [40, 100). The old existingIndex.containsExactly(rowRange) check would require each ADD range to exactly match an existing snapshot file and would falsely report Row ID existence conflict.

JingsongLi · 2026-06-23T11:16:01Z

@kerwin-zk We do not intend to support the COPY ON WRITE update / delete method, as it would result in severe index pollution and generate many small files.

JingsongLi reviewed Jun 13, 2026

View reviewed changes

kerwin-zk force-pushed the spark-v2-dml-data-evolution branch from c5d12ff to bcbbfb0 Compare June 17, 2026 13:34

kerwin-zk changed the title ~~[spark] Support V2 DELETE and UPDATE for data evolution tables~~ [spark] Support V2 UPDATE for data evolution tables Jun 17, 2026

kerwin-zk force-pushed the spark-v2-dml-data-evolution branch 2 times, most recently from 55f0db4 to 1927c17 Compare June 18, 2026 04:32

JingsongLi reviewed Jun 20, 2026

View reviewed changes

kerwin-zk force-pushed the spark-v2-dml-data-evolution branch 2 times, most recently from 6ebde90 to 3470749 Compare June 21, 2026 15:47

[spark] Support V2 UPDATE for data evolution tables

01443f1

kerwin-zk force-pushed the spark-v2-dml-data-evolution branch from 3470749 to 01443f1 Compare June 22, 2026 15:41

JingsongLi reviewed Jun 23, 2026

View reviewed changes

JingsongLi closed this Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Support V2 UPDATE for data evolution tables#8214

[spark] Support V2 UPDATE for data evolution tables#8214
kerwin-zk wants to merge 1 commit into
apache:masterfrom
kerwin-zk:spark-v2-dml-data-evolution

kerwin-zk commented Jun 12, 2026 •

edited

Loading

Uh oh!

JingsongLi commented Jun 12, 2026

Uh oh!

kerwin-zk commented Jun 12, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

JingsongLi commented Jun 17, 2026

Uh oh!

JingsongLi Jun 20, 2026

Uh oh!

kerwin-zk Jun 23, 2026

Uh oh!

JingsongLi Jun 23, 2026

Uh oh!

kerwin-zk Jun 23, 2026

Uh oh!

JingsongLi commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kerwin-zk commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

JingsongLi commented Jun 12, 2026

Uh oh!

kerwin-zk commented Jun 12, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jun 17, 2026

Uh oh!

JingsongLi Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

kerwin-zk Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

kerwin-zk Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kerwin-zk commented Jun 12, 2026 •

edited

Loading