[spark] Support V2 UPDATE for data evolution tables#8214
Conversation
|
How to support delete row ids? cc @leaves12138 |
@JingsongLi Deleted row ids are simply retired — they become holes in the row-id space. Surviving rows always keep their original row ids, and no physical |
JingsongLi
left a comment
There was a problem hiding this comment.
+1 from my side, no blocking issues found
@kerwin-zk It's too troublesome to maintain the index after deletion. Maybe the correct solution is deletion-vector. Can you just support update in this PR? |
c5d12ff to
bcbbfb0
Compare
55f0db4 to
1927c17
Compare
| .withIgnorePreviousFiles(true) | ||
| .getWrite | ||
| .asInstanceOf[AbstractFileStoreWrite[PaimonInternalRow]] | ||
| .createWriter(partition, 0) |
There was a problem hiding this comment.
This path skips the IOManager setup that the normal V2 writer does before creating append writers. If a data-evolution table has write-buffer-for-append=true (with the default spillable buffer), AppendOnlyWriter builds an ExternalBuffer with a null IOManager; once the buffer spills, it will fail at ioManager.createChannel(). Please pass an IOManager into this TableWriteImpl (and close it with the writer), or otherwise disable the buffered append path here.
6ebde90 to
3470749
Compare
3470749 to
01443f1
Compare
| // consistent as long as its range is fully covered by ranges deleted in this commit | ||
| // (concurrent rewrites of those files are caught by the regular deleted-file conflict | ||
| // checks). | ||
| Map<Pair<BinaryRow, Integer>, List<Range>> deletedRanges = new HashMap<>(); |
There was a problem hiding this comment.
Is this needed by UPDATE?
There was a problem hiding this comment.
@JingsongLi Yes, this is needed by UPDATE.
This logic is not for supporting SQL DELETE. It exists for the copy-on-write replacement that UPDATE performs in one commit: old physical data files are recorded as manifest DELETE entries, and rewritten files are added back with the original row ids.
The rewritten ADD files may be sub-ranges of a deleted file because of file rolling, so they do not necessarily have the exact same row-id range as any file in the current snapshot.
For example, an existing file covers row ids [0, 100). The UPDATE deletes that physical file and may add rewritten files [0, 40) and [40, 100). The old existingIndex.containsExactly(rowRange) check would require each ADD range to exactly match an existing snapshot file and would falsely report Row ID existence conflict.
|
@kerwin-zk We do not intend to support the COPY ON WRITE update / delete method, as it would result in severe index pollution and generate many small files. |
Purpose
Support V2 UPDATE for data evolution tables.
Tests
CI