Add support for ARIES UNDO and constant time recovery by gburd · Pull Request #28 · gburd/postgres

gburd · 2026-07-03T13:40:43Z

No description provided.

- Hourly upstream sync from postgres/postgres (24x daily) - AI-powered PR reviews using AWS Bedrock Claude Sonnet 4.5 - Multi-platform CI via existing Cirrus CI configuration - Cost tracking and comprehensive documentation Features: - Automatic issue creation on sync conflicts - PostgreSQL-specific code review prompts (C, SQL, docs, build) - Cost limits: $15/PR, $200/month - Inline PR comments with security/performance labels - Skip draft PRs to save costs Documentation: - .github/SETUP_SUMMARY.md - Quick setup overview - .github/QUICKSTART.md - 15-minute setup guide - .github/PRE_COMMIT_CHECKLIST.md - Verification checklist - .github/docs/ - Detailed guides for sync, AI review, Bedrock See .github/README.md for complete overview Complete Phase 3: Windows builds + fix sync for CI/CD commits Phase 3: Windows Dependency Build System - Implement full build workflow (OpenSSL, zlib, libxml2) - Smart caching by version hash (80% cost reduction) - Dependency bundling with manifest generation - Weekly auto-refresh + manual triggers - PowerShell download helper script - Comprehensive usage documentation Sync Workflow Fix: - Allow .github/ commits (CI/CD config) on master - Detect and reject code commits outside .github/ - Merge upstream while preserving .github/ changes - Create issues only for actual pristine violations Documentation: - Complete Windows build usage guide - Update all status docs to 100% complete - Phase 3 completion summary All three CI/CD phases complete (100%): ✅ Hourly upstream sync with .github/ preservation ✅ AI-powered PR reviews via Bedrock Claude 4.5 ✅ Windows dependency builds with smart caching Cost: $40-60/month total See .github/PHASE3_COMPLETE.md for details Fix sync to allow 'dev setup' commits on master The sync workflow was failing because the 'dev setup v19' commit modifies files outside .github/. Updated workflows to recognize commits with messages starting with 'dev setup' as allowed on master. Changes: - Detect 'dev setup' commits by message pattern (case-insensitive) - Allow merge if commits are .github/ OR dev setup OR both - Update merge messages to reflect preserved changes - Document pristine master policy with examples This allows personal development environment commits (IDE configs, debugging tools, shell aliases, Nix configs, etc.) on master without violating the pristine mirror policy. Future dev environment updates should start with 'dev setup' in the commit message to be automatically recognized and preserved. See .github/docs/pristine-master-policy.md for complete policy See .github/DEV_SETUP_FIX.md for fix summary Optimize CI/CD costs by skipping builds for pristine commits Add cost optimization to Windows dependency builds to avoid expensive builds when only pristine commits are pushed (dev setup commits or .github/ configuration changes). Changes: - Add check-changes job to detect pristine-only pushes - Skip Windows builds when all commits are dev setup or .github/ only - Add comprehensive cost optimization documentation - Update README with cost savings (~40% reduction) Expected savings: ~$3-5/month on Windows builds, ~$40-47/month total through combined optimizations. Manual dispatch and scheduled builds always run regardless.

Sparsemap is a memory-efficient data structure for maintaining sparse sets of integers using hierarchical bitmaps. It supports O(1) set/get operations and efficient iteration over set bits while using far less memory than a dense bitmap for sparse populations. The implementation provides: - sparsemap_set/get/is_set for individual bit manipulation - sparsemap_scan for efficient forward iteration - sparsemap_select for rank-based selection - Configurable initial capacity with automatic growth Used by the UNDO subsystem for tracking allocated pages and by RECNO for free-space management within relation forks. Includes a TAP regression test module (test_sparsemap) exercising all public API operations.

Header-only implementation of a probabilistic skip list providing O(log n) insert, delete, and lookup operations with O(n) space. Compared to rbtree, skip lists offer simpler implementation, better cache locality for sequential scans, and lock-free read potential. The implementation provides: - Type-safe macros for defining typed skip lists (DEFINE_SKIPLIST) - Configurable maximum height (up to 32 levels) - Forward iteration via SKIPLIST_FOREACH - Range queries and nearest-neighbor lookup - Memory allocation via palloc (TopMemoryContext by default) Used by the UNDO subsystem for maintaining ordered transaction metadata and by RECNO for HLC-ordered page directories. Includes a TAP regression test module (test_skiplist) exercising insertion, deletion, iteration, and edge cases.

Introduce two TableAmRoutine booleans and the begin_bulk_insert callback that the UNDO subsystem builds on, plus the RelationAmSupportsUndo() accessor index AMs use to gate UNDO record generation on the parent table. am_supports_undo marks an AM that registers an UNDO resource manager and emits UNDO records tagged with its own rmid; the UNDO core stays AM-agnostic and interprets the payload only through that RM's callbacks. am_inplace_update_keeps_tid marks an AM that updates in place and keeps the row's TID, so the executor can skip redundant index re-inserts for unchanged keys. The heap AM leaves both false. This commit only adds the routine fields and the accessor; no AM sets the flags yet.

Add the AM-agnostic UNDO engine: the in-WAL UNDO record format and insertion path, the per-relation RELUNDO fork with its own resource manager, the shared sLog tuple-state map, the rollback apply driver and compensation-log generation, the discard horizon, and the background revert/undo workers. Register the UNDO, ATM, and RELUNDO resource managers and wire the subsystem into transaction start/commit/abort, two-phase commit, recovery, and process startup. The engine interprets UNDO payloads only through per-RM callbacks and the RelUndo*_hook function pointers (defined here, left NULL), so the core has no compile-time knowledge of any specific access method. Heap, vacuum, pruning, reloptions, and executor integration consume only the AM-agnostic interfaces. No UNDO-producing AM is registered yet: RegisterUndoRmgrs() initializes the dispatch table but registers no per-AM handlers. The index-AM apply handlers and the AM that sets am_supports_undo arrive in later commits.

Add the UNDO resource-manager handlers for the nbtree and hash index AMs and register them from RegisterUndoRmgrs(). On rollback of an aborting transaction, the nbtree handler re-descends to the leaf entry by key and heap TID before marking it dead, so a committed entry that shifted onto the recorded slot under concurrent inserts or leaf splits is never killed; entries inside posting-list tuples are left for VACUUM. The hash handler reverses its own inserts analogously. Both are gated by RelationAmSupportsUndo() on the parent table, so they are inert until an UNDO-supporting table AM exists.

Add pg_setxattr/pg_getxattr/pg_removexattr/pg_listxattr wrappers over the platform extended-attribute syscalls (Linux/*BSD/macOS, no-op stubs where unsupported) and build them into libpgport. The transactional file-ops resource manager added next uses these to record and reverse xattr mutations.

Filesystem mutations the server performs outside the buffer manager -- creating and removing directories, copying trees, writing version files, managing symlinks, setting extended attributes -- historically had no WAL coverage and relied on best-effort cleanup callbacks that do not survive a crash. Add FILEOPS: a resource manager (RM_FILEOPS_ID) that records each mutation as a deferred pending operation, executes it at commit, and WAL-logs it so redo reproduces it during crash recovery and standby replay. This commit adds the operation engine, the public FileOps* API, the WAL record formats, the redo handler, and the rmgrdesc descriptors. A README under src/backend/storage/file explains why FILEOPS exists and how to use the API. The rollback (UNDO) side and the rewiring of existing callers follow in subsequent commits.

The FILEOPS engine added in the previous commit makes filesystem mutations crash-safe via WAL redo, but redo alone only reproduces a committed operation -- it cannot reverse one when the surrounding transaction aborts. Operations such as chmod, chown, truncate, and setxattr overwrite prior state in place, so undoing them requires the before-image that was captured when the operation ran. This commit adds fileops_undo.c, which registers a handler with the UNDO machinery (RM_UNDO_ID, subtypes FILEOPS_UNDO_*). For each reversible operation it records the before-image -- the original mode for a chmod, the prior owner for a chown, the original length for a truncate, the previous extended-attribute value for a setxattr, and so on -- and during UNDO application performs the inverse action: unlink a created file, rename back, or restore the saved mode, owner, length, or xattr value. FileopsUndoRmgrInit() registers the handler at startup alongside the other built-in UNDO resource managers. The storage/file build manifests are updated to compile fileops_undo.c, and undo.c now includes storage/fileops.h and calls FileopsUndoRmgrInit() from RegisterUndoRmgrs().

With the FILEOPS engine and its UNDO handlers in place, replace the raw filesystem mutations in the transaction-sensitive code paths with the FileOps* API so they become atomic with the transaction and crash-safe. CREATE/DROP DATABASE (dbcommands.c), CREATE/DROP TABLESPACE and ALTER DATABASE ... SET TABLESPACE (tablespace.c), and the directory-copy helper (copydir.c) now register deferred operations instead of calling mkdir/symlink/rename/rmtree directly, and xact.c drives the pending-op queue at commit, abort, subtransaction boundaries, and PREPARE. pg_waldump gains the fileops rmgr descriptor (fileopsdesc.c symlink plus its rmgrdesc.c table entry) so XLOG_FILEOPS_* records render in waldump output. The change ships user documentation (fileops.sgml and its filelist/postgres wiring, a worked example), a test_fileops contrib module exercising the API, regression coverage (regress/fileops.sql), and recovery TAP tests that crash mid-operation and verify redo and rollback. typedefs.list records the new types.

PostgreSQL's heap appends a new tuple version on every UPDATE and abandons the old one to VACUUM. For update-heavy workloads on narrow hot rows -- counters, queue heads, running aggregates, time-series tail writes -- the resulting version churn dominates buffer traffic, bloats relations, and keeps autovacuum permanently behind. RECNO is a table access method (amname "recno", RM_RECNO_ID) that updates tuples in place. An UPDATE overwrites the committed bytes on the main fork; the prior image is preserved in the per-relation UNDO fork (RELUNDO_FORKNUM) so a ROLLBACK restores it and a concurrent snapshot reader can still reconstruct the version it is entitled to see. Heap is untouched; a relation opts in with USING recno. Visibility is driven by a hybrid logical clock rather than per-tuple xmin/xmax scans. Each tuple header carries an HLC commit stamp; a snapshot captures an HLC horizon and a tuple is visible when its stamp precedes the horizon and its writer has committed. Transient state (uncommitted / deleted / updated) lives in header flag bits that UNDO clears on rollback through the RelUndoClearTransientFlags hook, and in-place delta reversal runs through RelUndoReverseDelta; RECNO installs both at init via RecnoRelUndoInstallHooks() so the UNDO core never sees a RECNO tuple layout. Same-page updates rewrite the slot directly; updates that no longer fit spill to an overflow chain with optional per-attribute dictionary compression (LZ4/ZSTD, ANALYZE-refreshed). A sparsemap-backed dirty map and a partitioned secondary log (sLog) serve before-images to old readers without a separate version-store cache; free space and visibility are tracked in RECNO-private FSM and VM forks. WAL coverage is complete: insert, in-place and out-of-place update, delete, tuple-lock, VACUUM, dict writes, and overflow all log redo records under RM_RECNO_ID with matching recovery routines, and crash recovery cooperates with the UNDO driver to roll back in-place loser transactions. Index AMs over a RECNO table force a heap recheck on index-only scans and tolerate stale index entries left by in-place updates until before-image reclamation retires them. The commit also wires RECNO into the supporting infrastructure: pg_am / pg_proc catalog entries, the recno rmgr description routine, pageinspect coverage (pageinspect 1.13 -> 1.14), the in-tree documentation chapter, and isolation, regression, and crash-recovery test suites exercising MVCC correctness, overflow, dictionary compression, and dual-mode UNDO rollback.

Every committed in-place UPDATE now appends an 8-byte RelUndoRecPtr to its new on-page image, guarded by RECNO_TUPLE_HAS_VERSION_PTR, pointing at the head of the relation's persistent version chain in the UNDO fork. Never-updated rows stay at their base length; a row grows by 8 bytes once on its first UPDATE and stays that size for subsequent same-size updates on the CAS fast path. The CAS fast path now targets base+8 and stamps the freshly reserved RelUndoRecPtr before computing the byte-diff, so the diff and the page memcpy both see the final slot layout. First-time +8 growth is performed on the exclusive-lock path, sized to the full slot with the verptr at the tail and logged as a full image to avoid primary/replica divergence. The widen-to-full-slot fixup runs before the critical section, since its repalloc/pfree are forbidden once START_CRIT_SECTION is entered. Accessors RecnoTupleGetVersionPtr/SetVersionPtr read and write the trailing field via memcpy (unaligned-safe) from the on-page slot tail, keyed on ItemIdGetLength rather than any logical length field. Also fixes two latent reverse-apply bugs that activate once reads rely on reconstruction: size the DELTA_UPDATE reconstruction buffer to the diff's old_total_len (not the current on-page length, which a shrinking update makes too small), and promote the slot-too-small restore skip from DEBUG1 to WARNING. RECNO: reconstruct prior versions from UNDO fork, drain sLog (WS-PVS2/3/4) Implement the CTR paper's sLog/PVS division of labor: the durable per-relation UNDO fork becomes the Persistent Version Store, and MVCC readers reconstruct prior tuple versions on demand from its byte-diffs instead of keeping full before-images in the in-memory sLog. WS-PVS2: RecnoReconstructVisibleVersion (recno_pvs.c) walks a tuple's version chain via its trailing verptr -> RelUndoReadRecord, reverse- applying each RELUNDO_DELTA_UPDATE / RELUNDO_UPDATE record until it reaches the version visible to the reader's snapshot. Replaces the committed-UPDATE branch of the former SLogTupleGetSharedBeforeImage call sites on all fetch paths (seq, TID, plain/IOS index, bitmap). WS-PVS3: stop publishing committed-UPDATE before-images into the sLog. Phase 1 removes the DSA before-image allocation and the shared-image getter; Phase 2 stops retaining committed-UPDATE markers in the flat hash and migrates the lost-update probe onto the fork (RecnoTupleHasCommittedUpdateAfter). Together these drain both the DSA memory limit and the bucket-count table_full saturation. WS-PVS4: complete the RelUndoRecPtr generation-counter machinery so a recycled fork page cannot alias a discarded probe record. Bump the metapage counter on free-list recycle and validate it on read; the counter is durably logged with the page reinit in one WAL record. Tests: recno TAP 5/5, recovery 061/063/065, isolation 148/148 all pass. The pre-existing varlen-growing-UPDATE regress failures and the select_parallel worker-count flake are unchanged from baseline. RECNO: COW-reference unchanged overflow columns on UPDATE (WS-Q2 Layer 0) A narrow-column UPDATE of a wide row re-stored the unchanged wide column into fresh overflow pages, doubling on-disk size until VACUUM caught up. Carry a 32-bit whole-value content hash in RecnoOverflowPtr (reclaimed from the dead ov_padding/ov_flags fields so the struct stays 20 bytes and RECNO_OVERFLOW_PTR_SIZE is unchanged -- growing it would defeat the force-shrink UPDATE recovery path). At the exclusive-path form site, snapshot the old on-page overflow pointers by attnum while the buffer is locked, then for each over-threshold varlena compare length+hash against the old pointer (zero I/O). On a match, byte-verify against the fetched old chain and, if equal, build a pointer varlena that references the old chain verbatim and skip re-storing. The byte-verify makes the hash a pure performance prefilter, never a correctness dependency: a collision merely wastes a fetch and falls through to a normal re-store. Sharing is safe without refcounts because RECNO UPDATE is strictly in-place and VACUUM Pass 1 already unions chain locators from every live HAS_OVERFLOW tuple, so a shared chain stays live while any referencer is. RECNO: test VACUUM retains version diffs a live snapshot needs (WS-PVS4) Committed in-place UPDATEs keep their before-image only as a byte-diff in the per-relation UNDO fork; RelUndoVacuum discards diffs whose updater xid precedes GetOldestNonRemovableTransactionId. A live REPEATABLE READ snapshot holds that horizon back, so VACUUM must leave the diff intact and the snapshot must still reconstruct the prior version. The existing recno-retained-reclamation spec never runs VACUUM and recno-vacuum-concurrent only checks DELETE visibility, so the retention gate had no direct coverage. This spec pins a snapshot, updates+commits+ VACUUMs the row, and asserts the snapshot still reads the original value. RECNO: reserve version-pointer headroom at insert (WS-PVS1 fix) Reserve sizeof(RelUndoRecPtr) trailing headroom on every inserted tuple so the first committed in-place UPDATE is a same-length overwrite rather than an +8-byte growth. Without this, a densely packed page of fixed-width rows has no residual free space for the per-row +8 growth, and a multi-row UPDATE fails with "updated recno tuple does not fit on page" once more than a handful of rows on one page are updated in a single command. Every inserted tuple is now born with RECNO_TUPLE_HAS_VERSION_PTR set and an InvalidRelUndoRecPtr trailing field. Readers already treat an invalid chain head as "no history", so never-updated rows are semantically unchanged; the first UPDATE stamps the real chain-head pointer into the pre-reserved slot via the existing CAS fast path. Patched both the single-insert (recno_tuple_insert) and batch (recno_multi_insert) paths. recno regression expected sizes updated to reflect the +8B/row on-page footprint (HEAP sizes unchanged). recno isolation suite: 149/149.

RecnoVacuumCrossPageDefrag held table buffer content locks (the source page EXCLUSIVE across the whole move loop, plus a re-read SHARE lock on the moved tuple) while calling index_insert. index_insert descends into _bt_search, which takes index leaf locks: a TABLE->INDEX lock order. Concurrent nbtree bottom-up deletion takes INDEX->TABLE (it SHARE-locks the table page from under the index leaf lock in recno_index_delete_ tuples). Buffer content locks are LWLock-style and invisible to the deadlock detector, so the cycle hangs forever rather than being aborted. Defer index maintenance until every table page content lock is released. While the destination page is still locked, copy each moved tuple off-page into a palloc'd buffer and record its new TID; after UnlockReleaseBuffer on both source and destination pages, replay the index inserts from those copies with no table content lock held. This mirrors heap VACUUM, which never calls index_insert under a heap buffer content lock. Safe because overflow-carrying pages are skipped by the defrag, so the copied tuple bytes are self-contained.

gburd added 11 commits July 3, 2026 09:51

Add left-right lock (LRLock), a wait-free read lock primitive

2034adc

github-actions Bot force-pushed the master branch 2 times, most recently from 00f76fb to da08252 Compare July 3, 2026 17:16

gburd added 3 commits July 3, 2026 15:01

[DO NOT MERGE] Benchmarks: RECNO and UNDO performance test suite

b29db98

github-actions Bot force-pushed the master branch from da08252 to 6850c08 Compare July 3, 2026 19:05

github-actions Bot force-pushed the master branch from 6850c08 to 3b6f413 Compare July 3, 2026 20:55

gburd force-pushed the undo branch from 2d70341 to 078c936 Compare July 3, 2026 20:56

github-actions Bot force-pushed the master branch from 3b6f413 to 5c6cce4 Compare July 3, 2026 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ARIES UNDO and constant time recovery#28

Add support for ARIES UNDO and constant time recovery#28
gburd wants to merge 15 commits into
masterfrom
undo

gburd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gburd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant