Skip to content

Add support for ARIES UNDO and constant time recovery#28

Draft
gburd wants to merge 15 commits into
masterfrom
undo
Draft

Add support for ARIES UNDO and constant time recovery#28
gburd wants to merge 15 commits into
masterfrom
undo

Conversation

@gburd

@gburd gburd commented Jul 3, 2026

Copy link
Copy Markdown
Owner

No description provided.

gburd added 11 commits July 3, 2026 09:51
  - Hourly upstream sync from postgres/postgres (24x daily)
  - AI-powered PR reviews using AWS Bedrock Claude Sonnet 4.5
  - Multi-platform CI via existing Cirrus CI configuration
  - Cost tracking and comprehensive documentation

  Features:
  - Automatic issue creation on sync conflicts
  - PostgreSQL-specific code review prompts (C, SQL, docs, build)
  - Cost limits: $15/PR, $200/month
  - Inline PR comments with security/performance labels
  - Skip draft PRs to save costs

  Documentation:
  - .github/SETUP_SUMMARY.md - Quick setup overview
  - .github/QUICKSTART.md - 15-minute setup guide
  - .github/PRE_COMMIT_CHECKLIST.md - Verification checklist
  - .github/docs/ - Detailed guides for sync, AI review, Bedrock

  See .github/README.md for complete overview

Complete Phase 3: Windows builds + fix sync for CI/CD commits

Phase 3: Windows Dependency Build System
- Implement full build workflow (OpenSSL, zlib, libxml2)
- Smart caching by version hash (80% cost reduction)
- Dependency bundling with manifest generation
- Weekly auto-refresh + manual triggers
- PowerShell download helper script
- Comprehensive usage documentation

Sync Workflow Fix:
- Allow .github/ commits (CI/CD config) on master
- Detect and reject code commits outside .github/
- Merge upstream while preserving .github/ changes
- Create issues only for actual pristine violations

Documentation:
- Complete Windows build usage guide
- Update all status docs to 100% complete
- Phase 3 completion summary

All three CI/CD phases complete (100%):
✅ Hourly upstream sync with .github/ preservation
✅ AI-powered PR reviews via Bedrock Claude 4.5
✅ Windows dependency builds with smart caching

Cost: $40-60/month total
See .github/PHASE3_COMPLETE.md for details

Fix sync to allow 'dev setup' commits on master

The sync workflow was failing because the 'dev setup v19' commit
modifies files outside .github/. Updated workflows to recognize
commits with messages starting with 'dev setup' as allowed on master.

Changes:
- Detect 'dev setup' commits by message pattern (case-insensitive)
- Allow merge if commits are .github/ OR dev setup OR both
- Update merge messages to reflect preserved changes
- Document pristine master policy with examples

This allows personal development environment commits (IDE configs,
debugging tools, shell aliases, Nix configs, etc.) on master without
violating the pristine mirror policy.

Future dev environment updates should start with 'dev setup' in the
commit message to be automatically recognized and preserved.

See .github/docs/pristine-master-policy.md for complete policy
See .github/DEV_SETUP_FIX.md for fix summary

Optimize CI/CD costs by skipping builds for pristine commits

Add cost optimization to Windows dependency builds to avoid expensive
builds when only pristine commits are pushed (dev setup commits or
.github/ configuration changes).

Changes:
- Add check-changes job to detect pristine-only pushes
- Skip Windows builds when all commits are dev setup or .github/ only
- Add comprehensive cost optimization documentation
- Update README with cost savings (~40% reduction)

Expected savings: ~$3-5/month on Windows builds, ~$40-47/month total
through combined optimizations.

Manual dispatch and scheduled builds always run regardless.
Sparsemap is a memory-efficient data structure for maintaining sparse
sets of integers using hierarchical bitmaps.  It supports O(1) set/get
operations and efficient iteration over set bits while using far less
memory than a dense bitmap for sparse populations.

The implementation provides:
  - sparsemap_set/get/is_set for individual bit manipulation
  - sparsemap_scan for efficient forward iteration
  - sparsemap_select for rank-based selection
  - Configurable initial capacity with automatic growth

Used by the UNDO subsystem for tracking allocated pages and by RECNO
for free-space management within relation forks.

Includes a TAP regression test module (test_sparsemap) exercising all
public API operations.
Header-only implementation of a probabilistic skip list providing
O(log n) insert, delete, and lookup operations with O(n) space.
Compared to rbtree, skip lists offer simpler implementation, better
cache locality for sequential scans, and lock-free read potential.

The implementation provides:
  - Type-safe macros for defining typed skip lists (DEFINE_SKIPLIST)
  - Configurable maximum height (up to 32 levels)
  - Forward iteration via SKIPLIST_FOREACH
  - Range queries and nearest-neighbor lookup
  - Memory allocation via palloc (TopMemoryContext by default)

Used by the UNDO subsystem for maintaining ordered transaction
metadata and by RECNO for HLC-ordered page directories.

Includes a TAP regression test module (test_skiplist) exercising
insertion, deletion, iteration, and edge cases.
Introduce two TableAmRoutine booleans and the begin_bulk_insert callback
that the UNDO subsystem builds on, plus the RelationAmSupportsUndo()
accessor index AMs use to gate UNDO record generation on the parent
table.

am_supports_undo marks an AM that registers an UNDO resource manager and
emits UNDO records tagged with its own rmid; the UNDO core stays
AM-agnostic and interprets the payload only through that RM's callbacks.
am_inplace_update_keeps_tid marks an AM that updates in place and keeps
the row's TID, so the executor can skip redundant index re-inserts for
unchanged keys.  The heap AM leaves both false.

This commit only adds the routine fields and the accessor; no AM sets the
flags yet.
Add the AM-agnostic UNDO engine: the in-WAL UNDO record format and
insertion path, the per-relation RELUNDO fork with its own resource
manager, the shared sLog tuple-state map, the rollback apply driver and
compensation-log generation, the discard horizon, and the background
revert/undo workers.  Register the UNDO, ATM, and RELUNDO resource
managers and wire the subsystem into transaction start/commit/abort,
two-phase commit, recovery, and process startup.

The engine interprets UNDO payloads only through per-RM callbacks and the
RelUndo*_hook function pointers (defined here, left NULL), so the core has
no compile-time knowledge of any specific access method.  Heap, vacuum,
pruning, reloptions, and executor integration consume only the
AM-agnostic interfaces.

No UNDO-producing AM is registered yet: RegisterUndoRmgrs() initializes
the dispatch table but registers no per-AM handlers.  The index-AM apply
handlers and the AM that sets am_supports_undo arrive in later commits.
Add the UNDO resource-manager handlers for the nbtree and hash index
AMs and register them from RegisterUndoRmgrs().  On rollback of an
aborting transaction, the nbtree handler re-descends to the leaf entry
by key and heap TID before marking it dead, so a committed entry that
shifted onto the recorded slot under concurrent inserts or leaf splits
is never killed; entries inside posting-list tuples are left for VACUUM.
The hash handler reverses its own inserts analogously.  Both are gated by
RelationAmSupportsUndo() on the parent table, so they are inert until an
UNDO-supporting table AM exists.
Add pg_setxattr/pg_getxattr/pg_removexattr/pg_listxattr wrappers over the
platform extended-attribute syscalls (Linux/*BSD/macOS, no-op stubs where
unsupported) and build them into libpgport.  The transactional file-ops
resource manager added next uses these to record and reverse xattr
mutations.
Filesystem mutations the server performs outside the buffer manager --
creating and removing directories, copying trees, writing version files,
managing symlinks, setting extended attributes -- historically had no WAL
coverage and relied on best-effort cleanup callbacks that do not survive a
crash.  Add FILEOPS: a resource manager (RM_FILEOPS_ID) that records each
mutation as a deferred pending operation, executes it at commit, and
WAL-logs it so redo reproduces it during crash recovery and standby replay.

This commit adds the operation engine, the public FileOps* API, the WAL
record formats, the redo handler, and the rmgrdesc descriptors.  A README
under src/backend/storage/file explains why FILEOPS exists and how to use
the API.  The rollback (UNDO) side and the rewiring of existing callers
follow in subsequent commits.
The FILEOPS engine added in the previous commit makes filesystem
mutations crash-safe via WAL redo, but redo alone only reproduces a
committed operation -- it cannot reverse one when the surrounding
transaction aborts.  Operations such as chmod, chown, truncate, and
setxattr overwrite prior state in place, so undoing them requires the
before-image that was captured when the operation ran.

This commit adds fileops_undo.c, which registers a handler with the
UNDO machinery (RM_UNDO_ID, subtypes FILEOPS_UNDO_*).  For each
reversible operation it records the before-image -- the original mode
for a chmod, the prior owner for a chown, the original length for a
truncate, the previous extended-attribute value for a setxattr, and so
on -- and during UNDO application performs the inverse action: unlink a
created file, rename back, or restore the saved mode, owner, length, or
xattr value.  FileopsUndoRmgrInit() registers the handler at startup
alongside the other built-in UNDO resource managers.

The storage/file build manifests are updated to compile fileops_undo.c,
and undo.c now includes storage/fileops.h and calls
FileopsUndoRmgrInit() from RegisterUndoRmgrs().
With the FILEOPS engine and its UNDO handlers in place, replace the
raw filesystem mutations in the transaction-sensitive code paths with
the FileOps* API so they become atomic with the transaction and
crash-safe.  CREATE/DROP DATABASE (dbcommands.c), CREATE/DROP
TABLESPACE and ALTER DATABASE ... SET TABLESPACE (tablespace.c), and
the directory-copy helper (copydir.c) now register deferred operations
instead of calling mkdir/symlink/rename/rmtree directly, and xact.c
drives the pending-op queue at commit, abort, subtransaction
boundaries, and PREPARE.

pg_waldump gains the fileops rmgr descriptor (fileopsdesc.c symlink
plus its rmgrdesc.c table entry) so XLOG_FILEOPS_* records render in
waldump output.  The change ships user documentation (fileops.sgml and
its filelist/postgres wiring, a worked example), a test_fileops
contrib module exercising the API, regression coverage
(regress/fileops.sql), and recovery TAP tests that crash mid-operation
and verify redo and rollback.  typedefs.list records the new types.
@github-actions github-actions Bot force-pushed the master branch 2 times, most recently from 00f76fb to da08252 Compare July 3, 2026 17:16
gburd added 3 commits July 3, 2026 15:01
PostgreSQL's heap appends a new tuple version on every UPDATE and
abandons the old one to VACUUM.  For update-heavy workloads on narrow
hot rows -- counters, queue heads, running aggregates, time-series
tail writes -- the resulting version churn dominates buffer traffic,
bloats relations, and keeps autovacuum permanently behind.

RECNO is a table access method (amname "recno", RM_RECNO_ID) that
updates tuples in place.  An UPDATE overwrites the committed bytes on
the main fork; the prior image is preserved in the per-relation UNDO
fork (RELUNDO_FORKNUM) so a ROLLBACK restores it and a concurrent
snapshot reader can still reconstruct the version it is entitled to
see.  Heap is untouched; a relation opts in with USING recno.

Visibility is driven by a hybrid logical clock rather than per-tuple
xmin/xmax scans.  Each tuple header carries an HLC commit stamp; a
snapshot captures an HLC horizon and a tuple is visible when its
stamp precedes the horizon and its writer has committed.  Transient
state (uncommitted / deleted / updated) lives in header flag bits
that UNDO clears on rollback through the RelUndoClearTransientFlags
hook, and in-place delta reversal runs through RelUndoReverseDelta;
RECNO installs both at init via RecnoRelUndoInstallHooks() so the
UNDO core never sees a RECNO tuple layout.

Same-page updates rewrite the slot directly; updates that no longer
fit spill to an overflow chain with optional per-attribute dictionary
compression (LZ4/ZSTD, ANALYZE-refreshed).  A sparsemap-backed dirty
map and a partitioned secondary log (sLog) serve before-images to old
readers without a separate version-store cache; free space and
visibility are tracked in RECNO-private FSM and VM forks.

WAL coverage is complete: insert, in-place and out-of-place update,
delete, tuple-lock, VACUUM, dict writes, and overflow all log redo
records under RM_RECNO_ID with matching recovery routines, and crash
recovery cooperates with the UNDO driver to roll back in-place loser
transactions.  Index AMs over a RECNO table force a heap recheck on
index-only scans and tolerate stale index entries left by in-place
updates until before-image reclamation retires them.

The commit also wires RECNO into the supporting infrastructure:
pg_am / pg_proc catalog entries, the recno rmgr description routine,
pageinspect coverage (pageinspect 1.13 -> 1.14), the in-tree
documentation chapter, and isolation, regression, and crash-recovery
test suites exercising MVCC correctness, overflow, dictionary
compression, and dual-mode UNDO rollback.
Every committed in-place UPDATE now appends an 8-byte RelUndoRecPtr to
its new on-page image, guarded by RECNO_TUPLE_HAS_VERSION_PTR, pointing
at the head of the relation's persistent version chain in the UNDO fork.
Never-updated rows stay at their base length; a row grows by 8 bytes
once on its first UPDATE and stays that size for subsequent same-size
updates on the CAS fast path.

The CAS fast path now targets base+8 and stamps the freshly reserved
RelUndoRecPtr before computing the byte-diff, so the diff and the page
memcpy both see the final slot layout. First-time +8 growth is performed
on the exclusive-lock path, sized to the full slot with the verptr at
the tail and logged as a full image to avoid primary/replica divergence.
The widen-to-full-slot fixup runs before the critical section, since its
repalloc/pfree are forbidden once START_CRIT_SECTION is entered.

Accessors RecnoTupleGetVersionPtr/SetVersionPtr read and write the
trailing field via memcpy (unaligned-safe) from the on-page slot tail,
keyed on ItemIdGetLength rather than any logical length field.

Also fixes two latent reverse-apply bugs that activate once reads rely
on reconstruction: size the DELTA_UPDATE reconstruction buffer to the
diff's old_total_len (not the current on-page length, which a shrinking
update makes too small), and promote the slot-too-small restore skip
from DEBUG1 to WARNING.

RECNO: reconstruct prior versions from UNDO fork, drain sLog (WS-PVS2/3/4)

Implement the CTR paper's sLog/PVS division of labor: the durable
per-relation UNDO fork becomes the Persistent Version Store, and MVCC
readers reconstruct prior tuple versions on demand from its byte-diffs
instead of keeping full before-images in the in-memory sLog.

WS-PVS2: RecnoReconstructVisibleVersion (recno_pvs.c) walks a tuple's
version chain via its trailing verptr -> RelUndoReadRecord, reverse-
applying each RELUNDO_DELTA_UPDATE / RELUNDO_UPDATE record until it
reaches the version visible to the reader's snapshot. Replaces the
committed-UPDATE branch of the former SLogTupleGetSharedBeforeImage
call sites on all fetch paths (seq, TID, plain/IOS index, bitmap).

WS-PVS3: stop publishing committed-UPDATE before-images into the sLog.
Phase 1 removes the DSA before-image allocation and the shared-image
getter; Phase 2 stops retaining committed-UPDATE markers in the flat
hash and migrates the lost-update probe onto the fork
(RecnoTupleHasCommittedUpdateAfter). Together these drain both the DSA
memory limit and the bucket-count table_full saturation.

WS-PVS4: complete the RelUndoRecPtr generation-counter machinery so a
recycled fork page cannot alias a discarded probe record. Bump the
metapage counter on free-list recycle and validate it on read; the
counter is durably logged with the page reinit in one WAL record.

Tests: recno TAP 5/5, recovery 061/063/065, isolation 148/148 all pass.
The pre-existing varlen-growing-UPDATE regress failures and the
select_parallel worker-count flake are unchanged from baseline.

RECNO: COW-reference unchanged overflow columns on UPDATE (WS-Q2 Layer 0)

A narrow-column UPDATE of a wide row re-stored the unchanged wide column
into fresh overflow pages, doubling on-disk size until VACUUM caught up.

Carry a 32-bit whole-value content hash in RecnoOverflowPtr (reclaimed
from the dead ov_padding/ov_flags fields so the struct stays 20 bytes and
RECNO_OVERFLOW_PTR_SIZE is unchanged -- growing it would defeat the
force-shrink UPDATE recovery path). At the exclusive-path form site,
snapshot the old on-page overflow pointers by attnum while the buffer is
locked, then for each over-threshold varlena compare length+hash against
the old pointer (zero I/O). On a match, byte-verify against the fetched
old chain and, if equal, build a pointer varlena that references the old
chain verbatim and skip re-storing. The byte-verify makes the hash a pure
performance prefilter, never a correctness dependency: a collision merely
wastes a fetch and falls through to a normal re-store.

Sharing is safe without refcounts because RECNO UPDATE is strictly
in-place and VACUUM Pass 1 already unions chain locators from every live
HAS_OVERFLOW tuple, so a shared chain stays live while any referencer is.

RECNO: test VACUUM retains version diffs a live snapshot needs (WS-PVS4)

Committed in-place UPDATEs keep their before-image only as a byte-diff in
the per-relation UNDO fork; RelUndoVacuum discards diffs whose updater xid
precedes GetOldestNonRemovableTransactionId.  A live REPEATABLE READ
snapshot holds that horizon back, so VACUUM must leave the diff intact and
the snapshot must still reconstruct the prior version.

The existing recno-retained-reclamation spec never runs VACUUM and
recno-vacuum-concurrent only checks DELETE visibility, so the retention
gate had no direct coverage.  This spec pins a snapshot, updates+commits+
VACUUMs the row, and asserts the snapshot still reads the original value.

RECNO: reserve version-pointer headroom at insert (WS-PVS1 fix)

Reserve sizeof(RelUndoRecPtr) trailing headroom on every inserted tuple
so the first committed in-place UPDATE is a same-length overwrite rather
than an +8-byte growth. Without this, a densely packed page of
fixed-width rows has no residual free space for the per-row +8 growth,
and a multi-row UPDATE fails with "updated recno tuple does not fit on
page" once more than a handful of rows on one page are updated in a
single command.

Every inserted tuple is now born with RECNO_TUPLE_HAS_VERSION_PTR set
and an InvalidRelUndoRecPtr trailing field. Readers already treat an
invalid chain head as "no history", so never-updated rows are
semantically unchanged; the first UPDATE stamps the real chain-head
pointer into the pre-reserved slot via the existing CAS fast path.

Patched both the single-insert (recno_tuple_insert) and batch
(recno_multi_insert) paths. recno regression expected sizes updated to
reflect the +8B/row on-page footprint (HEAP sizes unchanged). recno
isolation suite: 149/149.
RecnoVacuumCrossPageDefrag held table buffer content locks (the source
page EXCLUSIVE across the whole move loop, plus a re-read SHARE lock on
the moved tuple) while calling index_insert. index_insert descends into
_bt_search, which takes index leaf locks: a TABLE->INDEX lock order.
Concurrent nbtree bottom-up deletion takes INDEX->TABLE (it SHARE-locks
the table page from under the index leaf lock in recno_index_delete_
tuples). Buffer content locks are LWLock-style and invisible to the
deadlock detector, so the cycle hangs forever rather than being
aborted.

Defer index maintenance until every table page content lock is
released. While the destination page is still locked, copy each moved
tuple off-page into a palloc'd buffer and record its new TID; after
UnlockReleaseBuffer on both source and destination pages, replay the
index inserts from those copies with no table content lock held. This
mirrors heap VACUUM, which never calls index_insert under a heap buffer
content lock. Safe because overflow-carrying pages are skipped by the
defrag, so the copied tuple bytes are self-contained.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant