Skip to content

pg_fts: a bm25 full-text search engine#27

Draft
gburd wants to merge 109 commits into
masterfrom
fts
Draft

pg_fts: a bm25 full-text search engine#27
gburd wants to merge 109 commits into
masterfrom
fts

Conversation

@gburd

@gburd gburd commented Jul 3, 2026

Copy link
Copy Markdown
Owner

No description provided.

@github-actions github-actions Bot force-pushed the master branch 6 times, most recently from ce2678e to 82cc694 Compare July 4, 2026 16:04
gburd added 24 commits July 4, 2026 13:56
This is the first stage of a native full-text search subsystem intended to
provide true BM25/BM25F relevance ranking with index-only scoring and a
richer query language, addressing long-standing limitations of the
tsvector/tsquery + GIN stack: no corpus statistics (N, avgdl, df) are stored
anywhere, ts_rank is cover-density rather than BM25, and GIN posting lists
carry only TIDs so ranked queries must always recheck the heap.

Rather than land that as one large patch, the work is structured as a
reviewable series (see FTS_NEXTGEN_PLAN.md). This first commit introduces
only the SQL surface, evaluated by sequential scan, with no index access
method -- the same way tsvector/tsquery were originally introduced.

Adds two types:

  ftsdoc    an analyzed document (sorted, de-duplicated terms with term
            frequencies, plus the document length that BM25 will need)
  ftsquery  a parsed boolean query (AND/OR/NOT and grouping)

Both are varlena and TOAST-able with version-tagged binary send/recv formats.
A hand-written recursive-descent parser produces the query; the grammar is
small enough not to warrant a generator. Matching is a boolean stack machine
over the postfix item list, mirroring TS_execute.

The stage-1 tokenizer is deliberately minimal (ASCII case-fold, split on
non-alphanumerics). It is isolated behind fts_analyze_text() so that a later
stage can reuse PostgreSQL's existing text-search parser and dictionary
pipeline (snowball, ispell, synonyms, thesaurus, stopwords) without changing
the types, the operator, or the on-disk format.

Includes a regression test exercising analysis, query parsing and canonical
output, all boolean match cases, sequential-scan use in a WHERE clause, and
error handling for malformed queries.
Add to_ftsdoc(regconfig, text), which runs an installed text search
configuration's parser and dictionary chain via parsetext() and folds the
normalized lexemes into an ftsdoc.  This reuses the existing snowball/ispell/
synonym/thesaurus/stopword pipeline rather than reimplementing tokenization.
Shipped as extension upgrade 1.0 -> 1.1.
Add fts_bm25(doc, query, n_docs, avgdl, dfs), computing the Okapi BM25 score
with Lucene-style IDF and standard k1/b defaults.  Corpus statistics are
caller-supplied for now (the bm25 index AM will maintain them later), which is
enough to validate the scoring math by sequential scan.  Shipped as 1.1 -> 1.2.
Add a real index access method (USING bm25) over an ftsdoc column that answers
the @@@ operator via a bitmap scan.  The build scans the heap, collects
per-term postings (tid, tf), and writes a metapage (N, sum(doclen), nterms), a
sorted dictionary, and chained posting pages -- all through GenericXLog, so the
index is crash-safe and replicated without a custom resource manager.

The scan evaluates the boolean ftsquery by set algebra over posting lists
(AND=intersect, OR=union, NOT=complement against the indexed universe),
matching @@@ semantics exactly with no heap access.  Corpus statistics are
maintained for the coming index-only BM25 scoring.

The skeleton is build-once: aminsert raises an error directing REINDEX, and
incremental maintenance (pending list + background merge) is a later stage.
Shipped as extension upgrade 1.2 -> 1.3.
Add fts_bm25_opts(doc, query, n_docs, avgdl, k1, b, variant, dfs) supporting
lucene, robertson (classic), atire, and bm25+ IDF/scoring variants with
explicit k1/b, for reproducing reference implementations (Lucene/bm25s) in
conformance tests.  Shipped as 1.3 -> 1.4.
Add fts_highlight(text, query, pre, post) and fts_snippet(text, query, pre,
post, ellipsis, max_tokens), giving FTS5-parity result presentation.  Both
tokenize the source with the same folding as the analyzer and mark query-term
matches; snippet slides a token window and returns the densest match region.
Shipped as 1.4 -> 1.5.
Add tsquery_to_ftsquery() and an ASSIGNMENT cast so existing tsquery values and
queries port to the @@@ operator with minimal churn: &/|/! map to AND/OR/NOT.
The phrase operator <-> degrades to AND with a NOTICE (phrase support is a
later stage), preserving recall.  Shipped as 1.5 -> 1.6.
Add prefix matching to the query language: a trailing '*' on a term (e.g.
quick*) matches any document term with that prefix.  Implemented in the parser
(a per-item FTS_QF_PREFIX flag, carried through send/recv), the sequential
matcher (binary-search lower bound on the sorted term set), and the bm25 index
scan (union the posting lists of all dictionary terms sharing the prefix).

Phrase and NEAR need per-term positions, which the stage-1 ftsdoc format omits;
they follow as an ftsdoc v2 format addition.
Add fts_index_stats(regclass) -> (ndocs, avgdl, nterms) and fts_index_df(
regclass, ftsquery) -> float8[], reading N, avgdl and per-term document
frequency from the bm25 index metapage and dictionary.  BM25 can now be scored
from statistics the index maintains rather than caller guesses, closing the
loop between the AM and the scorer.  (Streaming index-only WAND top-K is a
further optimization.)  Shipped as 1.6 -> 1.7.
…partial)

Update the README to reflect the nine qualified stages (versions 1.0-1.7 plus
prefix queries) and to state honestly what remains: phrase/NEAR, WAND top-K,
incremental maintenance, contentless indexes, the parity gate, and the
fuzzy/regex stages.
The bm25 access method's aminsert no longer errors: it appends the new
document verbatim to an in-index chain of pending pages and bumps the metapage
N and sum(doclen).  The scan searches pending documents directly with the
per-document matcher, so newly inserted rows are immediately visible to @@@
without a REINDEX.  Per-term df in the dictionary stays stale until a merge
(REINDEX), matching GIN fastupdate's documented behavior.  All page writes go
through GenericXLog.  Shipped as 1.7 -> 1.8 (bm25 metapage format changed;
REINDEX required for pre-1.8 bm25 indexes).
Extend the ftsdoc format to v2, storing per-term token positions, and add
quoted-phrase query syntax ("a b c"): the parser emits an FTS_OP_PHRASE chain
(distance 1), and the matcher verifies adjacency by intersecting term position
lists.  The bm25 index treats a phrase as AND for candidate generation and now
requests a bitmap recheck, so @@@ re-evaluates adjacency exactly against the
heap ftsdoc.  Position-free v1 docs remain valid (phrase degrades to AND).

NEAR(a b, k) reuses the same distance-aware phrase_step and is a small parser
addition (comma + integer) on top of this.  Shipped as 1.8 -> 1.9 (ftsdoc
format v2).
The bm25 index stores only postings, never document text, so an expression
index on to_ftsdoc(text_column) is exactly FTS5's external-content model: the
text lives in the base table, the index is derived from it, and @@@ queries
(including phrases, via recheck) work against the expression.  Shipped as
1.9 -> 1.10 (documentation marker; no new SQL objects).
Add two ftsquery term forms:
  term~k  matches document terms within Levenshtein distance k (default 2),
          using core varstr_levenshtein_less_equal (bounded, no new dependency)
  /re/    matches document terms against a POSIX regex via core's cached
          regex engine

Both are evaluated per-document in the matcher.  The bm25 index returns all
indexed tuples as candidates for fuzzy/regex queries and the bitmap heap
recheck applies the exact test, so results are correct through the index.

This follows the plan's 'no new dependency for the common case' path; the
pg_tre trigram-formula pre-filter (with its Lime grammar converted to
Bison+Flex, and sparsemap v5.1.1 for posting compression) to narrow candidates
at scale is future work.  Shipped as 1.10 -> 1.11.
Add bench/bench.sql and bench/README: a reproducible A/B harness comparing the
bm25 stack against tsvector + GIN + ts_rank on a user-supplied corpus (index
size, ranked top-10 EXPLAIN ANALYZE).  The full parity gate (latency
percentiles, NDCG vs qrels, concurrent-ingest throughput, Lucene/bm25s score
conformance) is documented as a manual, reported measurement rather than a
make-check regression, since it needs an external corpus.

Update README.pg_fts to describe all implemented stages (1.0-1.11), the full
query language, a worked BM25 ranking example, and the remaining future work
(WAND top-K, trigram pre-filter for fuzzy/regex, NEAR, background merge,
BM25F).
Add fts_bm25f(docs ftsdoc[], query, weights[], n_docs, avgdls[], dfs[]): the
Robertson/Zaragoza BM25F, where per-field term frequencies are length-
normalized per field and combined by weight before the tf-saturation step
(not a naive sum of per-field BM25 scores).  This lets a term in a heavily
weighted field (e.g. title) outrank the same term in the body.  Shipped as
1.11 -> 1.12.
Add bm25_merge_pending(): read the existing dictionary + posting chains and all
pending documents back into a build state, rewrite the merged structure into
fresh blocks, and repoint the metapage -- no heap access.  Wired into
amvacuumcleanup (VACUUM now folds pending docs automatically) and exposed as
fts_merge(regclass) for on-demand merge.  Merging resolves the df staleness
that incremental inserts introduce (formerly-pending terms gain dictionary df).

Old blocks are left unreferenced and reclaimed by REINDEX; an FSM-based page
recycler is future work.  Shipped as 1.12 -> 1.13.
Add fts_search(index, query, k) -> setof(ctid, score): BM25 top-k computed
entirely from the index -- postings supply per-doc tf, the dictionary supplies
df and a per-term max-tf impact bound (now stored), the metapage supplies N and
avgdl -- with no heap access.  Per-document scores accumulate across query
terms and the top-k are returned by descending score; join on ctid to fetch
rows.  This is the index-only-scoring path (no heap fetch to rank), the core
performance win for ranked search.

Stored per-term max_tf in the dictionary provides the WAND upper bound for
document skipping.  Full executor integration via amcanorderbyop (an ORDER BY
score LIMIT k ordering scan with block-max WAND early termination) and exact
per-document |D| in postings are the remaining optimizations.  Shipped as
1.13 -> 1.14.
Add pg_fts_trgm.c: reduce a fuzzy term to its trigrams and test only document
terms sharing a trigram with it (Levenshtein is the expensive step).  This is
the pg_tre-style pruning that makes fuzzy matching viable on a large
vocabulary, applied at the term level.  Results are unchanged: the filter only
skips candidates that provably cannot match (pigeonhole: a match within k edits
shares a trigram when the term has more than k trigrams) and falls back to a
full scan for short terms.  A persistent on-disk trigram posting index in the
bm25 AM (the full three-tier funnel) is the remaining scale work.  Shipped as
1.14 -> 1.15.
fts_search() returned candidate ctids straight from the postings, which can
reference dead or updated tuples that the index has not yet merged out.  It now
opens the base table and checks each candidate against the active snapshot via
table_index_fetch_tuple, returning only visible tuples in score order and
stopping once k visible rows are found.  This makes the SRF correct under all
isolation levels, matching the visibility contract the @@@ bitmap path already
gets from the executor's bitmap heap scan + recheck.

(The @@@ operator path was already MVCC-correct: amgetbitmap sets recheck=true
and the bitmap heap scan applies snapshot visibility.  All page I/O uses the
buffer manager, and every writer is WAL-logged via GenericXLog, so the index is
correct on physical standbys and after crash recovery.)
Posting pages now store postings as a BM25PostingPageHdr (count) plus a varint
stream of (docid-gap, tf), instead of a raw 12-byte BM25Posting array.  docids
(block*MaxHeapTuplesPerPage + offset) are sorted ascending per term so gaps are
small and pack to 1-2 bytes; tf likewise.  A single bm25_page_decode() feeds
all five posting readers (scan eval, prefix, universe, term-postings, merge),
and bm25_write_postings() encodes.  This is the index-size win needed to
compete at scale; results are byte-for-byte identical (compression is
transparent).

Also fix a buffer-pin leak in the stage-7 pending-insert path: when the tail
pending page was full, tailbuf was unlocked but not unpinned before being
re-read as oldtail.  Surfaced by the larger compression test under cassert.
Shipped without a format version bump note here (unreleased); REINDEX any
existing bm25 index.
fts_search now uses a proper DAAT WAND: per-term cursors over the docid-sorted
posting lists, a top-k heap with a running score threshold, and per-term
max-contribution bounds (from the stored per-term max_tf) so documents that
cannot enter the current top-k are skipped via the pivot rule rather than fully
scored.  Results are byte-for-byte identical to the previous exact accumulate
path (verified: no regression-test diffs), with the WAND pruning as the speedup.

Posting pages now also carry per-page block-max_tf and first-docid in the page
opaque, the on-disk foundation for block-level skipping in a future page-cursor
WAND.  WAND over-fetches candidates so the MVCC visibility filter still yields k
visible rows.  Full amcanorderbyop executor integration (ORDER BY score LIMIT k\ndriving the ordering scan directly) remains as polish on top of this.
Postings now carry |D| (document token count) as a third varint per posting,
so the index-only scorer applies exact BM25 length normalization instead of
approximating |D| = avgdl.  This is a relevance-accuracy fix: for equal tf, a
shorter document now correctly outranks a longer one (matching Lucene/bm25s),
which is what the top-k test now shows (a 1-token 'quick' doc outranks a
3-token doc).  The WAND upper bound uses the shortest-document norm
(tf + k1*(1-b)) so it never underestimates and never prunes a qualifying hit.
REINDEX required (posting format changed).
gburd added 29 commits July 4, 2026 16:28
Comment-only sweep (no behavior change): several file/function headers still
described an early 'skeleton' that has since been fully built, which is
misleading to readers:
- pg_fts_am.c header: described a build-once, REINDEX-to-refresh, non-segmented
  index with (tid,tf) posting arrays -> now describes the segmented design
  (FOR-packed 128-doc blocks, per-block impacts, trigram index, livedocs
  tombstones, pending buffer + flush + tiered merge).\n- pg_fts_am_scan.c header + bm25_lookup_term/bm25_lookup_prefix: no longer a\n  bitmap-only 'skeleton scanning sorted arrays'; document IOS, WAND/MaxScore,\n  fuzzy/regex, VM-aware counts, and the sparse-block-index seek.\n- BM25PendingItem: 'REINDEX for now' -> flush via fts_merge()/VACUUM cleanup.\n- pg_fts_rank.c / pg_fts.h / pg_fts_query.c: the index DOES maintain N/avgdl/df,\n  and phrase/NEAR/prefix/fuzzy/regex ARE implemented (not 'later stages').\n- fts_doc_has_fuzzy: comment claimed it scans all terms; it already applies the\n  length + trigram pre-filters described.\nNo stubs or dead code found beyond the bulkdelete/migrate issues fixed in the\nprior two commits.
…int)

The BM25BlockHdr comment still claimed the intra-block payload was a varint
stream with FOR/PFOR 'a later swap' -- but bm25_for_pack/bm25_for_unpack
(frame-of-reference bit-packing of the docid-gap/tf/doclen columns) has been the
encoding for a while.  Correct the description; note only patched-FOR/PFOR for
outliers remains a possible refinement.
…re 1/6)

bm25_build accumulated the entire corpus's terms + postings in one in-memory
build state and wrote a single segment -- a very large CREATE INDEX could
exhaust maintenance_work_mem / RAM (one of the faults that motivated the
segmented redesign).  Now the build callback checks MemoryContextMemAllocated
against a budget (maintenance_work_mem, floored at 32MB) between tuples and,
when exceeded, flushes the accumulated terms as an immutable segment and resets
the build context to continue within a bounded footprint.  A document's terms
are always fully accumulated before a flush, so no doc is split across segments
and per-segment ndocs/sumdoclen stay consistent.  After the scan the residual
is flushed and bm25_merge_segments compacts the segments so a fresh index is not
left fragmented.

Verified: 60k-doc build at maintenance_work_mem=1MB flushes multiple segments
and returns byte-identical results to a single-segment build (counts, ndocs,
ranked top-k all match seqscan); regression green.
bm25_merge_segments merged the WHOLE directory into one segment on every trigger
-- O(index) write amplification under steady inserts.  Replace with a Lucene
TieredMergePolicy in miniature: sort live segments by size and merge only a run
of similarly-sized segments (within BM25_MERGE_SIZE_FACTOR, >= BM25_MERGE_TIER_MIN
of them), so small flushes coalesce cheaply while large segments are rarely
rewritten.  bm25_merge_selected merges a chosen subset and rewrites the metapage
directory preserving the order of the kept segments and appending the merged
one; the driver loops until no tier qualifies and the count is within budget.
Tombstoned docs are still dropped as segments are read.

Verified: 20 flushes coalesce to 3 segments (not 21), 1500 docs preserved and
queryable; delete+re-merge keeps counts correct (1250==seqscan); regression
green, zero crashes.
…/6-a)

bm25_costestimate delegated entirely to genericcostestimate, which over-prices a
ranked 'ORDER BY d <=> q LIMIT k' scan (a generic full index scan) so the
planner could pick seqscan+sort instead of the index that natively honours the
ORDER BY.  Now: keep the generic selectivity/pages/rows (needed for honest row
estimates), but when path->indexorderbys is set (a <=> ordering scan), price it
as block-max WAND/MaxScore actually behaves -- a modest startup plus a small
per-tuple cost and only a fraction of a page fetch per match -- reflecting that
WAND with a pushed-down LIMIT does work sublinear in the match set.  Plain @@@
scans keep the generic estimate.

Conservative (low but nonzero per-tuple so large LIMITs still scale).  Verified:
with seqscan enabled the planner now chooses the Index Scan for a ranked LIMIT
query; regression green.
bm25_lookup_prefix scanned the ENTIRE dictionary chain for term* queries.  Since
dictionary entries are byte-sorted and each segment already has a sparse
per-page block index (used by bm25_dict_seek for exact lookups), the matching
terms are contiguous: seek to the page that can hold the prefix, scan forward
only while entries could still start with it, and stop at the first term that
sorts past the prefix.  No new on-disk structure -- just a smarter walk, now
sublinear in the dictionary.  Signature takes the BM25SegMeta (for
dictindexstart); bm25_eval_query caller updated.

Verified exact vs a full seqscan over a 30k-term multi-page dictionary:
term001* = 100, term1* = 10000, word* = 30000, zzz* = 0, term29999* = 1 -- all
match; regression green.  (A front-coded/FST dictionary would compress the term
bytes further but is not needed for sublinear prefix search.)
…e 2/6)

The metapage segment directory is a fixed segs[] array.  With the size-tiered
merge (feature 3) the live segment count stays far below the cap, so a chained
overflow directory -- which would complicate all 49 reader sites that iterate
meta.segs[0..nsegments] -- is unwarranted for a case the merge policy makes
unreachable.  Instead double the cap to 128 (the metapage is then ~6.2KB of the
~8KB page, still comfortable) as a wider safety margin, and make the overflow
path a clear, actionable error (index name + VACUUM/REINDEX hint) rather than a
terse message; data is never corrupted at the limit.

Verified metapage still fits and the index builds/queries correctly; regression
green.  (Chaining is documented as unnecessary given tiered merge.)
…ranted)

Evaluated patched-FOR (PFOR) intra-block encoding for the docid-gap/tf/doclen
columns.  Instrumented the FOR packer on a Zipfian 50k-doc corpus: extracting
the top ~1/16 outliers per column would shrink the column bytes by only ~7%,
which is under ~0.5% of the whole index (the columns are already narrow -- tf
and doclen are small, and docid gaps are tiny within a common term's dense
blocks).  That does not justify adding exception-handling to the hot-path
random-access decoder (bm25_for_get, used per scored posting in WAND).  Plain
FOR is kept; the block-encoding comment now records this measured decision
rather than implying PFOR is pending.
…return)

1. add_posting term-key collision (data loss): two distinct terms >= 64 bytes
   sharing their first 64 bytes hash to the same padded key; the old code, on a
   key hit that failed exact comparison, created a new BuildTerm and CLOBBERED
   the hash entry's termidx -- fragmenting a term's postings across dictionary
   entries that readers (which stop at the first match) never reach, giving
   wrong df/counts.  Now BuildTerms chain per key (BuildTerm.next) and
   add_posting walks the chain for the truly-equal term.  Verified: three
   70-char terms sharing a 64-byte prefix each return the correct count.

2. bm25_canreturn returned true unconditionally, but the index is NOT covering.
   The planner then chose an index-only scan for SELECT <indexed ftsdoc column>
   and returned the placeholder all-NULL tuple -- i.e. NULLs instead of the real
   value (verified: 'SELECT d WHERE d @@@ q' gave d IS NULL).  Return false so
   IOS is never used to fetch a real attribute.  count(*) now uses a bitmap/
   plain index scan (correct; the fast transparent-count IOS is given up because
   the @@@ restriction column is in the IOS coverage check); the explicit
   visibility-map-aware fts_count() remains the fast count.  Corrected the
   now-inaccurate IOS comments; bm25_set_itup kept as a guarded no-op.

Full regression green; zero crashes/leaks.
…t accuracy)

MED:
- ftsdoc_recv/ftsquery_recv: bound the wire-supplied element count against the
  remaining message length before palloc, rejecting hostile/corrupt binary
  input (overflow/OOM at a trust boundary).
- sm_create NULL-checks at all three call sites (libc malloc can fail).
- Document that ranked (WAND) results cover merged segments only; pending docs
  are @@@/fts_count-visible and become ranked after a flush (fts_merge/VACUUM).

LOW (dead code / stale comments now matching the implementation):
- Remove unused page-opaque fields (block_max_tf/first_docid_hi/lo -- block-max
  lives in BM25BlockHdr) and the dead BM25PostingPageHdr struct.
- Fix bm25_write_postings comment (FOR blocks, not delta+varint pages).
- trgm_index.c header: trigram -> TERM-ORDINAL sets (not docids); reconcile the\n  contradictory "popular trigrams skipped" comments (nothing is skipped).\n- pg_fts.h / pg_fts.control: drop "stage 1, no index AM yet" (the bm25 AM,\n  ranking, segments exist).\n- pg_fts_aux.c: drop the fts_score_explain() claim (never implemented).\n- pg_fts_match.c: matcher does phrase/NEAR/prefix/fuzzy/regex, not just boolean.\n- ftsquery_out: it is a display rendering, not a guaranteed parser round-trip.\n- fts_regex_trigrams / trgm.c: clarify union-then-recheck (not "ANDed").\n\nFull regression green; zero crashes/leaks.
… features

The README still described the pre-segmented design (delta+varint postings, a
version list ending at 1.15, 'benchmark belongs alongside').  Rewrite to
document: the segmented storage architecture (dictionary+block-index,
FOR-packed 128-doc blocks with impact bounds, trigram index, livedocs
tombstones, pending buffer + flush + size-tiered merge + VACUUM tombstoning);
the query-execution paths (bitmap @@@, WAND/MaxScore <=> ordering scan with
lazy column decode, fts_count); versions through 1.19; and an honest
limitations section (resumable cursor, impact-ordered postings; PFOR and chained
directory evaluated and declined).  Points at bench/RESULTS_*.md for parity.
Replace "next-generation" and similar phrasing in the README title/intro,
pg_fts.h header, PGFILEDESC, control-file comment, and two code comments with
neutral, factual descriptions.  Drop the reference to the out-of-tree plan
document and the "headline" benchmark summary (the results files carry the
numbers).
The static library for vendor/sm.c passed -Wno-declaration-after-statement
unconditionally; MSVC's cl rejects the GCC-style spelling with
'D8021: invalid numeric argument', failing the Windows CI build.  Filter the
flag through cc.get_supported_arguments() so it is used on GCC/Clang and
dropped on MSVC (which accepts C99 mixed declarations natively).
The Windows/Visual Studio CI job failed compiling the vendored sparsemap: cl
rejects the GCC-style -Wno-declaration-after-statement flag, and the source
uses GCC extensions MSVC lacks (__attribute__((aligned/format/always_inline/
hot)) and the POSIX ssize_t).

Refresh vendor/sm.[ch] from upstream sparsemap and add vendor/sm_compat.h, a
portability shim included first by both: on _MSC_VER it neutralizes
__attribute__(...) (every use is an alignment/diagnostic/inlining hint that does
not change layout at the natural alignment of the members on x64) and typedefs
ssize_t from SSIZE_T (<BaseTsd.h>).  Normalize the fallthrough markers to the
project's all-caps /* FALLTHROUGH */ style, and gate both the fallthrough and
declaration-after-statement warning suppressions through
cc.get_supported_arguments (meson) / a target-scoped CFLAGS override (make) so
they apply on GCC/Clang and are dropped on MSVC.  The only pg_fts-specific delta
to the vendored code remains the SPARSEMAP_PREFIX=__pg_bm25_ namespacing block.

Local build is warning-clean under -Wimplicit-fallthrough=5; qualify PASS.
…MSVC

Two failures from the CI clang/MSVC builds not seen under local gcc:

- CompilerWarnings (clang -Werror): bm25_varint_encode/bm25_varint_decode and
  the BM25_MAX_POSTING_BYTES macro were dead once the FOR block codec replaced
  the varint posting encoding.  Remove them.

- Windows/MSVC: sparsemap's SM_LIKELY/SM_UNLIKELY expanded to __builtin_expect
  unconditionally (its comment already claimed they were no-ops elsewhere).
  Guard them behind __GNUC__/__clang__ like the other SM_* intrinsics, falling
  back to the bare condition on MSVC.

qualify PASS; warning-clean under -Wimplicit-fallthrough=5.
Linux Meson (32-bit) CI crashed in the tombstone VACUUM path (delete-all then
reuse).  In sparsemap's __sm_map_unset(), the size_t byte-offset variable
'offset' was overloaded with SM_IDX_MAX (== UINT64_MAX) as a 'gate coalescing
off' sentinel.  On ILP32 targets size_t is 32-bit, so the store truncated to
0xFFFFFFFF while the gate test 'offset != SM_IDX_MAX' promoted it back and
compared against 0xFFFFFFFFFFFFFFFF -- never equal -- so coalescing ran on an
uninitialized/invalidated chunk and crashed (the -Woverflow warnings at
sm.c:1607/3636/... were the tell).

Introduce a size_t-width sentinel SM_UNSET_NO_COALESCE ((size_t)-1) for the
byte-offset gate and use it at the three no-op/invalidated-pointer sites and the
coalesce test.  Fixed upstream in ~/ws/sparsemap and re-vendored; upstream test
suite (test_main, test_coverage 7801 expectations, portability, RLE, prefix,
large-index) all pass.  qualify PASS; regression green.
…t fix)

The previous fix missed the first gate-off site in __sm_map_unset (the
'no chunks in the map' branch still assigned SM_IDX_MAX to the size_t offset),
so the 32-bit tombstone VACUUM crash persisted.  Route that site through
SM_UNSET_NO_COALESCE too, and change __sm_chunk_rank's rank->rem 'infinity'
sentinel from UINT64_MAX to SIZE_MAX (rank->rem is size_t; the value is only
ever used as a large remaining-count, but the constant must not truncate).

Verified on genuine i686 (gcc -m32 on Fedora): the delete-all-then-reuse
coalescing reproducer passes and the build is free of -Woverflow/-Wuninitialized
in sm.c.  64-bit upstream suite still green (test_main 44/44, coverage 7801,
portability, RLE); pg_fts regression green.
Real-corpus testing exposed a correctness bug: after a document is deleted and
VACUUMed (tombstoned in its segment), if its heap slot is reused by a NEW
inserted document (same TID/docid), the new document was not found by @@@ /
fts_count.  Two causes, both fixed:

1. Pending docs were tombstone-filtered.  A doc in the pending write buffer is a
   live heap tuple by definition, but bm25_collect_matches OR'd pending matches
   into the accumulator BEFORE the tombstone filter, so a pending doc reusing a
   tombstoned TID was dropped.  Collect pending matches separately and union
   them AFTER filtering; pending docs are never tombstoned.

2. Tombstones were applied globally across segments.  bm25_docid_tombstoned
   checked a docid against EVERY segment's tombstone map, but a tombstone
   belongs to exactly one segment: a TID deleted in segment A may be legitimately
   reused by a live doc that a newer segment B indexes.  Filter each segment's
   own match contribution against only THAT segment's tombstone map
   (bm25_filter_tombstoned_seg), and in the WAND ranked path skip own-segment
   tombstoned docids inside each cursor (cursors are per-segment).  Removed the
   now-unused global bm25_filter_tombstoned / bm25_docid_tombstoned.

The regress tombstone test previously encoded the buggy 0 as expected output;
corrected to 60 and added an fts_count assertion so the count path is covered
too.  qualify PASS; regression green.
A document whose analyzed ftsdoc did not fit on a single pending page raised
'ftsdoc too large for a bm25 pending page' and failed the INSERT/UPDATE -- real
corpora (e.g. long Wikipedia articles) hit this routinely, so the index could
not be built over them.

When an ftsdoc exceeds the pending-page capacity, index it directly as its own
one-document segment (bm25_insert_oversized_as_segment) via the existing build
machinery: segment posting storage is a chain of FOR-packed pages with no
per-document size limit.  Such documents are rare, so building a small segment
per oversized insert is acceptable; corpus N/sumdoclen stay correct
(bm25_meta_add_segment accounts the doc, and the pending path is bypassed so
there is no double count).  Added a regression test with a ~4000-token document.\n\nqualify PASS; regression green.
Vendor sparsemap v5.2.0, which brings the MSVC portability shim (SM_ALIGNED /
ssize_t / guarded __builtin_*) and the O(N) coalescing performance fix upstream.
This replaces the local vendor/sm_compat.h shim (now redundant) and my earlier
one-off MSVC edits; sm.h is byte-identical to pristine v5.2.0 and sm.c differs
only by the __pg_bm25_ SPARSEMAP_PREFIX namespacing block plus the fix below.

v5.2.0 still has the 32-bit unset-coalesce truncation bug (its own change did
not touch the gate): in __sm_map_unset the size_t byte-offset 'offset' is set to
SM_IDX_MAX (== UINT64_MAX) as a 'gate coalescing off' sentinel, which truncates
to 0xFFFFFFFF on ILP32 so the '!= SM_IDX_MAX' gate (which promotes offset back
to 64 bits) never matches -> coalescing runs on an uninitialized chunk and
crashes.  Route the four gate-off sites and the gate test through a size_t-width
sentinel SM_UNSET_NO_COALESCE.  (Fixed in ~/ws/sparsemap for upstreaming;
vendored here.)

Consumer-side alignment fix: struct sparsemap is declared SM_ALIGNED(8), but
palloc only guarantees MAXALIGN (4 on ILP32), so palloc0(n*sizeof(sm_t)) placed
the per-segment tombstone maps at 4-aligned addresses -> -fsanitize=alignment
abort in the Linux Meson (32-bit) CI job.  Allocate that array with
palloc_aligned(..., 8, 0).  Verified on genuine i686 under
-m32 -fsanitize=undefined,alignment: the delete+coalesce+reopen path runs clean
(live=60) with an 8-aligned sm_t.  qualify PASS; 64-bit regression green;
upstream suite (test_main 44/44, coverage 7805, portability, RLE) green.
Indexing an oversized document creates a one-document segment
(bm25_insert_oversized_as_segment).  A bulk INSERT/UPDATE over a corpus with
many large documents (e.g. rebuilding the expression index over 2M Wikipedia
rows) could therefore create one segment per oversized row and hit the hard
BM25_MAX_SEGMENTS (128) cap before the next VACUUM merged anything -- 'bm25
index reached the maximum of 128 segments'.

After flushing an oversized segment, if the live segment count is within 16 of
the cap, run bm25_merge_segments() to coalesce.  Verified: 200 oversized
documents now index without overflow (merged to well under the cap) and all
remain searchable.  qualify PASS; regression green.
Head-to-head on 2,000,000 real English-Wikipedia articles (PG20devel,
r7i.8xlarge).  Match sets verified identical to the GIN path before timing.

Headline (median of 9, warm): ranked retrieval -- the actual FTS use case --
is where pg_fts wins, and the win grows with term frequency because GIN's
ts_rank must fetch+score+sort every match while pg_fts stops early via block-max
WAND: ranked top-10 common&mid 15 ms vs 65 ms (4.2x), top-100 common 75 ms vs
3029 ms (40x), top-10 two-common 50 ms vs 1641 ms (33x).  Plain counts / boolean
AND tie (both bitmap-scan); fts_count beats count(*) 3.7x.  Cost: ~2.5x larger
index (per-posting tf/|D|/positions), comparable single-threaded build.

Adds bench/get_wikipedia.py (HF parquet -> TSV loader) and bench/bench_fixed.sh
(pinned-term median-of-9 A/B runner).
…ch bench

sparsemap v5.2.1 upstreams the ILP32 gate fix I carried locally in the vendored
copy (SM_UNSET_NO_COALESCE, the size_t-width coalesce sentinel) and adds an
independent bug fix -- a multi-run RLE-chunk corruption in __sm_separate_rle_chunk
(wrong payload-vector placement and pivot-size accounting that desync'd the
sequential chunk walk).  Re-vendor v5.2.1: sm.h is byte-identical to pristine
v5.2.1 and sm.c now differs ONLY by the __pg_bm25_ SPARSEMAP_PREFIX block (no
local modifications remain).  Verified on i686 (gcc -m32 -fsanitize=undefined,
alignment): the delete+coalesce+reopen path is clean; 64-bit upstream suite
green (test_main 44/44, coverage 7820).  qualify PASS; regression green.

Also add bench/RESULTS_VS_PGSEARCH_WIKI.md: an honest pg_search 0.24.1 head-to-
head on 2M real Wikipedia articles (PG17.10).  pg_search wins ranked retrieval
(flat ~9ms; pg_fts's docid-ordered WAND degrades with term frequency, 26->70ms)
and common-term counts (aggregate pushdown); pg_fts's fts_count wins selective
counts (1.9-2.4ms) and its index is 1.55x smaller (3590 vs 5574 MB).  The gaps
are the documented codec investments (impact-ordered postings, COUNT/aggregate
Custom Scan pushdown) plus parallel scan -- architecture, not tuning.
…AIO)

Code-cited capability matrix and answers to adopter questions: concurrent
builds (CIC/REINDEX CONCURRENTLY work — aminsert routes concurrent writes to the
searchable pending list), index-only scans (no, non-covering by design; fts_count
is the fast-count path), compaction (VACUUM+fts_merge drops tombstones logically,
REINDEX reclaims physical space — no online REPACK), feature parity vs
pg_search/Tantivy, ZomboDB/ES and tsvector/GIN (has BM25/BM25F, phrase/NEAR/
prefix/fuzzy/regex, <=> ordering scan, fts_count, highlight/snippet, tombstone
deletes, WAL/GenericXLog safety; gaps: no parallel build/scan, no IOS, no
aggregation pushdown, no impact-ordered postings), logical-replication drop-in
(no — re-platform: different API, indexes provisioned per-subscriber; physical
replication + crash recovery ARE safe), and AIO (none of its own; build heap
scan gets core read_stream free; nextblk pointer-chains defeat prefetch and WAND
is anti-prefetch by design; only the cold merge full-scan could benefit, deferred
until measured).
Closes the ranked-retrieval gap vs Tantivy/pg_search, whose latency stays flat
as term frequency grows while pg_fts's docid-ordered block-max WAND degraded
(top-100 over a common term walked the whole ~5300-block posting list).

Format v3 adds a per-term impact-ordered block skip directory: for a term with
df >= BM25_SKIPDIR_MIN (2048), the writer records every posting block's
(blk, off, max_tf, min_doclen) and stores them sorted DESCENDING by an
avgdl-independent impact proxy (max_tf desc, min_doclen asc) in a BM25_SKIPDIR
page chain, referenced from three new BM25DictEntry fields (skipstart/skipoff/
nblocks).  The single-term ranked scan (fts_search_impact_single, dispatched
from fts_search_wand when nterms==1 and a directory exists) visits blocks
best-first and stops once k results beat the recomputed bound of the next
block -- Tantivy-style early termination.

Soundness (where the prior float-impact attempt failed on avgdl drift): only
raw integers are stored; the bound is RECOMPUTED at query time from
(max_tf, min_doclen) at the current avgdl, so it is always an exact upper bound
as the corpus grows.  The stored sort order is a visitation heuristic only --
exactness comes from the recomputed bound + the WAND stop condition.  Verified:
the index top-k set is identical to a brute-force seqscan+sort (regression test
with a >2048-doc term and distinct top scores).  Rare terms carry no directory
and use the existing exact docid-ordered scan; v2 indexes are read with that
scan too (version-gated), so no forced rewrite -- REINDEX gains the directory.

Format BM25_VERSION 2->3; extension 1.19->1.20 (migration is a marker, no SQL
surface change).  qualify PASS; regression green.
The skip directory is stored sorted by a max_tf proxy, but the true per-block
impact bound also depends on min_doclen (impact rises as min_doclen falls), so a
block with a smaller max_tf can have a higher bound than an earlier entry.
Stopping at the current entry's bound was therefore unsound -- it could halt
before a later, higher-bound block, returning a wrong/incomplete top-k (observed
at 2M scale: index top-k distances non-ascending and missing lower-distance
docs vs a seqscan).

Fix: after loading a term's directory, sort it by the EXACT impact bound
recomputed at the current avgdl (descending) before the scan.  Then the
front-of-remaining early-stop is a valid ceiling on every not-yet-visited block
and the returned top-k is exact.  Verified with distinct scores at 8000 docs:
index top-15 byte-identical to a brute-force seqscan+sort.  (Insertion sort;
near-O(n) since the stored max_tf order is already close to bound order.)
The k*4 (min 64) over-fetch forced the impact-directory single-term scan to
fetch far more than the LIMIT (a LIMIT 10 fetched 64), defeating its early
stop.  k*2 (min 32) still tolerates ~50% invisible top rows before the SRF
under-fills, and the amgettuple ordering scan grows k via its own retry, so
correctness holds while the early-stop can fire.  qualify PASS; regression green.
… real text)

The v3 impact-ordered block skip directory (commits f049f3a/ef637ba/68ce28f)
was correct and avgdl-drift-safe, but instrumented block-visit counts on 2M
real Wikipedia show it delivers NO early termination: within a term the
per-block impact bounds cluster in a razor-thin band just above the top-k
threshold (constant idf; a common term has some high-tf doc in nearly every
block), so best-first block ordering still visits ~99% of blocks before it can
stop.  For 'year' (df 678k, 5296 blocks) the scan visited 5282; for 'hungary'
170/173.  No measured latency win on any query band, at the cost of format v3,
a skip-page chain, ~3% larger index and a per-query sort.

Reverting to format v2 / extension 1.19.  pg_search's flat ranked latency comes
from a compact columnar segment codec (far less decode per candidate) plus
query parallelism, not an impact skip structure -- matching it is a codec
rewrite, out of scope here.  bench/NOTE_IMPACT_ORDERING.md records the attempt,
the measurements, and the conclusion so it is not re-tried blindly.

qualify PASS; regression green; index format unchanged (v2).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant