pg_fts: a bm25 full-text search engine by gburd · Pull Request #27 · gburd/postgres

gburd · 2026-07-03T13:07:36Z

No description provided.

This is the first stage of a native full-text search subsystem intended to provide true BM25/BM25F relevance ranking with index-only scoring and a richer query language, addressing long-standing limitations of the tsvector/tsquery + GIN stack: no corpus statistics (N, avgdl, df) are stored anywhere, ts_rank is cover-density rather than BM25, and GIN posting lists carry only TIDs so ranked queries must always recheck the heap. Rather than land that as one large patch, the work is structured as a reviewable series (see FTS_NEXTGEN_PLAN.md). This first commit introduces only the SQL surface, evaluated by sequential scan, with no index access method -- the same way tsvector/tsquery were originally introduced. Adds two types: ftsdoc an analyzed document (sorted, de-duplicated terms with term frequencies, plus the document length that BM25 will need) ftsquery a parsed boolean query (AND/OR/NOT and grouping) Both are varlena and TOAST-able with version-tagged binary send/recv formats. A hand-written recursive-descent parser produces the query; the grammar is small enough not to warrant a generator. Matching is a boolean stack machine over the postfix item list, mirroring TS_execute. The stage-1 tokenizer is deliberately minimal (ASCII case-fold, split on non-alphanumerics). It is isolated behind fts_analyze_text() so that a later stage can reuse PostgreSQL's existing text-search parser and dictionary pipeline (snowball, ispell, synonyms, thesaurus, stopwords) without changing the types, the operator, or the on-disk format. Includes a regression test exercising analysis, query parsing and canonical output, all boolean match cases, sequential-scan use in a WHERE clause, and error handling for malformed queries.

Add to_ftsdoc(regconfig, text), which runs an installed text search configuration's parser and dictionary chain via parsetext() and folds the normalized lexemes into an ftsdoc. This reuses the existing snowball/ispell/ synonym/thesaurus/stopword pipeline rather than reimplementing tokenization. Shipped as extension upgrade 1.0 -> 1.1.

Add fts_bm25(doc, query, n_docs, avgdl, dfs), computing the Okapi BM25 score with Lucene-style IDF and standard k1/b defaults. Corpus statistics are caller-supplied for now (the bm25 index AM will maintain them later), which is enough to validate the scoring math by sequential scan. Shipped as 1.1 -> 1.2.

Add a real index access method (USING bm25) over an ftsdoc column that answers the @@@ operator via a bitmap scan. The build scans the heap, collects per-term postings (tid, tf), and writes a metapage (N, sum(doclen), nterms), a sorted dictionary, and chained posting pages -- all through GenericXLog, so the index is crash-safe and replicated without a custom resource manager. The scan evaluates the boolean ftsquery by set algebra over posting lists (AND=intersect, OR=union, NOT=complement against the indexed universe), matching @@@ semantics exactly with no heap access. Corpus statistics are maintained for the coming index-only BM25 scoring. The skeleton is build-once: aminsert raises an error directing REINDEX, and incremental maintenance (pending list + background merge) is a later stage. Shipped as extension upgrade 1.2 -> 1.3.

Add fts_bm25_opts(doc, query, n_docs, avgdl, k1, b, variant, dfs) supporting lucene, robertson (classic), atire, and bm25+ IDF/scoring variants with explicit k1/b, for reproducing reference implementations (Lucene/bm25s) in conformance tests. Shipped as 1.3 -> 1.4.

Add fts_highlight(text, query, pre, post) and fts_snippet(text, query, pre, post, ellipsis, max_tokens), giving FTS5-parity result presentation. Both tokenize the source with the same folding as the analyzer and mark query-term matches; snippet slides a token window and returns the densest match region. Shipped as 1.4 -> 1.5.

Add tsquery_to_ftsquery() and an ASSIGNMENT cast so existing tsquery values and queries port to the @@@ operator with minimal churn: &/|/! map to AND/OR/NOT. The phrase operator <-> degrades to AND with a NOTICE (phrase support is a later stage), preserving recall. Shipped as 1.5 -> 1.6.

Add prefix matching to the query language: a trailing '*' on a term (e.g. quick*) matches any document term with that prefix. Implemented in the parser (a per-item FTS_QF_PREFIX flag, carried through send/recv), the sequential matcher (binary-search lower bound on the sorted term set), and the bm25 index scan (union the posting lists of all dictionary terms sharing the prefix). Phrase and NEAR need per-term positions, which the stage-1 ftsdoc format omits; they follow as an ftsdoc v2 format addition.

Add fts_index_stats(regclass) -> (ndocs, avgdl, nterms) and fts_index_df( regclass, ftsquery) -> float8[], reading N, avgdl and per-term document frequency from the bm25 index metapage and dictionary. BM25 can now be scored from statistics the index maintains rather than caller guesses, closing the loop between the AM and the scorer. (Streaming index-only WAND top-K is a further optimization.) Shipped as 1.6 -> 1.7.

…partial) Update the README to reflect the nine qualified stages (versions 1.0-1.7 plus prefix queries) and to state honestly what remains: phrase/NEAR, WAND top-K, incremental maintenance, contentless indexes, the parity gate, and the fuzzy/regex stages.

The bm25 access method's aminsert no longer errors: it appends the new document verbatim to an in-index chain of pending pages and bumps the metapage N and sum(doclen). The scan searches pending documents directly with the per-document matcher, so newly inserted rows are immediately visible to @@@ without a REINDEX. Per-term df in the dictionary stays stale until a merge (REINDEX), matching GIN fastupdate's documented behavior. All page writes go through GenericXLog. Shipped as 1.7 -> 1.8 (bm25 metapage format changed; REINDEX required for pre-1.8 bm25 indexes).

Extend the ftsdoc format to v2, storing per-term token positions, and add quoted-phrase query syntax ("a b c"): the parser emits an FTS_OP_PHRASE chain (distance 1), and the matcher verifies adjacency by intersecting term position lists. The bm25 index treats a phrase as AND for candidate generation and now requests a bitmap recheck, so @@@ re-evaluates adjacency exactly against the heap ftsdoc. Position-free v1 docs remain valid (phrase degrades to AND). NEAR(a b, k) reuses the same distance-aware phrase_step and is a small parser addition (comma + integer) on top of this. Shipped as 1.8 -> 1.9 (ftsdoc format v2).

The bm25 index stores only postings, never document text, so an expression index on to_ftsdoc(text_column) is exactly FTS5's external-content model: the text lives in the base table, the index is derived from it, and @@@ queries (including phrases, via recheck) work against the expression. Shipped as 1.9 -> 1.10 (documentation marker; no new SQL objects).

Add two ftsquery term forms: term~k matches document terms within Levenshtein distance k (default 2), using core varstr_levenshtein_less_equal (bounded, no new dependency) /re/ matches document terms against a POSIX regex via core's cached regex engine Both are evaluated per-document in the matcher. The bm25 index returns all indexed tuples as candidates for fuzzy/regex queries and the bitmap heap recheck applies the exact test, so results are correct through the index. This follows the plan's 'no new dependency for the common case' path; the pg_tre trigram-formula pre-filter (with its Lime grammar converted to Bison+Flex, and sparsemap v5.1.1 for posting compression) to narrow candidates at scale is future work. Shipped as 1.10 -> 1.11.

Add bench/bench.sql and bench/README: a reproducible A/B harness comparing the bm25 stack against tsvector + GIN + ts_rank on a user-supplied corpus (index size, ranked top-10 EXPLAIN ANALYZE). The full parity gate (latency percentiles, NDCG vs qrels, concurrent-ingest throughput, Lucene/bm25s score conformance) is documented as a manual, reported measurement rather than a make-check regression, since it needs an external corpus. Update README.pg_fts to describe all implemented stages (1.0-1.11), the full query language, a worked BM25 ranking example, and the remaining future work (WAND top-K, trigram pre-filter for fuzzy/regex, NEAR, background merge, BM25F).

Add fts_bm25f(docs ftsdoc[], query, weights[], n_docs, avgdls[], dfs[]): the Robertson/Zaragoza BM25F, where per-field term frequencies are length- normalized per field and combined by weight before the tf-saturation step (not a naive sum of per-field BM25 scores). This lets a term in a heavily weighted field (e.g. title) outrank the same term in the body. Shipped as 1.11 -> 1.12.

Add bm25_merge_pending(): read the existing dictionary + posting chains and all pending documents back into a build state, rewrite the merged structure into fresh blocks, and repoint the metapage -- no heap access. Wired into amvacuumcleanup (VACUUM now folds pending docs automatically) and exposed as fts_merge(regclass) for on-demand merge. Merging resolves the df staleness that incremental inserts introduce (formerly-pending terms gain dictionary df). Old blocks are left unreferenced and reclaimed by REINDEX; an FSM-based page recycler is future work. Shipped as 1.12 -> 1.13.

Add fts_search(index, query, k) -> setof(ctid, score): BM25 top-k computed entirely from the index -- postings supply per-doc tf, the dictionary supplies df and a per-term max-tf impact bound (now stored), the metapage supplies N and avgdl -- with no heap access. Per-document scores accumulate across query terms and the top-k are returned by descending score; join on ctid to fetch rows. This is the index-only-scoring path (no heap fetch to rank), the core performance win for ranked search. Stored per-term max_tf in the dictionary provides the WAND upper bound for document skipping. Full executor integration via amcanorderbyop (an ORDER BY score LIMIT k ordering scan with block-max WAND early termination) and exact per-document |D| in postings are the remaining optimizations. Shipped as 1.13 -> 1.14.

Add pg_fts_trgm.c: reduce a fuzzy term to its trigrams and test only document terms sharing a trigram with it (Levenshtein is the expensive step). This is the pg_tre-style pruning that makes fuzzy matching viable on a large vocabulary, applied at the term level. Results are unchanged: the filter only skips candidates that provably cannot match (pigeonhole: a match within k edits shares a trigram when the term has more than k trigrams) and falls back to a full scan for short terms. A persistent on-disk trigram posting index in the bm25 AM (the full three-tier funnel) is the remaining scale work. Shipped as 1.14 -> 1.15.

fts_search() returned candidate ctids straight from the postings, which can reference dead or updated tuples that the index has not yet merged out. It now opens the base table and checks each candidate against the active snapshot via table_index_fetch_tuple, returning only visible tuples in score order and stopping once k visible rows are found. This makes the SRF correct under all isolation levels, matching the visibility contract the @@@ bitmap path already gets from the executor's bitmap heap scan + recheck. (The @@@ operator path was already MVCC-correct: amgetbitmap sets recheck=true and the bitmap heap scan applies snapshot visibility. All page I/O uses the buffer manager, and every writer is WAL-logged via GenericXLog, so the index is correct on physical standbys and after crash recovery.)

Posting pages now store postings as a BM25PostingPageHdr (count) plus a varint stream of (docid-gap, tf), instead of a raw 12-byte BM25Posting array. docids (block*MaxHeapTuplesPerPage + offset) are sorted ascending per term so gaps are small and pack to 1-2 bytes; tf likewise. A single bm25_page_decode() feeds all five posting readers (scan eval, prefix, universe, term-postings, merge), and bm25_write_postings() encodes. This is the index-size win needed to compete at scale; results are byte-for-byte identical (compression is transparent). Also fix a buffer-pin leak in the stage-7 pending-insert path: when the tail pending page was full, tailbuf was unlocked but not unpinned before being re-read as oldtail. Surfaced by the larger compression test under cassert. Shipped without a format version bump note here (unreleased); REINDEX any existing bm25 index.

fts_search now uses a proper DAAT WAND: per-term cursors over the docid-sorted posting lists, a top-k heap with a running score threshold, and per-term max-contribution bounds (from the stored per-term max_tf) so documents that cannot enter the current top-k are skipped via the pivot rule rather than fully scored. Results are byte-for-byte identical to the previous exact accumulate path (verified: no regression-test diffs), with the WAND pruning as the speedup. Posting pages now also carry per-page block-max_tf and first-docid in the page opaque, the on-disk foundation for block-level skipping in a future page-cursor WAND. WAND over-fetches candidates so the MVCC visibility filter still yields k visible rows. Full amcanorderbyop executor integration (ORDER BY score LIMIT k\ndriving the ordering scan directly) remains as polish on top of this.

Postings now carry |D| (document token count) as a third varint per posting, so the index-only scorer applies exact BM25 length normalization instead of approximating |D| = avgdl. This is a relevance-accuracy fix: for equal tf, a shorter document now correctly outranks a longer one (matching Lucene/bm25s), which is what the top-k test now shows (a 1-token 'quick' doc outranks a 3-token doc). The WAND upper bound uses the shortest-document norm (tf + k1*(1-b)) so it never underestimates and never prunes a qualifying hit. REINDEX required (posting format changed).

Comment-only sweep (no behavior change): several file/function headers still described an early 'skeleton' that has since been fully built, which is misleading to readers: - pg_fts_am.c header: described a build-once, REINDEX-to-refresh, non-segmented index with (tid,tf) posting arrays -> now describes the segmented design (FOR-packed 128-doc blocks, per-block impacts, trigram index, livedocs tombstones, pending buffer + flush + tiered merge).\n- pg_fts_am_scan.c header + bm25_lookup_term/bm25_lookup_prefix: no longer a\n bitmap-only 'skeleton scanning sorted arrays'; document IOS, WAND/MaxScore,\n fuzzy/regex, VM-aware counts, and the sparse-block-index seek.\n- BM25PendingItem: 'REINDEX for now' -> flush via fts_merge()/VACUUM cleanup.\n- pg_fts_rank.c / pg_fts.h / pg_fts_query.c: the index DOES maintain N/avgdl/df,\n and phrase/NEAR/prefix/fuzzy/regex ARE implemented (not 'later stages').\n- fts_doc_has_fuzzy: comment claimed it scans all terms; it already applies the\n length + trigram pre-filters described.\nNo stubs or dead code found beyond the bulkdelete/migrate issues fixed in the\nprior two commits.

…int) The BM25BlockHdr comment still claimed the intra-block payload was a varint stream with FOR/PFOR 'a later swap' -- but bm25_for_pack/bm25_for_unpack (frame-of-reference bit-packing of the docid-gap/tf/doclen columns) has been the encoding for a while. Correct the description; note only patched-FOR/PFOR for outliers remains a possible refinement.

…re 1/6) bm25_build accumulated the entire corpus's terms + postings in one in-memory build state and wrote a single segment -- a very large CREATE INDEX could exhaust maintenance_work_mem / RAM (one of the faults that motivated the segmented redesign). Now the build callback checks MemoryContextMemAllocated against a budget (maintenance_work_mem, floored at 32MB) between tuples and, when exceeded, flushes the accumulated terms as an immutable segment and resets the build context to continue within a bounded footprint. A document's terms are always fully accumulated before a flush, so no doc is split across segments and per-segment ndocs/sumdoclen stay consistent. After the scan the residual is flushed and bm25_merge_segments compacts the segments so a fresh index is not left fragmented. Verified: 60k-doc build at maintenance_work_mem=1MB flushes multiple segments and returns byte-identical results to a single-segment build (counts, ndocs, ranked top-k all match seqscan); regression green.

bm25_merge_segments merged the WHOLE directory into one segment on every trigger -- O(index) write amplification under steady inserts. Replace with a Lucene TieredMergePolicy in miniature: sort live segments by size and merge only a run of similarly-sized segments (within BM25_MERGE_SIZE_FACTOR, >= BM25_MERGE_TIER_MIN of them), so small flushes coalesce cheaply while large segments are rarely rewritten. bm25_merge_selected merges a chosen subset and rewrites the metapage directory preserving the order of the kept segments and appending the merged one; the driver loops until no tier qualifies and the count is within budget. Tombstoned docs are still dropped as segments are read. Verified: 20 flushes coalesce to 3 segments (not 21), 1500 docs preserved and queryable; delete+re-merge keeps counts correct (1250==seqscan); regression green, zero crashes.

…/6-a) bm25_costestimate delegated entirely to genericcostestimate, which over-prices a ranked 'ORDER BY d <=> q LIMIT k' scan (a generic full index scan) so the planner could pick seqscan+sort instead of the index that natively honours the ORDER BY. Now: keep the generic selectivity/pages/rows (needed for honest row estimates), but when path->indexorderbys is set (a <=> ordering scan), price it as block-max WAND/MaxScore actually behaves -- a modest startup plus a small per-tuple cost and only a fraction of a page fetch per match -- reflecting that WAND with a pushed-down LIMIT does work sublinear in the match set. Plain @@@ scans keep the generic estimate. Conservative (low but nonzero per-tuple so large LIMITs still scale). Verified: with seqscan enabled the planner now chooses the Index Scan for a ranked LIMIT query; regression green.

bm25_lookup_prefix scanned the ENTIRE dictionary chain for term* queries. Since dictionary entries are byte-sorted and each segment already has a sparse per-page block index (used by bm25_dict_seek for exact lookups), the matching terms are contiguous: seek to the page that can hold the prefix, scan forward only while entries could still start with it, and stop at the first term that sorts past the prefix. No new on-disk structure -- just a smarter walk, now sublinear in the dictionary. Signature takes the BM25SegMeta (for dictindexstart); bm25_eval_query caller updated. Verified exact vs a full seqscan over a 30k-term multi-page dictionary: term001* = 100, term1* = 10000, word* = 30000, zzz* = 0, term29999* = 1 -- all match; regression green. (A front-coded/FST dictionary would compress the term bytes further but is not needed for sublinear prefix search.)

…e 2/6) The metapage segment directory is a fixed segs[] array. With the size-tiered merge (feature 3) the live segment count stays far below the cap, so a chained overflow directory -- which would complicate all 49 reader sites that iterate meta.segs[0..nsegments] -- is unwarranted for a case the merge policy makes unreachable. Instead double the cap to 128 (the metapage is then ~6.2KB of the ~8KB page, still comfortable) as a wider safety margin, and make the overflow path a clear, actionable error (index name + VACUUM/REINDEX hint) rather than a terse message; data is never corrupted at the limit. Verified metapage still fits and the index builds/queries correctly; regression green. (Chaining is documented as unnecessary given tiered merge.)

…ranted) Evaluated patched-FOR (PFOR) intra-block encoding for the docid-gap/tf/doclen columns. Instrumented the FOR packer on a Zipfian 50k-doc corpus: extracting the top ~1/16 outliers per column would shrink the column bytes by only ~7%, which is under ~0.5% of the whole index (the columns are already narrow -- tf and doclen are small, and docid gaps are tiny within a common term's dense blocks). That does not justify adding exception-handling to the hot-path random-access decoder (bm25_for_get, used per scored posting in WAND). Plain FOR is kept; the block-encoding comment now records this measured decision rather than implying PFOR is pending.

…return) 1. add_posting term-key collision (data loss): two distinct terms >= 64 bytes sharing their first 64 bytes hash to the same padded key; the old code, on a key hit that failed exact comparison, created a new BuildTerm and CLOBBERED the hash entry's termidx -- fragmenting a term's postings across dictionary entries that readers (which stop at the first match) never reach, giving wrong df/counts. Now BuildTerms chain per key (BuildTerm.next) and add_posting walks the chain for the truly-equal term. Verified: three 70-char terms sharing a 64-byte prefix each return the correct count. 2. bm25_canreturn returned true unconditionally, but the index is NOT covering. The planner then chose an index-only scan for SELECT <indexed ftsdoc column> and returned the placeholder all-NULL tuple -- i.e. NULLs instead of the real value (verified: 'SELECT d WHERE d @@@ q' gave d IS NULL). Return false so IOS is never used to fetch a real attribute. count(*) now uses a bitmap/ plain index scan (correct; the fast transparent-count IOS is given up because the @@@ restriction column is in the IOS coverage check); the explicit visibility-map-aware fts_count() remains the fast count. Corrected the now-inaccurate IOS comments; bm25_set_itup kept as a guarded no-op. Full regression green; zero crashes/leaks.

…t accuracy) MED: - ftsdoc_recv/ftsquery_recv: bound the wire-supplied element count against the remaining message length before palloc, rejecting hostile/corrupt binary input (overflow/OOM at a trust boundary). - sm_create NULL-checks at all three call sites (libc malloc can fail). - Document that ranked (WAND) results cover merged segments only; pending docs are @@@/fts_count-visible and become ranked after a flush (fts_merge/VACUUM). LOW (dead code / stale comments now matching the implementation): - Remove unused page-opaque fields (block_max_tf/first_docid_hi/lo -- block-max lives in BM25BlockHdr) and the dead BM25PostingPageHdr struct. - Fix bm25_write_postings comment (FOR blocks, not delta+varint pages). - trgm_index.c header: trigram -> TERM-ORDINAL sets (not docids); reconcile the\n contradictory "popular trigrams skipped" comments (nothing is skipped).\n- pg_fts.h / pg_fts.control: drop "stage 1, no index AM yet" (the bm25 AM,\n ranking, segments exist).\n- pg_fts_aux.c: drop the fts_score_explain() claim (never implemented).\n- pg_fts_match.c: matcher does phrase/NEAR/prefix/fuzzy/regex, not just boolean.\n- ftsquery_out: it is a display rendering, not a guaranteed parser round-trip.\n- fts_regex_trigrams / trgm.c: clarify union-then-recheck (not "ANDed").\n\nFull regression green; zero crashes/leaks.

… features The README still described the pre-segmented design (delta+varint postings, a version list ending at 1.15, 'benchmark belongs alongside'). Rewrite to document: the segmented storage architecture (dictionary+block-index, FOR-packed 128-doc blocks with impact bounds, trigram index, livedocs tombstones, pending buffer + flush + size-tiered merge + VACUUM tombstoning); the query-execution paths (bitmap @@@, WAND/MaxScore <=> ordering scan with lazy column decode, fts_count); versions through 1.19; and an honest limitations section (resumable cursor, impact-ordered postings; PFOR and chained directory evaluated and declined). Points at bench/RESULTS_*.md for parity.

Replace "next-generation" and similar phrasing in the README title/intro, pg_fts.h header, PGFILEDESC, control-file comment, and two code comments with neutral, factual descriptions. Drop the reference to the out-of-tree plan document and the "headline" benchmark summary (the results files carry the numbers).

The static library for vendor/sm.c passed -Wno-declaration-after-statement unconditionally; MSVC's cl rejects the GCC-style spelling with 'D8021: invalid numeric argument', failing the Windows CI build. Filter the flag through cc.get_supported_arguments() so it is used on GCC/Clang and dropped on MSVC (which accepts C99 mixed declarations natively).

The Windows/Visual Studio CI job failed compiling the vendored sparsemap: cl rejects the GCC-style -Wno-declaration-after-statement flag, and the source uses GCC extensions MSVC lacks (__attribute__((aligned/format/always_inline/ hot)) and the POSIX ssize_t). Refresh vendor/sm.[ch] from upstream sparsemap and add vendor/sm_compat.h, a portability shim included first by both: on _MSC_VER it neutralizes __attribute__(...) (every use is an alignment/diagnostic/inlining hint that does not change layout at the natural alignment of the members on x64) and typedefs ssize_t from SSIZE_T (<BaseTsd.h>). Normalize the fallthrough markers to the project's all-caps /* FALLTHROUGH */ style, and gate both the fallthrough and declaration-after-statement warning suppressions through cc.get_supported_arguments (meson) / a target-scoped CFLAGS override (make) so they apply on GCC/Clang and are dropped on MSVC. The only pg_fts-specific delta to the vendored code remains the SPARSEMAP_PREFIX=__pg_bm25_ namespacing block. Local build is warning-clean under -Wimplicit-fallthrough=5; qualify PASS.

…MSVC Two failures from the CI clang/MSVC builds not seen under local gcc: - CompilerWarnings (clang -Werror): bm25_varint_encode/bm25_varint_decode and the BM25_MAX_POSTING_BYTES macro were dead once the FOR block codec replaced the varint posting encoding. Remove them. - Windows/MSVC: sparsemap's SM_LIKELY/SM_UNLIKELY expanded to __builtin_expect unconditionally (its comment already claimed they were no-ops elsewhere). Guard them behind __GNUC__/__clang__ like the other SM_* intrinsics, falling back to the bare condition on MSVC. qualify PASS; warning-clean under -Wimplicit-fallthrough=5.

Linux Meson (32-bit) CI crashed in the tombstone VACUUM path (delete-all then reuse). In sparsemap's __sm_map_unset(), the size_t byte-offset variable 'offset' was overloaded with SM_IDX_MAX (== UINT64_MAX) as a 'gate coalescing off' sentinel. On ILP32 targets size_t is 32-bit, so the store truncated to 0xFFFFFFFF while the gate test 'offset != SM_IDX_MAX' promoted it back and compared against 0xFFFFFFFFFFFFFFFF -- never equal -- so coalescing ran on an uninitialized/invalidated chunk and crashed (the -Woverflow warnings at sm.c:1607/3636/... were the tell). Introduce a size_t-width sentinel SM_UNSET_NO_COALESCE ((size_t)-1) for the byte-offset gate and use it at the three no-op/invalidated-pointer sites and the coalesce test. Fixed upstream in ~/ws/sparsemap and re-vendored; upstream test suite (test_main, test_coverage 7801 expectations, portability, RLE, prefix, large-index) all pass. qualify PASS; regression green.

…t fix) The previous fix missed the first gate-off site in __sm_map_unset (the 'no chunks in the map' branch still assigned SM_IDX_MAX to the size_t offset), so the 32-bit tombstone VACUUM crash persisted. Route that site through SM_UNSET_NO_COALESCE too, and change __sm_chunk_rank's rank->rem 'infinity' sentinel from UINT64_MAX to SIZE_MAX (rank->rem is size_t; the value is only ever used as a large remaining-count, but the constant must not truncate). Verified on genuine i686 (gcc -m32 on Fedora): the delete-all-then-reuse coalescing reproducer passes and the build is free of -Woverflow/-Wuninitialized in sm.c. 64-bit upstream suite still green (test_main 44/44, coverage 7801, portability, RLE); pg_fts regression green.

Real-corpus testing exposed a correctness bug: after a document is deleted and VACUUMed (tombstoned in its segment), if its heap slot is reused by a NEW inserted document (same TID/docid), the new document was not found by @@@ / fts_count. Two causes, both fixed: 1. Pending docs were tombstone-filtered. A doc in the pending write buffer is a live heap tuple by definition, but bm25_collect_matches OR'd pending matches into the accumulator BEFORE the tombstone filter, so a pending doc reusing a tombstoned TID was dropped. Collect pending matches separately and union them AFTER filtering; pending docs are never tombstoned. 2. Tombstones were applied globally across segments. bm25_docid_tombstoned checked a docid against EVERY segment's tombstone map, but a tombstone belongs to exactly one segment: a TID deleted in segment A may be legitimately reused by a live doc that a newer segment B indexes. Filter each segment's own match contribution against only THAT segment's tombstone map (bm25_filter_tombstoned_seg), and in the WAND ranked path skip own-segment tombstoned docids inside each cursor (cursors are per-segment). Removed the now-unused global bm25_filter_tombstoned / bm25_docid_tombstoned. The regress tombstone test previously encoded the buggy 0 as expected output; corrected to 60 and added an fts_count assertion so the count path is covered too. qualify PASS; regression green.

A document whose analyzed ftsdoc did not fit on a single pending page raised 'ftsdoc too large for a bm25 pending page' and failed the INSERT/UPDATE -- real corpora (e.g. long Wikipedia articles) hit this routinely, so the index could not be built over them. When an ftsdoc exceeds the pending-page capacity, index it directly as its own one-document segment (bm25_insert_oversized_as_segment) via the existing build machinery: segment posting storage is a chain of FOR-packed pages with no per-document size limit. Such documents are rare, so building a small segment per oversized insert is acceptable; corpus N/sumdoclen stay correct (bm25_meta_add_segment accounts the doc, and the pending path is bypassed so there is no double count). Added a regression test with a ~4000-token document.\n\nqualify PASS; regression green.

Vendor sparsemap v5.2.0, which brings the MSVC portability shim (SM_ALIGNED / ssize_t / guarded __builtin_*) and the O(N) coalescing performance fix upstream. This replaces the local vendor/sm_compat.h shim (now redundant) and my earlier one-off MSVC edits; sm.h is byte-identical to pristine v5.2.0 and sm.c differs only by the __pg_bm25_ SPARSEMAP_PREFIX namespacing block plus the fix below. v5.2.0 still has the 32-bit unset-coalesce truncation bug (its own change did not touch the gate): in __sm_map_unset the size_t byte-offset 'offset' is set to SM_IDX_MAX (== UINT64_MAX) as a 'gate coalescing off' sentinel, which truncates to 0xFFFFFFFF on ILP32 so the '!= SM_IDX_MAX' gate (which promotes offset back to 64 bits) never matches -> coalescing runs on an uninitialized chunk and crashes. Route the four gate-off sites and the gate test through a size_t-width sentinel SM_UNSET_NO_COALESCE. (Fixed in ~/ws/sparsemap for upstreaming; vendored here.) Consumer-side alignment fix: struct sparsemap is declared SM_ALIGNED(8), but palloc only guarantees MAXALIGN (4 on ILP32), so palloc0(n*sizeof(sm_t)) placed the per-segment tombstone maps at 4-aligned addresses -> -fsanitize=alignment abort in the Linux Meson (32-bit) CI job. Allocate that array with palloc_aligned(..., 8, 0). Verified on genuine i686 under -m32 -fsanitize=undefined,alignment: the delete+coalesce+reopen path runs clean (live=60) with an 8-aligned sm_t. qualify PASS; 64-bit regression green; upstream suite (test_main 44/44, coverage 7805, portability, RLE) green.

Indexing an oversized document creates a one-document segment (bm25_insert_oversized_as_segment). A bulk INSERT/UPDATE over a corpus with many large documents (e.g. rebuilding the expression index over 2M Wikipedia rows) could therefore create one segment per oversized row and hit the hard BM25_MAX_SEGMENTS (128) cap before the next VACUUM merged anything -- 'bm25 index reached the maximum of 128 segments'. After flushing an oversized segment, if the live segment count is within 16 of the cap, run bm25_merge_segments() to coalesce. Verified: 200 oversized documents now index without overflow (merged to well under the cap) and all remain searchable. qualify PASS; regression green.

Head-to-head on 2,000,000 real English-Wikipedia articles (PG20devel, r7i.8xlarge). Match sets verified identical to the GIN path before timing. Headline (median of 9, warm): ranked retrieval -- the actual FTS use case -- is where pg_fts wins, and the win grows with term frequency because GIN's ts_rank must fetch+score+sort every match while pg_fts stops early via block-max WAND: ranked top-10 common&mid 15 ms vs 65 ms (4.2x), top-100 common 75 ms vs 3029 ms (40x), top-10 two-common 50 ms vs 1641 ms (33x). Plain counts / boolean AND tie (both bitmap-scan); fts_count beats count(*) 3.7x. Cost: ~2.5x larger index (per-posting tf/|D|/positions), comparable single-threaded build. Adds bench/get_wikipedia.py (HF parquet -> TSV loader) and bench/bench_fixed.sh (pinned-term median-of-9 A/B runner).

…ch bench sparsemap v5.2.1 upstreams the ILP32 gate fix I carried locally in the vendored copy (SM_UNSET_NO_COALESCE, the size_t-width coalesce sentinel) and adds an independent bug fix -- a multi-run RLE-chunk corruption in __sm_separate_rle_chunk (wrong payload-vector placement and pivot-size accounting that desync'd the sequential chunk walk). Re-vendor v5.2.1: sm.h is byte-identical to pristine v5.2.1 and sm.c now differs ONLY by the __pg_bm25_ SPARSEMAP_PREFIX block (no local modifications remain). Verified on i686 (gcc -m32 -fsanitize=undefined, alignment): the delete+coalesce+reopen path is clean; 64-bit upstream suite green (test_main 44/44, coverage 7820). qualify PASS; regression green. Also add bench/RESULTS_VS_PGSEARCH_WIKI.md: an honest pg_search 0.24.1 head-to- head on 2M real Wikipedia articles (PG17.10). pg_search wins ranked retrieval (flat ~9ms; pg_fts's docid-ordered WAND degrades with term frequency, 26->70ms) and common-term counts (aggregate pushdown); pg_fts's fts_count wins selective counts (1.9-2.4ms) and its index is 1.55x smaller (3590 vs 5574 MB). The gaps are the documented codec investments (impact-ordered postings, COUNT/aggregate Custom Scan pushdown) plus parallel scan -- architecture, not tuning.

…AIO) Code-cited capability matrix and answers to adopter questions: concurrent builds (CIC/REINDEX CONCURRENTLY work — aminsert routes concurrent writes to the searchable pending list), index-only scans (no, non-covering by design; fts_count is the fast-count path), compaction (VACUUM+fts_merge drops tombstones logically, REINDEX reclaims physical space — no online REPACK), feature parity vs pg_search/Tantivy, ZomboDB/ES and tsvector/GIN (has BM25/BM25F, phrase/NEAR/ prefix/fuzzy/regex, <=> ordering scan, fts_count, highlight/snippet, tombstone deletes, WAL/GenericXLog safety; gaps: no parallel build/scan, no IOS, no aggregation pushdown, no impact-ordered postings), logical-replication drop-in (no — re-platform: different API, indexes provisioned per-subscriber; physical replication + crash recovery ARE safe), and AIO (none of its own; build heap scan gets core read_stream free; nextblk pointer-chains defeat prefetch and WAND is anti-prefetch by design; only the cold merge full-scan could benefit, deferred until measured).

Closes the ranked-retrieval gap vs Tantivy/pg_search, whose latency stays flat as term frequency grows while pg_fts's docid-ordered block-max WAND degraded (top-100 over a common term walked the whole ~5300-block posting list). Format v3 adds a per-term impact-ordered block skip directory: for a term with df >= BM25_SKIPDIR_MIN (2048), the writer records every posting block's (blk, off, max_tf, min_doclen) and stores them sorted DESCENDING by an avgdl-independent impact proxy (max_tf desc, min_doclen asc) in a BM25_SKIPDIR page chain, referenced from three new BM25DictEntry fields (skipstart/skipoff/ nblocks). The single-term ranked scan (fts_search_impact_single, dispatched from fts_search_wand when nterms==1 and a directory exists) visits blocks best-first and stops once k results beat the recomputed bound of the next block -- Tantivy-style early termination. Soundness (where the prior float-impact attempt failed on avgdl drift): only raw integers are stored; the bound is RECOMPUTED at query time from (max_tf, min_doclen) at the current avgdl, so it is always an exact upper bound as the corpus grows. The stored sort order is a visitation heuristic only -- exactness comes from the recomputed bound + the WAND stop condition. Verified: the index top-k set is identical to a brute-force seqscan+sort (regression test with a >2048-doc term and distinct top scores). Rare terms carry no directory and use the existing exact docid-ordered scan; v2 indexes are read with that scan too (version-gated), so no forced rewrite -- REINDEX gains the directory. Format BM25_VERSION 2->3; extension 1.19->1.20 (migration is a marker, no SQL surface change). qualify PASS; regression green.

The skip directory is stored sorted by a max_tf proxy, but the true per-block impact bound also depends on min_doclen (impact rises as min_doclen falls), so a block with a smaller max_tf can have a higher bound than an earlier entry. Stopping at the current entry's bound was therefore unsound -- it could halt before a later, higher-bound block, returning a wrong/incomplete top-k (observed at 2M scale: index top-k distances non-ascending and missing lower-distance docs vs a seqscan). Fix: after loading a term's directory, sort it by the EXACT impact bound recomputed at the current avgdl (descending) before the scan. Then the front-of-remaining early-stop is a valid ceiling on every not-yet-visited block and the returned top-k is exact. Verified with distinct scores at 8000 docs: index top-15 byte-identical to a brute-force seqscan+sort. (Insertion sort; near-O(n) since the stored max_tf order is already close to bound order.)

The k*4 (min 64) over-fetch forced the impact-directory single-term scan to fetch far more than the LIMIT (a LIMIT 10 fetched 64), defeating its early stop. k*2 (min 32) still tolerates ~50% invisible top rows before the SRF under-fills, and the amgettuple ordering scan grows k via its own retry, so correctness holds while the early-stop can fire. qualify PASS; regression green.

… real text) The v3 impact-ordered block skip directory (commits f049f3a/ef637ba/68ce28f) was correct and avgdl-drift-safe, but instrumented block-visit counts on 2M real Wikipedia show it delivers NO early termination: within a term the per-block impact bounds cluster in a razor-thin band just above the top-k threshold (constant idf; a common term has some high-tf doc in nearly every block), so best-first block ordering still visits ~99% of blocks before it can stop. For 'year' (df 678k, 5296 blocks) the scan visited 5282; for 'hungary' 170/173. No measured latency win on any query band, at the cost of format v3, a skip-page chain, ~3% larger index and a per-query sort. Reverting to format v2 / extension 1.19. pg_search's flat ranked latency comes from a compact columnar segment codec (far less decode per candidate) plus query parallelism, not an impact skip structure -- matching it is a codec rewrite, out of scope here. bench/NOTE_IMPACT_ORDERING.md records the attempt, the measurements, and the conclusion so it is not re-tried blindly. qualify PASS; regression green; index format unchanged (v2).

github-actions Bot force-pushed the master branch 6 times, most recently from ce2678e to 82cc694 Compare July 4, 2026 16:04

gburd added 24 commits July 4, 2026 13:56

pg_fts: update README for BM25F, merge, index-only top-k, trigram filter

ba511cc

gburd added 29 commits July 4, 2026 16:28

pg_fts: tidy stale 'later tiered merge' comment (merge is implemented)

2d93f60

gburd force-pushed the fts branch from 5b9eb44 to c6bbfc7 Compare July 4, 2026 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pg_fts: a bm25 full-text search engine#27

pg_fts: a bm25 full-text search engine#27
gburd wants to merge 109 commits into
masterfrom
fts

gburd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gburd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant