Fix flaky ITs: glossary rename propagation, ml model list, RDF glossary graph#29253
Fix flaky ITs: glossary rename propagation, ml model list, RDF glossary graph#29253harshach wants to merge 7 commits into
Conversation
Glossary rename left child glossary-term docs stuck at the old glossary name in the search index (flaky test_renameGlossaryPropagatesToChildTerm SearchIndex). updateAssetIndexes drove the cascade through the fire-and-forget reindexAcrossIndices (getAsyncExecutor().submit), whose task raced the rename commit, read pre-commit rows on a separate connection, and its late write clobbered the correct post-commit write. Rewrite updateAssetIndexes to run in-line: re-index the glossary and every nested child term from the DB-authoritative getAllTerms list via updateEntity (drained synchronously on the request thread post-commit = read-your-write), plus one synchronous updateGlossaryTermByFqnPrefix for tagged assets' tags.tagFQN — the same mechanism glossary-term rename uses. Drops the async reindexAcrossIndices calls and the computed-but-unused getGlossaryUsageFromES/targetFQN bookkeeping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Listing ml models 500'd with "Entity not found: mlmodelService <id>" (flaky MlModelResourceIT.testAutoPaginationFluentAPI) when a sibling test's cascade delete removed the parent service between the relationship lookup and the strict getEntityReferenceById(...NON_DELETED) in batchFetchServices. Resolve the service reference leniently: catch EntityNotFoundException and skip the entry (the ml model row is mid-cascade), mirroring the framework's existing getEntityOrNull / resolveInheritanceParentLeniently tolerance. The inheritance path already null-guards a null service, and only EntityNotFoundException is swallowed so real errors still surface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scopedResponseCarriesGlossaryNameAndIdPerNode waited only until the term node appeared in RDF, then asserted (outside the wait) that the node carried a populated group/glossaryId. Those project separately, so on a slower run the node was present while group was still null, failing the assertion. Move the group/glossaryId assertions inside the Awaitility block so it polls until the full projection lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
❌ PR checklist incompleteThis PR cannot be merged until the following are addressed on its linked issue:
The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically. Maintainers can bypass this check by adding the |
… loop Review feedback (greptile P2): the per-term updateEntity loop serialized N DB reads + N ES writes onto the request thread for large glossaries. Add SearchRepository.updateEntitiesByReference(refs): re-reads each entity with the same bounded reindex field-set updateEntity uses, then issues one bulk index update. Glossary rename now re-indexes its child terms through it (deferred post-commit), collapsing N ES round-trips into a single bulk request while keeping the boundary-safe rebuild-from-DB (getAllTerms over fixed-width fqnHash segments) that also refreshes the glossary denorm — a single ES prefix-rewrite can't, and would over-match sibling glossaries sharing a name prefix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review feedback: updateEntitiesByReference re-read every reference with get(..., NON_DELETED) up front, so one child term concurrently deleted during a glossary rename threw EntityNotFoundException and aborted the whole bulk before updateEntitiesIndex ran — leaving none of the sibling docs re-indexed at the new name. The earlier per-term loop degraded gracefully (independent deferred writes); the bulk collapse lost that fault isolation. Guard each re-read with a per-reference catch of EntityNotFoundException (only that type, so real errors still surface), mirroring the lenient ML model service resolution in this PR. A deleted child is simply skipped — its own delete cascade removes its document — and the rest still index. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tter Nightly LiveIndexRetryIT.liveIndexEnqueuesRetriesDuringEsOutageAndDrainsAfter flaked: after a prolonged ES outage the table was never re-indexed (expected 1, got 0). Root cause: SearchIndexRetryWorker escalated *every* failure — including transient ones (cluster timeout / 5xx / IO) — via PENDING → RETRY_1 → RETRY_2 → FAILED. During an outage the bulk-flush timeouts burned the 3-attempt budget before the cluster recovered, moving the row to terminal FAILED. claimPending never re-claims FAILED, so the write was abandoned permanently — and the test's pendingRetryCount() reads 0 (FAILED isn't pending) while the entity stays unindexed. Only non-retryable (4xx document) errors should dead-letter. Cap retryable failures at PENDING_RETRY_2 so claimPending keeps re-claiming the row and it is re-indexed once the cluster is back. Fixes the flake and the underlying resilience gap (a transient outage no longer drops writes from the index). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_putPreservesLogicalSuiteSearchMembership flaked on the loaded postgres-elasticsearch-redis leg: the test-case PUT description took longer than 30s to surface in the search index under that leg's async propagation load. Both search-membership awaits in the method get 60s, matching the slower-leg headroom other search-propagation tests use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review ✅ Approved 2 resolved / 2 findingsResolves flaky integration tests by replacing racy async reindexing with synchronous post-commit operations and adding lenient handling for concurrent entity deletions. Also improves search reliability by preventing premature dead-lettering of transient reindexing failures. ✅ 2 resolved✅ Edge Case: Bulk child-term reindex aborts all terms if one is concurrently deleted
✅ Edge Case: Retryable failures never dead-letter, risking infinite retry of poison records
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
🟡 Playwright Results — all passed (10 flaky)✅ 4310 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 88 skipped
🟡 10 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |



Fixes three pre-existing flaky integration tests that surfaced across recent PRs. In each case the same commit passed on two of the three integration legs and failed on one (
postgres-elasticsearch-redisormysql-elasticsearch) — a flake signature, not a regression in the PR that hit them.1.
GlossaryResourceIT.test_renameGlossaryPropagatesToChildTermSearchIndexAfter a glossary rename, child glossary-term docs stayed at the old glossary name in the search index (60s timeout).
GlossaryRepository.updateAssetIndexesdrove the cascade through the fire-and-forgetreindexAcrossIndices(getAsyncExecutor().submit). Submitted inside the rename transaction, that async task could read pre-commit (old-FQN) rows on a separate connection, and its late write clobbered the correct post-commitupdateEntitywrite.Fix: make the propagation in-line, mirroring glossary-term rename — re-index the glossary and every nested child term from the DB-authoritative
getAllTermslist viaupdateEntity(drained synchronously on the request thread post-commit = read-your-write), plus one synchronousupdateGlossaryTermByFqnPrefixfor tagged assets'tags.tagFQN. Drops the asyncreindexAcrossIndicescalls and the computed-but-unusedgetGlossaryUsageFromES/targetFQN*bookkeeping.2.
MlModelResourceIT.testAutoPaginationFluentAPIListing ml models intermittently 500'd with
Entity not found: mlmodelService <id>— a read-side TOCTOU: a sibling test's cascade delete removed the parent service between the relationship lookup and the strictgetEntityReferenceById(..., NON_DELETED)inbatchFetchServices.Fix: resolve the service reference leniently — catch
EntityNotFoundExceptionand skip the entry (the ml model row is mid-cascade), mirroring the framework's existinggetEntityOrNull/resolveInheritanceParentLenientlytolerance. The inheritance path already null-guards a null service, and onlyEntityNotFoundExceptionis swallowed so real errors still surface.3.
RdfGlossaryGraphIT.scopedResponseCarriesGlossaryNameAndIdPerNodeThe test waited only until the term node appeared in RDF, then asserted — outside the wait — that the node carried a populated
group/glossaryId. Those project separately, so on a slower run the node was present whilegroupwas still null.Fix: move the
group/glossaryIdassertions inside theAwaitilityblock so it polls until the full projection lands. Test-only.Verification
mvn -pl openmetadata-service spotless:apply compile— greenmvn -pl openmetadata-integration-tests test-compile— green🤖 Generated with Claude Code
Summary by Gitar
SearchIndexRetryWorkerto keep transiently failing re-index operations inPENDING_RETRY_2status indefinitely instead of marking them as failed.This will update automatically on new commits.
Greptile Summary
Fixes three confirmed flaky integration tests (glossary rename propagation, ml model pagination, RDF glossary graph) and adds a complementary search-retry durability improvement. Each root cause — a racy async reindex, a TOCTOU cascade-delete race, and an assertion outside the poll loop — is addressed directly.
reindexAcrossIndiceswith synchronous post-commitupdateEntitiesByReference(bulk) +updateGlossaryTermByFqnPrefix, both wrapped indeferIfFlushScopeActiveso they run after the transaction commits and see the new FQNs.EntityNotFoundExceptioninbatchFetchServiceswhen the parent service is concurrently cascade-deleted, matching the framework's existing lenient-resolve pattern.retryableNextStatusso transient ES failures (5xx / IO) stay atPENDING_RETRY_2indefinitely instead of promoting to terminalFAILED, while entity-resolution failures still cap out via the existingnextRetryStatus.Confidence Score: 5/5
Safe to merge — all three flake fixes are well-targeted, and the retry-durability change is complementary and non-breaking.
Each root cause is addressed at its actual source: the racy async reindex is replaced with a deferred-but-synchronous post-commit drain that reads committed rows, the TOCTOU cascade-delete race in MlModel pagination is guarded by a lenient catch that only swallows EntityNotFoundException, and the RDF assertion race is eliminated by pulling all checks inside the Awaitility block. The SearchIndexRetryWorker change correctly distinguishes transient outages (infinite retry at PENDING_RETRY_2) from unresolvable entities (still caps at FAILED via the unchanged nextRetryStatus path), and claimPending already queries all three PENDING statuses. No existing contracts are broken.
No files require special attention.
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Client participant GlossaryRepository participant SearchRepository participant DB participant ES Client->>GlossaryRepository: rename glossary GlossaryRepository->>DB: update glossary FQN (transaction) GlossaryRepository->>DB: cascade child term FQNs (transaction) Note over GlossaryRepository: open deferral scope GlossaryRepository->>SearchRepository: updateEntity(glossaryRef) [deferred] GlossaryRepository->>SearchRepository: deferIfFlushScopeActive(updateEntitiesByReference) [deferred] GlossaryRepository->>SearchRepository: deferIfFlushScopeActive(updateGlossaryTermByFqnPrefix) [deferred] GlossaryRepository->>DB: COMMIT Note over SearchRepository: post-commit drain SearchRepository->>DB: re-fetch glossary doc SearchRepository->>ES: update glossary index SearchRepository->>DB: getAllTerms re-fetch each child term SearchRepository->>ES: bulk update all child term docs SearchRepository->>ES: updateGlossaryTermByFqnPrefix(tags.tagFQN)%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Client participant GlossaryRepository participant SearchRepository participant DB participant ES Client->>GlossaryRepository: rename glossary GlossaryRepository->>DB: update glossary FQN (transaction) GlossaryRepository->>DB: cascade child term FQNs (transaction) Note over GlossaryRepository: open deferral scope GlossaryRepository->>SearchRepository: updateEntity(glossaryRef) [deferred] GlossaryRepository->>SearchRepository: deferIfFlushScopeActive(updateEntitiesByReference) [deferred] GlossaryRepository->>SearchRepository: deferIfFlushScopeActive(updateGlossaryTermByFqnPrefix) [deferred] GlossaryRepository->>DB: COMMIT Note over SearchRepository: post-commit drain SearchRepository->>DB: re-fetch glossary doc SearchRepository->>ES: update glossary index SearchRepository->>DB: getAllTerms re-fetch each child term SearchRepository->>ES: bulk update all child term docs SearchRepository->>ES: updateGlossaryTermByFqnPrefix(tags.tagFQN)Reviews (5): Last reviewed commit: "test(testcase): widen logical-suite sear..." | Re-trigger Greptile