Skip to content

Fix flaky ITs: glossary rename propagation, ml model list, RDF glossary graph#29253

Open
harshach wants to merge 7 commits into
mainfrom
harshach/flaky-test-triage
Open

Fix flaky ITs: glossary rename propagation, ml model list, RDF glossary graph#29253
harshach wants to merge 7 commits into
mainfrom
harshach/flaky-test-triage

Conversation

@harshach

@harshach harshach commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Fixes three pre-existing flaky integration tests that surfaced across recent PRs. In each case the same commit passed on two of the three integration legs and failed on one (postgres-elasticsearch-redis or mysql-elasticsearch) — a flake signature, not a regression in the PR that hit them.

1. GlossaryResourceIT.test_renameGlossaryPropagatesToChildTermSearchIndex

After a glossary rename, child glossary-term docs stayed at the old glossary name in the search index (60s timeout).

GlossaryRepository.updateAssetIndexes drove the cascade through the fire-and-forget reindexAcrossIndices (getAsyncExecutor().submit). Submitted inside the rename transaction, that async task could read pre-commit (old-FQN) rows on a separate connection, and its late write clobbered the correct post-commit updateEntity write.

Fix: make the propagation in-line, mirroring glossary-term rename — re-index the glossary and every nested child term from the DB-authoritative getAllTerms list via updateEntity (drained synchronously on the request thread post-commit = read-your-write), plus one synchronous updateGlossaryTermByFqnPrefix for tagged assets' tags.tagFQN. Drops the async reindexAcrossIndices calls and the computed-but-unused getGlossaryUsageFromES/targetFQN* bookkeeping.

2. MlModelResourceIT.testAutoPaginationFluentAPI

Listing ml models intermittently 500'd with Entity not found: mlmodelService <id> — a read-side TOCTOU: a sibling test's cascade delete removed the parent service between the relationship lookup and the strict getEntityReferenceById(..., NON_DELETED) in batchFetchServices.

Fix: resolve the service reference leniently — catch EntityNotFoundException and skip the entry (the ml model row is mid-cascade), mirroring the framework's existing getEntityOrNull / resolveInheritanceParentLeniently tolerance. The inheritance path already null-guards a null service, and only EntityNotFoundException is swallowed so real errors still surface.

3. RdfGlossaryGraphIT.scopedResponseCarriesGlossaryNameAndIdPerNode

The test waited only until the term node appeared in RDF, then asserted — outside the wait — that the node carried a populated group/glossaryId. Those project separately, so on a slower run the node was present while group was still null.

Fix: move the group/glossaryId assertions inside the Awaitility block so it polls until the full projection lands. Test-only.

Verification

  • mvn -pl openmetadata-service spotless:apply compile — green
  • mvn -pl openmetadata-integration-tests test-compile — green
  • The ITs themselves require the Testcontainers harness (ES + DB + Fuseki) to run end-to-end.

🤖 Generated with Claude Code


Summary by Gitar

  • Search reliability:
    • Modified SearchIndexRetryWorker to keep transiently failing re-index operations in PENDING_RETRY_2 status indefinitely instead of marking them as failed.
    • This change prevents search index document loss during transient cluster outages by allowing retries to continue until recovery.

This will update automatically on new commits.

Greptile Summary

Fixes three confirmed flaky integration tests (glossary rename propagation, ml model pagination, RDF glossary graph) and adds a complementary search-retry durability improvement. Each root cause — a racy async reindex, a TOCTOU cascade-delete race, and an assertion outside the poll loop — is addressed directly.

  • GlossaryRepository: replaces fire-and-forget reindexAcrossIndices with synchronous post-commit updateEntitiesByReference (bulk) + updateGlossaryTermByFqnPrefix, both wrapped in deferIfFlushScopeActive so they run after the transaction commits and see the new FQNs.
  • MlModelRepository: swallows EntityNotFoundException in batchFetchServices when the parent service is concurrently cascade-deleted, matching the framework's existing lenient-resolve pattern.
  • SearchIndexRetryWorker: introduces retryableNextStatus so transient ES failures (5xx / IO) stay at PENDING_RETRY_2 indefinitely instead of promoting to terminal FAILED, while entity-resolution failures still cap out via the existing nextRetryStatus.

Confidence Score: 5/5

Safe to merge — all three flake fixes are well-targeted, and the retry-durability change is complementary and non-breaking.

Each root cause is addressed at its actual source: the racy async reindex is replaced with a deferred-but-synchronous post-commit drain that reads committed rows, the TOCTOU cascade-delete race in MlModel pagination is guarded by a lenient catch that only swallows EntityNotFoundException, and the RDF assertion race is eliminated by pulling all checks inside the Awaitility block. The SearchIndexRetryWorker change correctly distinguishes transient outages (infinite retry at PENDING_RETRY_2) from unresolvable entities (still caps at FAILED via the unchanged nextRetryStatus path), and claimPending already queries all three PENDING statuses. No existing contracts are broken.

No files require special attention.

Important Files Changed

Filename Overview
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryRepository.java Rewrites updateAssetIndexes to use synchronous, deferred post-commit bulk re-indexing instead of racy fire-and-forget reindexAcrossIndices; fix is logically sound and mirrors the GlossaryTerm rename path.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java Adds updateEntitiesByReference, a new bulk-reindex helper that mirrors updateEntity(EntityReference) but collapses N individual ES round-trips into one bulk request; EntityNotFoundException on concurrent deletes is correctly swallowed.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchIndexRetryWorker.java Splits retry escalation into two policies: retryableNextStatus (stays at PENDING_RETRY_2 for IO/5xx outages) vs nextRetryStatus (still caps at FAILED for entity-resolution failures); distinction is intentional and correct.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/MlModelRepository.java resolveServiceRefLeniently catches EntityNotFoundException for concurrent cascade-deletes; only EntityNotFoundException is swallowed so real errors still propagate.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/RdfGlossaryGraphIT.java Moves group/glossaryId assertions inside the Awaitility block so the poll covers the full projection, not just node presence; correct fix for the flake.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/TestCaseResourceIT.java Doubles the Awaitility timeout from 30s to 60s for two search-index membership checks; straightforward tolerance improvement.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Client
    participant GlossaryRepository
    participant SearchRepository
    participant DB
    participant ES

    Client->>GlossaryRepository: rename glossary
    GlossaryRepository->>DB: update glossary FQN (transaction)
    GlossaryRepository->>DB: cascade child term FQNs (transaction)
    Note over GlossaryRepository: open deferral scope
    GlossaryRepository->>SearchRepository: updateEntity(glossaryRef) [deferred]
    GlossaryRepository->>SearchRepository: deferIfFlushScopeActive(updateEntitiesByReference) [deferred]
    GlossaryRepository->>SearchRepository: deferIfFlushScopeActive(updateGlossaryTermByFqnPrefix) [deferred]
    GlossaryRepository->>DB: COMMIT
    Note over SearchRepository: post-commit drain
    SearchRepository->>DB: re-fetch glossary doc
    SearchRepository->>ES: update glossary index
    SearchRepository->>DB: getAllTerms re-fetch each child term
    SearchRepository->>ES: bulk update all child term docs
    SearchRepository->>ES: updateGlossaryTermByFqnPrefix(tags.tagFQN)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Client
    participant GlossaryRepository
    participant SearchRepository
    participant DB
    participant ES

    Client->>GlossaryRepository: rename glossary
    GlossaryRepository->>DB: update glossary FQN (transaction)
    GlossaryRepository->>DB: cascade child term FQNs (transaction)
    Note over GlossaryRepository: open deferral scope
    GlossaryRepository->>SearchRepository: updateEntity(glossaryRef) [deferred]
    GlossaryRepository->>SearchRepository: deferIfFlushScopeActive(updateEntitiesByReference) [deferred]
    GlossaryRepository->>SearchRepository: deferIfFlushScopeActive(updateGlossaryTermByFqnPrefix) [deferred]
    GlossaryRepository->>DB: COMMIT
    Note over SearchRepository: post-commit drain
    SearchRepository->>DB: re-fetch glossary doc
    SearchRepository->>ES: update glossary index
    SearchRepository->>DB: getAllTerms re-fetch each child term
    SearchRepository->>ES: bulk update all child term docs
    SearchRepository->>ES: updateGlossaryTermByFqnPrefix(tags.tagFQN)
Loading

Reviews (5): Last reviewed commit: "test(testcase): widen logical-suite sear..." | Re-trigger Greptile

harshach and others added 3 commits June 20, 2026 12:42
Glossary rename left child glossary-term docs stuck at the old glossary
name in the search index (flaky test_renameGlossaryPropagatesToChildTerm
SearchIndex). updateAssetIndexes drove the cascade through the
fire-and-forget reindexAcrossIndices (getAsyncExecutor().submit), whose
task raced the rename commit, read pre-commit rows on a separate
connection, and its late write clobbered the correct post-commit write.

Rewrite updateAssetIndexes to run in-line: re-index the glossary and every
nested child term from the DB-authoritative getAllTerms list via
updateEntity (drained synchronously on the request thread post-commit =
read-your-write), plus one synchronous updateGlossaryTermByFqnPrefix for
tagged assets' tags.tagFQN — the same mechanism glossary-term rename uses.
Drops the async reindexAcrossIndices calls and the computed-but-unused
getGlossaryUsageFromES/targetFQN bookkeeping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Listing ml models 500'd with "Entity not found: mlmodelService <id>"
(flaky MlModelResourceIT.testAutoPaginationFluentAPI) when a sibling
test's cascade delete removed the parent service between the relationship
lookup and the strict getEntityReferenceById(...NON_DELETED) in
batchFetchServices.

Resolve the service reference leniently: catch EntityNotFoundException and
skip the entry (the ml model row is mid-cascade), mirroring the framework's
existing getEntityOrNull / resolveInheritanceParentLeniently tolerance. The
inheritance path already null-guards a null service, and only
EntityNotFoundException is swallowed so real errors still surface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scopedResponseCarriesGlossaryNameAndIdPerNode waited only until the term
node appeared in RDF, then asserted (outside the wait) that the node
carried a populated group/glossaryId. Those project separately, so on a
slower run the node was present while group was still null, failing the
assertion. Move the group/glossaryId assertions inside the Awaitility
block so it polls until the full projection lands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

❌ PR checklist incomplete

This PR cannot be merged until the following are addressed on its linked issue:

  • No GitHub issue is linked. Link an issue in the Development section of the PR (or add Fixes #12345 to the description). For a same-org cross-repo issue, add Fixes open-metadata/<repo>#123 to the description.

The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically.

Maintainers can bypass this check by adding the skip-pr-checks label.

@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 20, 2026
@harshach harshach added the skip-pr-checks Bypass PR metadata validation check label Jun 20, 2026
… loop

Review feedback (greptile P2): the per-term updateEntity loop serialized
N DB reads + N ES writes onto the request thread for large glossaries.

Add SearchRepository.updateEntitiesByReference(refs): re-reads each entity
with the same bounded reindex field-set updateEntity uses, then issues one
bulk index update. Glossary rename now re-indexes its child terms through
it (deferred post-commit), collapsing N ES round-trips into a single bulk
request while keeping the boundary-safe rebuild-from-DB (getAllTerms over
fixed-width fqnHash segments) that also refreshes the glossary denorm — a
single ES prefix-rewrite can't, and would over-match sibling glossaries
sharing a name prefix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review feedback: updateEntitiesByReference re-read every reference with
get(..., NON_DELETED) up front, so one child term concurrently deleted
during a glossary rename threw EntityNotFoundException and aborted the
whole bulk before updateEntitiesIndex ran — leaving none of the sibling
docs re-indexed at the new name. The earlier per-term loop degraded
gracefully (independent deferred writes); the bulk collapse lost that
fault isolation.

Guard each re-read with a per-reference catch of EntityNotFoundException
(only that type, so real errors still surface), mirroring the lenient ML
model service resolution in this PR. A deleted child is simply skipped —
its own delete cascade removes its document — and the rest still index.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tter

Nightly LiveIndexRetryIT.liveIndexEnqueuesRetriesDuringEsOutageAndDrainsAfter
flaked: after a prolonged ES outage the table was never re-indexed
(expected 1, got 0). Root cause: SearchIndexRetryWorker escalated *every*
failure — including transient ones (cluster timeout / 5xx / IO) — via
PENDING → RETRY_1 → RETRY_2 → FAILED. During an outage the bulk-flush
timeouts burned the 3-attempt budget before the cluster recovered, moving
the row to terminal FAILED. claimPending never re-claims FAILED, so the
write was abandoned permanently — and the test's pendingRetryCount() reads
0 (FAILED isn't pending) while the entity stays unindexed.

Only non-retryable (4xx document) errors should dead-letter. Cap retryable
failures at PENDING_RETRY_2 so claimPending keeps re-claiming the row and it
is re-indexed once the cluster is back. Fixes the flake and the underlying
resilience gap (a transient outage no longer drops writes from the index).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_putPreservesLogicalSuiteSearchMembership flaked on the loaded
postgres-elasticsearch-redis leg: the test-case PUT description took
longer than 30s to surface in the search index under that leg's async
propagation load. Both search-membership awaits in the method get 60s,
matching the slower-leg headroom other search-propagation tests use.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gitar-bot

gitar-bot Bot commented Jun 21, 2026

Copy link
Copy Markdown
Code Review ✅ Approved 2 resolved / 2 findings

Resolves flaky integration tests by replacing racy async reindexing with synchronous post-commit operations and adding lenient handling for concurrent entity deletions. Also improves search reliability by preventing premature dead-lettering of transient reindexing failures.

✅ 2 resolved
Edge Case: Bulk child-term reindex aborts all terms if one is concurrently deleted

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:1566-1580 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryRepository.java:580-588
updateEntitiesByReference builds the entity list by calling entityRepository.get(null, id, fields) for every reference up front, then bulk-indexes them. get(...) defaults to Include.NON_DELETED (EntityRepository.java) and throws EntityNotFoundException for a soft-deleted/missing row. Because the re-read loop has no per-reference guard, a single child term that is concurrently deleted during the glossary rename aborts the whole loop before updateEntitiesIndex runs — so NONE of the sibling child-term docs get re-indexed at the new glossary name.

This is a regression versus the previous commit's per-term loop (for (child) updateEntity(child.getEntityReference())), where each term was an independent deferred write: one failing term only affected that term, the rest still indexed.

Compounding this, the deferred closure in updateAssetIndexes is registered with locator entityId = updated.getId() (the glossary's id) and entityType = GLOSSARY_TERM. On drain failure, enqueueSearchWriteRetry enqueues (glossaryId, newFqn, GLOSSARY_TERM) to SearchIndexRetryQueue — a glossary-term that does not exist — so the retry can never rebuild the actual child-term documents, leaving them stuck at the old glossary name. This re-opens the exact staleness this PR set out to fix, in the (PR-relevant) concurrent-delete scenario.

Suggested fix: make updateEntitiesByReference tolerant of individually missing references (mirroring the lenient EntityNotFoundException handling added for ML models in this same PR) so one deleted child does not abort indexing of the rest.

Edge Case: Retryable failures never dead-letter, risking infinite retry of poison records

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchIndexRetryWorker.java:199-213 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchIndexRetryWorker.java:593-602 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchIndexRetryWorker.java:677-679
In handleProcessingError, retryable failures now use retryableNextStatus, which caps at PENDING_RETRY_2 and never escalates to STATUS_FAILED. Combined with isRetryable, this creates an unbounded retry loop for genuinely-unprocessable records.

isRetryable (lines 593-602) returns true for any IOException, and crucially returns true as a catch-all whenever no search status code can be extracted (status <= 0return true). So any unexpected exception thrown during reindexEntityCascade (e.g., an NPE, a serialization/mapping bug, or any non-Elasticsearch/OpenSearch error) is classified retryable. With the new logic such a record stays at PENDING_RETRY_2 forever, is re-claimed by claimPending every poll (~5s), and updateFailureAndRetryCount keeps incrementing its retry count without bound.

This is intentional and correct for a true cluster outage (gated by waitForClientAvailability), but for a poison record where the cluster is healthy and the failure is deterministic, it means: (1) a permanent busy-loop consuming a worker slot, (2) the underlying bug is masked rather than surfaced via a terminal FAILED state and non_retryable/alerting metrics, and (3) the row can never be cleared.

Consider keeping the outage-tolerant behavior only for clearly-transient infrastructure failures (status codes via isRetryableStatusCode, IOException, client-unavailable), while letting truly unknown/deterministic exceptions escalate to FAILED after a bounded count — or add an absolute attempt/age ceiling on the retryable path so poison records eventually dead-letter.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (10 flaky)

✅ 4310 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 301 0 1 4
🟡 Shard 2 813 0 2 9
🟡 Shard 3 815 0 1 8
🟡 Shard 4 857 0 2 12
🟡 Shard 5 732 0 1 47
🟡 Shard 6 792 0 3 8
🟡 10 flaky test(s) (passed on retry)
  • Pages/Roles.spec.ts › Roles page should work properly (shard 1, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/Glossary/GlossaryRemoveOperations.spec.ts › should add and remove reviewer from glossary term (shard 2, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show online status badge on user profile for active users (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Enum (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs skip-pr-checks Bypass PR metadata validation check

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant