fix(cases): async case duration sync#2781
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ee692687f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
✅ No security or compliance issues detected. Reviewed everything up to e8d25a0. Security Overview
Detected Code Changes
|
There was a problem hiding this comment.
1 issue found across 18 files
Confidence score: 3/5
- There is a concrete reliability risk in
tracecat/cases/durations/consumer.py: failed/unacked jobs are only reclaimed on idle reads, so retries can be starved indefinitely when the stream stays busy. - Given the medium severity (6/10) with fairly high confidence (8/10) and direct user-facing impact on retry behavior, this carries some merge risk rather than being a minor housekeeping issue.
- Pay close attention to
tracecat/cases/durations/consumer.py- reclaim logic tied to idle reads may prevent timely recovery of failed/unacked jobs under sustained load.
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 19974b485d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b21e7f525d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e8d25a0a4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.event_type is CaseEventType.CASE_VIEWED: | ||
| raise ValueError("case_viewed cannot be used as a duration anchor") |
There was a problem hiding this comment.
Remove case_viewed from the duration picker
This new rejection makes the existing duration picker option unusable: I checked frontend/src/components/cases/case-duration-options.ts, and it still includes value: "case_viewed". In that scenario users can select “Case viewed” when creating or updating a duration definition, but the API now rejects the request with a validation error; either remove/migrate that option (and any existing definitions) or keep the anchor supported.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
1 issue found across 5 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="tracecat/cases/durations/service.py">
<violation number="1" location="tracecat/cases/durations/service.py:190">
P2: Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using `.all()` to avoid O(N) memory spikes on large workspaces.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| ) | ||
| result = await self.session.execute(stmt) | ||
| duration_service = CaseDurationService(session=self.session, role=self.role) | ||
| async for case_id in cooperative_every(result.scalars().all(), every=8): |
There was a problem hiding this comment.
P2: Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using .all() to avoid O(N) memory spikes on large workspaces.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tracecat/cases/durations/service.py, line 190:
<comment>Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using `.all()` to avoid O(N) memory spikes on large workspaces.</comment>
<file context>
@@ -167,6 +165,31 @@ async def delete_definition(self, duration_id: uuid.UUID) -> None:
+ )
+ result = await self.session.execute(stmt)
+ duration_service = CaseDurationService(session=self.session, role=self.role)
+ async for case_id in cooperative_every(result.scalars().all(), every=8):
+ await duration_service.sync_case_durations(case_id)
+
</file context>
|
Found 1 test failure on Blacksmith runners: Failure
|
Summary
Fixes ENG-1462: https://linear.app/tracecat/issue/ENG-1462/investigate-slow-case-loads-from-synchronous-duration-sync
This removes case-duration recomputation from case read paths and moves normal mutation-driven duration materialization into an after-commit, coalesced Redis stream consumer.
Changes:
GET /cases/{case_id}/durationsread-only; it now lists materialized rows without syncing first.duration_sync="async" | "inline" | "none"to case event creation.CASE_VIEWEDas audit-only for durations and rejectcase_viewedas a duration anchor.(workspace_id, case_id), skips irrelevant event types, and uses per-case PG advisory locks.TRACECAT__CASE_DURATION_SYNC_ENABLED=false, so the flag is a safe async-worker kill switch.Motivation
Opening a case could previously trigger writes from GET-time paths:
CASE_VIEWEDevent,sync_case_durations(), andGET /durationsalso synced durations before listing.When workflows mutated the same case concurrently, those synchronous recomputes amplified write contention around
case_durationand held request transactions open. Other non-case pages remained snappy because they did not hit this write-on-read path.Benchmarks
Hot-case profile for before/after comparison:
1case40duration definitions300history events4mutators x8mutations12case loads3baseline case loads10 msload intervalHot-case old-path runs
113.8 ms, max159.0 ms184.0 ms, p95222.2 ms, max222.2 ms291.5 ms, p95928.3 ms, max942.2 ms70, ungranted3, case_duration ungranted0138.0 ms, max170.5 ms122.6 ms, p95320.3 ms, max320.3 ms170.5 ms, p95209.5 ms, max237.9 ms131.0 ms, max189.3 ms124.7 ms, p95292.5 ms, max292.5 ms155.2 ms, p95179.6 ms, max206.0 msHot-case new async-worker runs
55.9 ms, p9571.2 ms, max71.2 ms121.2 ms, p95181.0 ms, max181.0 ms118.6 ms, p95155.3 ms, max189.0 ms068.6 ms, p9591.7 ms, max91.7 ms124.6 ms, p95235.2 ms, max235.2 ms117.7 ms, p95217.9 ms, max221.9 ms071.4 ms, p95118.3 ms, max118.3 ms127.9 ms, p95154.0 ms, max154.0 ms118.1 ms, p95149.9 ms, max150.6 ms0Main signal: the old hot update path had mutation p95 around
928.3 ms; the current PR commit is149.9 msin the same reduced hot-case profile.Existing burst/health benchmark on new implementation
Profile:
20cases80definitions600history events per case1update per case50 mshealth interval1000 mshealth timeout315.2 ms322.4 ms322.5 ms202.0 ms2.3 ms2.3 ms45.3 ms142.3 ms142.3 ms141.6 ms2.3 ms2.3 ms62.9 ms3.4 ms3.4 ms447.5 ms193.7 ms193.7 ms142.7 ms3.4 ms3.4 ms6Other values:
0.674 s0, burst0, cooldown00Verification
uv run pytest tests/unit/test_case_events_service.py tests/unit/test_cases_service.py tests/unit/test_case_duration_service.py tests/unit/test_case_duration_router.py tests/unit/test_case_duration_sync_consumer.py107 passeduv run ruff check ...uv run ruff format --check ...uv run basedpyright ...0 errors, 0 warnings, 0 notesTRACECAT_RUN_CASE_DURATION_BENCHMARKS=1 ... uv run pytest tests/integration/test_case_duration_benchmarks.py -k hot_case -sSummary by cubic
Moves case-duration recomputation off reads and event writes to an async Redis-stream consumer to fix ENG-1462 slow case loads. The durations endpoint is now read-only, and the worker reads backlog on startup for reliable sync.
New Features
duration_sync="async" | "inline" | "none"(create-case uses inline). Definition create/update enqueue cursor-paged backfills; when the worker is disabled, they backfill inline to preserve behavior. Service roletracecat-case-duration-syncadded.CASE_VIEWEDas audit-only; rejectcase_viewedas a duration anchor.GET /cases/{id}/durationslists materialized rows only.Bug Fixes
case_closed,case_reopenedmap tostatus_changed).Written for commit e8d25a0. Summary will update on new commits.