fix(cases): async case duration sync by daryllimyt · Pull Request #2781 · TracecatHQ/tracecat

daryllimyt · 2026-05-29T20:26:55Z

Summary

Fixes ENG-1462: https://linear.app/tracecat/issue/ENG-1462/investigate-slow-case-loads-from-synchronous-duration-sync

This removes case-duration recomputation from case read paths and moves normal mutation-driven duration materialization into an after-commit, coalesced Redis stream consumer.

Changes:

Make GET /cases/{case_id}/durations read-only; it now lists materialized rows without syncing first.
Add duration_sync="async" | "inline" | "none" to case event creation.
Keep create-case duration materialization inline so new cases have rows immediately.
Treat CASE_VIEWED as audit-only for durations and reject case_viewed as a duration anchor.
Enqueue mutation-driven case duration sync after commit.
Add a dedicated Redis stream consumer that coalesces by (workspace_id, case_id), skips irrelevant event types, and uses per-case PG advisory locks.
Fall back to inline sync when TRACECAT__CASE_DURATION_SYNC_ENABLED=false, so the flag is a safe async-worker kill switch.
Keep failed coalesced case jobs pending instead of letting transient sync errors stop the consumer task.
Generate updated frontend client types for the new service role and endpoint description.

Motivation

Opening a case could previously trigger writes from GET-time paths:

case detail view tracking created a CASE_VIEWED event,
event creation synchronously called sync_case_durations(), and
GET /durations also synced durations before listing.

When workflows mutated the same case concurrently, those synchronous recomputes amplified write contention around case_duration and held request transactions open. Other non-case pages remained snappy because they did not hit this write-on-read path.

Benchmarks

Hot-case profile for before/after comparison:

1 case
40 duration definitions
300 history events
4 mutators x 8 mutations
12 case loads
3 baseline case loads
10 ms load interval

Hot-case old-path runs

Mode	Case-load baseline	Case-load burst	Mutation latency	Notes
update	p50 `113.8 ms`, max `159.0 ms`	p50 `184.0 ms`, p95 `222.2 ms`, max `222.2 ms`	p50 `291.5 ms`, p95 `928.3 ms`, max `942.2 ms`	max lock waits `70`, ungranted `3`, case_duration ungranted `0`
event	p50 `138.0 ms`, max `170.5 ms`	p50 `122.6 ms`, p95 `320.3 ms`, max `320.3 ms`	p50 `170.5 ms`, p95 `209.5 ms`, max `237.9 ms`	old-path variant
sync	p50 `131.0 ms`, max `189.3 ms`	p50 `124.7 ms`, p95 `292.5 ms`, max `292.5 ms`	p50 `155.2 ms`, p95 `179.6 ms`, max `206.0 ms`	old-path variant

Hot-case new async-worker runs

Run	Case-load baseline	Case-load burst	Mutation latency	Errors
worker-inclusive, isolated stream/db	p50 `55.9 ms`, p95 `71.2 ms`, max `71.2 ms`	p50 `121.2 ms`, p95 `181.0 ms`, max `181.0 ms`	p50 `118.6 ms`, p95 `155.3 ms`, max `189.0 ms`	`0`
simplified implementation	p50 `68.6 ms`, p95 `91.7 ms`, max `91.7 ms`	p50 `124.6 ms`, p95 `235.2 ms`, max `235.2 ms`	p50 `117.7 ms`, p95 `217.9 ms`, max `221.9 ms`	`0`
current PR commit	p50 `71.4 ms`, p95 `118.3 ms`, max `118.3 ms`	p50 `127.9 ms`, p95 `154.0 ms`, max `154.0 ms`	p50 `118.1 ms`, p95 `149.9 ms`, max `150.6 ms`	`0`

Main signal: the old hot update path had mutation p95 around 928.3 ms; the current PR commit is 149.9 ms in the same reduced hot-case profile.

Existing burst/health benchmark on new implementation

Profile:

20 cases
80 definitions
600 history events per case
1 update per case
50 ms health interval
1000 ms health timeout

Metric	p50	p95	max	Samples
update latency	`315.2 ms`	`322.4 ms`	`322.5 ms`	`20`
health baseline	`2.0 ms`	`2.3 ms`	`2.3 ms`	`4`
health burst	`5.3 ms`	`142.3 ms`	`142.3 ms`	`14`
health cooldown	`1.6 ms`	`2.3 ms`	`2.3 ms`	`6`
loop lag baseline	`2.9 ms`	`3.4 ms`	`3.4 ms`	`4`
loop lag burst	`47.5 ms`	`193.7 ms`	`193.7 ms`	`14`
loop lag cooldown	`2.7 ms`	`3.4 ms`	`3.4 ms`	`6`

Other values:

burst elapsed: 0.674 s
health errors: baseline 0, burst 0, cooldown 0
update errors: 0

Verification

uv run pytest tests/unit/test_case_events_service.py tests/unit/test_cases_service.py tests/unit/test_case_duration_service.py tests/unit/test_case_duration_router.py tests/unit/test_case_duration_sync_consumer.py
- 107 passed
uv run ruff check ...
- passed
uv run ruff format --check ...
- passed
uv run basedpyright ...
- 0 errors, 0 warnings, 0 notes
Pre-commit hooks during commit:
- ruff check/format passed
- generated frontend client passed after installing frontend dependencies
- python type check passed
- frontend biome check passed
- frontend type check passed
Current PR hot-case benchmark:
- TRACECAT_RUN_CASE_DURATION_BENCHMARKS=1 ... uv run pytest tests/integration/test_case_duration_benchmarks.py -k hot_case -s
- passed

Summary by cubic

Moves case-duration recomputation off reads and event writes to an async Redis-stream consumer to fix ENG-1462 slow case loads. The durations endpoint is now read-only, and the worker reads backlog on startup for reliable sync.

New Features
- Async duration materialization via Redis stream: coalesces per case, runs as an API background task, and has a kill switch that falls back to inline. Event creation supports duration_sync="async" | "inline" | "none" (create-case uses inline). Definition create/update enqueue cursor-paged backfills; when the worker is disabled, they backfill inline to preserve behavior. Service role tracecat-case-duration-sync added.
- Treat CASE_VIEWED as audit-only; reject case_viewed as a duration anchor. GET /cases/{id}/durations lists materialized rows only.
Bug Fixes
- Reliability: use transaction-scoped per-case PG advisory locks, reclaim and process idle/pending jobs, read stream backlog on group creation, keep locked/failed jobs pending, and force unconditional sync when backfill coalesces with case events.
- Correctly match status-change aliases when deciding if a sync is needed (e.g., case_closed, case_reopened map to status_changed).

^{Written for commit e8d25a0. Summary will update on new commits.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ee692687f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

zeropath-ai · 2026-05-29T20:34:19Z

✅ No security or compliance issues detected. Reviewed everything up to e8d25a0.

Security Overview

🔎 Scanned files: 19 changed file(s)
🔗 Scan Link: https://zeropath.com/app/repositories/00dffd6c-8834-4dc9-b6d8-b44cd1622986?scanId=06dfd3cc-3129-4c19-940a-412f1a3e65dd&codeScanTypes=PrScan&tab=issues

Detected Code Changes

Change Type	Relevant files
Enhancement	► frontend/src/client/schemas.gen.ts Add 'tracecat-case-duration-sync' to Role type ► frontend/src/client/services.gen.ts Update description for list_case_durations ► frontend/src/client/types.gen.ts Add 'tracecat-case-duration-sync' to service_id type ► tests/integration/test_case_duration_benchmarks.py Add new environment variables for hot case benchmarks Implement _sync_initial_case_durations helper function Implement _load_case_page_once helper function Implement _load_case_page_repeatedly helper function Implement _run_hot_case_update_burst helper function Add new benchmark test case: test_hot_case_load_latency_during_async_duration_mutation_burst ► tests/unit/test_case_duration_router.py Add unit test for list_case_durations being read-only ► tests/unit/test_case_duration_service.py Add unit tests for backfill enqueueing after definition creation/update Add unit tests for inline syncing when async duration sync is disabled Add test case for duration anchor rejecting CASE_VIEWED event ► tracecat/api/app.py Add background task for case duration sync consumer ► tracecat/cases/durations/consumer.py Implement CaseDurationSyncConsumer for async job processing Implement _ensure_group to create Redis stream group Implement _handle_entries to process incoming jobs Implement _sync_case_duration to handle individual case syncs Implement _process_backfill_job for backfill jobs Implement _claim_idle_messages for job recovery Add utility functions for job parsing and error handling ► tracecat/cases/durations/consumer.py Add start_case_duration_sync_consumer function ► tests/unit/test_case_events_service.py Add unit tests for event creation enqueuing duration sync by default Add unit tests for event creation allowing inline duration sync Add unit tests for event creation syncing inline when async duration sync is disabled ► tests/unit/test_cases_service.py Add stub for enqueue_case_duration_sync_after_commit and publish_case_event_payload
Refactor	► tests/integration/test_case_duration_benchmarks.py Refactor benchmark configuration and helper functions ► tests/unit/test_case_duration_service.py Refactor tests to use new stubbed functions ► tests/unit/test_case_events_service.py Refactor stubbed functions
Configuration changes	► tests/integration/test_case_duration_benchmarks.py Add TRACECAT_CASE_DURATION_BENCHMARK_HOT_CASE_* environment variables

cubic-dev-ai

1 issue found across 18 files

Confidence score: 3/5

There is a concrete reliability risk in tracecat/cases/durations/consumer.py: failed/unacked jobs are only reclaimed on idle reads, so retries can be starved indefinitely when the stream stays busy.
Given the medium severity (6/10) with fairly high confidence (8/10) and direct user-facing impact on retry behavior, this carries some merge risk rather than being a minor housekeeping issue.
Pay close attention to tracecat/cases/durations/consumer.py - reclaim logic tied to idle reads may prevent timely recovery of failed/unacked jobs under sustained load.

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19974b485d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b21e7f525d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8d25a0a4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T20:47:28Z

+        if self.event_type is CaseEventType.CASE_VIEWED:
+            raise ValueError("case_viewed cannot be used as a duration anchor")


Remove case_viewed from the duration picker

This new rejection makes the existing duration picker option unusable: I checked frontend/src/components/cases/case-duration-options.ts, and it still includes value: "case_viewed". In that scenario users can select “Case viewed” when creating or updating a duration definition, but the API now rejects the request with a validation error; either remove/migrate that option (and any existing definitions) or keep the anchor supported.

Useful? React with 👍 / 👎.

cubic-dev-ai

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tracecat/cases/durations/service.py">

<violation number="1" location="tracecat/cases/durations/service.py:190">
P2: Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using `.all()` to avoid O(N) memory spikes on large workspaces.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-06-18T21:07:59Z

+        )
+        result = await self.session.execute(stmt)
+        duration_service = CaseDurationService(session=self.session, role=self.role)
+        async for case_id in cooperative_every(result.scalars().all(), every=8):


P2: Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using .all() to avoid O(N) memory spikes on large workspaces.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At tracecat/cases/durations/service.py, line 190: <comment>Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using `.all()` to avoid O(N) memory spikes on large workspaces.</comment> <file context> @@ -167,6 +165,31 @@ async def delete_definition(self, duration_id: uuid.UUID) -> None: + ) + result = await self.session.execute(stmt) + duration_service = CaseDurationService(session=self.session, role=self.role) + async for case_id in cooperative_every(result.scalars().all(), every=8): + await duration_service.sync_case_durations(case_id) + </file context>

blacksmith-sh · 2026-06-18T21:13:06Z

Found 1 test failure on Blacksmith runners:

Failure

Test	View Logs
`test_config/test_config_boolean_env_values_use_env_bool`	View Logs

fix(cases): async case duration sync

8ee6926

daryllimyt added tests Changes to unit and integration tests fix Bug fix performance Changes that improve performance cases Case management improvements and changes labels May 29, 2026

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

Comment thread tracecat/cases/durations/consumer.py Outdated

cubic-dev-ai Bot reviewed May 29, 2026

View reviewed changes

Comment thread tracecat/cases/durations/consumer.py Outdated

daryllimyt added 2 commits June 1, 2026 14:03

fix(cases): match status duration aliases

95f5e75

fix(cases): reclaim duration sync retries

19974b4

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tracecat/cases/durations/consumer.py Outdated

Comment thread tracecat/cases/durations/consumer.py Outdated

daryllimyt added 2 commits June 18, 2026 16:07

fix(cases): preserve duration backfill sync

4d7437b

fix(cases): read duration sync backlog

b21e7f5

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread tracecat/cases/durations/consumer.py Outdated

Comment thread tracecat/cases/durations/service.py Outdated

daryllimyt added 2 commits June 18, 2026 16:40

fix(cases): use transaction duration sync locks

b25d6b6

fix(cases): backfill durations without worker

e8d25a0

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Jun 18, 2026

View reviewed changes

		if self.event_type is CaseEventType.CASE_VIEWED:
		raise ValueError("case_viewed cannot be used as a duration anchor")

Conversation

daryllimyt commented May 29, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Benchmarks

Hot-case old-path runs

Hot-case new async-worker runs

Existing burst/health benchmark on new implementation

Verification

Summary by cubic

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

zeropath-ai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

blacksmith-sh Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

daryllimyt commented May 29, 2026 •

edited by cubic-dev-ai Bot

Loading

zeropath-ai Bot commented May 29, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

blacksmith-sh Bot commented Jun 18, 2026 •

edited

Loading