Skip to content

fix(cases): async case duration sync#2781

Open
daryllimyt wants to merge 7 commits into
mainfrom
daryl/eng-1462-async-case-duration-sync
Open

fix(cases): async case duration sync#2781
daryllimyt wants to merge 7 commits into
mainfrom
daryl/eng-1462-async-case-duration-sync

Conversation

@daryllimyt

@daryllimyt daryllimyt commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes ENG-1462: https://linear.app/tracecat/issue/ENG-1462/investigate-slow-case-loads-from-synchronous-duration-sync

This removes case-duration recomputation from case read paths and moves normal mutation-driven duration materialization into an after-commit, coalesced Redis stream consumer.

Changes:

  • Make GET /cases/{case_id}/durations read-only; it now lists materialized rows without syncing first.
  • Add duration_sync="async" | "inline" | "none" to case event creation.
  • Keep create-case duration materialization inline so new cases have rows immediately.
  • Treat CASE_VIEWED as audit-only for durations and reject case_viewed as a duration anchor.
  • Enqueue mutation-driven case duration sync after commit.
  • Add a dedicated Redis stream consumer that coalesces by (workspace_id, case_id), skips irrelevant event types, and uses per-case PG advisory locks.
  • Fall back to inline sync when TRACECAT__CASE_DURATION_SYNC_ENABLED=false, so the flag is a safe async-worker kill switch.
  • Keep failed coalesced case jobs pending instead of letting transient sync errors stop the consumer task.
  • Generate updated frontend client types for the new service role and endpoint description.

Motivation

Opening a case could previously trigger writes from GET-time paths:

  • case detail view tracking created a CASE_VIEWED event,
  • event creation synchronously called sync_case_durations(), and
  • GET /durations also synced durations before listing.

When workflows mutated the same case concurrently, those synchronous recomputes amplified write contention around case_duration and held request transactions open. Other non-case pages remained snappy because they did not hit this write-on-read path.

Benchmarks

Hot-case profile for before/after comparison:

  • 1 case
  • 40 duration definitions
  • 300 history events
  • 4 mutators x 8 mutations
  • 12 case loads
  • 3 baseline case loads
  • 10 ms load interval

Hot-case old-path runs

Mode Case-load baseline Case-load burst Mutation latency Notes
update p50 113.8 ms, max 159.0 ms p50 184.0 ms, p95 222.2 ms, max 222.2 ms p50 291.5 ms, p95 928.3 ms, max 942.2 ms max lock waits 70, ungranted 3, case_duration ungranted 0
event p50 138.0 ms, max 170.5 ms p50 122.6 ms, p95 320.3 ms, max 320.3 ms p50 170.5 ms, p95 209.5 ms, max 237.9 ms old-path variant
sync p50 131.0 ms, max 189.3 ms p50 124.7 ms, p95 292.5 ms, max 292.5 ms p50 155.2 ms, p95 179.6 ms, max 206.0 ms old-path variant

Hot-case new async-worker runs

Run Case-load baseline Case-load burst Mutation latency Errors
worker-inclusive, isolated stream/db p50 55.9 ms, p95 71.2 ms, max 71.2 ms p50 121.2 ms, p95 181.0 ms, max 181.0 ms p50 118.6 ms, p95 155.3 ms, max 189.0 ms 0
simplified implementation p50 68.6 ms, p95 91.7 ms, max 91.7 ms p50 124.6 ms, p95 235.2 ms, max 235.2 ms p50 117.7 ms, p95 217.9 ms, max 221.9 ms 0
current PR commit p50 71.4 ms, p95 118.3 ms, max 118.3 ms p50 127.9 ms, p95 154.0 ms, max 154.0 ms p50 118.1 ms, p95 149.9 ms, max 150.6 ms 0

Main signal: the old hot update path had mutation p95 around 928.3 ms; the current PR commit is 149.9 ms in the same reduced hot-case profile.

Existing burst/health benchmark on new implementation

Profile:

  • 20 cases
  • 80 definitions
  • 600 history events per case
  • 1 update per case
  • 50 ms health interval
  • 1000 ms health timeout
Metric p50 p95 max Samples
update latency 315.2 ms 322.4 ms 322.5 ms 20
health baseline 2.0 ms 2.3 ms 2.3 ms 4
health burst 5.3 ms 142.3 ms 142.3 ms 14
health cooldown 1.6 ms 2.3 ms 2.3 ms 6
loop lag baseline 2.9 ms 3.4 ms 3.4 ms 4
loop lag burst 47.5 ms 193.7 ms 193.7 ms 14
loop lag cooldown 2.7 ms 3.4 ms 3.4 ms 6

Other values:

  • burst elapsed: 0.674 s
  • health errors: baseline 0, burst 0, cooldown 0
  • update errors: 0

Verification

  • uv run pytest tests/unit/test_case_events_service.py tests/unit/test_cases_service.py tests/unit/test_case_duration_service.py tests/unit/test_case_duration_router.py tests/unit/test_case_duration_sync_consumer.py
    • 107 passed
  • uv run ruff check ...
    • passed
  • uv run ruff format --check ...
    • passed
  • uv run basedpyright ...
    • 0 errors, 0 warnings, 0 notes
  • Pre-commit hooks during commit:
    • ruff check/format passed
    • generated frontend client passed after installing frontend dependencies
    • python type check passed
    • frontend biome check passed
    • frontend type check passed
  • Current PR hot-case benchmark:
    • TRACECAT_RUN_CASE_DURATION_BENCHMARKS=1 ... uv run pytest tests/integration/test_case_duration_benchmarks.py -k hot_case -s
    • passed

Summary by cubic

Moves case-duration recomputation off reads and event writes to an async Redis-stream consumer to fix ENG-1462 slow case loads. The durations endpoint is now read-only, and the worker reads backlog on startup for reliable sync.

  • New Features

    • Async duration materialization via Redis stream: coalesces per case, runs as an API background task, and has a kill switch that falls back to inline. Event creation supports duration_sync="async" | "inline" | "none" (create-case uses inline). Definition create/update enqueue cursor-paged backfills; when the worker is disabled, they backfill inline to preserve behavior. Service role tracecat-case-duration-sync added.
    • Treat CASE_VIEWED as audit-only; reject case_viewed as a duration anchor. GET /cases/{id}/durations lists materialized rows only.
  • Bug Fixes

    • Reliability: use transaction-scoped per-case PG advisory locks, reclaim and process idle/pending jobs, read stream backlog on group creation, keep locked/failed jobs pending, and force unconditional sync when backfill coalesces with case events.
    • Correctly match status-change aliases when deciding if a sync is needed (e.g., case_closed, case_reopened map to status_changed).

Written for commit e8d25a0. Summary will update on new commits.

Review in cubic

@daryllimyt daryllimyt added tests Changes to unit and integration tests fix Bug fix performance Changes that improve performance cases Case management improvements and changes labels May 29, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ee692687f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/cases/durations/consumer.py Outdated
@zeropath-ai

zeropath-ai Bot commented May 29, 2026

Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to e8d25a0.

Security Overview
Detected Code Changes
Change Type Relevant files
Enhancement ► frontend/src/client/schemas.gen.ts
    Add 'tracecat-case-duration-sync' to Role type
► frontend/src/client/services.gen.ts
    Update description for list_case_durations
► frontend/src/client/types.gen.ts
    Add 'tracecat-case-duration-sync' to service_id type
► tests/integration/test_case_duration_benchmarks.py
    Add new environment variables for hot case benchmarks
    Implement _sync_initial_case_durations helper function
    Implement _load_case_page_once helper function
    Implement _load_case_page_repeatedly helper function
    Implement _run_hot_case_update_burst helper function
    Add new benchmark test case: test_hot_case_load_latency_during_async_duration_mutation_burst
► tests/unit/test_case_duration_router.py
    Add unit test for list_case_durations being read-only
► tests/unit/test_case_duration_service.py
    Add unit tests for backfill enqueueing after definition creation/update
    Add unit tests for inline syncing when async duration sync is disabled
    Add test case for duration anchor rejecting CASE_VIEWED event
► tracecat/api/app.py
    Add background task for case duration sync consumer
► tracecat/cases/durations/consumer.py
    Implement CaseDurationSyncConsumer for async job processing
    Implement _ensure_group to create Redis stream group
    Implement _handle_entries to process incoming jobs
    Implement _sync_case_duration to handle individual case syncs
    Implement _process_backfill_job for backfill jobs
    Implement _claim_idle_messages for job recovery
    Add utility functions for job parsing and error handling
► tracecat/cases/durations/consumer.py
    Add start_case_duration_sync_consumer function
► tests/unit/test_case_events_service.py
    Add unit tests for event creation enqueuing duration sync by default
    Add unit tests for event creation allowing inline duration sync
    Add unit tests for event creation syncing inline when async duration sync is disabled
► tests/unit/test_cases_service.py
    Add stub for enqueue_case_duration_sync_after_commit and publish_case_event_payload
Refactor ► tests/integration/test_case_duration_benchmarks.py
    Refactor benchmark configuration and helper functions
► tests/unit/test_case_duration_service.py
    Refactor tests to use new stubbed functions
► tests/unit/test_case_events_service.py
    Refactor stubbed functions
Configuration changes ► tests/integration/test_case_duration_benchmarks.py
    Add TRACECAT_CASE_DURATION_BENCHMARK_HOT_CASE_* environment variables

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 18 files

Confidence score: 3/5

  • There is a concrete reliability risk in tracecat/cases/durations/consumer.py: failed/unacked jobs are only reclaimed on idle reads, so retries can be starved indefinitely when the stream stays busy.
  • Given the medium severity (6/10) with fairly high confidence (8/10) and direct user-facing impact on retry behavior, this carries some merge risk rather than being a minor housekeeping issue.
  • Pay close attention to tracecat/cases/durations/consumer.py - reclaim logic tied to idle reads may prevent timely recovery of failed/unacked jobs under sustained load.

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread tracecat/cases/durations/consumer.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19974b485d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/cases/durations/consumer.py Outdated
Comment thread tracecat/cases/durations/consumer.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b21e7f525d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tracecat/cases/durations/consumer.py Outdated
Comment thread tracecat/cases/durations/service.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8d25a0a4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +92 to +93
if self.event_type is CaseEventType.CASE_VIEWED:
raise ValueError("case_viewed cannot be used as a duration anchor")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove case_viewed from the duration picker

This new rejection makes the existing duration picker option unusable: I checked frontend/src/components/cases/case-duration-options.ts, and it still includes value: "case_viewed". In that scenario users can select “Case viewed” when creating or updating a duration definition, but the API now rejects the request with a validation error; either remove/migrate that option (and any existing definitions) or keep the anchor supported.

Useful? React with 👍 / 👎.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tracecat/cases/durations/service.py">

<violation number="1" location="tracecat/cases/durations/service.py:190">
P2: Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using `.all()` to avoid O(N) memory spikes on large workspaces.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

)
result = await self.session.execute(stmt)
duration_service = CaseDurationService(session=self.session, role=self.role)
async for case_id in cooperative_every(result.scalars().all(), every=8):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using .all() to avoid O(N) memory spikes on large workspaces.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tracecat/cases/durations/service.py, line 190:

<comment>Inline backfill loads all case IDs into memory before syncing. Stream or batch IDs instead of using `.all()` to avoid O(N) memory spikes on large workspaces.</comment>

<file context>
@@ -167,6 +165,31 @@ async def delete_definition(self, duration_id: uuid.UUID) -> None:
+        )
+        result = await self.session.execute(stmt)
+        duration_service = CaseDurationService(session=self.session, role=self.role)
+        async for case_id in cooperative_every(result.scalars().all(), every=8):
+            await duration_service.sync_case_durations(case_id)
+
</file context>

@blacksmith-sh

blacksmith-sh Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Found 1 test failure on Blacksmith runners:

Failure

Test View Logs
test_config/test_config_boolean_env_values_use_env_bool View Logs

Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cases Case management improvements and changes fix Bug fix performance Changes that improve performance tests Changes to unit and integration tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant