Skip to content
Draft
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ jobs:
unit-test:
if: github.repository_owner == 'getsentry'
runs-on: ${{ matrix.os }}
timeout-minutes: 30
strategy:
matrix:
os: [ubuntu-24.04, ubuntu-24.04-arm]
Expand All @@ -40,6 +41,7 @@ jobs:
integration-test:
if: github.repository_owner == 'getsentry'
runs-on: ${{ matrix.os }}
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
Expand Down
3 changes: 3 additions & 0 deletions action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -278,3 +278,6 @@ runs:
echo "::group::Inspect failure - docker compose logs"
docker compose logs
echo "::endgroup::"
echo "::group::Inspect failure - docker stats"
docker stats --no-stream
echo "::endgroup::"
4 changes: 2 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -202,8 +202,8 @@ services:
KAFKA_TOOLS_LOG4J_LOGLEVEL: "WARN"
ulimits:
nofile:
soft: 4096
hard: 4096
soft: 100000
hard: 100000
volumes:
- "sentry-kafka:/var/lib/kafka/data"
- "sentry-kafka-log:/var/lib/kafka/log"
Expand Down
12 changes: 11 additions & 1 deletion sentry/sentry.conf.example.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,17 @@ def get_internal_network():
DEFAULT_KAFKA_OPTIONS = {
"bootstrap.servers": "kafka:9092",
"message.max.bytes": 50000000,
"socket.timeout.ms": 1000,
"socket.timeout.ms": 10000, # Timeout for individual socket operations (send/recv)
"request.timeout.ms": 30000, # Max time to wait for a broker response before failing

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From logs:

  uptime-results-1                         | %4|1778036634.157|SESSTMOUT|rdkafka#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 30400 ms without a successful response from the group coordinator (broker 1001, last error was Success): revoking assignment and rejoining group

searching "Consumer group session timed out" shows a couple more kafka timeouts after 30s as well, maybe we should increase kafka timeout even further?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aminvakil holy sh*t you're right

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I didn't catch this myself, it's been a while since I also use AI to review stuff :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the problem persist. I have no idea other to blame github's arm64 runners.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they are too slow, this happens to me as well in other projects.

I searched "Consumer group session timed out" in latest run:

  process-spans-1                            | %4|1779437870.346|SESSTMOUT|rdkafka#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 31691 ms without a successful response from the group coordinator (broker -1, last error was Success): revoking assignment and rejoining group

Seems like they still have 30 seconds timeout.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is according to AI:

Root cause summary:

  1. Silent config drop (Bug 1): KAFKA_CLUSTERS["default"] = DEFAULT_KAFKA_OPTIONS uses the legacy flat format. get_kafka_consumer_cluster_options() detects this and, because it's called with only_bootstrap=True, extracts only bootstrap.servers — all other settings including
    session.timeout.ms and heartbeat.interval.ms are silently discarded.
  2. HACK override (Bug 2): Even if the settings survived, build_consumer_config in consumers/init.py unconditionally overrides session.timeout.ms to match max_poll_interval_ms when that value is < 45000ms. Since --max-poll-interval-ms defaults to 30000 in run.py,
    session.timeout.ms is force-set to 30s — which is exactly what the ~31.7s timeout error reflects.

Recommended fix (no code change required): Pass --max-poll-interval-ms 300000 (or any value ≥ 45000) to the sentry run consumer process-spans command in the self-hosted docker-compose.yml. This disables the HACK and allows the session timeout to stay at the safe default of
45s.

Let me try to mess things up again

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it wasn't fixed, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't. But the first try works. I wonder if it's the cache, but I could be wrong

"retries": 5, # Number of retries for transient/retriable request failures
"retry.backoff.ms": 1000, # Wait time between retry attempts
"reconnect.backoff.ms": 1000, # Initial wait before reconnecting after a lost connection
"reconnect.backoff.max.ms": 10000, # Upper bound for exponential backoff on reconnect attempts
# Session & heartbeat — must satisfy:
# heartbeat.interval.ms < session.timeout.ms < max.poll.interval.ms
"session.timeout.ms": 60000, # Grace period before broker evicts an unresponsive consumer (default: 45s)
"heartbeat.interval.ms": 20000, # How often the consumer sends a heartbeat — must be 1/3 of session.timeout.ms
"max.poll.interval.ms": 600000, # Max allowed time between poll() calls before the consumer is considered dead
}

SENTRY_EVENTSTREAM = "sentry.eventstream.kafka.KafkaEventStream"
Expand Down
Loading