Skip to content

Oximeter: add queue depth metric.#10736

Open
jmcarp wants to merge 2 commits into
mainfrom
jmcarp/oximeter-queue-depth-metric
Open

Oximeter: add queue depth metric.#10736
jmcarp wants to merge 2 commits into
mainfrom
jmcarp/oximeter-queue-depth-metric

Conversation

@jmcarp

@jmcarp jmcarp commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

In #10683, we added a metric counting the number of samples evicted from the oximeter collector's database_batcher queue. Operators can use that metric to detect that samples are being lost, but not that samples are potentially about to be lost. In other words, we have to wait for something to go wrong to take action.

This patch adds a queue depth metric to the collector as well: a histogram recording the current length of the queue each time we push a new batch of samples onto it. Operators can use this metric to detect when the queue depth is approaching its maximum length.

Part of #10552.

Note: builds on #10683. Can just review the second commit, or review both commits here and ignore #10683.

And I promise I'm almost done thinking about #10552. This is one of the last papercuts before I'll feel like this is sorted.

jmcarp added 2 commits July 2, 2026 10:50
…her.

Oximeter sends samples from all collection tasks to a shared database batcher,
which then inserts samples into clickhouse. The database batcher uses a bounded
queue that drops old samples when adding a new sample would overflow the queue.
We currently log a warning when dropping an old sample, but operators would
have to proactively check those logs in order to notice data loss via this
queue. To make dropped samples more visible, this patches introduces a new
oximeter metric that counts the number of dropped samples in the database
batcher.

Note: if the batcher isn't able to push samples to the database at all, we
won't be able to record the new metric! However, we write the new metric at the
head of the queue, and dropped sample counts persist for the lifetime of the
oximeter agent, so we'll be able to push metrics unless the queue is wildly
oversaturated.

Part of #10552.
In #10683, we added a metric counting the number of samples evicted from the
oximeter collector's database_batcher queue. Operators can use that metric to
detect that samples are being lost, but not that samples are potentially about
to be lost. In other words, we have to wait for something to go wrong to take
action.

This patch adds a queue depth metric to the collector as well: a histogram
recording the current length of the queue each time we push a new batch of
samples onto it. Operators can use this metric to detect when the queue depth
is approaching its maximum length.

Part of #10552.
@jmcarp jmcarp requested a review from bnaecker July 2, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant