fix(worker-executor): bound component cache size and concurrent compilations#3643
fix(worker-executor): bound component cache size and concurrent compilations#3643kmatasfp wants to merge 5 commits into
Conversation
✅ Deploy Preview for golemcloud canceled.
|
mschuwalow
left a comment
There was a problem hiding this comment.
Eventually I would prefer a "proper" queue for the compilations instead of doing it implicitly and in-memory only using a semaphore. But for now this is better
| assert!(cache.contains_key(&3).await); | ||
| } | ||
|
|
||
| #[test(flavor = "multi_thread", worker_threads = 4)] |
There was a problem hiding this comment.
test-r does not have such attributes (all test runs in a shared multi-threaded tokio runtime)
We have a compilation service already. The thing this PR limits only runs if the agent instance is created "too early" after deployment. The compilation cache service itself can be infinitely horizontally scaled to reduce this gap. I don't see why the semaphore-based concurrency limitation would not be good enough here |
| let mut eviction_needed = false; | ||
| let result = { | ||
| let own_id = self.state.last_id.fetch_add(1, Ordering::SeqCst); | ||
| let own_id = self.state.last_id.fetch_add(1, Ordering::Relaxed); |
There was a problem hiding this comment.
This agent review comment seems valid:
1. get_or_insert_spawned still has the unfixed trigger + clobber-prone behavior — and a real bounded cache uses it.
The PR fixed the count race in get_or_insert and get_or_insert_pending (switched to new_count > capacity + Relaxed), but get_or_insert_spawned was left untouched:
let own_id = self.state.last_id.fetch_add(1, Ordering::SeqCst); // line 342
...
let old_count = self_clone.state.count.fetch_add(1, Ordering::SeqCst); // line 374
record_cache_size(self_clone.name, old_count.saturating_add(1));
if Some(old_count) == self_clone.capacity { // line 378 — the exact bug being fixed
eviction_needed = true;
}This path is reached via get_or_insert_simple_spawned, which worker_read_only_cache uses — a capacity-bounded cache (Some(cache_capacity), default 256, LeastRecentlyUsed(1)):
- bounded cache creation: worker/mod.rs:440
- usage: worker/mod.rs:1096
Worse: the PR replaced the eviction's count.store(survivors) with a relative fetch_sub, so count is no longer reset to the true survivor count. Under main's old code, the == capacity trigger could self-heal because every eviction reset the counter; now drift in this path is permanent, so once the read-only cache's count skips past capacity it can stop evicting and grow unbounded — exactly the failure the PR's own test description warns about.
Fix: apply the same change to get_or_insert_spawned (lines 374–378): let new_count = old_count.saturating_add(1); and if self_clone.capacity.is_some_and(|c| new_count > c). Also reconcile the SeqCst/Relaxed decision here. Consider adding a get_or_insert_simple_spawned capacity test, since the new race test only covers the non-spawned path.
| @@ -0,0 +1,147 @@ | |||
| // Copyright 2024-2026 Golem Cloud | |||
| // | |||
| // Licensed under the Golem Source Available License v1.1 (the "License"); | |||
There was a problem hiding this comment.
This header is different than all the others :)
| component_id, | ||
| component_revision, | ||
| reason: format!("{e}"), | ||
| // Bound concurrent download+compile work across the |
There was a problem hiding this comment.
I think we should not limit the downloads (at least not necessarily by the same amount, but by connection count / pool considerations which I think we already have). The explanation (cold-start storm of components exhausting memory) is weak to me because we DO calculate with the component sizes in the memory admission layer.
That said, if I remember correctly we calculate with the wasm size with a constant factor (2?) which is probably too low. We could prefill an optional known compiled size in the registry service back from the compilation cache service and calculate the memory limits with that (not in this PR, of course, just an idea)
There was a problem hiding this comment.
Yeah, I agree the current comment overstates the "download" part. The limiter is meant for the local load/compile fallback: in-process component cache miss + compiled artifact store miss -> raw component download into a full Vec<u8> -> Component::from_binary temporary allocations -> component.serialize() .cwasm buffer -> write to compiled artifact store.
Admission accounts for guest linear memory and a shared component-revision charge from registry component size metadata, but it does not reserve memory for this fallback case, ram for downloaded bytes, compiled bytes etc we hold briefly. With many cold starts for components whose compiled artifacts are missing, we can have many copies of that working set concurrently and run out of memory. This is especially expensive in case of TS agents to have multiple "copies" of the same thing in memory, even if it is briefly, it adds up. And on top of ram it also takes cpu time.
Some background when I was doing density testing I saw host side memory growing very fast and us using more cpu before I added this limiter. Test was creating 100s of unique components.
And the idea of having a better estimate of how much memory each component will take loaded into memory would def help with density and all the limits we have
Ignore my previous comment. I brainfarted and thought this change is in the compilation service (which does use a queue for this (non-durable though)). In the executor this is fine 👍 |
No description provided.