Skip to content

fix(worker-executor): bound component cache size and concurrent compilations#3643

Open
kmatasfp wants to merge 5 commits into
mainfrom
port/component-compilation-limits
Open

fix(worker-executor): bound component cache size and concurrent compilations#3643
kmatasfp wants to merge 5 commits into
mainfrom
port/component-compilation-limits

Conversation

@kmatasfp

Copy link
Copy Markdown
Contributor

No description provided.

@kmatasfp kmatasfp requested a review from a team June 17, 2026 07:54
@netlify

netlify Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deploy Preview for golemcloud canceled.

Name Link
🔨 Latest commit c83588c
🔍 Latest deploy log https://app.netlify.com/projects/golemcloud/deploys/6a34da2fc3e2a50008b2c751

@mschuwalow mschuwalow left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually I would prefer a "proper" queue for the compilations instead of doing it implicitly and in-memory only using a semaphore. But for now this is better

Comment thread golem-common/src/cache.rs Outdated
assert!(cache.contains_key(&3).await);
}

#[test(flavor = "multi_thread", worker_threads = 4)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test-r does not have such attributes (all test runs in a shared multi-threaded tokio runtime)

@vigoo

vigoo commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Eventually I would prefer a "proper" queue for the compilations instead of doing it implicitly and in-memory only using a semaphore. But for now this is better

We have a compilation service already. The thing this PR limits only runs if the agent instance is created "too early" after deployment. The compilation cache service itself can be infinitely horizontally scaled to reduce this gap.

I don't see why the semaphore-based concurrency limitation would not be good enough here

Comment thread golem-common/src/cache.rs
let mut eviction_needed = false;
let result = {
let own_id = self.state.last_id.fetch_add(1, Ordering::SeqCst);
let own_id = self.state.last_id.fetch_add(1, Ordering::Relaxed);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This agent review comment seems valid:

1. get_or_insert_spawned still has the unfixed trigger + clobber-prone behavior — and a real bounded cache uses it.
The PR fixed the count race in get_or_insert and get_or_insert_pending (switched to new_count > capacity + Relaxed), but get_or_insert_spawned was left untouched:

let own_id = self.state.last_id.fetch_add(1, Ordering::SeqCst);          // line 342
...
let old_count = self_clone.state.count.fetch_add(1, Ordering::SeqCst);   // line 374
record_cache_size(self_clone.name, old_count.saturating_add(1));
if Some(old_count) == self_clone.capacity {                              // line 378 — the exact bug being fixed
    eviction_needed = true;
}

This path is reached via get_or_insert_simple_spawned, which worker_read_only_cache uses — a capacity-bounded cache (Some(cache_capacity), default 256, LeastRecentlyUsed(1)):

  • bounded cache creation: worker/mod.rs:440
  • usage: worker/mod.rs:1096

Worse: the PR replaced the eviction's count.store(survivors) with a relative fetch_sub, so count is no longer reset to the true survivor count. Under main's old code, the == capacity trigger could self-heal because every eviction reset the counter; now drift in this path is permanent, so once the read-only cache's count skips past capacity it can stop evicting and grow unbounded — exactly the failure the PR's own test description warns about.

Fix: apply the same change to get_or_insert_spawned (lines 374–378): let new_count = old_count.saturating_add(1); and if self_clone.capacity.is_some_and(|c| new_count > c). Also reconcile the SeqCst/Relaxed decision here. Consider adding a get_or_insert_simple_spawned capacity test, since the new race test only covers the non-spawned path.

@@ -0,0 +1,147 @@
// Copyright 2024-2026 Golem Cloud
//
// Licensed under the Golem Source Available License v1.1 (the "License");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This header is different than all the others :)

component_id,
component_revision,
reason: format!("{e}"),
// Bound concurrent download+compile work across the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not limit the downloads (at least not necessarily by the same amount, but by connection count / pool considerations which I think we already have). The explanation (cold-start storm of components exhausting memory) is weak to me because we DO calculate with the component sizes in the memory admission layer.

That said, if I remember correctly we calculate with the wasm size with a constant factor (2?) which is probably too low. We could prefill an optional known compiled size in the registry service back from the compilation cache service and calculate the memory limits with that (not in this PR, of course, just an idea)

@kmatasfp kmatasfp Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree the current comment overstates the "download" part. The limiter is meant for the local load/compile fallback: in-process component cache miss + compiled artifact store miss -> raw component download into a full Vec<u8> -> Component::from_binary temporary allocations -> component.serialize() .cwasm buffer -> write to compiled artifact store.

Admission accounts for guest linear memory and a shared component-revision charge from registry component size metadata, but it does not reserve memory for this fallback case, ram for downloaded bytes, compiled bytes etc we hold briefly. With many cold starts for components whose compiled artifacts are missing, we can have many copies of that working set concurrently and run out of memory. This is especially expensive in case of TS agents to have multiple "copies" of the same thing in memory, even if it is briefly, it adds up. And on top of ram it also takes cpu time.

Some background when I was doing density testing I saw host side memory growing very fast and us using more cpu before I added this limiter. Test was creating 100s of unique components.

And the idea of having a better estimate of how much memory each component will take loaded into memory would def help with density and all the limits we have

@mschuwalow

Copy link
Copy Markdown
Contributor

Eventually I would prefer a "proper" queue for the compilations instead of doing it implicitly and in-memory only using a semaphore. But for now this is better

We have a compilation service already. The thing this PR limits only runs if the agent instance is created "too early" after deployment. The compilation cache service itself can be infinitely horizontally scaled to reduce this gap.

I don't see why the semaphore-based concurrency limitation would not be good enough here

Ignore my previous comment. I brainfarted and thought this change is in the compilation service (which does use a queue for this (non-durable though)). In the executor this is fine 👍

@kmatasfp kmatasfp requested a review from vigoo June 19, 2026 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants