fix(worker-executor): bound component cache size and concurrent compilations by kmatasfp · Pull Request #3643 · golemcloud/golem

kmatasfp · 2026-06-17T07:54:52Z

No description provided.

…lations

netlify · 2026-06-17T07:54:57Z

✅ Deploy Preview for golemcloud canceled.

Name	Link
🔨 Latest commit	`c83588c`
🔍 Latest deploy log	https://app.netlify.com/projects/golemcloud/deploys/6a34da2fc3e2a50008b2c751

mschuwalow

Eventually I would prefer a "proper" queue for the compilations instead of doing it implicitly and in-memory only using a semaphore. But for now this is better

vigoo · 2026-06-18T14:34:10Z

        assert!(cache.contains_key(&3).await);
    }

+    #[test(flavor = "multi_thread", worker_threads = 4)]


test-r does not have such attributes (all test runs in a shared multi-threaded tokio runtime)

vigoo · 2026-06-18T14:38:35Z

Eventually I would prefer a "proper" queue for the compilations instead of doing it implicitly and in-memory only using a semaphore. But for now this is better

We have a compilation service already. The thing this PR limits only runs if the agent instance is created "too early" after deployment. The compilation cache service itself can be infinitely horizontally scaled to reduce this gap.

I don't see why the semaphore-based concurrency limitation would not be good enough here

vigoo · 2026-06-18T14:41:14Z

        let mut eviction_needed = false;
        let result = {
-            let own_id = self.state.last_id.fetch_add(1, Ordering::SeqCst);
+            let own_id = self.state.last_id.fetch_add(1, Ordering::Relaxed);


This agent review comment seems valid:

1. get_or_insert_spawned still has the unfixed trigger + clobber-prone behavior — and a real bounded cache uses it.
The PR fixed the count race in get_or_insert and get_or_insert_pending (switched to new_count > capacity + Relaxed), but get_or_insert_spawned was left untouched:

let own_id = self.state.last_id.fetch_add(1, Ordering::SeqCst); // line 342 ... let old_count = self_clone.state.count.fetch_add(1, Ordering::SeqCst); // line 374 record_cache_size(self_clone.name, old_count.saturating_add(1)); if Some(old_count) == self_clone.capacity { // line 378 — the exact bug being fixed eviction_needed = true; }

This path is reached via get_or_insert_simple_spawned, which worker_read_only_cache uses — a capacity-bounded cache (Some(cache_capacity), default 256, LeastRecentlyUsed(1)):

bounded cache creation: worker/mod.rs:440

usage: worker/mod.rs:1096

Worse: the PR replaced the eviction's count.store(survivors) with a relative fetch_sub, so count is no longer reset to the true survivor count. Under main's old code, the == capacity trigger could self-heal because every eviction reset the counter; now drift in this path is permanent, so once the read-only cache's count skips past capacity it can stop evicting and grow unbounded — exactly the failure the PR's own test description warns about.

Fix: apply the same change to get_or_insert_spawned (lines 374–378): let new_count = old_count.saturating_add(1); and if self_clone.capacity.is_some_and(|c| new_count > c). Also reconcile the SeqCst/Relaxed decision here. Consider adding a get_or_insert_simple_spawned capacity test, since the new race test only covers the non-spawned path.

vigoo · 2026-06-18T14:43:01Z

@@ -0,0 +1,147 @@
+// Copyright 2024-2026 Golem Cloud
+//
+// Licensed under the Golem Source Available License v1.1 (the "License");


This header is different than all the others :)

vigoo · 2026-06-18T14:47:16Z

-                                        component_id,
-                                        component_revision,
-                                        reason: format!("{e}"),
+                            // Bound concurrent download+compile work across the


I think we should not limit the downloads (at least not necessarily by the same amount, but by connection count / pool considerations which I think we already have). The explanation (cold-start storm of components exhausting memory) is weak to me because we DO calculate with the component sizes in the memory admission layer.

That said, if I remember correctly we calculate with the wasm size with a constant factor (2?) which is probably too low. We could prefill an optional known compiled size in the registry service back from the compilation cache service and calculate the memory limits with that (not in this PR, of course, just an idea)

Yeah, I agree the current comment overstates the "download" part. The limiter is meant for the local load/compile fallback: in-process component cache miss + compiled artifact store miss -> raw component download into a full Vec<u8> -> Component::from_binary temporary allocations -> component.serialize() .cwasm buffer -> write to compiled artifact store.

Admission accounts for guest linear memory and a shared component-revision charge from registry component size metadata, but it does not reserve memory for this fallback case, ram for downloaded bytes, compiled bytes etc we hold briefly. With many cold starts for components whose compiled artifacts are missing, we can have many copies of that working set concurrently and run out of memory. This is especially expensive in case of TS agents to have multiple "copies" of the same thing in memory, even if it is briefly, it adds up. And on top of ram it also takes cpu time.

Some background when I was doing density testing I saw host side memory growing very fast and us using more cpu before I added this limiter. Test was creating 100s of unique components.

And the idea of having a better estimate of how much memory each component will take loaded into memory would def help with density and all the limits we have

mschuwalow · 2026-06-18T14:47:25Z

Eventually I would prefer a "proper" queue for the compilations instead of doing it implicitly and in-memory only using a semaphore. But for now this is better

We have a compilation service already. The thing this PR limits only runs if the agent instance is created "too early" after deployment. The compilation cache service itself can be infinitely horizontally scaled to reduce this gap.

I don't see why the semaphore-based concurrency limitation would not be good enough here

Ignore my previous comment. I brainfarted and thought this change is in the compilation service (which does use a queue for this (non-durable though)). In the executor this is fine 👍

fix(worker-executor): bound component cache size and concurrent compi…

9208d3b

…lations

kmatasfp requested a review from a team June 17, 2026 07:54

Merge branch 'main' into port/component-compilation-limits

5aafa7d

mschuwalow approved these changes Jun 18, 2026

View reviewed changes

vigoo reviewed Jun 18, 2026

View reviewed changes

kmatasfp and others added 3 commits June 18, 2026 16:10

Merge branch 'main' into port/component-compilation-limits

387f9ad

fix(worker-executor): address component limiter review

999b45c

fix(common): prevent cache count drift during eviction

c83588c

kmatasfp requested a review from vigoo June 19, 2026 05:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker-executor): bound component cache size and concurrent compilations#3643

fix(worker-executor): bound component cache size and concurrent compilations#3643
kmatasfp wants to merge 5 commits into
mainfrom
port/component-compilation-limits

kmatasfp commented Jun 17, 2026

Uh oh!

netlify Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

mschuwalow left a comment

Uh oh!

vigoo Jun 18, 2026

Uh oh!

vigoo commented Jun 18, 2026

Uh oh!

vigoo Jun 18, 2026

Uh oh!

vigoo Jun 18, 2026

Uh oh!

vigoo Jun 18, 2026

Uh oh!

kmatasfp Jun 19, 2026 •

edited

Loading

Uh oh!

mschuwalow commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kmatasfp commented Jun 17, 2026

Uh oh!

netlify Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for golemcloud canceled.

Uh oh!

mschuwalow left a comment

Choose a reason for hiding this comment

Uh oh!

vigoo Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

vigoo commented Jun 18, 2026

Uh oh!

vigoo Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

vigoo Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

vigoo Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

kmatasfp Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mschuwalow commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify Bot commented Jun 17, 2026 •

edited

Loading

kmatasfp Jun 19, 2026 •

edited

Loading