-
Notifications
You must be signed in to change notification settings - Fork 203
fix(worker-executor): bound component cache size and concurrent compilations #3643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kmatasfp
wants to merge
5
commits into
main
Choose a base branch
from
port/component-compilation-limits
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
9208d3b
fix(worker-executor): bound component cache size and concurrent compi…
kmatasfp 5aafa7d
Merge branch 'main' into port/component-compilation-limits
kmatasfp 387f9ad
Merge branch 'main' into port/component-compilation-limits
kmatasfp 999b45c
fix(worker-executor): address component limiter review
kmatasfp c83588c
fix(common): prevent cache count drift during eviction
kmatasfp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -47,8 +47,16 @@ pub struct Cache<K, PV, V, E> { | |
| full_cache_eviction: FullCacheEvictionMode, | ||
| background_handle: Arc<Mutex<Option<JoinHandle<()>>>>, | ||
| name: &'static str, | ||
| /// Test-only seam: when set, awaited inside the full-cache eviction between | ||
| /// snapshotting the entries to keep and committing the new size, so a test | ||
| /// can deterministically interleave a concurrent insert at that point. | ||
| #[cfg(test)] | ||
| evict_interleave: Arc<Mutex<Option<EvictInterleaveHook>>>, | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| type EvictInterleaveHook = Arc<dyn Fn() -> Pin<Box<dyn Future<Output = ()> + Send>> + Send + Sync>; | ||
|
|
||
| pub trait SimpleCache<K, V, E> { | ||
| fn get_or_insert_simple<F>(&self, key: &K, f: F) -> impl Future<Output = Result<V, E>> | ||
| where | ||
|
|
@@ -141,6 +149,8 @@ impl< | |
| full_cache_eviction, | ||
| background_handle: Arc::new(Mutex::new(None)), | ||
| name, | ||
| #[cfg(test)] | ||
| evict_interleave: Arc::new(Mutex::new(None)), | ||
| }; | ||
|
|
||
| if let Some(capacity) = capacity { | ||
|
|
@@ -177,6 +187,20 @@ impl< | |
| cache | ||
| } | ||
|
|
||
| /// Test-only: installs a hook awaited inside full-cache eviction between | ||
| /// computing the surviving entry set and committing the new size, so a test | ||
| /// can deterministically interleave a concurrent insert at that point. | ||
| #[cfg(test)] | ||
| fn set_evict_interleave(&self, hook: EvictInterleaveHook) { | ||
| *self.evict_interleave.lock().unwrap() = Some(hook); | ||
| } | ||
|
|
||
| /// Test-only: removes the eviction interleave hook. | ||
| #[cfg(test)] | ||
| fn clear_evict_interleave(&self) { | ||
| *self.evict_interleave.lock().unwrap() = None; | ||
| } | ||
|
|
||
| /// Tries to get a cached value for the given key. If the value is missing or is pending, it returns None. | ||
| pub async fn try_get(&self, key: &K) -> Option<V> { | ||
| let result = self | ||
|
|
@@ -234,7 +258,7 @@ impl< | |
| { | ||
| let mut eviction_needed = false; | ||
| let result = { | ||
| let own_id = self.state.last_id.fetch_add(1, Ordering::SeqCst); | ||
| let own_id = self.state.last_id.fetch_add(1, Ordering::Relaxed); | ||
| let result = self.get_or_add_as_pending(key, own_id, f1).await?; | ||
| match result { | ||
| Item::Pending { | ||
|
|
@@ -257,11 +281,12 @@ impl< | |
| }, | ||
| ) | ||
| .await; | ||
| let old_count = self.state.count.fetch_add(1, Ordering::SeqCst); | ||
| let old_count = self.state.count.fetch_add(1, Ordering::Relaxed); | ||
| let new_count = old_count.saturating_add(1); | ||
|
|
||
| record_cache_size(self.name, old_count.saturating_add(1)); | ||
| record_cache_size(self.name, new_count); | ||
|
|
||
| if Some(old_count) == self.capacity { | ||
| if self.capacity.is_some_and(|capacity| new_count > capacity) { | ||
| eviction_needed = true; | ||
| } | ||
| } else { | ||
|
|
@@ -397,7 +422,7 @@ impl< | |
| F2: FnOnce(&PV) -> Pin<Box<dyn Future<Output = Result<V, E>> + Send>> + Send + 'static, | ||
| { | ||
| { | ||
| let own_id = self.state.last_id.fetch_add(1, Ordering::SeqCst); | ||
| let own_id = self.state.last_id.fetch_add(1, Ordering::Relaxed); | ||
| let result = self.get_or_add_as_pending(key, own_id, f1).await?; | ||
| match result { | ||
| Item::Pending { | ||
|
|
@@ -429,11 +454,15 @@ impl< | |
| ) | ||
| .await; | ||
| let old_count = | ||
| self_clone.state.count.fetch_add(1, Ordering::SeqCst); | ||
| self_clone.state.count.fetch_add(1, Ordering::Relaxed); | ||
| let new_count = old_count.saturating_add(1); | ||
|
|
||
| record_cache_size(self_clone.name, old_count.saturating_add(1)); | ||
| record_cache_size(self_clone.name, new_count); | ||
|
|
||
| if Some(old_count) == self_clone.capacity { | ||
| if self_clone | ||
| .capacity | ||
| .is_some_and(|capacity| new_count > capacity) | ||
| { | ||
| self_clone.evict().await; | ||
| } | ||
| } else { | ||
|
|
@@ -490,7 +519,7 @@ impl< | |
| pub async fn remove(&self, key: &K) { | ||
| let removed = self.state.items.remove_async(key).await.is_some(); | ||
| if removed { | ||
| let count = self.state.count.fetch_sub(1, Ordering::SeqCst); | ||
| let count = self.state.count.fetch_sub(1, Ordering::Relaxed); | ||
| record_cache_size(self.name, count.saturating_sub(1)); | ||
| } | ||
| } | ||
|
|
@@ -530,7 +559,7 @@ impl< | |
| if let Some(state) = weak_state.upgrade() { | ||
| let removed = state.items.remove_sync(&key).is_some(); | ||
| if removed { | ||
| let count = state.count.fetch_sub(1, Ordering::SeqCst); | ||
| let count = state.count.fetch_sub(1, Ordering::Relaxed); | ||
| record_cache_size(name, count.saturating_sub(1)); | ||
| } | ||
| } | ||
|
|
@@ -559,20 +588,31 @@ impl< | |
| } | ||
|
|
||
| async fn evict_least_recently_used(&self, count: usize) { | ||
| let mut keys_to_keep = vec![]; | ||
| let mut cached = vec![]; | ||
| self.state | ||
| .items | ||
| .iter_async(|key, value| { | ||
| if let Item::Cached { last_access, .. } = value { | ||
| keys_to_keep.push((key.clone(), last_access.elapsed().as_millis())) | ||
| cached.push((key.clone(), last_access.elapsed().as_millis())) | ||
| } | ||
| true | ||
| }) | ||
| .await; | ||
|
|
||
| keys_to_keep.sort_by_key(|(_, v)| *v); | ||
| keys_to_keep.truncate(keys_to_keep.len().saturating_sub(count)); | ||
| let keys_to_keep: HashSet<&K> = keys_to_keep.iter().map(|(k, _)| k).collect(); | ||
| // Sort most-recently-used first (smallest elapsed first) so truncating | ||
| // the tail drops the oldest entries and keeps the newest. | ||
| cached.sort_by_key(|(_, elapsed)| *elapsed); | ||
|
|
||
| // Keep at most `cached_len - count` entries, and never more than the | ||
| // configured capacity, so an over-capacity cache is always trimmed back | ||
| // down to the bound regardless of how far it overshot. | ||
| let cached_len = cached.len(); | ||
| let mut keep = cached_len.saturating_sub(count); | ||
| if let Some(capacity) = self.capacity { | ||
| keep = keep.min(capacity); | ||
| } | ||
| cached.truncate(keep); | ||
| let keys_to_keep: HashSet<&K> = cached.iter().map(|(k, _)| k).collect(); | ||
|
|
||
| self.state | ||
| .items | ||
|
|
@@ -581,21 +621,56 @@ impl< | |
| Item::Pending { .. } => true, | ||
| }) | ||
| .await; | ||
| self.state.count.store(keys_to_keep.len(), Ordering::SeqCst); | ||
| record_cache_size(self.name, keys_to_keep.len()); | ||
|
|
||
| // Test-only seam: let a test interleave a concurrent insert here, after | ||
| // the surviving set has been computed but before the size is committed, | ||
| // to deterministically exercise the count race. | ||
| #[cfg(test)] | ||
| { | ||
| let hook = self.evict_interleave.lock().unwrap().clone(); | ||
| if let Some(hook) = hook { | ||
| hook().await; | ||
| } | ||
| } | ||
|
|
||
| // Decrement by the number of cached entries actually removed rather than | ||
| // overwriting the counter. A blind store would clobber the increments of | ||
| // inserts that completed concurrently with this eviction, drifting the | ||
| // size below the real count and disabling the capacity trigger. | ||
| let removed = cached_len.saturating_sub(keep); | ||
| let new_count = self | ||
| .state | ||
| .count | ||
| .fetch_sub(removed, Ordering::Relaxed) | ||
| .saturating_sub(removed); | ||
| record_cache_size(self.name, new_count); | ||
| } | ||
|
|
||
| async fn evict_older_than(&self, ttl: Duration) { | ||
| let removed = Arc::new(std::sync::atomic::AtomicUsize::new(0)); | ||
| let removed_in_retain = removed.clone(); | ||
| self.state | ||
| .items | ||
| .retain_async(|_, item| match item { | ||
| Item::Cached { last_access, .. } => last_access.elapsed() < ttl, | ||
| Item::Cached { last_access, .. } => { | ||
| let keep = last_access.elapsed() < ttl; | ||
| if !keep { | ||
| removed_in_retain.fetch_add(1, Ordering::Relaxed); | ||
| } | ||
| keep | ||
| } | ||
| Item::Pending { .. } => true, | ||
| }) | ||
| .await; | ||
| let count = self.state.items.len(); | ||
| self.state.count.store(count, Ordering::SeqCst); | ||
| record_cache_size(self.name, count); | ||
| // Decrement by the number of cached entries actually removed rather than | ||
| // overwriting the counter, so concurrent insert increments are not lost. | ||
| let removed = removed.load(Ordering::Relaxed); | ||
| let new_count = self | ||
| .state | ||
| .count | ||
| .fetch_sub(removed, Ordering::Relaxed) | ||
| .saturating_sub(removed); | ||
| record_cache_size(self.name, new_count); | ||
| } | ||
|
|
||
| async fn update_last_access(&self, key: &K) { | ||
|
|
@@ -1316,6 +1391,77 @@ mod tests { | |
| assert!(cache.contains_key(&3).await); | ||
| } | ||
|
|
||
| #[test(flavor = "multi_thread", worker_threads = 4)] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. test-r does not have such attributes (all test runs in a shared multi-threaded tokio runtime) |
||
| async fn capacity_holds_when_insert_races_eviction() { | ||
| // Deterministically reproduces the production count race. Capacity | ||
| // eviction snapshots the surviving entry set and then *blindly stores* | ||
| // that as the new size. If an insert lands between the snapshot and the | ||
| // store, its increment is clobbered, so the cache's notion of its own | ||
| // size drifts below the real number of cached entries. Because eviction | ||
| // is only triggered when an insert observes the size exactly equal to | ||
| // capacity, once the size has drifted below capacity the trigger is | ||
| // never hit again and the cache grows without bound. | ||
| // | ||
| // A test seam pauses eviction at exactly that window so the race is | ||
| // forced every run rather than relying on timing. | ||
| let capacity = 4usize; | ||
| let cache = bounded_cache(capacity, "capacity_race"); | ||
|
|
||
| // Fill exactly to capacity. | ||
| for i in 0..capacity as u64 { | ||
| cache | ||
| .get_or_insert_simple(&i, || async move { Ok(i) }) | ||
| .await | ||
| .unwrap(); | ||
| } | ||
| assert_eq!(cache.iter().await.len(), capacity); | ||
|
|
||
| // Arrange for a concurrent insert to fire while eviction is paused at the | ||
| // seam (after computing survivors, before committing the size). The hook | ||
| // runs once. | ||
| let cache_for_hook = cache.clone(); | ||
| let fired = Arc::new(std::sync::atomic::AtomicBool::new(false)); | ||
| let fired_clone = fired.clone(); | ||
| cache.set_evict_interleave(Arc::new(move || { | ||
| let cache = cache_for_hook.clone(); | ||
| let fired = fired_clone.clone(); | ||
| Box::pin(async move { | ||
| if !fired.swap(true, Ordering::SeqCst) { | ||
| // A unique insert completing inside the eviction window: its | ||
| // count increment will be clobbered by the eviction's store. | ||
| cache | ||
| .get_or_insert_simple(&1000u64, || async move { Ok(1000) }) | ||
| .await | ||
| .unwrap(); | ||
| } | ||
| }) | ||
| })); | ||
|
|
||
| // This insert crosses capacity and triggers the (now racing) eviction. | ||
| cache | ||
| .get_or_insert_simple(&100u64, || async move { Ok(100) }) | ||
| .await | ||
| .unwrap(); | ||
| cache.clear_evict_interleave(); | ||
|
|
||
| // After the clobber, insert more unique keys. With the bug, the size has | ||
| // drifted below capacity so the eviction trigger is never hit again and | ||
| // the real cached population grows unbounded past capacity. | ||
| for k in 0..50u64 { | ||
| cache | ||
| .get_or_insert_simple(&(2000 + k), || async move { Ok(2000 + k) }) | ||
| .await | ||
| .unwrap(); | ||
| } | ||
|
|
||
| let size = cache.iter().await.len(); | ||
| assert!( | ||
| size <= capacity, | ||
| "cache with capacity {capacity} grew to {size} cached entries after an insert raced \ | ||
| eviction; the capacity bound is not being enforced" | ||
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| async fn background_eviction_older_than_ttl() { | ||
| let cache: Cache<u64, (), u64, String> = Cache::new( | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This agent review comment seems valid:
1.
get_or_insert_spawnedstill has the unfixed trigger + clobber-prone behavior — and a real bounded cache uses it.The PR fixed the count race in
get_or_insertandget_or_insert_pending(switched tonew_count > capacity+Relaxed), butget_or_insert_spawnedwas left untouched:This path is reached via
get_or_insert_simple_spawned, whichworker_read_only_cacheuses — a capacity-bounded cache (Some(cache_capacity), default 256,LeastRecentlyUsed(1)):Worse: the PR replaced the eviction's
count.store(survivors)with a relativefetch_sub, so count is no longer reset to the true survivor count. Under main's old code, the== capacitytrigger could self-heal because every eviction reset the counter; now drift in this path is permanent, so once the read-only cache's count skips pastcapacityit can stop evicting and grow unbounded — exactly the failure the PR's own test description warns about.Fix: apply the same change to
get_or_insert_spawned(lines 374–378):let new_count = old_count.saturating_add(1);andif self_clone.capacity.is_some_and(|c| new_count > c). Also reconcile theSeqCst/Relaxeddecision here. Consider adding aget_or_insert_simple_spawnedcapacity test, since the new race test only covers the non-spawned path.