test(resolver): add tests for cache thread-safety and defensive copies#1027
test(resolver): add tests for cache thread-safety and defensive copies#1027LalatenduMohanty wants to merge 2 commits intopython-wheel-build:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughIntroduces thread-safe candidate caching in Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_resolver.py`:
- Around line 1355-1363: The test currently uses a single _SlowProvider instance
for all threads so it doesn't exercise resolver.BaseProvider.resolver_cache
across instances; modify the thread setup so each thread gets its own provider
instance (e.g., create a list of providers = [_SlowProvider() for _ in range(4)]
and pass providers[i] into resolve_in_thread or construct a new _SlowProvider()
in the Thread args) so the class-scoped resolver_cache and any cross-instance
locking/racing are actually tested; update references to the single provider
variable accordingly.
- Around line 1311-1313: The test improperly seeds the cache by appending to the
list returned by _get_cached_candidates(identifier), which will break if that
method returns a defensive copy; instead directly populate the provider's
internal cache storage (e.g. set provider._cached_candidates[identifier] =
[_make_candidate("test-pkg", "1.0.0")] or use the provider's explicit cache
write helper if one exists) so the cache state is actually mutated for the test;
update the lines that call _get_cached_candidates to write into
provider._cached_candidates (or the appropriate internal cache structure) rather
than appending to the returned list.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 01abf8fd-b526-4448-83f9-d08a75259191
📒 Files selected for processing (1)
tests/test_resolver.py
3f56e40 to
03b4d17
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/test_resolver.py (1)
1327-1372: Thread-safety test properly exercises the class-level cache.Using separate provider instances (line 1354) per thread correctly tests the shared class-level
resolver_cache. The barrier ensures all threads hit the cache simultaneously.One robustness note:
t.join(timeout=10)doesn't raise if threads are still alive. Consider checkingt.is_alive()after joins to fail fast on unexpected hangs:for t in threads: t.join(timeout=10) + assert not any(t.is_alive() for t in threads), "Threads did not complete in time"This prevents silent test passes when threads hang.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_resolver.py` around lines 1327 - 1372, The test test_find_cached_candidates_thread_safe should fail fast if any thread hangs: after joining each thread (the for t in threads: t.join(timeout=10) loop) check each thread's liveness (using t.is_alive()) and raise/assert if any thread remains alive so the test fails instead of silently passing; update the test to perform this liveness check after the join loop (or immediately after each join) referencing the threads list and Thread objects created in resolve_in_thread to detect and report hangs.src/fromager/resolver.py (1)
612-621: Empty result caching edge case.If
find_candidates()returns an empty list,if cached_candidates:evaluates toFalseon subsequent calls, causing repeated invocations. Consider using a sentinel orNoneto distinguish "cache miss" from "cached empty":- cached_candidates = self._get_cached_candidates(identifier) - if cached_candidates: + cache_key = (type(self), self.cache_key) + provider_cache = self.resolver_cache.get(identifier, {}) + if cache_key in provider_cache: + cached_candidates = list(provider_cache[cache_key]) logger.debug(...) return cached_candidatesThis is pre-existing behavior and may be acceptable if empty results are rare or cheap to recompute.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fromager/resolver.py` around lines 612 - 621, The current cached_candidates truthiness check treats an empty list as a cache miss and causes repeated recomputation; change the caching logic so _get_cached_candidates returns None for a miss and stores/returns an actual empty list when a lookup succeeded but produced no results, then replace the conditional in the resolver (the block using _get_identifier_lock and cached_candidates) to test "cached_candidates is not None" (or compare against a sentinel) so an explicit cached empty list is honored and avoids repeated find_candidates() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/fromager/resolver.py`:
- Around line 612-621: The current cached_candidates truthiness check treats an
empty list as a cache miss and causes repeated recomputation; change the caching
logic so _get_cached_candidates returns None for a miss and stores/returns an
actual empty list when a lookup succeeded but produced no results, then replace
the conditional in the resolver (the block using _get_identifier_lock and
cached_candidates) to test "cached_candidates is not None" (or compare against a
sentinel) so an explicit cached empty list is honored and avoids repeated
find_candidates() calls.
In `@tests/test_resolver.py`:
- Around line 1327-1372: The test test_find_cached_candidates_thread_safe should
fail fast if any thread hangs: after joining each thread (the for t in threads:
t.join(timeout=10) loop) check each thread's liveness (using t.is_alive()) and
raise/assert if any thread remains alive so the test fails instead of silently
passing; update the test to perform this liveness check after the join loop (or
immediately after each join) referencing the threads list and Thread objects
created in resolve_in_thread to detect and report hangs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 11662622-87de-4775-8478-b23f82a0c01e
📒 Files selected for processing (2)
src/fromager/resolver.pytests/test_resolver.py
…cking Add per-identifier locks to _find_cached_candidates() and return defensive copies from _get_cached_candidates() to prevent concurrent threads from corrupting cached candidate lists during parallel builds. A single global lock would serialize all resolution work, so a per-identifier scheme is used instead — threads resolving different packages proceed concurrently while threads resolving the same package wait for the first to populate the cache. Closes: python-wheel-build#1024 Co-Authored-By: Claude <claude@anthropic.com> Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>
03b4d17 to
67541ae
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/fromager/resolver.py`:
- Around line 583-595: _get_cached_candidates currently uses setdefault to
create and return an empty list on cache miss, which collapses "no cached
result" and "cached empty result"; change the logic so misses and cached empty
lists are distinguishable: in _get_cached_candidates (and similarly at lines
noted around 626-633) do not create a new list on miss—look up provider_cache
with provider_cache.get((cls, self.cache_key)) or test key presence and return
None for a miss, and update _find_cached_candidates (or the caller that checks
the cache) to treat a cached empty list as a cache hit by checking for key
presence (is not None or key in provider_cache / resolver_cache) instead of
truthiness, ensuring find_candidates() is not re-run for previously-cached empty
results.
- Around line 617-621: The debug log calls that currently use logger.debug("%s:
...", identifier, ...) (around the unfiltered candidates message and the cache
hit/miss messages in the resolver) must be executed inside the per-request
logging context helper; wrap each of those logger.debug calls in a with
req_ctxvar_context(): block so they carry the standard per-requirement context
(use the existing identifier variable as before and apply this change to the
block around the unfiltered candidates message and the subsequent cache hit/miss
debug lines referenced in the same function).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1e32f653-084a-455d-9f76-697fe6b0c32d
📒 Files selected for processing (2)
src/fromager/resolver.pytests/test_resolver.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_resolver.py
The old truthiness check `if cached_candidates:` treated an empty list as a cache miss, silently discarding valid results. In practice this has no effect because find_matches raises ResolverException on empty candidates, terminating the resolution before a second lookup can occur. But the cache should not discard data it already computed. Use None as the "not yet cached" sentinel instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>
|
|
||
| resolver_cache: typing.ClassVar[ResolverCache] = {} | ||
| _cache_locks: typing.ClassVar[dict[str, threading.Lock]] = {} | ||
| _meta_lock: typing.ClassVar[threading.Lock] = threading.Lock() |
There was a problem hiding this comment.
_cache_locks grows unbounded as new identifiers are resolved. Locks are never cleaned up.
We will potentially see long-running processes or large dependency graphs soon when multiple version bootstrap is enabled. This will accumulate locks for every package ever resolved.
Can we add lock cleanup for clear_cache()? Something like:
def clear_cache(cls, identifier: str | None = None) -> None:
"""Clear global resolver cache and associated locks."""
with cls._meta_lock:
if identifier is None:
cls.resolver_cache.clear()
cls._cache_locks.clear()
else:
canon_name = canonicalize_name(identifier)
cls.resolver_cache.pop(canon_name)
cls._cache_locks.pop(canon_name, None) # Lock may not exist
There was a problem hiding this comment.
This is a good point. We can add the lock cleanup. I was also thinking of adding this but the existing code was not causing the situation when this is an issue. Will take a deeper look.
| Must be called under the per-identifier lock from _get_identifier_lock. | ||
| """ | ||
| cls = type(self) | ||
| provider_cache = cls.resolver_cache.setdefault(identifier, {}) |
There was a problem hiding this comment.
This creates an empty dict even when just checking if an identifier is cached.
Suggestion:
provider_cache = cls.resolver_cache.get(identifier, {})
candidate_cache = provider_cache.get((cls, self.cache_key))
| assert second[0].version == Version("1.0.0") | ||
|
|
||
|
|
||
| def test_find_cached_candidates_thread_safe() -> None: |
There was a problem hiding this comment.
test_find_cached_candidates_thread_safe() only tests threads resolving the same package. Maybe we should also verify that different packages don't block each other?
Simpler solution: The caller is not allowed to corrupt the state of the cache. The method _find_cached_candidates has no synchronization, so concurrent threads bypass the cache and redundantly call find_candidates(). Is that actually a problem in real life? AFAIK only bootstrap phase is resolving packages. The bootstrap phase is single threaded. |
fix(resolver): make resolver cache thread-safe with per-identifier locking
Add failing tests that demonstrate two bugs in the resolver cache:
_get_cached_candidates returns a direct reference to the internal cache list, allowing callers to corrupt shared state by mutation.
_find_cached_candidates has no synchronization, so concurrent threads bypass the cache and redundantly call find_candidates().
These tests will pass once the resolver cache is made thread-safe with proper locking and defensive copies.
Closes: #1024
Co-Authored-By: Claude <claude@anthropic.co
Signed-off-by: Lalatendu Mohanty lmohanty@redhat.com