test(resolver): add tests for cache thread-safety and defensive copies by LalatenduMohanty · Pull Request #1027 · python-wheel-build/fromager

LalatenduMohanty · 2026-04-06T04:41:25Z

fix(resolver): make resolver cache thread-safe with per-identifier locking

Add per-identifier locks to _find_cached_candidates() and return
defensive copies from _get_cached_candidates() to prevent concurrent
threads from corrupting cached candidate lists during parallel builds.

A single global lock would serialize all resolution work, so a
per-identifier scheme is used instead — threads resolving different
packages proceed concurrently while threads resolving the same
package wait for the first to populate the cache.

Add failing tests that demonstrate two bugs in the resolver cache:

_get_cached_candidates returns a direct reference to the internal cache list, allowing callers to corrupt shared state by mutation.
_find_cached_candidates has no synchronization, so concurrent threads bypass the cache and redundantly call find_candidates().

These tests will pass once the resolver cache is made thread-safe with proper locking and defensive copies.

Closes: #1024
Co-Authored-By: Claude <claude@anthropic.co
Signed-off-by: Lalatendu Mohanty lmohanty@redhat.com

coderabbitai · 2026-04-06T04:41:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9ec44240-9664-47f9-ad87-ddb00ca2174a

📥 Commits

Reviewing files that changed from the base of the PR and between 67541ae and 8218e5b.

📒 Files selected for processing (2)

src/fromager/resolver.py
tests/test_resolver.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/fromager/resolver.py

📝 Walkthrough

Walkthrough

Introduces thread-safe candidate caching in src/fromager/resolver.py: adds a class-level meta-lock, a per-identifier lock map with a _get_identifier_lock() helper, _get_cached_candidates() that returns a defensive copy (or None), and _set_cached_candidates() to atomically store copies. _find_cached_candidates() now uses per-id locking, avoids in-place mutation, and supports a non-cached materialization path. Adds tests in tests/test_resolver.py including helpers and providers, a defensive-copy test, and a multithreaded test ensuring find_candidates() is invoked exactly once.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: adding tests for cache thread-safety and defensive copies to the resolver module.
Description check	✅ Passed	The description clearly relates to the changeset, explaining the thread-safety issues being fixed and the tests being added to verify the fixes.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1024` objectives: implements per-identifier locking to prevent cache corruption [1024], returns defensive copies to prevent external mutation [1024], and adds tests verifying both defensive-copy behavior and thread-safe cache population [1024].
Out of Scope Changes check	✅ Passed	All changes are tightly scoped to resolver cache thread-safety: test scaffolding and two new tests in test_resolver.py, and cache synchronization logic in resolver.py. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_resolver.py`:
- Around line 1355-1363: The test currently uses a single _SlowProvider instance
for all threads so it doesn't exercise resolver.BaseProvider.resolver_cache
across instances; modify the thread setup so each thread gets its own provider
instance (e.g., create a list of providers = [_SlowProvider() for _ in range(4)]
and pass providers[i] into resolve_in_thread or construct a new _SlowProvider()
in the Thread args) so the class-scoped resolver_cache and any cross-instance
locking/racing are actually tested; update references to the single provider
variable accordingly.
- Around line 1311-1313: The test improperly seeds the cache by appending to the
list returned by _get_cached_candidates(identifier), which will break if that
method returns a defensive copy; instead directly populate the provider's
internal cache storage (e.g. set provider._cached_candidates[identifier] =
[_make_candidate("test-pkg", "1.0.0")] or use the provider's explicit cache
write helper if one exists) so the cache state is actually mutated for the test;
update the lines that call _get_cached_candidates to write into
provider._cached_candidates (or the appropriate internal cache structure) rather
than appending to the returned list.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01abf8fd-b526-4448-83f9-d08a75259191

📥 Commits

Reviewing files that changed from the base of the PR and between b5df8e2 and e441fe9.

📒 Files selected for processing (1)

tests/test_resolver.py

tests/test_resolver.py

coderabbitai

🧹 Nitpick comments (2)

tests/test_resolver.py (1)
1327-1372: Thread-safety test properly exercises the class-level cache.

Using separate provider instances (line 1354) per thread correctly tests the shared class-level resolver_cache. The barrier ensures all threads hit the cache simultaneously.

One robustness note: t.join(timeout=10) doesn't raise if threads are still alive. Consider checking t.is_alive() after joins to fail fast on unexpected hangs:
     for t in threads:
         t.join(timeout=10)
+    assert not any(t.is_alive() for t in threads), "Threads did not complete in time"
This prevents silent test passes when threads hang.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_resolver.py` around lines 1327 - 1372, The test
test_find_cached_candidates_thread_safe should fail fast if any thread hangs:
after joining each thread (the for t in threads: t.join(timeout=10) loop) check
each thread's liveness (using t.is_alive()) and raise/assert if any thread
remains alive so the test fails instead of silently passing; update the test to
perform this liveness check after the join loop (or immediately after each join)
referencing the threads list and Thread objects created in resolve_in_thread to
detect and report hangs.
src/fromager/resolver.py (1)
612-621: Empty result caching edge case.

If find_candidates() returns an empty list, if cached_candidates: evaluates to False on subsequent calls, causing repeated invocations. Consider using a sentinel or None to distinguish "cache miss" from "cached empty":
-        cached_candidates = self._get_cached_candidates(identifier)
-        if cached_candidates:
+        cache_key = (type(self), self.cache_key)
+        provider_cache = self.resolver_cache.get(identifier, {})
+        if cache_key in provider_cache:
+            cached_candidates = list(provider_cache[cache_key])
             logger.debug(...)
             return cached_candidates
This is pre-existing behavior and may be acceptable if empty results are rare or cheap to recompute.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fromager/resolver.py` around lines 612 - 621, The current
cached_candidates truthiness check treats an empty list as a cache miss and
causes repeated recomputation; change the caching logic so
_get_cached_candidates returns None for a miss and stores/returns an actual
empty list when a lookup succeeded but produced no results, then replace the
conditional in the resolver (the block using _get_identifier_lock and
cached_candidates) to test "cached_candidates is not None" (or compare against a
sentinel) so an explicit cached empty list is honored and avoids repeated
find_candidates() calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/fromager/resolver.py`:
- Around line 612-621: The current cached_candidates truthiness check treats an
empty list as a cache miss and causes repeated recomputation; change the caching
logic so _get_cached_candidates returns None for a miss and stores/returns an
actual empty list when a lookup succeeded but produced no results, then replace
the conditional in the resolver (the block using _get_identifier_lock and
cached_candidates) to test "cached_candidates is not None" (or compare against a
sentinel) so an explicit cached empty list is honored and avoids repeated
find_candidates() calls.

In `@tests/test_resolver.py`:
- Around line 1327-1372: The test test_find_cached_candidates_thread_safe should
fail fast if any thread hangs: after joining each thread (the for t in threads:
t.join(timeout=10) loop) check each thread's liveness (using t.is_alive()) and
raise/assert if any thread remains alive so the test fails instead of silently
passing; update the test to perform this liveness check after the join loop (or
immediately after each join) referencing the threads list and Thread objects
created in resolve_in_thread to detect and report hangs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 11662622-87de-4775-8478-b23f82a0c01e

📥 Commits

Reviewing files that changed from the base of the PR and between 3f56e40 and 03b4d17.

📒 Files selected for processing (2)

src/fromager/resolver.py
tests/test_resolver.py

…cking Add per-identifier locks to _find_cached_candidates() and return defensive copies from _get_cached_candidates() to prevent concurrent threads from corrupting cached candidate lists during parallel builds. A single global lock would serialize all resolution work, so a per-identifier scheme is used instead — threads resolving different packages proceed concurrently while threads resolving the same package wait for the first to populate the cache. Closes: python-wheel-build#1024 Co-Authored-By: Claude <claude@anthropic.com> Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fromager/resolver.py`:
- Around line 583-595: _get_cached_candidates currently uses setdefault to
create and return an empty list on cache miss, which collapses "no cached
result" and "cached empty result"; change the logic so misses and cached empty
lists are distinguishable: in _get_cached_candidates (and similarly at lines
noted around 626-633) do not create a new list on miss—look up provider_cache
with provider_cache.get((cls, self.cache_key)) or test key presence and return
None for a miss, and update _find_cached_candidates (or the caller that checks
the cache) to treat a cached empty list as a cache hit by checking for key
presence (is not None or key in provider_cache / resolver_cache) instead of
truthiness, ensuring find_candidates() is not re-run for previously-cached empty
results.
- Around line 617-621: The debug log calls that currently use logger.debug("%s:
...", identifier, ...) (around the unfiltered candidates message and the cache
hit/miss messages in the resolver) must be executed inside the per-request
logging context helper; wrap each of those logger.debug calls in a with
req_ctxvar_context(): block so they carry the standard per-requirement context
(use the existing identifier variable as before and apply this change to the
block around the unfiltered candidates message and the subsequent cache hit/miss
debug lines referenced in the same function).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e32f653-084a-455d-9f76-697fe6b0c32d

📥 Commits

Reviewing files that changed from the base of the PR and between 03b4d17 and 67541ae.

📒 Files selected for processing (2)

src/fromager/resolver.py
tests/test_resolver.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_resolver.py

src/fromager/resolver.py

The old truthiness check `if cached_candidates:` treated an empty list as a cache miss, silently discarding valid results. In practice this has no effect because find_matches raises ResolverException on empty candidates, terminating the resolution before a second lookup can occur. But the cache should not discard data it already computed. Use None as the "not yet cached" sentinel instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>

rd4398 · 2026-04-06T17:42:47Z

src/fromager/resolver.py

+
    resolver_cache: typing.ClassVar[ResolverCache] = {}
+    _cache_locks: typing.ClassVar[dict[str, threading.Lock]] = {}
+    _meta_lock: typing.ClassVar[threading.Lock] = threading.Lock()


_cache_locks grows unbounded as new identifiers are resolved. Locks are never cleaned up.

We will potentially see long-running processes or large dependency graphs soon when multiple version bootstrap is enabled. This will accumulate locks for every package ever resolved.

Can we add lock cleanup for clear_cache()? Something like:

def clear_cache(cls, identifier: str | None = None) -> None: """Clear global resolver cache and associated locks.""" with cls._meta_lock: if identifier is None: cls.resolver_cache.clear() cls._cache_locks.clear() else: canon_name = canonicalize_name(identifier) cls.resolver_cache.pop(canon_name) cls._cache_locks.pop(canon_name, None) # Lock may not exist

This is a good point. We can add the lock cleanup. I was also thinking of adding this but the existing code was not causing the situation when this is an issue. Will take a deeper look.

rd4398 · 2026-04-06T17:44:30Z

src/fromager/resolver.py

+        Must be called under the per-identifier lock from _get_identifier_lock.
+        """
+        cls = type(self)
+        provider_cache = cls.resolver_cache.setdefault(identifier, {})


This creates an empty dict even when just checking if an identifier is cached.

Suggestion:

provider_cache = cls.resolver_cache.get(identifier, {}) candidate_cache = provider_cache.get((cls, self.cache_key))

rd4398 · 2026-04-06T17:45:30Z

tests/test_resolver.py

+    assert second[0].version == Version("1.0.0")
+
+
+def test_find_cached_candidates_thread_safe() -> None:


test_find_cached_candidates_thread_safe() only tests threads resolving the same package. Maybe we should also verify that different packages don't block each other?

tiran · 2026-04-07T08:36:49Z

_get_cached_candidates returns a direct reference to the internal cache list, allowing callers to corrupt shared state by mutation.

Simpler solution: The caller is not allowed to corrupt the state of the cache.

The method _get_cached_candidates and _find_cached_candidates are internal, private implementation details. The methods are carefully written to use atomic dict and list operations. They don't corrupt the caches.

_find_cached_candidates has no synchronization, so concurrent threads bypass the cache and redundantly call find_candidates().

Is that actually a problem in real life? AFAIK only bootstrap phase is resolving packages. The bootstrap phase is single threaded.

LalatenduMohanty requested a review from a team as a code owner April 6, 2026 04:41

mergify bot added the ci label Apr 6, 2026

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

tests/test_resolver.py Outdated Show resolved Hide resolved

tests/test_resolver.py Outdated Show resolved Hide resolved

LalatenduMohanty force-pushed the fix/resolver-cache-thread-safety branch from 3f56e40 to 03b4d17 Compare April 6, 2026 12:52

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

LalatenduMohanty marked this pull request as draft April 6, 2026 13:05

LalatenduMohanty force-pushed the fix/resolver-cache-thread-safety branch from 03b4d17 to 67541ae Compare April 6, 2026 13:38

LalatenduMohanty marked this pull request as ready for review April 6, 2026 14:22

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

src/fromager/resolver.py Outdated Show resolved Hide resolved

src/fromager/resolver.py Show resolved Hide resolved

rd4398 reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(resolver): add tests for cache thread-safety and defensive copies#1027

test(resolver): add tests for cache thread-safety and defensive copies#1027
LalatenduMohanty wants to merge 2 commits intopython-wheel-build:mainfrom
LalatenduMohanty:fix/resolver-cache-thread-safety

LalatenduMohanty commented Apr 6, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 6, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

rd4398 Apr 6, 2026

Uh oh!

LalatenduMohanty Apr 6, 2026

Uh oh!

rd4398 Apr 6, 2026

Uh oh!

rd4398 Apr 6, 2026

Uh oh!

tiran commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assert second[0].version == Version("1.0.0")


		def test_find_cached_candidates_thread_safe() -> None:

Conversation

LalatenduMohanty commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rd4398 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

rd4398 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

rd4398 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

tiran commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LalatenduMohanty commented Apr 6, 2026 •

edited

Loading

coderabbitai bot commented Apr 6, 2026 •

edited

Loading