Skip to content

test(resolver): add tests for cache thread-safety and defensive copies#1027

Open
LalatenduMohanty wants to merge 2 commits intopython-wheel-build:mainfrom
LalatenduMohanty:fix/resolver-cache-thread-safety
Open

test(resolver): add tests for cache thread-safety and defensive copies#1027
LalatenduMohanty wants to merge 2 commits intopython-wheel-build:mainfrom
LalatenduMohanty:fix/resolver-cache-thread-safety

Conversation

@LalatenduMohanty
Copy link
Copy Markdown
Member

@LalatenduMohanty LalatenduMohanty commented Apr 6, 2026

fix(resolver): make resolver cache thread-safe with per-identifier locking

Add per-identifier locks to _find_cached_candidates() and return
defensive copies from _get_cached_candidates() to prevent concurrent
threads from corrupting cached candidate lists during parallel builds.

A single global lock would serialize all resolution work, so a
per-identifier scheme is used instead — threads resolving different
packages proceed concurrently while threads resolving the same
package wait for the first to populate the cache.

Add failing tests that demonstrate two bugs in the resolver cache:

  1. _get_cached_candidates returns a direct reference to the internal cache list, allowing callers to corrupt shared state by mutation.

  2. _find_cached_candidates has no synchronization, so concurrent threads bypass the cache and redundantly call find_candidates().

These tests will pass once the resolver cache is made thread-safe with proper locking and defensive copies.

Closes: #1024
Co-Authored-By: Claude <claude@anthropic.co
Signed-off-by: Lalatendu Mohanty lmohanty@redhat.com

@LalatenduMohanty LalatenduMohanty requested a review from a team as a code owner April 6, 2026 04:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9ec44240-9664-47f9-ad87-ddb00ca2174a

📥 Commits

Reviewing files that changed from the base of the PR and between 67541ae and 8218e5b.

📒 Files selected for processing (2)
  • src/fromager/resolver.py
  • tests/test_resolver.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/fromager/resolver.py

📝 Walkthrough

Walkthrough

Introduces thread-safe candidate caching in src/fromager/resolver.py: adds a class-level meta-lock, a per-identifier lock map with a _get_identifier_lock() helper, _get_cached_candidates() that returns a defensive copy (or None), and _set_cached_candidates() to atomically store copies. _find_cached_candidates() now uses per-id locking, avoids in-place mutation, and supports a non-cached materialization path. Adds tests in tests/test_resolver.py including helpers and providers, a defensive-copy test, and a multithreaded test ensuring find_candidates() is invoked exactly once.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding tests for cache thread-safety and defensive copies to the resolver module.
Description check ✅ Passed The description clearly relates to the changeset, explaining the thread-safety issues being fixed and the tests being added to verify the fixes.
Linked Issues check ✅ Passed The PR fully addresses issue #1024 objectives: implements per-identifier locking to prevent cache corruption [1024], returns defensive copies to prevent external mutation [1024], and adds tests verifying both defensive-copy behavior and thread-safe cache population [1024].
Out of Scope Changes check ✅ Passed All changes are tightly scoped to resolver cache thread-safety: test scaffolding and two new tests in test_resolver.py, and cache synchronization logic in resolver.py. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci label Apr 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_resolver.py`:
- Around line 1355-1363: The test currently uses a single _SlowProvider instance
for all threads so it doesn't exercise resolver.BaseProvider.resolver_cache
across instances; modify the thread setup so each thread gets its own provider
instance (e.g., create a list of providers = [_SlowProvider() for _ in range(4)]
and pass providers[i] into resolve_in_thread or construct a new _SlowProvider()
in the Thread args) so the class-scoped resolver_cache and any cross-instance
locking/racing are actually tested; update references to the single provider
variable accordingly.
- Around line 1311-1313: The test improperly seeds the cache by appending to the
list returned by _get_cached_candidates(identifier), which will break if that
method returns a defensive copy; instead directly populate the provider's
internal cache storage (e.g. set provider._cached_candidates[identifier] =
[_make_candidate("test-pkg", "1.0.0")] or use the provider's explicit cache
write helper if one exists) so the cache state is actually mutated for the test;
update the lines that call _get_cached_candidates to write into
provider._cached_candidates (or the appropriate internal cache structure) rather
than appending to the returned list.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01abf8fd-b526-4448-83f9-d08a75259191

📥 Commits

Reviewing files that changed from the base of the PR and between b5df8e2 and e441fe9.

📒 Files selected for processing (1)
  • tests/test_resolver.py

@LalatenduMohanty LalatenduMohanty force-pushed the fix/resolver-cache-thread-safety branch from 3f56e40 to 03b4d17 Compare April 6, 2026 12:52
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/test_resolver.py (1)

1327-1372: Thread-safety test properly exercises the class-level cache.

Using separate provider instances (line 1354) per thread correctly tests the shared class-level resolver_cache. The barrier ensures all threads hit the cache simultaneously.

One robustness note: t.join(timeout=10) doesn't raise if threads are still alive. Consider checking t.is_alive() after joins to fail fast on unexpected hangs:

     for t in threads:
         t.join(timeout=10)
+    assert not any(t.is_alive() for t in threads), "Threads did not complete in time"

This prevents silent test passes when threads hang.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_resolver.py` around lines 1327 - 1372, The test
test_find_cached_candidates_thread_safe should fail fast if any thread hangs:
after joining each thread (the for t in threads: t.join(timeout=10) loop) check
each thread's liveness (using t.is_alive()) and raise/assert if any thread
remains alive so the test fails instead of silently passing; update the test to
perform this liveness check after the join loop (or immediately after each join)
referencing the threads list and Thread objects created in resolve_in_thread to
detect and report hangs.
src/fromager/resolver.py (1)

612-621: Empty result caching edge case.

If find_candidates() returns an empty list, if cached_candidates: evaluates to False on subsequent calls, causing repeated invocations. Consider using a sentinel or None to distinguish "cache miss" from "cached empty":

-        cached_candidates = self._get_cached_candidates(identifier)
-        if cached_candidates:
+        cache_key = (type(self), self.cache_key)
+        provider_cache = self.resolver_cache.get(identifier, {})
+        if cache_key in provider_cache:
+            cached_candidates = list(provider_cache[cache_key])
             logger.debug(...)
             return cached_candidates

This is pre-existing behavior and may be acceptable if empty results are rare or cheap to recompute.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fromager/resolver.py` around lines 612 - 621, The current
cached_candidates truthiness check treats an empty list as a cache miss and
causes repeated recomputation; change the caching logic so
_get_cached_candidates returns None for a miss and stores/returns an actual
empty list when a lookup succeeded but produced no results, then replace the
conditional in the resolver (the block using _get_identifier_lock and
cached_candidates) to test "cached_candidates is not None" (or compare against a
sentinel) so an explicit cached empty list is honored and avoids repeated
find_candidates() calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/fromager/resolver.py`:
- Around line 612-621: The current cached_candidates truthiness check treats an
empty list as a cache miss and causes repeated recomputation; change the caching
logic so _get_cached_candidates returns None for a miss and stores/returns an
actual empty list when a lookup succeeded but produced no results, then replace
the conditional in the resolver (the block using _get_identifier_lock and
cached_candidates) to test "cached_candidates is not None" (or compare against a
sentinel) so an explicit cached empty list is honored and avoids repeated
find_candidates() calls.

In `@tests/test_resolver.py`:
- Around line 1327-1372: The test test_find_cached_candidates_thread_safe should
fail fast if any thread hangs: after joining each thread (the for t in threads:
t.join(timeout=10) loop) check each thread's liveness (using t.is_alive()) and
raise/assert if any thread remains alive so the test fails instead of silently
passing; update the test to perform this liveness check after the join loop (or
immediately after each join) referencing the threads list and Thread objects
created in resolve_in_thread to detect and report hangs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 11662622-87de-4775-8478-b23f82a0c01e

📥 Commits

Reviewing files that changed from the base of the PR and between 3f56e40 and 03b4d17.

📒 Files selected for processing (2)
  • src/fromager/resolver.py
  • tests/test_resolver.py

@LalatenduMohanty LalatenduMohanty marked this pull request as draft April 6, 2026 13:05
…cking

Add per-identifier locks to _find_cached_candidates() and return
defensive copies from _get_cached_candidates() to prevent concurrent
threads from corrupting cached candidate lists during parallel builds.

A single global lock would serialize all resolution work, so a
per-identifier scheme is used instead — threads resolving different
packages proceed concurrently while threads resolving the same
package wait for the first to populate the cache.

Closes: python-wheel-build#1024
Co-Authored-By: Claude <claude@anthropic.com>
Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>
@LalatenduMohanty LalatenduMohanty force-pushed the fix/resolver-cache-thread-safety branch from 03b4d17 to 67541ae Compare April 6, 2026 13:38
@LalatenduMohanty LalatenduMohanty marked this pull request as ready for review April 6, 2026 14:22
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fromager/resolver.py`:
- Around line 583-595: _get_cached_candidates currently uses setdefault to
create and return an empty list on cache miss, which collapses "no cached
result" and "cached empty result"; change the logic so misses and cached empty
lists are distinguishable: in _get_cached_candidates (and similarly at lines
noted around 626-633) do not create a new list on miss—look up provider_cache
with provider_cache.get((cls, self.cache_key)) or test key presence and return
None for a miss, and update _find_cached_candidates (or the caller that checks
the cache) to treat a cached empty list as a cache hit by checking for key
presence (is not None or key in provider_cache / resolver_cache) instead of
truthiness, ensuring find_candidates() is not re-run for previously-cached empty
results.
- Around line 617-621: The debug log calls that currently use logger.debug("%s:
...", identifier, ...) (around the unfiltered candidates message and the cache
hit/miss messages in the resolver) must be executed inside the per-request
logging context helper; wrap each of those logger.debug calls in a with
req_ctxvar_context(): block so they carry the standard per-requirement context
(use the existing identifier variable as before and apply this change to the
block around the unfiltered candidates message and the subsequent cache hit/miss
debug lines referenced in the same function).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e32f653-084a-455d-9f76-697fe6b0c32d

📥 Commits

Reviewing files that changed from the base of the PR and between 03b4d17 and 67541ae.

📒 Files selected for processing (2)
  • src/fromager/resolver.py
  • tests/test_resolver.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_resolver.py

The old truthiness check `if cached_candidates:` treated an empty
list as a cache miss, silently discarding valid results. In practice
this has no effect because find_matches raises ResolverException on
empty candidates, terminating the resolution before a second lookup
can occur. But the cache should not discard data it already computed.

Use None as the "not yet cached" sentinel instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lalatendu Mohanty <lmohanty@redhat.com>

resolver_cache: typing.ClassVar[ResolverCache] = {}
_cache_locks: typing.ClassVar[dict[str, threading.Lock]] = {}
_meta_lock: typing.ClassVar[threading.Lock] = threading.Lock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cache_locks grows unbounded as new identifiers are resolved. Locks are never cleaned up.

We will potentially see long-running processes or large dependency graphs soon when multiple version bootstrap is enabled. This will accumulate locks for every package ever resolved.

Can we add lock cleanup for clear_cache()? Something like:

def clear_cache(cls, identifier: str | None = None) -> None:
      """Clear global resolver cache and associated locks."""
      with cls._meta_lock:
          if identifier is None:
              cls.resolver_cache.clear()
              cls._cache_locks.clear()
          else:
              canon_name = canonicalize_name(identifier)
              cls.resolver_cache.pop(canon_name)
              cls._cache_locks.pop(canon_name, None)  # Lock may not exist

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. We can add the lock cleanup. I was also thinking of adding this but the existing code was not causing the situation when this is an issue. Will take a deeper look.

Must be called under the per-identifier lock from _get_identifier_lock.
"""
cls = type(self)
provider_cache = cls.resolver_cache.setdefault(identifier, {})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates an empty dict even when just checking if an identifier is cached.

Suggestion:

provider_cache = cls.resolver_cache.get(identifier, {})
  candidate_cache = provider_cache.get((cls, self.cache_key))

assert second[0].version == Version("1.0.0")


def test_find_cached_candidates_thread_safe() -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_find_cached_candidates_thread_safe() only tests threads resolving the same package. Maybe we should also verify that different packages don't block each other?

@tiran
Copy link
Copy Markdown
Collaborator

tiran commented Apr 7, 2026

_get_cached_candidates returns a direct reference to the internal cache list, allowing callers to corrupt shared state by mutation.

Simpler solution: The caller is not allowed to corrupt the state of the cache.

The method _get_cached_candidates and _find_cached_candidates are internal, private implementation details. The methods are carefully written to use atomic dict and list operations. They don't corrupt the caches.

_find_cached_candidates has no synchronization, so concurrent threads bypass the cache and redundantly call find_candidates().

Is that actually a problem in real life? AFAIK only bootstrap phase is resolving packages. The bootstrap phase is single threaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BaseProvider.resolver_cache is not thread-safe and can corrupt candidate lists during parallel builds

3 participants