fix(execute_class): add async lock to prevent double deploy by deanq · Pull Request #274 · runpod/flash

deanq · 2026-03-15T05:35:44Z

Summary

Add asyncio.Lock to RemoteClassWrapper._ensure_initialized() to prevent concurrent calls from both deploying resources (AE-2370)
Uses double-checked locking: fast-path if self._initialized check before lock acquisition, second check inside the lock
Add bug probe tests TestEnsureInitializedRace with 4 tests validating the fix

What was happening

Two concurrent requests both pass the if not self._initialized check, both call get_or_deploy_resource, both deploy — wasting resources and orphaning one stub. The second assignment silently overwrites the first.

Changes

File	Change
`execute_class.py`	Add `import asyncio`, `self._init_lock` in `__init__`, wrap `_ensure_initialized` with double-checked lock
`tests/bug_probes/test_class_execution.py`	4 tests: concurrent-calls-deploy-once, initialized-flag-set, second-call-skips, deploy-failure-releases-lock

Test plan

make quality-check passes (85.50% coverage)
Bug probe TestEnsureInitializedRace validates concurrent calls deploy exactly once
Existing test suite unaffected (53 passed, 1 skipped)

Closes AE-2370

runpod-Henrik

1. The fix — correct

Double-checked locking is the right pattern here. Fast path avoids lock overhead on every method call after initialization; slow path acquires the lock and re-checks before deploying.

if self._initialized:           # fast path — no lock after init
    return
async with self._init_lock:
    if self._initialized:       # re-check after acquiring lock
        return
    ...
    self._initialized = True    # set only after stub is ready

Three details that are all correct:

asyncio.Lock() in __init__ — safe on Python ≥3.10 (Flash's minimum). In 3.10+, locks bind lazily to the running loop on first await, not at construction time. No DeprecationWarning or RuntimeError.
_initialized = True after stub_resource() — if stub creation raises, _initialized stays False and the lock is released by async with. Retry works correctly. This is tested by test_deploy_failure_releases_lock_and_allows_retry.
_init_lock not accessible via __getattr__ — __getattr__ only fires on missing attributes. Since _init_lock is set in __init__ it's found by normal attribute lookup before __getattr__ is called. No interaction.

2. Question: `asyncio.sleep(0.05)` timing assumption in concurrency test

test_concurrent_calls_deploy_only_once creates two tasks and sleeps 50ms to let both reach the gate:

task1 = asyncio.create_task(wrapper_instance._ensure_initialized())
task2 = asyncio.create_task(wrapper_instance._ensure_initialized())
await asyncio.sleep(0.05)   # hope both tasks reached gate.wait() by now
gate.set()

If the host is slow (loaded CI runner), task2 may not have reached await gate.wait() before gate.set() fires — task2 then starts after _initialized is already True and the test still passes, but it no longer proves the lock works. The test becomes a timing-sensitive no-op rather than a race proof.

A more reliable pattern uses a counter to confirm both tasks are in-flight before releasing:

arrived = 0
all_arrived = asyncio.Event()

async def slow_deploy(config):
    nonlocal deploy_call_count, arrived
    deploy_call_count += 1
    arrived += 1
    if arrived >= 2:
        all_arrived.set()
    await gate.wait()
    return MagicMock()

# After creating tasks, wait until both have called deploy before releasing
await all_arrived.wait()
gate.set()
await asyncio.gather(task1, task2)

Not blocking — the current test catches the bug reliably on any reasonable machine — but worth knowing for CI robustness.

3. Gap: lock is per-instance, not per resource

If the same resource config is passed to two separate create_remote_class() calls, two RemoteClassWrapper instances are created with two independent _init_lock instances. Concurrent initialization of those two wrappers could still double-deploy at the ResourceManager level. That's out of scope for this PR — but worth confirming: does ResourceManager.get_or_deploy_resource guard against concurrent deploys for the same resource config from different wrapper instances? If not, that's a separate ticket.

4. Tests — solid

Four tests covering concurrent deploy-once, flag set correctly, idempotency, and failure-path lock release. The failure-path test (added in the second commit) is the most important correctness guarantee and it's well structured.

Verdict

PASS. The fix is correct, uses the right asyncio primitives for Python ≥3.10, and the flag ordering is right. Two asks: (1) acknowledge the timing assumption in the concurrency test or use the counter pattern above, and (2) confirm whether ResourceManager provides the cross-instance guarantee, or file a follow-up ticket if it doesn't.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Copilot

Pull request overview

This PR addresses AE-2370 by preventing concurrent _ensure_initialized() calls on the same RemoteClassWrapper instance from triggering multiple deployments, using an asyncio.Lock with double-checked locking.

Changes:

Add an asyncio.Lock (self._init_lock) to RemoteClassWrapper and guard _ensure_initialized() with double-checked locking.
Add bug-probe tests validating single-deploy behavior under concurrent calls and retry behavior after transient deploy failures.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/runpod_flash/execute_class.py`	Adds an async init lock and wraps initialization to prevent double deployment under concurrency.
`tests/bug_probes/test_class_execution.py`	Adds async race-condition regression tests for `_ensure_initialized()` (including retry-on-failure).
`tests/bug_probes/__init__.py`	Initializes the new `bug_probes` test package (empty file).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

tests/bug_probes/test_class_execution.py

runpod-Henrik

Follow-up on prior review

The four tests are solid — flag-set, idempotency, concurrent deploy-once, and the failure-path retry are all covered.

Two open items from the prior review remain, both non-blocking:

asyncio.sleep(0.05) timing — counter pattern not adopted. The sleep-based approach works in practice but can silently degrade to a no-op on a loaded CI runner where task2 doesn't reach gate.wait() before gate.set() fires. Low risk, but worth knowing.
Cross-instance double-deploy — if two separate create_remote_class() calls produce two RemoteClassWrapper instances for the same resource config, they each have their own _init_lock and could still race at the ResourceManager level. Confirming whether ResourceManager.get_or_deploy_resource guards this case, or filing a follow-up ticket if it doesn't, would close the loop.

Verdict: PASS — fix is correct, lock ordering is right.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

tests/bug_probes/test_class_execution.py

- Rename TestNEW1_EnsureInitializedRace to TestEnsureInitializedRace - Update docstring to describe regression guard, not pre-fix state - Replace asyncio.sleep(0.05) with explicit deploy_entered Event

…double deploy Without a lock, concurrent calls to _ensure_initialized both pass the check and both call get_or_deploy_resource, wasting resources and orphaning one stub. Uses double-checked locking: fast-path check before lock acquisition, second check inside the lock. Closes AE-2370

- Replace misleading carried-over comment with accurate description - Add inline comments explaining double-checked locking pattern - Add failure-path test: deploy exception releases lock, allows retry

- Rename TestNEW1_EnsureInitializedRace to TestEnsureInitializedRace - Update docstring to describe regression guard, not pre-fix state - Replace asyncio.sleep(0.05) with explicit deploy_entered Event

runpod-Henrik reviewed Mar 16, 2026

View reviewed changes

deanq requested a review from Copilot March 16, 2026 19:58

Copilot started reviewing on behalf of deanq March 16, 2026 19:58 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

tests/bug_probes/test_class_execution.py Outdated Show resolved Hide resolved

tests/bug_probes/test_class_execution.py Show resolved Hide resolved

tests/bug_probes/test_class_execution.py Show resolved Hide resolved

deanq force-pushed the fix/AE-2370-ensure-initialized-race branch from c5ca393 to b076846 Compare March 17, 2026 20:00

runpod-Henrik reviewed Mar 18, 2026

View reviewed changes

deanq force-pushed the fix/AE-2370-ensure-initialized-race branch 3 times, most recently from 7b57673 to 914d41c Compare March 25, 2026 23:06

KAJdev approved these changes Mar 31, 2026

View reviewed changes

tests/bug_probes/test_class_execution.py Outdated Show resolved Hide resolved

deanq added 3 commits April 1, 2026 12:15

fix(execute_class): address PR review feedback

3ac029e

- Replace misleading carried-over comment with accurate description - Add inline comments explaining double-checked locking pattern - Add failure-path test: deploy exception releases lock, allows retry

fix(tests): address PR #274 review feedback

0e1d2e9

- Rename TestNEW1_EnsureInitializedRace to TestEnsureInitializedRace - Update docstring to describe regression guard, not pre-fix state - Replace asyncio.sleep(0.05) with explicit deploy_entered Event

deanq force-pushed the fix/AE-2370-ensure-initialized-race branch from bda7ffb to 0e1d2e9 Compare April 1, 2026 19:23

deanq merged commit 4ef36c8 into main Apr 1, 2026
4 checks passed

deanq deleted the fix/AE-2370-ensure-initialized-race branch April 1, 2026 19:29

runpod-release-please-bot bot mentioned this pull request Apr 1, 2026

chore: release 1.11.3 #299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(execute_class): add async lock to prevent double deploy#274

fix(execute_class): add async lock to prevent double deploy#274
deanq merged 3 commits intomainfrom
fix/AE-2370-ensure-initialized-race

deanq commented Mar 15, 2026 •

edited

Loading

Uh oh!

runpod-Henrik left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

runpod-Henrik left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

deanq commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was happening

Changes

Test plan

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. The fix — correct

2. Question: asyncio.sleep(0.05) timing assumption in concurrency test

3. Gap: lock is per-instance, not per resource

4. Tests — solid

Verdict

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

Follow-up on prior review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deanq commented Mar 15, 2026 •

edited

Loading

2. Question: `asyncio.sleep(0.05)` timing assumption in concurrency test