Fix: Resolve watcher service deadlock during self-restart on folder deletion#1269
Fix: Resolve watcher service deadlock during self-restart on folder deletion#1269DeveloperAmrit wants to merge 1 commit into
Conversation
WalkthroughThis PR addresses a deadlock in the watcher service thread synchronization and updates OpenAPI schema documentation. The watcher changes prevent the worker thread from joining itself by detecting caller context and delegating internal restarts to daemon threads. Schema definitions are reordered for consistency. ChangesWatcher Thread Self-Restart Deadlock Prevention
OpenAPI Schema Documentation Updates
🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
sync-microservice/app/utils/watcher.py (1)
334-351: ⚡ Quick winClarify restart return semantics for the async path.
The function returns
Truewith two different meanings: immediately after scheduling (in-thread path) versus after actual completion (external path). The docstring and main caller inroutes/watcher.pytreat the return value as completion status ("Folder watcher restarted successfully"), but the in-thread path only schedules the restart asynchronously. This semantic mismatch should be resolved by updating the docstring to distinguish between scheduled and completed states, or add a note clarifying the return value is "True if restart completed (external) or scheduled (in-thread)".Proposed patch
def watcher_util_restart_folder_watcher() -> bool: """ Restart the folder watcher by stopping the current one and starting fresh. Returns: - True if restart was successful, False otherwise + True if restart completed (external call) or was successfully scheduled + (in-thread call), False otherwise """ logger.info("Restarting folder watcher...") if watcher_thread and threading.current_thread() == watcher_thread: # We are inside the watcher thread, so we can't join ourselves. # Launch a background thread to do the restart. def restart_worker(): watcher_util_stop_folder_watcher() watcher_util_start_folder_watcher() threading.Thread(target=restart_worker, daemon=True).start() + logger.info("Watcher restart scheduled on background thread") return True🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@sync-microservice/app/utils/watcher.py` around lines 334 - 351, The function watcher_util_restart_folder_watcher currently returns True both when the restart is scheduled asynchronously from inside the watcher thread and when it has completed from an external caller, causing ambiguous semantics for callers (e.g., routes/watcher.py). Update the watcher_util_restart_folder_watcher docstring to explicitly state the two return meanings: "True if restart completed when called externally, or True if restart was scheduled when called from inside the watcher thread (async)"; also mention that callers who need confirmation of completion should call watcher_util_stop_folder_watcher/watcher_util_start_folder_watcher directly or implement a synchronization mechanism. Reference watcher_util_restart_folder_watcher, watcher_util_stop_folder_watcher, watcher_util_start_folder_watcher and the caller routes/watcher.py in your change so readers can find the relevant logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@sync-microservice/app/utils/watcher.py`:
- Around line 343-351: Add an automated regression test that exercises the
self-restart path inside the watcher worker so we never hang on a self-join:
spawn a real watcher_thread (or simulate it via running the watcher worker
function on a new Thread), ensure threading.current_thread() == watcher_thread
when calling the restart logic, and invoke the code path that creates
restart_worker (the branch referencing watcher_thread and
threading.current_thread()). In the test, mock or patch
watcher_util_stop_folder_watcher and watcher_util_start_folder_watcher to record
calls, start the restart path from inside that watcher thread, and assert within
a short timeout that (a) the background restart thread was started, (b) mocked
stop/start functions were called, and (c) the test thread does not block (use a
join with timeout or an event to fail on hang). Target functions/symbols to
locate code: watcher_thread, restart_worker, watcher_util_stop_folder_watcher,
watcher_util_start_folder_watcher, and the branch that checks
threading.current_thread() == watcher_thread.
---
Nitpick comments:
In `@sync-microservice/app/utils/watcher.py`:
- Around line 334-351: The function watcher_util_restart_folder_watcher
currently returns True both when the restart is scheduled asynchronously from
inside the watcher thread and when it has completed from an external caller,
causing ambiguous semantics for callers (e.g., routes/watcher.py). Update the
watcher_util_restart_folder_watcher docstring to explicitly state the two return
meanings: "True if restart completed when called externally, or True if restart
was scheduled when called from inside the watcher thread (async)"; also mention
that callers who need confirmation of completion should call
watcher_util_stop_folder_watcher/watcher_util_start_folder_watcher directly or
implement a synchronization mechanism. Reference
watcher_util_restart_folder_watcher, watcher_util_stop_folder_watcher,
watcher_util_start_folder_watcher and the caller routes/watcher.py in your
change so readers can find the relevant logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a3f3ca87-e0f6-4f81-b9e1-9fafe54e788d
📒 Files selected for processing (2)
docs/backend/backend_python/openapi.jsonsync-microservice/app/utils/watcher.py
| if watcher_thread and threading.current_thread() == watcher_thread: | ||
| # We are inside the watcher thread, so we can't join ourselves. | ||
| # Launch a background thread to do the restart. | ||
| def restart_worker(): | ||
| watcher_util_stop_folder_watcher() | ||
| watcher_util_start_folder_watcher() | ||
|
|
||
| threading.Thread(target=restart_worker, daemon=True).start() | ||
| return True |
There was a problem hiding this comment.
Add an automated regression test for the self-restart path.
This is critical thread-lifecycle logic. Please add a test that triggers restart from inside the watcher worker path and verifies there is no self-join hang and restart completes.
As per coding guidelines, "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@sync-microservice/app/utils/watcher.py` around lines 343 - 351, Add an
automated regression test that exercises the self-restart path inside the
watcher worker so we never hang on a self-join: spawn a real watcher_thread (or
simulate it via running the watcher worker function on a new Thread), ensure
threading.current_thread() == watcher_thread when calling the restart logic, and
invoke the code path that creates restart_worker (the branch referencing
watcher_thread and threading.current_thread()). In the test, mock or patch
watcher_util_stop_folder_watcher and watcher_util_start_folder_watcher to record
calls, start the restart path from inside that watcher thread, and assert within
a short timeout that (a) the background restart thread was started, (b) mocked
stop/start functions were called, and (c) the test thread does not block (use a
join with timeout or an event to fail on hang). Target functions/symbols to
locate code: watcher_thread, restart_worker, watcher_util_stop_folder_watcher,
watcher_util_start_folder_watcher, and the branch that checks
threading.current_thread() == watcher_thread.
Shevilll
left a comment
There was a problem hiding this comment.
Peer Review: Watcher Service Thread Deadlock Fix
Excellent diagnosis and elegant fix for this thread deadlock! Concurrency bugs involving background thread .join() calls are notorious for being hard to capture, debug, and resolve.
Your analysis of the deadlock in issue #1031 is spot on:
- When a folder deletion was detected,
watcher_threadcalledwatcher_util_restart_folder_watcher(), which in turn invokedwatcher_util_stop_folder_watcher(). - Inside
watcher_util_stop_folder_watcher(), the thread attempted to executewatcher_thread.join(timeout=5.0). - Since the caller thread was
watcher_thread, it was waiting for itself to exit. This blocked itself from ever exiting, resulting in a hang until the 5-second timeout expired.
Critique of the Proposed Solution
Your proposed fixes are highly effective and safe:
- Preventing Self-Joining: Checking
threading.current_thread() != watcher_threadprevents the thread from attempting to join itself. - Spawning the Async Restart Worker: Spawning a background daemon
restart_workerwhenthreading.current_thread() == watcher_threadis a very clever way to decouple the shutdown and startup sequence from the running thread itself. Since therestart_workeris on a separate thread, it can safely block onwatcher_thread.join(). Meanwhile, the originalwatcher_threadreturnsTrueand exits gracefully, unblocking the join!
Addressing CodeRabbit's Review Comments
CodeRabbit has requested changes on this PR regarding two distinct issues:
1. Return Semantics Ambiguity (Docstring Update)
Currently, watcher_util_restart_folder_watcher() returns True in two different scenarios:
- Synchronously completed (when called from an external thread).
- Asynchronously scheduled (when called from within
watcher_thread).
This can confuse callers like routes/watcher.py that might assume a True return value guarantees the new watcher is already up and running.
Recommendation: Update the docstring of watcher_util_restart_folder_watcher to clarify this distinction:
def watcher_util_restart_folder_watcher() -> bool:
"""
Restart the folder watcher by stopping the current one and starting fresh.
Returns:
True if the restart was successfully completed (when called from an external thread)
or successfully scheduled on a background thread (when called from within the watcher thread).
False otherwise.
Note:
If calling from inside the watcher thread, this operation is non-blocking (async).
Callers requiring immediate, synchronous confirmation of completion should call
watcher_util_stop_folder_watcher() and watcher_util_start_folder_watcher() directly
or implement a synchronization mechanism.
"""2. Automated Regression Test
To ensure this deadlock never creeps back in, adding an automated regression test is a highly recommended practice. Since the sync-microservice currently lacks a dedicated test suite, you can place this test inside a new test file or integrate it as a separate test case.
Here is a blueprint for a robust unit test using Python's unittest and unittest.mock frameworks. It simulates calling watcher_util_restart_folder_watcher from the watcher thread itself and asserts that the restart worker executes correctly without deadlocking:
import unittest
from unittest.mock import patch, MagicMock
import threading
import time
from app.utils.watcher import (
watcher_util_restart_folder_watcher,
watcher_util_stop_folder_watcher,
watcher_util_start_folder_watcher
)
class TestWatcherRestartDeadlock(unittest.TestCase):
@patch("app.utils.watcher.watcher_util_stop_folder_watcher")
@patch("app.utils.watcher.watcher_util_start_folder_watcher")
@patch("app.utils.watcher.watcher_thread")
def test_self_restart_schedules_async_andnt_deadlock(self, mock_watcher_thread_var, mock_start, mock_stop):
"""
Verify that calling restart from within the watcher thread does not deadlock,
returns True immediately, and schedules the stop/start calls on a daemon thread.
"""
# 1. Simulate that the active thread is the global watcher_thread
active_thread = threading.current_thread()
mock_watcher_thread_var.__eq__.return_value = True # force current_thread() == watcher_thread
# We manually patch the global watcher_thread to be the current executing thread
with patch("app.utils.watcher.watcher_thread", active_thread):
# 2. Call restart from the active watcher thread context
success = watcher_util_restart_folder_watcher()
# Assert it schedules and returns True immediately
self.assertTrue(success)
# Since it runs asynchronously, stop/start shouldn't be called *instantly*
# Wait briefly (e.g., 50ms) for the daemon thread to run
time.sleep(0.05)
# 3. Verify background worker executed stop & start
mock_stop.assert_called_once()
mock_start.assert_called_once()
if __name__ == "__main__":
unittest.main()By providing these minor updates, you will address CodeRabbit's requests and make this PR extremely robust and ready for merge! Great job on this contribution.
Addressed Issues:
Fixes #1031
Screenshots/Recordings:
Additional Notes:
Description
This PR fixes a bug where the sync-microservice watcher service would freeze and eventually fail when attempting to restart itself after detecting a deleted folder.
Previously, when a folder was deleted, the background thread (watcher_thread) triggered a restart sequence that called watcher_util_stop_folder_watcher(). Inside that stop function, the thread attempted to execute watcher_thread.join(). Since it was joining itself, this caused a deadlock that hung for the 5-second timeout, preventing a clean restart and throwing a runtime error.
Changes made
How to test
AI Usage Disclosure:
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.
Check one of the checkboxes below:
I have used the following AI models and tools: TODO
Checklist
Summary by CodeRabbit
Documentation
Bug Fixes