Skip to content

Fix deadlock, UAF, and lock inversion in PathWatcher#28104

Closed
robobun wants to merge 1 commit intomainfrom
claude/path-watcher-deadlock-fix-v2
Closed

Fix deadlock, UAF, and lock inversion in PathWatcher#28104
robobun wants to merge 1 commit intomainfrom
claude/path-watcher-deadlock-fix-v2

Conversation

@robobun
Copy link
Copy Markdown
Collaborator

@robobun robobun commented Mar 14, 2026

Summary

Targeted fix for three concurrency bugs in src/bun.js/node/path_watcher.zig that cause fs.watch() to deadlock or crash:

  • unregisterWatcher self-deadlock: The defer calling deinit() fired while this.mutex was still held (Zig defers are LIFO). Hoisted to a should_deinit flag checked by a defer registered before the lock, so LIFO ordering yields unlock → deinit.
  • unrefPendingTask UAF: Same pattern — deinit() called under this.mutex, then destroy(this) means the deferred mutex.unlock() writes to freed memory. Same fix pattern.
  • processWatcher AB/BA lock inversion: Worker thread held watcher.mutex → acquired manager.mutex via _decrementPathRef. Main thread held manager.mutex → wanted watcher.mutex. Fix: capture append result, unlock watcher.mutex before calling _decrementPathRef.

Also applies the same should_deinit pattern to unrefPendingDirectory to avoid self-deadlock when deinit() re-locks via setClosed().

This is the same approach as #27469, rebased onto current main.

Test plan

  • Debug build compiles successfully
  • fs.watch.test.ts passes (permission tests fail due to running as root — pre-existing, same on main)
  • CI green

https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c

Three targeted fixes in path_watcher.zig:

1. unregisterWatcher self-deadlock: The defer that called deinit()
   fired while this.mutex was still held (Zig defers are LIFO).
   Hoist to a should_deinit flag checked by a defer registered
   before the lock, so LIFO ordering yields unlock then deinit.

2. unrefPendingTask UAF: Same pattern — deinit() called under
   this.mutex, then destroy(this) means the deferred unlock is
   a UAF. Same fix: should_deinit flag with early defer.

3. processWatcher AB/BA lock inversion: Worker thread held
   watcher.mutex then acquired manager.mutex via _decrementPathRef.
   Main thread held manager.mutex then wanted watcher.mutex.
   Fix: capture append result, unlock watcher.mutex before calling
   _decrementPathRef.

Also fixes unrefPendingDirectory with the same should_deinit
pattern to avoid self-deadlock when deinit() re-locks this.mutex
via setClosed().

https://claude.ai/code/session_01PxdzvsDCznAw3KZ5zSPF5c
@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented Mar 14, 2026

Updated 8:39 AM PT - Mar 14th, 2026

@claude, your commit 693d6c0 has 10 failures in Build #39620 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 28104

That installs a local version of the PR into your bun-28104 executable, so you can run:

bun-28104 --bun

@github-actions
Copy link
Copy Markdown
Contributor

Found 8 issues this PR may fix:

  1. Deadlock: all threads permanently blocked due to lock ordering violation between allocator and thread pool (macOS arm64, Bun 1.3.5 SEA) #26762 - Deadlock with lock ordering violation between allocator and thread pool - directly addresses the mutex deadlock prevention
  2. fs.watch() throws unhandled EACCES when a file inside watched directory is not readable #28038 - fs.watch() throws unhandled EACCES - prevents crashes during error conditions with deferred cleanup
  3. fs.watch does not work after previous .close() #18919 - fs.watch does not work after previous .close() - fixes cleanup issues that prevent proper reuse after close operations
  4. fs.watch stops emitting change events for a file after it is deleted and recreated (Linux: WSL2/Ubuntu) #24875 - fs.watch stops emitting change events after file deletion/recreation - prevents state corruption during file lifecycle events
  5. Bun watch #19004 - Bun watch crash on Windows - prevents use-after-free crashes during watcher cleanup
  6. PANIC on hot-reload when modifying imported modules (release ,debug) and HUGE memory leak (debug). #8513 - PANIC on hot-reload with memory leak - addresses improper cleanup causing panics and memory leaks
  7. Possible race condition with --filter #10357 - Possible race condition with --filter - fixes threading issues with improved mutex ordering
  8. fs.watch goes to infinite loop with big files #6955 - fs.watch infinite loop with big files - prevents deadlocks that cause infinite loops during cleanup

If this is helpful, consider adding Fixes number to the PR description to auto-close the issue on merge.

Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 14, 2026

Walkthrough

Refactors deinitialization logic in path watcher to defer cleanup operations until after mutex locks are released, avoiding potential deadlocks and use-after-free scenarios during task and directory watcher teardown.

Changes

Cohort / File(s) Summary
Path Watcher Deinitialization Safety
src/bun.js/node/path_watcher.zig
Replaces direct deinit() calls with deferred cleanup patterns using should_deinit flags, ensuring deinitialization occurs after releasing locks. Adjusts reference counting and control flow around watcher teardown to prevent deadlocks and resource leaks during error scenarios.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing three concurrency bugs (deadlock, UAF, lock inversion) in PathWatcher, which is the core objective of the changeset.
Description check ✅ Passed The description covers both required template sections: detailed explanation of what the PR does with specific bug fixes, and a test plan showing local verification steps completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/bun.js/node/path_watcher.zig`:
- Around line 42-55: The race is that deinit_on_last_task can be set after the
last task cleared has_pending_tasks but before deinit() checks
hasPendingTasks(), leaving teardown untriggered; fix by making
PathWatcherManager.deinit() synchronize with the same mutex or atomically
coordinate with the pending-task flag: either grab the same this.mutex (the
mutex used by unrefPendingTask()/the block that does has_pending_tasks.store)
around the hasPendingTasks() check and the setting of deinit_on_last_task, or
perform an atomic read/compare (e.g. load has_pending_tasks with appropriate
ordering and if it is zero call the manager teardown immediately) and then store
deinit_on_last_task — ensure the code path that sets deinit_on_last_task cannot
race past a concurrent pending_tasks==0 store without triggering deinit
(reference symbols: PathWatcherManager.deinit(),
hasPendingTasks()/has_pending_tasks, deinit_on_last_task,
this.mutex/unrefPendingTask).
- Around line 847-863: The worker path can race into double-destroy by deferring
this.deinit() when it observes this.isClosed() while the closer concurrently
tears down the watcher; to fix, do not schedule deinit from the worker side:
inside the block that runs when pending_directories becomes 0 (the code using
this.pending_directories, this.has_pending_directories.store, and the isClosed()
check), remove the should_deinit/defer deinit logic and instead perform only
safe teardown work under the mutex (e.g. call unregisterWatcher() or set a
"closed-and-drained" state) and let the close path perform the actual
this.deinit() once it holds the proper ownership; ensure you reference and use
this.pending_directories, this.has_pending_directories, this.isClosed(),
unregisterWatcher(), and deinit() to implement this change so no deferred deinit
can run on freed memory.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c8035eef-7d0c-4045-802f-fb725d7d3456

📥 Commits

Reviewing files that changed from the base of the PR and between 1f134a1 and 693d6c0.

📒 Files selected for processing (1)
  • src/bun.js/node/path_watcher.zig

Comment on lines +42 to +55
// deinit() may destroy(this). Defer it until after unlock so we don't
// unlock() a freed mutex.
var should_deinit = false;
defer if (should_deinit) this.deinit();

this.mutex.lock();
defer this.mutex.unlock();
this.pending_tasks -= 1;
if (this.deinit_on_last_task and this.pending_tasks == 0) {
if (this.pending_tasks == 0) {
// Clear unconditionally: if tasks drain to zero before deinit() runs,
// gating this on deinit_on_last_task leaves the flag stale-true and
// deinit() keeps deferring on a count that is already zero.
this.has_pending_tasks.store(false, .release);
this.deinit();
if (this.deinit_on_last_task) should_deinit = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

deinit_on_last_task can still be armed after the final task already exited.

This fixes the stale-has_pending_tasks case, but PathWatcherManager.deinit() still does hasPendingTasks() before taking this.mutex. If the last worker reaches Line 50 after that load but before deinit() stores deinit_on_last_task = true, Line 55 stays false, the worker returns, and no later unrefPendingTask() exists to finish teardown. That leaves the manager/main watcher leaked after the last watcher closes on the error path.

Suggested fix
 fn deinit(this: *PathWatcherManager) void {
@@
-        if (this.hasPendingTasks()) {
-            this.mutex.lock();
-            defer this.mutex.unlock();
-            // deinit when all tasks are done
-            this.deinit_on_last_task = true;
-            return;
-        }
+        {
+            this.mutex.lock();
+            defer this.mutex.unlock();
+            if (this.pending_tasks > 0) {
+                // deinit when all tasks are done
+                this.deinit_on_last_task = true;
+                return;
+            }
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bun.js/node/path_watcher.zig` around lines 42 - 55, The race is that
deinit_on_last_task can be set after the last task cleared has_pending_tasks but
before deinit() checks hasPendingTasks(), leaving teardown untriggered; fix by
making PathWatcherManager.deinit() synchronize with the same mutex or atomically
coordinate with the pending-task flag: either grab the same this.mutex (the
mutex used by unrefPendingTask()/the block that does has_pending_tasks.store)
around the hasPendingTasks() check and the setting of deinit_on_last_task, or
perform an atomic read/compare (e.g. load has_pending_tasks with appropriate
ordering and if it is zero call the manager teardown immediately) and then store
deinit_on_last_task — ensure the code path that sets deinit_on_last_task cannot
race past a concurrent pending_tasks==0 store without triggering deinit
(reference symbols: PathWatcherManager.deinit(),
hasPendingTasks()/has_pending_tasks, deinit_on_last_task,
this.mutex/unrefPendingTask).

Comment on lines +847 to +863
// deinit() calls setClosed() which re-locks this.mutex, and may then
// proceed to destroy(this). Defer it until after unlock so we don't
// self-deadlock or unlock() a freed mutex.
var should_deinit = false;
defer if (should_deinit) this.deinit();

this.mutex.lock();
defer this.mutex.unlock();
this.pending_directories -= 1;
if (this.isClosed() and this.pending_directories == 0) {
if (this.pending_directories == 0) {
// Clear unconditionally: if the scan drains to zero before close()
// runs (the common case — scan is fast, close happens later),
// gating this on isClosed() leaves the flag stale-true. deinit()
// then early-returns on hasPendingDirectories() forever,
// unregisterWatcher never runs, and every fd the scan opened leaks.
this.has_pending_directories.store(false, .release);
this.deinit();
if (this.isClosed()) should_deinit = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

PathWatcher.deinit() can race the last unrefPendingDirectory() into a double destroy.

deinit() still calls setClosed() and then separately loads hasPendingDirectories(). If the close path releases this.mutex in setClosed() and the worker reaches Lines 856-863 before that load, the worker queues a deferred this.deinit() because closed == true while the closer simultaneously sees hasPendingDirectories() == false and tears the watcher down immediately. The deferred call then runs on freed memory.

Suggested fix
 pub fn deinit(this: *PathWatcher) void {
-        this.setClosed();
-        if (this.hasPendingDirectories()) {
-            // will be freed on last directory
-            return;
-        }
+        {
+            this.mutex.lock();
+            defer this.mutex.unlock();
+            this.closed.store(true, .release);
+            if (this.pending_directories > 0) {
+                // will be freed on last directory
+                return;
+            }
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// deinit() calls setClosed() which re-locks this.mutex, and may then
// proceed to destroy(this). Defer it until after unlock so we don't
// self-deadlock or unlock() a freed mutex.
var should_deinit = false;
defer if (should_deinit) this.deinit();
this.mutex.lock();
defer this.mutex.unlock();
this.pending_directories -= 1;
if (this.isClosed() and this.pending_directories == 0) {
if (this.pending_directories == 0) {
// Clear unconditionally: if the scan drains to zero before close()
// runs (the common case — scan is fast, close happens later),
// gating this on isClosed() leaves the flag stale-true. deinit()
// then early-returns on hasPendingDirectories() forever,
// unregisterWatcher never runs, and every fd the scan opened leaks.
this.has_pending_directories.store(false, .release);
this.deinit();
if (this.isClosed()) should_deinit = true;
pub fn deinit(this: *PathWatcher) void {
{
this.mutex.lock();
defer this.mutex.unlock();
this.closed.store(true, .release);
if (this.pending_directories > 0) {
// will be freed on last directory
return;
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/bun.js/node/path_watcher.zig` around lines 847 - 863, The worker path can
race into double-destroy by deferring this.deinit() when it observes
this.isClosed() while the closer concurrently tears down the watcher; to fix, do
not schedule deinit from the worker side: inside the block that runs when
pending_directories becomes 0 (the code using this.pending_directories,
this.has_pending_directories.store, and the isClosed() check), remove the
should_deinit/defer deinit logic and instead perform only safe teardown work
under the mutex (e.g. call unregisterWatcher() or set a "closed-and-drained"
state) and let the close path perform the actual this.deinit() once it holds the
proper ownership; ensure you reference and use this.pending_directories,
this.has_pending_directories, this.isClosed(), unregisterWatcher(), and deinit()
to implement this change so no deferred deinit can run on freed memory.

Comment on lines 846 to 865
pub fn unrefPendingDirectory(this: *PathWatcher) void {
// deinit() calls setClosed() which re-locks this.mutex, and may then
// proceed to destroy(this). Defer it until after unlock so we don't
// self-deadlock or unlock() a freed mutex.
var should_deinit = false;
defer if (should_deinit) this.deinit();

this.mutex.lock();
defer this.mutex.unlock();
this.pending_directories -= 1;
if (this.isClosed() and this.pending_directories == 0) {
if (this.pending_directories == 0) {
// Clear unconditionally: if the scan drains to zero before close()
// runs (the common case — scan is fast, close happens later),
// gating this on isClosed() leaves the flag stale-true. deinit()
// then early-returns on hasPendingDirectories() forever,
// unregisterWatcher never runs, and every fd the scan opened leaks.
this.has_pending_directories.store(false, .release);
this.deinit();
if (this.isClosed()) should_deinit = true;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Race condition in unrefPendingDirectory: both the main thread (via detach()->deinit()) and a worker thread (via the deferred should_deinit path) can execute PathWatcher.deinit() concurrently, causing a double-free/UAF. After the worker clears has_pending_directories and releases the mutex, the main thread's deinit() sees hasPendingDirectories()==false at line 909 and proceeds to destroy(this) at line 936, while the worker's deferred this.deinit() at line 851 then fires on freed memory. Adding an atomic deinit_started flag with a compare-and-swap guard in deinit() would ensure only one thread proceeds with cleanup.

Extended reasoning...

The Bug

The deferred should_deinit pattern introduced in unrefPendingDirectory (lines 846-865) creates a race window between the worker thread and the main thread that can lead to double-free or use-after-free.

Race Scenario Step-by-Step

Consider a user calling watcher.close() (which triggers detach()->deinit() on the main thread) while a DirectoryRegisterTask worker is finishing its last directory scan (which calls unrefPendingDirectory on the worker thread):

  1. Thread A (main): detach() -> deinit() -> setClosed() acquires this.mutex, stores closed=true, releases this.mutex (lines 840-843)
  2. Thread B (worker): unrefPendingDirectory() acquires this.mutex (line 853), decrements pending_directories to 0 (line 855), stores has_pending_directories=false with .release ordering (line 862), reads isClosed() -> true (set by Thread A in step 1), sets should_deinit=true (line 863), then the defer this.mutex.unlock() fires (line 854)
  3. Thread A: hasPendingDirectories() (line 909) performs an atomic .acquire load of has_pending_directories -> sees false (Thread B cleared it in step 2; visible via the happens-before chain: A mutex.unlock -> B mutex.lock -> B .release store -> B mutex.unlock -> A .acquire load)
  4. Thread A: Proceeds past the guard at line 909, calls manager.unregisterWatcher(this) (line 925), then bun.default_allocator.destroy(this) (line 936) — PathWatcher is now freed
  5. Thread B: The outer defer fires (line 851): should_deinit == true -> this.deinit() -> this.setClosed() -> this.mutex.lock()UAF: dereferences freed this

Even if step 5 doesn't crash on the mutex access, the second deinit() proceeds to call unregisterWatcher and destroy again, causing a double-free.

Why This Wasn't Possible Before

Before this PR, unrefPendingDirectory called this.deinit() while still holding this.mutex. Since deinit()->setClosed() tries to re-acquire this.mutex (a non-recursive lock), this self-deadlocked — which is the bug this PR intends to fix. The deadlock, while itself a bug, prevented the worker thread from ever completing deinit(), so only the main thread could proceed with cleanup. The PR correctly fixes the self-deadlock by deferring deinit() until after the mutex is released, but this creates the window between mutex.unlock (DEFER-2 in LIFO order) and deinit (DEFER-1) where the main thread can observe the cleared has_pending_directories flag and race ahead to destroy the watcher.

Impact

This is a real concurrency bug that can cause crashes (segfault on UAF or double-free) in production when a user closes a recursive fs.watch() watcher while directory scanning is still in progress. The race window is narrow but plausible since it only requires the main-thread close and the last worker directory-unref to overlap.

Suggested Fix

Add an atomic deinit_started flag (e.g., std.atomic.Value(bool)) to PathWatcher and guard deinit() with an atomic compare-and-swap:

pub fn deinit(this: *PathWatcher) void {
    // Ensure only one thread proceeds with cleanup
    if (this.deinit_started.cmpxchgStrong(false, true, .acq_rel, .acquire) != null) return;
    this.setClosed();
    // ... rest of deinit
}

This ensures that whichever thread reaches deinit() first performs the cleanup, and the other thread's call is a no-op.

Comment on lines +629 to 635
var should_deinit = false;
defer if (should_deinit) this.deinit();

this.mutex.lock();
defer this.mutex.unlock();

var watchers = this.watchers.slice();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing AB/BA lock inversion between onError() and deinit(): onError() acquires this.mutex then default_manager_mutex (lines 311, 326), while deinit() acquires them in the opposite order (lines 671, 685). This PR makes the deadlock path more reachable by deferring deinit() outside the lock in unregisterWatcher, but the root cause (inconsistent lock ordering in onError) is pre-existing. Consider fixing onError() to release this.mutex before acquiring default_manager_mutex, matching the order used by deinit().

Extended reasoning...

What the bug is

onError() (line 306) and deinit() (line 669) acquire this.mutex and default_manager_mutex in opposite orders, creating a classic AB/BA lock inversion:

  • onError() (watcher thread): acquires this.mutex (line 311), then default_manager_mutex (line 326)
  • deinit(): acquires default_manager_mutex (line 671), then conditionally acquires this.mutex (line 685, when hasPendingTasks() is true)

How the PR makes this more reachable

Before this PR, deinit() was called from unregisterWatcher and unrefPendingTask while this.mutex was already held. This caused a self-deadlock at line 685 (re-locking a non-recursive mutex), which was itself a bug — but it also meant the cross-thread AB/BA path could never be reached from those call sites.

After this PR, deinit() is deferred until after this.mutex is released (the should_deinit pattern). This correctly fixes the self-deadlock and UAF, but now deinit() enters with no locks held and acquires default_manager_mutex first, then this.mutex — the opposite order from onError().

Concrete deadlock scenario

  1. Thread A (main thread): unregisterWatcher sets should_deinit = true, releases this.mutex, deferred deinit() fires, acquires default_manager_mutex (line 671), calls hasPendingTasks() which returns true (directory scans in flight), tries this.mutex.lock() at line 685 — blocks.
  2. Thread B (watcher thread): onError() fires, acquires this.mutex (line 311), tries default_manager_mutex.lock() at line 326 — blocks.
  3. Result: classic deadlock — each thread holds the lock the other needs.

Why this is pre-existing

The inconsistent lock ordering in onError() is not introduced by this PR — onError() is unchanged. Moreover, onError() itself calls this.deinit() at line 332 after releasing both locks (via the block scope ending at line 329). This means a concurrent deinit() from another path (e.g., a second error, or a watcher closing) could already race against onError() acquiring default_manager_mutex while holding this.mutex. The PR increases the surface area for hitting this inversion but does not introduce the fundamental ordering inconsistency.

Impact and fix

The deadlock requires a watcher error occurring simultaneously with the last watcher being unregistered while directory scan tasks are pending — a narrow but real window. The fix would be to make onError() release this.mutex before acquiring default_manager_mutex, similar to how processWatcher was fixed in this same PR. For example, the watchers block (lines 310-328) could be split: iterate and notify watchers under this.mutex, release it, then acquire default_manager_mutex to clear default_manager.

@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented Apr 17, 2026

Superseded by #29391 — same approach, rebased onto current main with no behavior changes beyond the defer-ordering fix (the unconditional has_pending_* clear in this PR caused ASAN regressions).

@robobun robobun closed this Apr 17, 2026
Jarred-Sumner pushed a commit that referenced this pull request Apr 17, 2026
## What does this PR do?

Targeted fix for four concurrency bugs in
`src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock
or crash. This is the minimal port of #27469 (by @chrislloyd) onto
current main — previous attempts (#27957, #28088, #28104) added scope
and introduced ASAN regressions. This PR applies only the defer-ordering
fix.

## Root causes

### `unregisterWatcher` self-deadlock (primary)

The deinit-on-last-watcher `defer` was registered *after* the
`mutex.lock()`/`defer mutex.unlock()` pair:

```zig
this.mutex.lock();
defer this.mutex.unlock();           // fires last
var watchers = this.watchers.slice();
defer {                               // fires BEFORE unlock
    if (this.deinit_on_last_watcher and this.watcher_count == 0) {
        this.deinit();                // called holding this.mutex
    }
}
```

Zig defers fire LIFO, so `deinit()` ran while still holding
`this.mutex`. `deinit()` re-acquires `this.mutex` when
`hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock
(observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock
detected` on Linux).

Also UAF: if `deinit()` completes it calls `destroy(this)`, then the
deferred `mutex.unlock()` writes to freed memory.

### `unrefPendingTask` UAF

Same shape: `deinit()` called while holding `this.mutex`. If it
completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF.

### `unrefPendingDirectory` self-deadlock

`deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while
holding → self-deadlock.

### `processWatcher` AB/BA lock inversion

Worker thread holds `watcher.mutex` → calls
`manager._decrementPathRef()` on the OOM error path, which acquires
`manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` →
wants `watcher.mutex`. Classic AB/BA deadlock.

## Fix

All four use the same pattern: set a `should_deinit` flag under the
lock, checked by a `defer` registered *before* the lock/unlock pair.
LIFO ordering then yields unlock → deinit. For `processWatcher`, capture
the append result and release `watcher.mutex` before calling
`_decrementPathRef`.

No semantic changes to when the `has_pending_*` atomics are cleared —
the conditions remain exactly as before, only `deinit()` is moved
outside the lock.

## How did you verify your code works?

- `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes
- Without the fix (src/ stashed), the same test fails with `panic:
Deadlock detected` from the debug-build mutex checker
- All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under
ASAN (these regressed in #28104)
- `test/regression/issue/3657.test.ts` passes under ASAN
- `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing
root-permission failures (same on main)

Supersedes #27469, #27957, #28088, #28104.

---

**Re: duplicate-bot flagging #26385** — that PR fixes a different lock
inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around
`main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex`
self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex`
inversion. They're complementary, not duplicates.

**Re: find-issues-bot flagging #18919** — tested the repro; it hits a
separate pre-existing use-after-poison in the File Watcher thread that
also reproduces on main without this change. Not claiming that fix here.
structwafel pushed a commit to structwafel/bun that referenced this pull request Apr 25, 2026
…9391)

## What does this PR do?

Targeted fix for four concurrency bugs in
`src/bun.js/node/path_watcher.zig` that cause `fs.watch()` to deadlock
or crash. This is the minimal port of oven-sh#27469 (by @chrislloyd) onto
current main — previous attempts (oven-sh#27957, oven-sh#28088, oven-sh#28104) added scope
and introduced ASAN regressions. This PR applies only the defer-ordering
fix.

## Root causes

### `unregisterWatcher` self-deadlock (primary)

The deinit-on-last-watcher `defer` was registered *after* the
`mutex.lock()`/`defer mutex.unlock()` pair:

```zig
this.mutex.lock();
defer this.mutex.unlock();           // fires last
var watchers = this.watchers.slice();
defer {                               // fires BEFORE unlock
    if (this.deinit_on_last_watcher and this.watcher_count == 0) {
        this.deinit();                // called holding this.mutex
    }
}
```

Zig defers fire LIFO, so `deinit()` ran while still holding
`this.mutex`. `deinit()` re-acquires `this.mutex` when
`hasPendingTasks()` is true. The mutex is non-recursive → self-deadlock
(observed as `__ulock_wait2` hang on macOS, debug-build `panic: Deadlock
detected` on Linux).

Also UAF: if `deinit()` completes it calls `destroy(this)`, then the
deferred `mutex.unlock()` writes to freed memory.

### `unrefPendingTask` UAF

Same shape: `deinit()` called while holding `this.mutex`. If it
completes, `destroy(this)` means the deferred `mutex.unlock()` is a UAF.

### `unrefPendingDirectory` self-deadlock

`deinit()` calls `setClosed()` which re-locks `this.mutex`. Called while
holding → self-deadlock.

### `processWatcher` AB/BA lock inversion

Worker thread holds `watcher.mutex` → calls
`manager._decrementPathRef()` on the OOM error path, which acquires
`manager.mutex`. Main thread `unregisterWatcher` holds `manager.mutex` →
wants `watcher.mutex`. Classic AB/BA deadlock.

## Fix

All four use the same pattern: set a `should_deinit` flag under the
lock, checked by a `defer` registered *before* the lock/unlock pair.
LIFO ordering then yields unlock → deinit. For `processWatcher`, capture
the append result and release `watcher.mutex` before calling
`_decrementPathRef`.

No semantic changes to when the `has_pending_*` atomics are cleared —
the conditions remain exactly as before, only `deinit()` is moved
outside the lock.

## How did you verify your code works?

- `bun bd test test/js/node/watch/fs.watch.deadlock.test.ts` passes
- Without the fix (src/ stashed), the same test fails with `panic:
Deadlock detected` from the debug-build mutex checker
- All Node.js parallel `test-fs-watch-recursive-*.js` tests pass under
ASAN (these regressed in oven-sh#28104)
- `test/regression/issue/3657.test.ts` passes under ASAN
- `test/js/node/watch/fs.watch.test.ts` — 30 pass, 2 pre-existing
root-permission failures (same on main)

Supersedes oven-sh#27469, oven-sh#27957, oven-sh#28088, oven-sh#28104.

---

**Re: duplicate-bot flagging oven-sh#26385** — that PR fixes a different lock
inversion (`PathWatcherManager.mutex` ↔ `Watcher.mutex`, around
`main_watcher.remove()`). This PR fixes `PathWatcherManager.mutex`
self-deadlock and `PathWatcher.mutex` ↔ `PathWatcherManager.mutex`
inversion. They're complementary, not duplicates.

**Re: find-issues-bot flagging oven-sh#18919** — tested the repro; it hits a
separate pre-existing use-after-poison in the File Watcher thread that
also reproduces on main without this change. Not claiming that fix here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants