fix: stop host-daemon from resurrecting destroyed environments (native watcher crash) by brsbl · Pull Request #58 · ymichael/bb

brsbl · 2026-05-30T02:29:23Z

Fix: host-daemon resurrecting destroyed environments (native file-watcher crash) + watch-lifecycle hardening

Fixes the native file-watcher crash that was repeatedly killing the desktop app, plus related watch-lifecycle hardening.

Root cause

requireWorkspaceEnvironment → RuntimeManager.ensureEnvironment re-provisioned and re-subscribed an FSEvents watcher for any environment named by a workspace.* command, with no guard against already-destroyed envs. With ~300 destroyed worktrees, every workspace.status poll resurrected a dead env + watcher → continuous FSEvents churn → null-pointer segfault in @parcel/watcher (watcher.node, FSEventsCallback → DirTree::add/find) → whole-app crash (EXC_BAD_ACCESS). Crash reports: bb-2026-05-29-145501.ips, bb-2026-05-29-183342.ips.

Changes

Tombstone destroyed envs so they're never re-watched.
Idempotent environment.destroy.
reconcileLiveEnvironments on every (re)connect, driven by a new required liveEnvironmentIds field in the session-open response — drops watchers/runtimes for idle envs the server no longer lists as live, keeps envs with active work.
Bounded WorkspaceStatusWatcher retry (60 attempts, resets on success; no tight loop).
P2-A heal-gap fix: reconcile now also lifts a tombstone for any env that reappears in liveEnvironmentIds, so an idle env can't get stuck tombstoned-but-ready after a failed teardown.

Validation

Full-repo typecheck green.
Tests green: host-daemon, host-watcher, contract, db, server — plus new tests for each scope above (incl. P2-A).

Known / deferred (not in this PR)

P2-B: the session-open contract is now required+strict on both sides, so a mixed-version (old daemon ↔ new server) reconnect fails session-open. Fine for the bundled app, which restarts server+daemon together. Accepted.
P3: a thread.start can briefly lift a tombstone racing a just-processed destroy; self-heals on the next reconcile, no FSEvents leak. No action.
Deferred: upgrade @parcel/watcher past 2.5.6 and fix the darwin-x64-vs-arm64 prebuild mismatch. The native segfault can't be caught from JS; eliminating the churn (this PR) is the real fix. Separate follow-up.

Incident report (full context)

What was actually crashing the app (fixed)

Daemon resurrecting dead environments → file-watcher crash (the main one) — see Root cause above. Fixed by this PR.
Two duplicate backends fighting over the database — two orphaned pnpm run dev stacks had been running since May 22, each a full second bb server+daemon pointed at the same data dir + DB. Multiple backends contending on one SQLite file is the likely cause of a separate database-layer crash (better_sqlite3 segfault, bb-2026-05-30-004416.ips). Operational fix (outside this PR): both stacks killed.
Database bloat — bug Fix env-daemon CI flakes #1 generated a flood of command records that were never pruned; host_daemon_commands reached 174k rows / 388 MB, pushing bb.db to 727 MB. Operational fix (outside this PR): pruned terminal commands + VACUUM → 177 MB.

Looked alarming but harmless

The "Failed to reprime app data change cache" ENOENT flood = references to deleted thread-storage folders. Not a crash.
Stale-data warnings: a thread.rename on a provider-less thread; the status app posting to a deleted thread (404). Cosmetic.

The desktop app hard-crashed with a native @parcel/watcher segfault (FSEventsCallback -> DirTree::add/find). Root cause is an in-memory watch-lifecycle leak in the host-daemon: requireWorkspaceEnvironment -> RuntimeManager.ensureEnvironment re-provisions and re-subscribes an FSEvents watcher for ANY environment referenced by a workspace.* command, with no guard against environments the daemon already destroyed. With ~300 destroyed managed worktrees in the moss project, every workspace.status poll resurrected a dead environment + watcher, churning FSEvents and feeding the native crash. Fix (daemon-owned watch/runtime lifecycle): - RuntimeManager tombstones destroyed environments; destroyEnvironment records the tombstone (even with no live entry) and requireWorkspaceEnvironment refuses to reconnect a tombstoned env (ExpectedCommandDispatchError "environment_destroyed"), so it is never re-watched. ensureEnvironment clears the tombstone when an env is explicitly (re)provisioned. - environment.destroy is idempotent: a repeat destroy returns success instead of resurrecting the workspace. - reconcileLiveEnvironments(liveIds), driven by a new liveEnvironmentIds field on the session-open response, runs on every (re)connect. It drops watchers + runtimes for idle environments the server no longer considers live (destroyed while the daemon was disconnected, whose destroy command never arrived) and tombstones them. Environments with active threads or terminals are never dropped. - WorkspaceStatusWatcher retries are now bounded (give up after a capped number of attempts) so a permanently-missing/invalid path stops re-subscribing instead of retrying forever. Tests: RuntimeManager tombstone + reconcile behavior; dispatch-level resurrection guard + idempotent destroy; bounded watcher retry; server session-open returns only non-destroyed environments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

brsbl · 2026-05-30T02:45:06Z

Safety review: GO — no P0/P1. Tracking the reviewer's P2/P3 follow-ups here (documentation only; no code change in this PR). Also appended to the PR description's follow-ups section.

Safety-review follow-ups (review came back GO — no P0/P1)

P2-A — reconcile does not heal a stuck tombstone (idle managed-worktree only, recoverable): destroyEnvironment (runtime-manager.ts) tombstones before the teardown that can throw (destroyedEnvironmentIds.add(...) then runtime.shutdown() / workspace.destroy()). If teardown throws anything other than path_not_found, the command fails and the server reverts the env destroying → ready, but the daemon stays tombstoned — so every workspace.status / workspace.diff for that idle env returns environment_destroyed until a thread.start/terminal lifts the tombstone via ensureEnvironment. reconcileLiveEnvironments only adds tombstones (it iterates entries, and a tombstoned env has no entry), so reconnect does not heal it. Suggested fix: in reconcileLiveEnvironments, also remove from destroyedEnvironmentIds any id present in liveEnvironmentIds; or only tombstone after teardown succeeds. Impact: idle managed-worktree only, recoverable, never affects active threads.
P2-B — mixed-version session-open is incompatible by design: the session-open response liveEnvironmentIds field is now required + strict on both sides, so an old-daemon ↔ new-server (or vice-versa) reconnect fails session-open. Fine for the bundled desktop app, which restarts server + daemon together (the hot-swap quits + relaunches the whole app), but noted for any independent/rolling deploy.
P3 (minor) — thread.start/terminal lifts the tombstone unconditionally via ensureEnvironment. A thread.start racing a just-processed destroy can lift the tombstone; it self-heals on the next reconcile and causes no FSEvents leak (createEntry provisions before subscribing).

Safety-review follow-up P2-A. destroyEnvironment tombstones an environment before the teardown (runtime.shutdown()/workspace.destroy()) that can throw. If teardown fails with anything other than path_not_found the command errors, the server reverts the environment destroying->ready, but the daemon stays tombstoned. reconcileLiveEnvironments only ADDED tombstones (it iterates entries, and a tombstoned env has no entry), so reconnect never healed it and every workspace.status/diff for that idle env returned environment_destroyed until a thread/terminal happened to re-provision it. reconcileLiveEnvironments now also LIFTS the tombstone for any environment id the server reports live, so a failed-teardown env recovers on the next session open. Adds a test: destroyed -> tombstoned, then present in liveEnvironmentIds on reconcile -> tombstone lifted and the env is watchable again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stop host-daemon from resurrecting destroyed environments (native watcher crash)#58

fix: stop host-daemon from resurrecting destroyed environments (native watcher crash)#58
brsbl wants to merge 2 commits into
ymichael:mainfrom
brsbl:bb/fix-host-daemon-env-watch-lifecycle-leak-destroy-thr_c5xxwwvknt

brsbl commented May 30, 2026 •

edited

Loading

Uh oh!

brsbl commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brsbl commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: host-daemon resurrecting destroyed environments (native file-watcher crash) + watch-lifecycle hardening

Root cause

Changes

Validation

Known / deferred (not in this PR)

Incident report (full context)

What was actually crashing the app (fixed)

Looked alarming but harmless

Uh oh!

brsbl commented May 30, 2026

Safety-review follow-ups (review came back GO — no P0/P1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brsbl commented May 30, 2026 •

edited

Loading