Skip to content

fix: stop host-daemon from resurrecting destroyed environments (native watcher crash)#58

Draft
brsbl wants to merge 2 commits into
ymichael:mainfrom
brsbl:bb/fix-host-daemon-env-watch-lifecycle-leak-destroy-thr_c5xxwwvknt
Draft

fix: stop host-daemon from resurrecting destroyed environments (native watcher crash)#58
brsbl wants to merge 2 commits into
ymichael:mainfrom
brsbl:bb/fix-host-daemon-env-watch-lifecycle-leak-destroy-thr_c5xxwwvknt

Conversation

@brsbl
Copy link
Copy Markdown
Collaborator

@brsbl brsbl commented May 30, 2026

Fix: host-daemon resurrecting destroyed environments (native file-watcher crash) + watch-lifecycle hardening

Fixes the native file-watcher crash that was repeatedly killing the desktop app, plus related watch-lifecycle hardening.

Root cause

requireWorkspaceEnvironment → RuntimeManager.ensureEnvironment re-provisioned and re-subscribed an FSEvents watcher for any environment named by a workspace.* command, with no guard against already-destroyed envs. With ~300 destroyed worktrees, every workspace.status poll resurrected a dead env + watcher → continuous FSEvents churn → null-pointer segfault in @parcel/watcher (watcher.node, FSEventsCallback → DirTree::add/find) → whole-app crash (EXC_BAD_ACCESS). Crash reports: bb-2026-05-29-145501.ips, bb-2026-05-29-183342.ips.

Changes

  • Tombstone destroyed envs so they're never re-watched.
  • Idempotent environment.destroy.
  • reconcileLiveEnvironments on every (re)connect, driven by a new required liveEnvironmentIds field in the session-open response — drops watchers/runtimes for idle envs the server no longer lists as live, keeps envs with active work.
  • Bounded WorkspaceStatusWatcher retry (60 attempts, resets on success; no tight loop).
  • P2-A heal-gap fix: reconcile now also lifts a tombstone for any env that reappears in liveEnvironmentIds, so an idle env can't get stuck tombstoned-but-ready after a failed teardown.

Validation

  • Full-repo typecheck green.
  • Tests green: host-daemon, host-watcher, contract, db, server — plus new tests for each scope above (incl. P2-A).

Known / deferred (not in this PR)

  • P2-B: the session-open contract is now required+strict on both sides, so a mixed-version (old daemon ↔ new server) reconnect fails session-open. Fine for the bundled app, which restarts server+daemon together. Accepted.
  • P3: a thread.start can briefly lift a tombstone racing a just-processed destroy; self-heals on the next reconcile, no FSEvents leak. No action.
  • Deferred: upgrade @parcel/watcher past 2.5.6 and fix the darwin-x64-vs-arm64 prebuild mismatch. The native segfault can't be caught from JS; eliminating the churn (this PR) is the real fix. Separate follow-up.

Incident report (full context)

What was actually crashing the app (fixed)

  1. Daemon resurrecting dead environments → file-watcher crash (the main one) — see Root cause above. Fixed by this PR.
  2. Two duplicate backends fighting over the database — two orphaned pnpm run dev stacks had been running since May 22, each a full second bb server+daemon pointed at the same data dir + DB. Multiple backends contending on one SQLite file is the likely cause of a separate database-layer crash (better_sqlite3 segfault, bb-2026-05-30-004416.ips). Operational fix (outside this PR): both stacks killed.
  3. Database bloat — bug Fix env-daemon CI flakes #1 generated a flood of command records that were never pruned; host_daemon_commands reached 174k rows / 388 MB, pushing bb.db to 727 MB. Operational fix (outside this PR): pruned terminal commands + VACUUM → 177 MB.

Looked alarming but harmless

  • The "Failed to reprime app data change cache" ENOENT flood = references to deleted thread-storage folders. Not a crash.
  • Stale-data warnings: a thread.rename on a provider-less thread; the status app posting to a deleted thread (404). Cosmetic.

The desktop app hard-crashed with a native @parcel/watcher segfault
(FSEventsCallback -> DirTree::add/find). Root cause is an in-memory
watch-lifecycle leak in the host-daemon: requireWorkspaceEnvironment ->
RuntimeManager.ensureEnvironment re-provisions and re-subscribes an
FSEvents watcher for ANY environment referenced by a workspace.* command,
with no guard against environments the daemon already destroyed. With ~300
destroyed managed worktrees in the moss project, every workspace.status
poll resurrected a dead environment + watcher, churning FSEvents and
feeding the native crash.

Fix (daemon-owned watch/runtime lifecycle):
- RuntimeManager tombstones destroyed environments; destroyEnvironment
  records the tombstone (even with no live entry) and requireWorkspaceEnvironment
  refuses to reconnect a tombstoned env (ExpectedCommandDispatchError
  "environment_destroyed"), so it is never re-watched. ensureEnvironment
  clears the tombstone when an env is explicitly (re)provisioned.
- environment.destroy is idempotent: a repeat destroy returns success
  instead of resurrecting the workspace.
- reconcileLiveEnvironments(liveIds), driven by a new liveEnvironmentIds
  field on the session-open response, runs on every (re)connect. It drops
  watchers + runtimes for idle environments the server no longer considers
  live (destroyed while the daemon was disconnected, whose destroy command
  never arrived) and tombstones them. Environments with active threads or
  terminals are never dropped.
- WorkspaceStatusWatcher retries are now bounded (give up after a capped
  number of attempts) so a permanently-missing/invalid path stops
  re-subscribing instead of retrying forever.

Tests: RuntimeManager tombstone + reconcile behavior; dispatch-level
resurrection guard + idempotent destroy; bounded watcher retry; server
session-open returns only non-destroyed environments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@brsbl
Copy link
Copy Markdown
Collaborator Author

brsbl commented May 30, 2026

Safety review: GO — no P0/P1. Tracking the reviewer's P2/P3 follow-ups here (documentation only; no code change in this PR). Also appended to the PR description's follow-ups section.

Safety-review follow-ups (review came back GO — no P0/P1)

  • P2-A — reconcile does not heal a stuck tombstone (idle managed-worktree only, recoverable): destroyEnvironment (runtime-manager.ts) tombstones before the teardown that can throw (destroyedEnvironmentIds.add(...) then runtime.shutdown() / workspace.destroy()). If teardown throws anything other than path_not_found, the command fails and the server reverts the env destroying → ready, but the daemon stays tombstoned — so every workspace.status / workspace.diff for that idle env returns environment_destroyed until a thread.start/terminal lifts the tombstone via ensureEnvironment. reconcileLiveEnvironments only adds tombstones (it iterates entries, and a tombstoned env has no entry), so reconnect does not heal it. Suggested fix: in reconcileLiveEnvironments, also remove from destroyedEnvironmentIds any id present in liveEnvironmentIds; or only tombstone after teardown succeeds. Impact: idle managed-worktree only, recoverable, never affects active threads.
  • P2-B — mixed-version session-open is incompatible by design: the session-open response liveEnvironmentIds field is now required + strict on both sides, so an old-daemon ↔ new-server (or vice-versa) reconnect fails session-open. Fine for the bundled desktop app, which restarts server + daemon together (the hot-swap quits + relaunches the whole app), but noted for any independent/rolling deploy.
  • P3 (minor) — thread.start/terminal lifts the tombstone unconditionally via ensureEnvironment. A thread.start racing a just-processed destroy can lift the tombstone; it self-heals on the next reconcile and causes no FSEvents leak (createEntry provisions before subscribing).

Safety-review follow-up P2-A. destroyEnvironment tombstones an environment
before the teardown (runtime.shutdown()/workspace.destroy()) that can throw.
If teardown fails with anything other than path_not_found the command errors,
the server reverts the environment destroying->ready, but the daemon stays
tombstoned. reconcileLiveEnvironments only ADDED tombstones (it iterates
entries, and a tombstoned env has no entry), so reconnect never healed it and
every workspace.status/diff for that idle env returned environment_destroyed
until a thread/terminal happened to re-provision it.

reconcileLiveEnvironments now also LIFTS the tombstone for any environment id
the server reports live, so a failed-teardown env recovers on the next session
open. Adds a test: destroyed -> tombstoned, then present in liveEnvironmentIds
on reconcile -> tombstone lifted and the env is watchable again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant