fix: stop host-daemon from resurrecting destroyed environments (native watcher crash)#58
Draft
brsbl wants to merge 2 commits into
Conversation
The desktop app hard-crashed with a native @parcel/watcher segfault (FSEventsCallback -> DirTree::add/find). Root cause is an in-memory watch-lifecycle leak in the host-daemon: requireWorkspaceEnvironment -> RuntimeManager.ensureEnvironment re-provisions and re-subscribes an FSEvents watcher for ANY environment referenced by a workspace.* command, with no guard against environments the daemon already destroyed. With ~300 destroyed managed worktrees in the moss project, every workspace.status poll resurrected a dead environment + watcher, churning FSEvents and feeding the native crash. Fix (daemon-owned watch/runtime lifecycle): - RuntimeManager tombstones destroyed environments; destroyEnvironment records the tombstone (even with no live entry) and requireWorkspaceEnvironment refuses to reconnect a tombstoned env (ExpectedCommandDispatchError "environment_destroyed"), so it is never re-watched. ensureEnvironment clears the tombstone when an env is explicitly (re)provisioned. - environment.destroy is idempotent: a repeat destroy returns success instead of resurrecting the workspace. - reconcileLiveEnvironments(liveIds), driven by a new liveEnvironmentIds field on the session-open response, runs on every (re)connect. It drops watchers + runtimes for idle environments the server no longer considers live (destroyed while the daemon was disconnected, whose destroy command never arrived) and tombstones them. Environments with active threads or terminals are never dropped. - WorkspaceStatusWatcher retries are now bounded (give up after a capped number of attempts) so a permanently-missing/invalid path stops re-subscribing instead of retrying forever. Tests: RuntimeManager tombstone + reconcile behavior; dispatch-level resurrection guard + idempotent destroy; bounded watcher retry; server session-open returns only non-destroyed environments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Safety review: GO — no P0/P1. Tracking the reviewer's P2/P3 follow-ups here (documentation only; no code change in this PR). Also appended to the PR description's follow-ups section. Safety-review follow-ups (review came back GO — no P0/P1)
|
Safety-review follow-up P2-A. destroyEnvironment tombstones an environment before the teardown (runtime.shutdown()/workspace.destroy()) that can throw. If teardown fails with anything other than path_not_found the command errors, the server reverts the environment destroying->ready, but the daemon stays tombstoned. reconcileLiveEnvironments only ADDED tombstones (it iterates entries, and a tombstoned env has no entry), so reconnect never healed it and every workspace.status/diff for that idle env returned environment_destroyed until a thread/terminal happened to re-provision it. reconcileLiveEnvironments now also LIFTS the tombstone for any environment id the server reports live, so a failed-teardown env recovers on the next session open. Adds a test: destroyed -> tombstoned, then present in liveEnvironmentIds on reconcile -> tombstone lifted and the env is watchable again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: host-daemon resurrecting destroyed environments (native file-watcher crash) + watch-lifecycle hardening
Fixes the native file-watcher crash that was repeatedly killing the desktop app, plus related watch-lifecycle hardening.
Root cause
requireWorkspaceEnvironment → RuntimeManager.ensureEnvironmentre-provisioned and re-subscribed an FSEvents watcher for any environment named by aworkspace.*command, with no guard against already-destroyed envs. With ~300 destroyed worktrees, everyworkspace.statuspoll resurrected a dead env + watcher → continuous FSEvents churn → null-pointer segfault in@parcel/watcher(watcher.node,FSEventsCallback → DirTree::add/find) → whole-app crash (EXC_BAD_ACCESS). Crash reports:bb-2026-05-29-145501.ips,bb-2026-05-29-183342.ips.Changes
environment.destroy.reconcileLiveEnvironmentson every (re)connect, driven by a new requiredliveEnvironmentIdsfield in the session-open response — drops watchers/runtimes for idle envs the server no longer lists as live, keeps envs with active work.WorkspaceStatusWatcherretry (60 attempts, resets on success; no tight loop).liveEnvironmentIds, so an idle env can't get stuck tombstoned-but-readyafter a failed teardown.Validation
Known / deferred (not in this PR)
thread.startcan briefly lift a tombstone racing a just-processed destroy; self-heals on the next reconcile, no FSEvents leak. No action.@parcel/watcherpast 2.5.6 and fix the darwin-x64-vs-arm64 prebuild mismatch. The native segfault can't be caught from JS; eliminating the churn (this PR) is the real fix. Separate follow-up.Incident report (full context)
What was actually crashing the app (fixed)
pnpm run devstacks had been running since May 22, each a full second bb server+daemon pointed at the same data dir + DB. Multiple backends contending on one SQLite file is the likely cause of a separate database-layer crash (better_sqlite3segfault,bb-2026-05-30-004416.ips). Operational fix (outside this PR): both stacks killed.host_daemon_commandsreached 174k rows / 388 MB, pushingbb.dbto 727 MB. Operational fix (outside this PR): pruned terminal commands + VACUUM → 177 MB.Looked alarming but harmless
thread.renameon a provider-less thread; the status app posting to a deleted thread (404). Cosmetic.