Skip to content

Orphaned shellper processes accumulate again (regression of #389): af cleanup / Tower restart leave untracked shellpers running #1007

@waleedkadous

Description

@waleedkadous

Problem

Orphaned shellper-main.js processes are accumulating again — a regression of #389 (and the earlier #294 / #341 / #324, all CLOSED). On the dev box right now there are 51 live shellper processes but Tower's authoritative registry (GET /api/terminals, which is global across all workspaces) only tracks 27. That leaves 24 orphans — running shellpers with no Tower session attached in any workspace, all reparented to PPID 1 (their original Tower instance died).

Forensics (2026-06-06)

Method — subtract Tower's global live set from the running shellper set:

  • Tower /api/terminals: 27 tracked PIDs (codev 6 + shannon 20 + writing 1 — matches /api/workspaces counts exactly).
  • ps for shellper-main.js: 51 PIDs.
  • Orphans = running − tracked = 24, all PPID 1.

The 24 orphans cluster into:

Bucket Count Notes
E2E test leftovers 14 ~/.agent-farm/test-workspaces/codev-reconnect-* + /tmp /bin/echo fixtures; node 25
Old shannon sessions 9 node-25 zombies + two stray /bin/bash inside live builder worktrees (bugfix-2135, bugfix-2164)
Old codev session 1 pid was a node-25 claude session

Two distinct leak sources

  1. af cleanup doesn't kill the shellper — the original af cleanup doesn't kill shellper processes — orphaned processes accumulate #389 defect. Worktree + branch + Tower row removed, but the process survives (e.g. codev bugfix-985: worktree deleted, yet Tower still lists pid 31931 as a live session — a stale-but-tracked row, a sibling of this bug).
  2. Tower restart abandons old shellpers — the node-25→node-26 migration restarted Tower; the old node-25 shellpers detached (PPID 1) instead of being reaped. This is the Shellper processes do not survive Tower restart despite detached:true #324 failure mode resurfacing. Most of the 24 orphans are node-25.

E2E reconnect tests also leak their fixture shellpers (14 of the 24) — the test harness should tear these down.

Expected

  • af cleanup kills the shellper process group (and children), not just the Tower row.
  • Tower startup reaps shellpers from a prior instance that no client will ever reattach to (or they should self-terminate when their parent Tower dies — detached:true should not mean "outlive Tower forever").
  • E2E reconnect tests clean up their spawned shellpers.
  • Add a reconciliation/GC pass: any shellper-main.js not present in /api/terminals is an orphan and can be reaped.

Detection one-liner

# orphans = running shellpers not in Tower's global registry
comm -23 \
  <(ps -eo pid,command | grep shellper-main.js | grep -v grep | awk '{print $1}' | sort) \
  <(curl -s http://127.0.0.1:4100/api/terminals | python3 -c 'import sys,json;[print(t["pid"]) for t in json.load(sys.stdin)["terminals"]]' | sort)

Related (all CLOSED): #389, #341, #324, #294.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/towerArea: Tower server / agent farm CLIbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions