Skip to content

cocoon vm status defaults to indefinite polling; orphans easily on shell disconnect #48

@CMGS

Description

@CMGS

Summary

cocoon vm status [VMID] defaults to a 5-second polling loop with no built-in deadline or pipe-close detection. When the caller's terminal/tmux disconnects without delivering SIGHUP to the cocoon process (very common when piped through sudo or wrapped in bash -c and the controlling session dies), the polling loop survives indefinitely.

On the cocoonset-gke-private node cocoon-pool-2 we found a 21-day-old orphan from this exact pattern:

1965315 1 root S 21-08:24:21 sudo cocoon vm status --format json
1965317 1965315 root Sl 21-08:24:21 cocoon vm status --format json
1965480 1 bytedang+ Ss 21-08:18:49 bash -c sudo cocoon vm status <vmid> --format json | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('pid', 0))"

The python3 wrapper calls json.load(sys.stdin) which blocks on EOF, so the pipe reader never goes away → cocoon never receives SIGPIPE → cocoon keeps polling. The user's intent was a one-shot inspect; they got a forever loop instead.

Root cause

cmd/vm/status.go:77-80:

interval, _ := cmd.Flags().GetInt("interval")
if interval <= 0 {
    interval = 5 //nolint:mnd
}

There is no one-shot path; vm status always enters the polling loop. The proper one-shot is vm inspect, but the affordance is unclear — many operators reach for vm status <vmid> expecting --format json to print once and exit.

Proposed fix

Pick one or more:

  1. Default to one-shot, require --watch (or --event, currently a flag) to enter the polling loop. Breaking change but matches Unix convention (statuswatch).
  2. --interval=0 means one-shot instead of "use default 5s". Easy, non-breaking; documented in the flag help.
  3. Detect broken stdout: trap SIGPIPE / EPIPE on the emit path and exit cleanly. Catches the python3 + sudo orphan pattern even when neither (1) nor (2) is opted into.
  4. Deadline: --timeout flag for the polling loop, defaulting to e.g. 1h so a forgotten vm status doesn't run for 21 days.

(1)+(3) together would have prevented the orphan entirely. (3) alone catches the python3-stdin pattern at the cost of not solving disconnected-tty cases when stdout is the original tty (already closed cleanly by sshd).

Out of scope

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions