Summary
On the cocoonset-gke-private node cocoon-pool-2 we observed 8 cloud-hypervisor / firecracker processes alive on the host with no matching entry in cocoon's vm DB. PIDs ranged from 1 to 4 days old. After cocoon gc, the rundirs were reclaimed but the CH processes themselves had to be SIGTERM'd by hand.
This is the classic "DB cleaned, VM not killed" leak — the inverse of the documented DeleteAll invariant that dir cleanup precedes DB delete so a failed cleanup leaves a retry-able record.
How it can happen
hypervisor/stop.go:DeleteAll flow per VM:
b.WithRunningVM(ctx, &rec, func(_ int) error {
if !force { return fmt.Errorf("running (force required)") }
return stopOne(ctx, id)
})
// if runningErr is nil OR ErrNotRunning, proceed:
RemoveVMDirs(rec.RunDir, rec.LogDir)
DB.Update(delete record)
WithRunningVM returns ErrNotRunning when either:
- The PID file is missing (
fs.ErrNotExist), or
VerifyProcessCmdline(pid, BinaryName, SocketPath) returns false — the PID exists but its /proc/<pid>/cmdline doesn't match.
In both cases DeleteAll treats the VM as "already gone" and cleans the DB without killing anything. If a real CH is running but cocoon's PID file is wrong or missing (stale from a previous lifecycle, manually wiped, file-system corruption, etc.), the live CH process is silently orphaned.
A second related path: even when stopOne is invoked, the CH-side forceTerminate reads the same PID via WithRunningVM and calls terminateWithPidfd. utils/pidfd_linux.go:16-17:
if !VerifyProcessCmdline(pid, binaryName, expectArg) {
return true, nil
}
Returning (true, nil) ("handled, no error") means TerminateProcess returns nil → stopOne returns nil → DB cleanup proceeds, without ever sending a signal. Same outcome as path 1 if the cmdline check is flaky.
Why pidfile/cmdline could be wrong on production
- PID file race: cocoon writes pidfile after
cmd.Start but before the process is fully verified. If cocoon CLI is killed in this window, pidfile may not exist.
- PID recycle: long-running host with many VMs cycling; the kernel can recycle PIDs.
VerifyProcessCmdline is the guard, but it's only as good as /proc/<pid>/cmdline matching the strict basename + the api-socket substring. Any drift (e.g., pidfile points at an unrelated process whose cmdline doesn't contain the api.sock string) → false negative → "not running".
- Rundir wiped by hand or by an external tool while CH is still attached via fds (CH keeps the socket open; pidfile gone but CH alive).
- Crash mid-startup: cocoon started CH, wrote pidfile, then crashed before persisting the DB record... actually this leaves no DB entry, not the observed shape.
The observed shape on production is DB-empty + CH-alive, which matches WithRunningVM ⇒ ErrNotRunning ⇒ DB.Update(delete) cleanly. PID file or cmdline check fired falsely.
Proposed fixes
Pick one (combinable):
- Last-line verification before DB delete: after
WithRunningVM returns ErrNotRunning, scan /proc for any cloud-hypervisor / firecracker process whose --api-socket argument contains the VM's runDir. If found, treat as "real running, escalate", not "gone".
- Don't lose the API socket as a side-channel: if the api.sock UDS is still answerable (CH exports a vm.info endpoint), the VM is alive regardless of what the pidfile says. Probe the socket as a tiebreaker before declaring not-running.
- Tighten
VerifyProcessCmdline: require an exact match on the full api-socket path (currently a substring check on the rest of argv). Reduces false negatives from cmdline truncation / locale weirdness — but this is the existing approach being bypassed, so it's only a defense in depth.
(1) or (2) is the durable fix because they don't depend on the pidfile being trustworthy. (1) is purely a filesystem scan; (2) needs a short timeout to keep vm rm snappy when CH is genuinely gone but the socket file lingers.
Out-of-band cleanup recommendation
Once the leak is detected, cocoon gc doesn't kill the orphan VMM processes (it only reclaims rundirs whose VM ID has no DB entry — and even that only after the rundir is reachable). We had to:
SIGTERM the orphan PIDs by hand → cocoon gc
A cocoon doctor (or cocoon gc --processes) flag that scans for cloud-hypervisor / firecracker processes whose --api-socket rundir is not in any backend's DB and offers to kill them would close the loop.
Related
Summary
On the cocoonset-gke-private node
cocoon-pool-2we observed 8 cloud-hypervisor / firecracker processes alive on the host with no matching entry in cocoon's vm DB. PIDs ranged from 1 to 4 days old. Aftercocoon gc, the rundirs were reclaimed but the CH processes themselves had to be SIGTERM'd by hand.This is the classic "DB cleaned, VM not killed" leak — the inverse of the documented
DeleteAllinvariant that dir cleanup precedes DB delete so a failed cleanup leaves a retry-able record.How it can happen
hypervisor/stop.go:DeleteAllflow per VM:WithRunningVMreturnsErrNotRunningwhen either:fs.ErrNotExist), orVerifyProcessCmdline(pid, BinaryName, SocketPath)returns false — the PID exists but its/proc/<pid>/cmdlinedoesn't match.In both cases
DeleteAlltreats the VM as "already gone" and cleans the DB without killing anything. If a real CH is running but cocoon's PID file is wrong or missing (stale from a previous lifecycle, manually wiped, file-system corruption, etc.), the live CH process is silently orphaned.A second related path: even when
stopOneis invoked, the CH-sideforceTerminatereads the same PID viaWithRunningVMand callsterminateWithPidfd.utils/pidfd_linux.go:16-17:Returning
(true, nil)("handled, no error") meansTerminateProcessreturns nil →stopOnereturns nil → DB cleanup proceeds, without ever sending a signal. Same outcome as path 1 if the cmdline check is flaky.Why pidfile/cmdline could be wrong on production
cmd.Startbut before the process is fully verified. If cocoon CLI is killed in this window, pidfile may not exist.VerifyProcessCmdlineis the guard, but it's only as good as/proc/<pid>/cmdlinematching the strict basename + the api-socket substring. Any drift (e.g., pidfile points at an unrelated process whose cmdline doesn't contain the api.sock string) → false negative → "not running".The observed shape on production is DB-empty + CH-alive, which matches
WithRunningVM ⇒ ErrNotRunning ⇒ DB.Update(delete)cleanly. PID file or cmdline check fired falsely.Proposed fixes
Pick one (combinable):
WithRunningVMreturnsErrNotRunning, scan/procfor any cloud-hypervisor / firecracker process whose--api-socketargument contains the VM's runDir. If found, treat as "real running, escalate", not "gone".VerifyProcessCmdline: require an exact match on the full api-socket path (currently a substring check on the rest of argv). Reduces false negatives from cmdline truncation / locale weirdness — but this is the existing approach being bypassed, so it's only a defense in depth.(1) or (2) is the durable fix because they don't depend on the pidfile being trustworthy. (1) is purely a filesystem scan; (2) needs a short timeout to keep
vm rmsnappy when CH is genuinely gone but the socket file lingers.Out-of-band cleanup recommendation
Once the leak is detected,
cocoon gcdoesn't kill the orphan VMM processes (it only reclaims rundirs whose VM ID has no DB entry — and even that only after the rundir is reachable). We had to:A
cocoon doctor(orcocoon gc --processes) flag that scans for cloud-hypervisor / firecracker processes whose--api-socketrundir is not in any backend's DB and offers to kill them would close the loop.Related
sudo cocoon vm rm --forcemadeexec.CommandContextSIGKILL miss the cocoon CLI grandchild, increasing the odds that avm rmwas cancelled mid-flight. That's been removed, which lowers the trigger rate, but the underlying cocoon-side DB-vs-process invariant still needs fixing.