vm rm --force: DB delete proceeds when WithRunningVM reports ErrNotRunning, but CH may actually be alive

## Summary

On the cocoonset-gke-private node `cocoon-pool-2` we observed **8 cloud-hypervisor / firecracker processes alive on the host with no matching entry in cocoon's vm DB**. PIDs ranged from 1 to 4 days old. After `cocoon gc`, the rundirs were reclaimed but the CH processes themselves had to be SIGTERM'd by hand.

This is the classic "DB cleaned, VM not killed" leak — the inverse of the documented `DeleteAll` invariant that *dir cleanup precedes DB delete so a failed cleanup leaves a retry-able record*.

## How it can happen

`hypervisor/stop.go:DeleteAll` flow per VM:

```go
b.WithRunningVM(ctx, &rec, func(_ int) error {
    if !force { return fmt.Errorf("running (force required)") }
    return stopOne(ctx, id)
})
// if runningErr is nil OR ErrNotRunning, proceed:
RemoveVMDirs(rec.RunDir, rec.LogDir)
DB.Update(delete record)
```

`WithRunningVM` returns `ErrNotRunning` when **either**:

1. The PID file is missing (`fs.ErrNotExist`), or
2. `VerifyProcessCmdline(pid, BinaryName, SocketPath)` returns false — the PID exists but its `/proc/<pid>/cmdline` doesn't match.

In both cases `DeleteAll` treats the VM as "already gone" and **cleans the DB without killing anything**. If a real CH is running but cocoon's PID file is wrong or missing (stale from a previous lifecycle, manually wiped, file-system corruption, etc.), the live CH process is silently orphaned.

A second related path: even when `stopOne` is invoked, the CH-side `forceTerminate` reads the same PID via `WithRunningVM` and calls `terminateWithPidfd`. `utils/pidfd_linux.go:16-17`:

```go
if !VerifyProcessCmdline(pid, binaryName, expectArg) {
    return true, nil
}
```

Returning `(true, nil)` ("handled, no error") means `TerminateProcess` returns nil → `stopOne` returns nil → DB cleanup proceeds, **without ever sending a signal**. Same outcome as path 1 if the cmdline check is flaky.

## Why pidfile/cmdline could be wrong on production

- PID file race: cocoon writes pidfile after `cmd.Start` but before the process is fully verified. If cocoon CLI is killed in this window, pidfile may not exist.
- PID recycle: long-running host with many VMs cycling; the kernel can recycle PIDs. `VerifyProcessCmdline` is the guard, but it's only as good as `/proc/<pid>/cmdline` matching the strict basename + the api-socket substring. Any drift (e.g., pidfile points at an unrelated process whose cmdline doesn't contain the api.sock string) → false negative → "not running".
- Rundir wiped by hand or by an external tool while CH is still attached via fds (CH keeps the socket open; pidfile gone but CH alive).
- Crash mid-startup: cocoon started CH, wrote pidfile, then crashed before persisting the DB record... actually this leaves no DB entry, not the observed shape.

The observed shape on production is **DB-empty + CH-alive**, which matches `WithRunningVM ⇒ ErrNotRunning ⇒ DB.Update(delete)` cleanly. PID file or cmdline check fired falsely.

## Proposed fixes

Pick one (combinable):

1. **Last-line verification before DB delete**: after `WithRunningVM` returns `ErrNotRunning`, scan `/proc` for any cloud-hypervisor / firecracker process whose `--api-socket` argument contains the VM's runDir. If found, treat as "real running, escalate", not "gone".
2. **Don't lose the API socket as a side-channel**: if the api.sock UDS is still answerable (CH exports a vm.info endpoint), the VM is alive regardless of what the pidfile says. Probe the socket as a tiebreaker before declaring not-running.
3. **Tighten `VerifyProcessCmdline`**: require an exact match on the full api-socket path (currently a substring check on the rest of argv). Reduces false negatives from cmdline truncation / locale weirdness — but this is the existing approach being bypassed, so it's only a defense in depth.

(1) or (2) is the durable fix because they don't depend on the pidfile being trustworthy. (1) is purely a filesystem scan; (2) needs a short timeout to keep `vm rm` snappy when CH is genuinely gone but the socket file lingers.

## Out-of-band cleanup recommendation

Once the leak is detected, `cocoon gc` doesn't kill the orphan VMM processes (it only reclaims rundirs whose VM ID has no DB entry — and even that only after the rundir is reachable). We had to:

```
SIGTERM the orphan PIDs by hand → cocoon gc
```

A `cocoon doctor` (or `cocoon gc --processes`) flag that scans for cloud-hypervisor / firecracker processes whose `--api-socket` rundir is not in any backend's DB and offers to kill them would close the loop.

## Related

- cocoonstack/vk-cocoon#18 — vk-cocoon's `sudo cocoon vm rm --force` made `exec.CommandContext` SIGKILL miss the cocoon CLI grandchild, increasing the odds that a `vm rm` was cancelled mid-flight. That's been removed, which lowers the trigger rate, but the underlying cocoon-side DB-vs-process invariant still needs fixing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm rm --force: DB delete proceeds when WithRunningVM reports ErrNotRunning, but CH may actually be alive #49

Summary

How it can happen

Why pidfile/cmdline could be wrong on production

Proposed fixes

Out-of-band cleanup recommendation

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vm rm --force: DB delete proceeds when WithRunningVM reports ErrNotRunning, but CH may actually be alive #49

Description

Summary

How it can happen

Why pidfile/cmdline could be wrong on production

Proposed fixes

Out-of-band cleanup recommendation

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions