Skip to content

vm rm --force: DB delete proceeds when WithRunningVM reports ErrNotRunning, but CH may actually be alive #49

@CMGS

Description

@CMGS

Summary

On the cocoonset-gke-private node cocoon-pool-2 we observed 8 cloud-hypervisor / firecracker processes alive on the host with no matching entry in cocoon's vm DB. PIDs ranged from 1 to 4 days old. After cocoon gc, the rundirs were reclaimed but the CH processes themselves had to be SIGTERM'd by hand.

This is the classic "DB cleaned, VM not killed" leak — the inverse of the documented DeleteAll invariant that dir cleanup precedes DB delete so a failed cleanup leaves a retry-able record.

How it can happen

hypervisor/stop.go:DeleteAll flow per VM:

b.WithRunningVM(ctx, &rec, func(_ int) error {
    if !force { return fmt.Errorf("running (force required)") }
    return stopOne(ctx, id)
})
// if runningErr is nil OR ErrNotRunning, proceed:
RemoveVMDirs(rec.RunDir, rec.LogDir)
DB.Update(delete record)

WithRunningVM returns ErrNotRunning when either:

  1. The PID file is missing (fs.ErrNotExist), or
  2. VerifyProcessCmdline(pid, BinaryName, SocketPath) returns false — the PID exists but its /proc/<pid>/cmdline doesn't match.

In both cases DeleteAll treats the VM as "already gone" and cleans the DB without killing anything. If a real CH is running but cocoon's PID file is wrong or missing (stale from a previous lifecycle, manually wiped, file-system corruption, etc.), the live CH process is silently orphaned.

A second related path: even when stopOne is invoked, the CH-side forceTerminate reads the same PID via WithRunningVM and calls terminateWithPidfd. utils/pidfd_linux.go:16-17:

if !VerifyProcessCmdline(pid, binaryName, expectArg) {
    return true, nil
}

Returning (true, nil) ("handled, no error") means TerminateProcess returns nil → stopOne returns nil → DB cleanup proceeds, without ever sending a signal. Same outcome as path 1 if the cmdline check is flaky.

Why pidfile/cmdline could be wrong on production

  • PID file race: cocoon writes pidfile after cmd.Start but before the process is fully verified. If cocoon CLI is killed in this window, pidfile may not exist.
  • PID recycle: long-running host with many VMs cycling; the kernel can recycle PIDs. VerifyProcessCmdline is the guard, but it's only as good as /proc/<pid>/cmdline matching the strict basename + the api-socket substring. Any drift (e.g., pidfile points at an unrelated process whose cmdline doesn't contain the api.sock string) → false negative → "not running".
  • Rundir wiped by hand or by an external tool while CH is still attached via fds (CH keeps the socket open; pidfile gone but CH alive).
  • Crash mid-startup: cocoon started CH, wrote pidfile, then crashed before persisting the DB record... actually this leaves no DB entry, not the observed shape.

The observed shape on production is DB-empty + CH-alive, which matches WithRunningVM ⇒ ErrNotRunning ⇒ DB.Update(delete) cleanly. PID file or cmdline check fired falsely.

Proposed fixes

Pick one (combinable):

  1. Last-line verification before DB delete: after WithRunningVM returns ErrNotRunning, scan /proc for any cloud-hypervisor / firecracker process whose --api-socket argument contains the VM's runDir. If found, treat as "real running, escalate", not "gone".
  2. Don't lose the API socket as a side-channel: if the api.sock UDS is still answerable (CH exports a vm.info endpoint), the VM is alive regardless of what the pidfile says. Probe the socket as a tiebreaker before declaring not-running.
  3. Tighten VerifyProcessCmdline: require an exact match on the full api-socket path (currently a substring check on the rest of argv). Reduces false negatives from cmdline truncation / locale weirdness — but this is the existing approach being bypassed, so it's only a defense in depth.

(1) or (2) is the durable fix because they don't depend on the pidfile being trustworthy. (1) is purely a filesystem scan; (2) needs a short timeout to keep vm rm snappy when CH is genuinely gone but the socket file lingers.

Out-of-band cleanup recommendation

Once the leak is detected, cocoon gc doesn't kill the orphan VMM processes (it only reclaims rundirs whose VM ID has no DB entry — and even that only after the rundir is reachable). We had to:

SIGTERM the orphan PIDs by hand → cocoon gc

A cocoon doctor (or cocoon gc --processes) flag that scans for cloud-hypervisor / firecracker processes whose --api-socket rundir is not in any backend's DB and offers to kill them would close the loop.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions