Skip to content

fix: memory leaks#131

Merged
NathanFlurry merged 1 commit into
mainfrom
fix-runtime-memory-leaks
Jun 25, 2026
Merged

fix: memory leaks#131
NathanFlurry merged 1 commit into
mainfrom
fix-runtime-memory-leaks

Conversation

@NathanFlurry

@NathanFlurry NathanFlurry commented Jun 25, 2026

Copy link
Copy Markdown
Member

Summary

A memory-leak audit of the VM runtime found 12 leaks (the secure-exec half of a cross-repo audit; the agent-os half is rivet-dev/agentos#1535). Common root cause: tracking maps/registries are populated on create but released only on a happy-path teardown that is skipped on an error ? or never runs — plus per-timer threads with no cancellation. Each fix releases on every termination path and ships a fast, safeguard-firing test (no resource-saturation tests, per the testing rules).

Fixes

ID Crate / file Leak → fix
H1 sidecar/{vm,service}.rs dispose_vm_internal removed per-VM tracking only after ~7 fallible ? steps, and the dispose_session/remove_connection loops ?-out on the first failure, abandoning later VMs → VM removed before the fallible teardown half; cleanup runs unconditionally (error surfaced after); loops attempt-all + aggregate
M6 sidecar/{service,vm}.rs extension_process_output_buffers cleared only on successful handoff → also cleared on VM disposal
M5 sidecar/{service,stdio}.rs stdio active_sessions set never shrank → disposed sessions are now untracked (also stops the ~250µs event-pump from iterating dead entries)
L4 sidecar/{execution,state}.rs loopback-TLS registry kept dead Weak entries until the next lazy retain() → endpoints remove their own entry on Drop (guarded by ptr_eq + last-peer check)
H1 sidecar-browser/service.rs same dispose-short-circuit on vms/contexts (each holding a BrowserKernel) → cleared on every dispose path
H2 / M3 execution/javascript.rs guest _scheduleTimer / kernel timers spawned untracked OS threads with uncapped delay (thread-exhaustion DoS, Arc pinning) → delay capped + cancellable via generation check + clear-on-teardown
M1 v8-runtime/bridge.rs VM_CONTEXTS slot freed only on the error path → evicted on context finalize; reused isolates no longer creep to MAX_VM_CONTEXTS
M2 v8-runtime/session.rs pending promise-resolver v8::Globals could drop after their isolate on run_event_loop early-exit (leak and a V8 lifetime-contract violation) → reset before isolate teardown on every exit path
M4 kernel/socket_table.rs socket id allocated before the backlog check; full-backlog connect failure leaked it (both unix + inet) → id allocated only after the check passes
L2 vfs/posix/overlay_fs.rs rename failure orphaned staged snapshot entries → rolled back on copy/rename error
L3 vfs/engine/mem/metadata_store.rs snapshots grew forever (no delete, gc() was a no-op) → delete_snapshot() + a real gc() that reclaims unreferenced blocks

L1 (the bounded, documented Box::into_raw isolate-handle, ~5 KB, reclaimed at process exit) is intentionally left as-is.

Cross-repo note (enables agent-os H4)

This adds an additive Extension::on_session_disposed hook fired on DisposeReason::ConnectionClosed. agent-os's ACP extension already has a ready cleanup_sessions_for_connection; once this lands and is published, agent-os can override the hook to fully close its H4 (ACP sessions leaking on client disconnect). Additive default = no break for existing Extension impls.

Trust model

All are reachable through ordinary guest execution or session/VM/process churn (in-scope under the sidecar↔executor boundary). H2/M3 and M1 are guest-amplifiable (timer-thread exhaustion; context-slot exhaustion) and also tighten "limits bounded by default" (the previously uncapped timer delay).

Testing

cargo test per crate, all green: sidecar (dispose-lifecycle incl. error-path reclaim, M5 untrack, on_session_disposed-on-ConnectionClosed, L4 registry), sidecar-browser (H1), execution (timer cap + generation suppression), v8-runtime 113 (M1 finalize sweep, M2 reset-before-teardown — V8 tests subprocess-isolated), kernel (full-backlog id not consumed, unix+inet), vfs-core (L2 rollback, L3 delete_snapshot + gc). New V8 tests follow the crate's subprocess convention; no saturation tests added.

@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-131 June 25, 2026 21:33 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-131 June 25, 2026 21:33 Destroyed
@NathanFlurry NathanFlurry changed the title fix: plug twelve runtime memory leaks (sidecar, kernel, VFS, V8) fix: memory leaks Jun 25, 2026
A leak audit of the VM runtime found tracking maps/registries populated on
create but released only on a happy-path teardown (skipped on error '?' or
never), plus per-timer threads with no cancellation. Each fix releases on
every termination path; each ships a fast safeguard-firing test.

sidecar:
- dispose_vm_internal removes per-VM tracking on every exit path (VM removed
  before the fallible teardown half; cleanup runs unconditionally, error
  surfaced after) and the dispose_session/remove_connection loops attempt
  every item and aggregate errors instead of '?'-ing out (H1).
- extension_process_output_buffers cleared on VM disposal, not just on
  successful handoff (M6).
- disposed sessions are now untracked from the stdio active_sessions set (M5).
- loopback-TLS endpoints remove their own registry entry on Drop instead of
  relying on the lazy retain() sweep (L4).
- new additive Extension::on_session_disposed hook, fired on
  DisposeReason::ConnectionClosed, so extensions (e.g. ACP) can free
  per-session state on client disconnect (enables agent-os H4).
sidecar-browser: vms/contexts maps cleared on every dispose path (H1).
execution: bridge/kernel timer threads are delay-capped and cancellable via a
  generation check + clear-on-teardown, so guest timers can't exhaust OS
  threads or outlive their session (H2, M3).
v8-runtime: VM_CONTEXTS slots evicted on context finalize, not only on error,
  so reused isolates don't hit MAX_VM_CONTEXTS (M1); pending promise-resolver
  Globals reset before the isolate is dropped on every run_event_loop exit
  (Shutdown/abort), fixing the leak and a V8 lifetime-contract violation (M2).
kernel: socket ids allocated only after the backlog check passes, so a
  full-backlog connect failure no longer consumes an id (M4).
vfs: rename rolls back staged snapshot entries on copy/rename failure (L2);
  InMemoryMetadataStore gains delete_snapshot() and a real gc() that reclaims
  unreferenced blocks (L3).

L1 (bounded, documented Box::into_raw isolate handle) intentionally left as-is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NathanFlurry NathanFlurry force-pushed the fix-runtime-memory-leaks branch from 98c634f to 28cd8da Compare June 25, 2026 21:48
@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-131 June 25, 2026 21:48 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-131 June 25, 2026 21:48 Destroyed
@NathanFlurry NathanFlurry merged commit f183ed2 into main Jun 25, 2026
3 of 4 checks passed
NathanFlurry added a commit that referenced this pull request Jun 26, 2026
…sponses, fix service-test build

Fixes surfaced while syncing agent-os against latest secure-exec main:

1. limits: classify DEFAULT_WASM_RUNNER_HEAP_LIMIT_MB (#129) and MAX_TIMER_DELAY_MS
   (#131) — both added without inventory entries, so limits_audit failed on main.
2. sidecar: accept_sidecar_response drops a stale sidecar_response with no matching
   pending request (UnmatchedResponse) or already completed (DuplicateResponse)
   instead of failing the whole sidecar — a per-VM callback can be answered by the
   host after that VM is disposed on the shared sidecar process. Real protocol
   violations stay fatal.
3. tests: re-export crate::EventSinkTransport into the source-included service test
   crate (#132 added the use in src/service.rs without the matching test re-export,
   breaking 'cargo test -p secure-exec-sidecar --test service' compilation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NathanFlurry added a commit that referenced this pull request Jun 26, 2026
…sponses, fix service-test build (#133)

Fixes surfaced while syncing agent-os against latest secure-exec main:

1. limits: classify DEFAULT_WASM_RUNNER_HEAP_LIMIT_MB (#129) and MAX_TIMER_DELAY_MS
   (#131) — both added without inventory entries, so limits_audit failed on main.
2. sidecar: accept_sidecar_response drops a stale sidecar_response with no matching
   pending request (UnmatchedResponse) or already completed (DuplicateResponse)
   instead of failing the whole sidecar — a per-VM callback can be answered by the
   host after that VM is disposed on the shared sidecar process. Real protocol
   violations stay fatal.
3. tests: re-export crate::EventSinkTransport into the source-included service test
   crate (#132 added the use in src/service.rs without the matching test re-export,
   breaking 'cargo test -p secure-exec-sidecar --test service' compilation).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant