fix: memory leaks#131
Merged
Merged
Conversation
A leak audit of the VM runtime found tracking maps/registries populated on create but released only on a happy-path teardown (skipped on error '?' or never), plus per-timer threads with no cancellation. Each fix releases on every termination path; each ships a fast safeguard-firing test. sidecar: - dispose_vm_internal removes per-VM tracking on every exit path (VM removed before the fallible teardown half; cleanup runs unconditionally, error surfaced after) and the dispose_session/remove_connection loops attempt every item and aggregate errors instead of '?'-ing out (H1). - extension_process_output_buffers cleared on VM disposal, not just on successful handoff (M6). - disposed sessions are now untracked from the stdio active_sessions set (M5). - loopback-TLS endpoints remove their own registry entry on Drop instead of relying on the lazy retain() sweep (L4). - new additive Extension::on_session_disposed hook, fired on DisposeReason::ConnectionClosed, so extensions (e.g. ACP) can free per-session state on client disconnect (enables agent-os H4). sidecar-browser: vms/contexts maps cleared on every dispose path (H1). execution: bridge/kernel timer threads are delay-capped and cancellable via a generation check + clear-on-teardown, so guest timers can't exhaust OS threads or outlive their session (H2, M3). v8-runtime: VM_CONTEXTS slots evicted on context finalize, not only on error, so reused isolates don't hit MAX_VM_CONTEXTS (M1); pending promise-resolver Globals reset before the isolate is dropped on every run_event_loop exit (Shutdown/abort), fixing the leak and a V8 lifetime-contract violation (M2). kernel: socket ids allocated only after the backlog check passes, so a full-backlog connect failure no longer consumes an id (M4). vfs: rename rolls back staged snapshot entries on copy/rename failure (L2); InMemoryMetadataStore gains delete_snapshot() and a real gc() that reclaims unreferenced blocks (L3). L1 (bounded, documented Box::into_raw isolate handle) intentionally left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
98c634f to
28cd8da
Compare
NathanFlurry
added a commit
that referenced
this pull request
Jun 26, 2026
…sponses, fix service-test build Fixes surfaced while syncing agent-os against latest secure-exec main: 1. limits: classify DEFAULT_WASM_RUNNER_HEAP_LIMIT_MB (#129) and MAX_TIMER_DELAY_MS (#131) — both added without inventory entries, so limits_audit failed on main. 2. sidecar: accept_sidecar_response drops a stale sidecar_response with no matching pending request (UnmatchedResponse) or already completed (DuplicateResponse) instead of failing the whole sidecar — a per-VM callback can be answered by the host after that VM is disposed on the shared sidecar process. Real protocol violations stay fatal. 3. tests: re-export crate::EventSinkTransport into the source-included service test crate (#132 added the use in src/service.rs without the matching test re-export, breaking 'cargo test -p secure-exec-sidecar --test service' compilation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NathanFlurry
added a commit
that referenced
this pull request
Jun 26, 2026
…sponses, fix service-test build (#133) Fixes surfaced while syncing agent-os against latest secure-exec main: 1. limits: classify DEFAULT_WASM_RUNNER_HEAP_LIMIT_MB (#129) and MAX_TIMER_DELAY_MS (#131) — both added without inventory entries, so limits_audit failed on main. 2. sidecar: accept_sidecar_response drops a stale sidecar_response with no matching pending request (UnmatchedResponse) or already completed (DuplicateResponse) instead of failing the whole sidecar — a per-VM callback can be answered by the host after that VM is disposed on the shared sidecar process. Real protocol violations stay fatal. 3. tests: re-export crate::EventSinkTransport into the source-included service test crate (#132 added the use in src/service.rs without the matching test re-export, breaking 'cargo test -p secure-exec-sidecar --test service' compilation). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A memory-leak audit of the VM runtime found 12 leaks (the secure-exec half of a cross-repo audit; the agent-os half is rivet-dev/agentos#1535). Common root cause: tracking maps/registries are populated on create but released only on a happy-path teardown that is skipped on an error
?or never runs — plus per-timer threads with no cancellation. Each fix releases on every termination path and ships a fast, safeguard-firing test (no resource-saturation tests, per the testing rules).Fixes
sidecar/{vm,service}.rsdispose_vm_internalremoved per-VM tracking only after ~7 fallible?steps, and thedispose_session/remove_connectionloops?-out on the first failure, abandoning later VMs → VM removed before the fallible teardown half; cleanup runs unconditionally (error surfaced after); loops attempt-all + aggregatesidecar/{service,vm}.rsextension_process_output_bufferscleared only on successful handoff → also cleared on VM disposalsidecar/{service,stdio}.rsactive_sessionsset never shrank → disposed sessions are now untracked (also stops the ~250µs event-pump from iterating dead entries)sidecar/{execution,state}.rsWeakentries until the next lazyretain()→ endpoints remove their own entry onDrop(guarded byptr_eq+ last-peer check)sidecar-browser/service.rsvms/contexts(each holding aBrowserKernel) → cleared on every dispose pathexecution/javascript.rs_scheduleTimer/ kernel timers spawned untracked OS threads with uncapped delay (thread-exhaustion DoS, Arc pinning) → delay capped + cancellable via generation check + clear-on-teardownv8-runtime/bridge.rsVM_CONTEXTSslot freed only on the error path → evicted on context finalize; reused isolates no longer creep toMAX_VM_CONTEXTSv8-runtime/session.rsv8::Globals could drop after their isolate onrun_event_loopearly-exit (leak and a V8 lifetime-contract violation) → reset before isolate teardown on every exit pathkernel/socket_table.rsvfs/posix/overlay_fs.rsvfs/engine/mem/metadata_store.rssnapshotsgrew forever (no delete,gc()was a no-op) →delete_snapshot()+ a realgc()that reclaims unreferenced blocksL1 (the bounded, documented
Box::into_rawisolate-handle, ~5 KB, reclaimed at process exit) is intentionally left as-is.Cross-repo note (enables agent-os H4)
This adds an additive
Extension::on_session_disposedhook fired onDisposeReason::ConnectionClosed. agent-os's ACP extension already has a readycleanup_sessions_for_connection; once this lands and is published, agent-os can override the hook to fully close its H4 (ACP sessions leaking on client disconnect). Additive default = no break for existingExtensionimpls.Trust model
All are reachable through ordinary guest execution or session/VM/process churn (in-scope under the sidecar↔executor boundary). H2/M3 and M1 are guest-amplifiable (timer-thread exhaustion; context-slot exhaustion) and also tighten "limits bounded by default" (the previously uncapped timer delay).
Testing
cargo testper crate, all green: sidecar (dispose-lifecycle incl. error-path reclaim, M5 untrack, on_session_disposed-on-ConnectionClosed, L4 registry), sidecar-browser (H1), execution (timer cap + generation suppression), v8-runtime 113 (M1 finalize sweep, M2 reset-before-teardown — V8 tests subprocess-isolated), kernel (full-backlog id not consumed, unix+inet), vfs-core (L2 rollback, L3 delete_snapshot + gc). New V8 tests follow the crate's subprocess convention; no saturation tests added.