fix: memory leaks#1535
Merged
Merged
Conversation
6b41702 to
5987ce8
Compare
5987ce8 to
b6ff3a2
Compare
Member
Author
|
Ran a multi-agent self-review over the diff (5 reviewers: Rust client, Rust sidecar/H4, TS process maps, TS layers, test quality). Fixes applied in this push:
Reviewers judged the layers fix and the Rust client drain/abort logic sound. Two acknowledged limitations remain documented in the PR: H4's connection-disconnect path needs a per-connection teardown hook from secure-exec (separate repo), and a couple of unit tests can't reach a live transport. Gate: |
|
🚅 Deployed to the agentos-pr-1535 environment in agentos
🚅 Deployed to the agentos-pr-1535 environment in rivet-frontend
|
b6ff3a2 to
01c130a
Compare
01c130a to
2783747
Compare
2783747 to
856932c
Compare
856932c to
261039e
Compare
Tracking maps and spawned tasks were populated on create/spawn but never released on the resource's natural end-of-life, growing unbounded with ordinary process/session/VM churn over a long-running sidecar. Each fix frees on every termination path and ships a reproducing test. - rpc-client.ts: release trackedProcesses/trackedProcessesById + listener sets after trailing output drains; delete signalStates at the finishProcess chokepoint; clear all tracking maps + localMounts in dispose(). (H6, M8, H7) - agent-os.ts: delete _processes entry after the exit handler; clear in queryable per the process-table contract). (H5) - layers.ts: LayerStore.dispose() clears the in-memory layers map; drop sealed layers' filesystem payload at seal time. (H7) - client/process.rs + agent_os.rs: store output-callback JoinHandles and abort+drain the registry in shutdown(); track + abort the ACP event-pump task in shutdown(). (H3, M7) - agentos-sidecar/acp_extension.rs: evict an AcpSessionRecord on adapter process-exit, clear the sessions map in on_dispose(), bound stdout_buffer, and — completing H4 — override the on_session_disposed hook to evict a connection's sessions when the client disconnects without close_session. (H4) Pins secure-exec to preview 0.0.0-main.f183ed2 (the merged secure-exec leak fixes + the on_session_disposed connection-teardown hook H4 depends on). ci.yml: run `secure-exec-dep.mjs prepare-build` before the cargo steps so PR CI builds the Rust crates against the pinned secure-exec (clone-at-sha for preview pins, no-op for releases) — matching publish.yaml. Without this, a preview pin's unreleased crate API (the new Extension hook) isn't visible to PR CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
261039e to
1bdc38c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A memory-leak audit of agent-os + secure-exec surfaced 20 confirmed leaks. This PR fixes the 7 that live in agent-os (the secure-exec findings are a separate repo → follow-up PR). All seven were re-verified still-present on current
mainbefore fixing.Common root cause: tracking maps / spawned tasks are populated on create/spawn but released only on an explicit teardown call that may never happen (or never at all), so they grow unbounded with ordinary process / session / VM churn over a long-running sidecar. Each fix frees on every termination path and ships a reproducing test (fails pre-fix, passes post-fix).
Fixes
packages/core/src/sidecar/rpc-client.tstrackedProcesses/trackedProcessesById+onStdout/onStderrlistener sets never cleared → cleared infinishProcess()(the single exit chokepoint) and indispose()packages/core/src/sidecar/rpc-client.tssignalStatesnever deleted (siblingsignalRefresheswas) → delete onprocess_exited, clear indispose()packages/core/src/layers.ts(+localMountsin rpc-client)LayerStorehad no disposal; sealed layers retained fullFilesystemSnapshotExport→ addLayerStore.dispose()to clear the map; drop the filesystem payload at seal time; clearlocalMountsindispose()packages/core/src/agent-os.ts_processesmap deleted on no path;dispose()cleared_shells/_acpTerminalsbut not_processes→ delete after the exit handler, clear indispose()crates/client/src/process.rs,agent_os.rsSenderand awaited aClosedthat never fired → store theJoinHandles onProcessEntry, anddrain+abort the registry inshutdown()(drops the retained senders so the channel closes)crates/client/src/agent_os.rsJoinHandle;shutdown()never aborted it → store the handle, abort it inshutdown()(mirrorspending_shell_exits)crates/agentos-sidecar/src/acp_extension.rsAcpSessionRecordremoved only by explicitclose_session()→ evict on adapter process-exit, clear the map inon_dispose(), capstdout_buffer, and addcleanup_sessions_for_connection()Note on H4 (partial)
The strongest real-time path — client disconnect without
close_session— can't be fully closed from inside the crate: the pinned secure-exec host (v0.3.1) exposes no per-connection teardown callback to extensions. Shipped here: process-exit eviction (fires on agent exit),on_disposeclear-all (fires on extension teardown), and an always-onstdout_buffercap.cleanup_sessions_for_connection()is implemented and test-covered, ready to wire when secure-exec adds the hook.Trust model
All seven are reachable through ordinary guest execution / session / VM / process churn — in-scope under the sidecar↔executor boundary; none require malicious client config.
Testing
cargo test -p agentos-client— 2 new leak tests (registry drained + tasks aborted; tracked handle aborted) ✅cargo test -p agentos-sidecar acp_extension— 3 new tests (connection-teardown eviction, stdout cap, adapter-exit detection) ✅vitestcore leak suites — 6 tests across the 3 new files ✅cargo build(both crates),tsc --noEmit, and siblinglayers.test.tsregression all green.🤖 Generated with Claude Code