Skip to content

fix: memory leaks#1535

Merged
NathanFlurry merged 1 commit into
mainfrom
fix-core-memory-leaks
Jun 25, 2026
Merged

fix: memory leaks#1535
NathanFlurry merged 1 commit into
mainfrom
fix-core-memory-leaks

Conversation

@NathanFlurry

Copy link
Copy Markdown
Member

Summary

A memory-leak audit of agent-os + secure-exec surfaced 20 confirmed leaks. This PR fixes the 7 that live in agent-os (the secure-exec findings are a separate repo → follow-up PR). All seven were re-verified still-present on current main before fixing.

Common root cause: tracking maps / spawned tasks are populated on create/spawn but released only on an explicit teardown call that may never happen (or never at all), so they grow unbounded with ordinary process / session / VM churn over a long-running sidecar. Each fix frees on every termination path and ships a reproducing test (fails pre-fix, passes post-fix).

Fixes

ID File Leak → Fix
H6 packages/core/src/sidecar/rpc-client.ts trackedProcesses/trackedProcessesById + onStdout/onStderr listener sets never cleared → cleared in finishProcess() (the single exit chokepoint) and in dispose()
M8 packages/core/src/sidecar/rpc-client.ts signalStates never deleted (sibling signalRefreshes was) → delete on process_exited, clear in dispose()
H7 packages/core/src/layers.ts (+ localMounts in rpc-client) in-memory LayerStore had no disposal; sealed layers retained full FilesystemSnapshotExport → add LayerStore.dispose() to clear the map; drop the filesystem payload at seal time; clear localMounts in dispose()
H5 packages/core/src/agent-os.ts _processes map deleted on no path; dispose() cleared _shells/_acpTerminals but not _processes → delete after the exit handler, clear in dispose()
H3 crates/client/src/process.rs, agent_os.rs per-process output-callback tasks held the broadcast Sender and awaited a Closed that never fired → store the JoinHandles on ProcessEntry, and drain+abort the registry in shutdown() (drops the retained senders so the channel closes)
M7 crates/client/src/agent_os.rs ACP event-pump task spawned without storing its JoinHandle; shutdown() never aborted it → store the handle, abort it in shutdown() (mirrors pending_shell_exits)
H4 crates/agentos-sidecar/src/acp_extension.rs AcpSessionRecord removed only by explicit close_session() → evict on adapter process-exit, clear the map in on_dispose(), cap stdout_buffer, and add cleanup_sessions_for_connection()

Note on H4 (partial)

The strongest real-time path — client disconnect without close_session — can't be fully closed from inside the crate: the pinned secure-exec host (v0.3.1) exposes no per-connection teardown callback to extensions. Shipped here: process-exit eviction (fires on agent exit), on_dispose clear-all (fires on extension teardown), and an always-on stdout_buffer cap. cleanup_sessions_for_connection() is implemented and test-covered, ready to wire when secure-exec adds the hook.

Trust model

All seven are reachable through ordinary guest execution / session / VM / process churn — in-scope under the sidecar↔executor boundary; none require malicious client config.

Testing

  • cargo test -p agentos-client — 2 new leak tests (registry drained + tasks aborted; tracked handle aborted) ✅
  • cargo test -p agentos-sidecar acp_extension — 3 new tests (connection-teardown eviction, stdout cap, adapter-exit detection) ✅
  • vitest core leak suites — 6 tests across the 3 new files ✅
  • Gate: cargo build (both crates), tsc --noEmit, and sibling layers.test.ts regression all green.

🤖 Generated with Claude Code

@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1535 June 25, 2026 20:36 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1535 June 25, 2026 20:36 Destroyed
@NathanFlurry NathanFlurry force-pushed the fix-core-memory-leaks branch from 6b41702 to 5987ce8 Compare June 25, 2026 21:01
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1535 June 25, 2026 21:01 Destroyed
@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1535 June 25, 2026 21:01 Destroyed
@NathanFlurry NathanFlurry force-pushed the fix-core-memory-leaks branch from 5987ce8 to b6ff3a2 Compare June 25, 2026 21:01
@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1535 June 25, 2026 21:01 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1535 June 25, 2026 21:01 Destroyed
@NathanFlurry

Copy link
Copy Markdown
Member Author

Ran a multi-agent self-review over the diff (5 reviewers: Rust client, Rust sidecar/H4, TS process maps, TS layers, test quality). Fixes applied in this push:

  • Blocker (TS): finishProcess() was releasing process tracking synchronously, which defeated drainTrailingProcessOutput() and dropped trailing guest output (notably on the snapshot-fallback fast-exit path). Release is now deferred until the drain window closes (releaseProcessTrackingAfterDrain), keeping the entry + listeners alive for late process_output events.
  • M8 completeness: moved the signalStates/signalRefreshes deletes out of the process_exited branch into the single finishProcess chokepoint, so non-process_exited exit paths (pump-error, background-error, snapshot-fallback) also release them.
  • H4 robustness: the adapter-exit error producer now embeds ADAPTER_EXITED_ERROR_MARKER directly, so the eviction predicate can't silently regress if the message wording changes.
  • Tests strengthened: added a test for the wired on_dispose() session clear; coupled the adapter-exit predicate test to the real producer string; added a multibyte-UTF-8 cap_stdout_buffer case; asserted M7 pump-handle None→Some→None wiring and H3 output_tasks capture; made the layer-store seal test prove the payload moved into the snapshot rather than asserting only the validity guard.

Reviewers judged the layers fix and the Rust client drain/abort logic sound. Two acknowledged limitations remain documented in the PR: H4's connection-disconnect path needs a per-connection teardown hook from secure-exec (separate repo), and a couple of unit tests can't reach a live transport.

Gate: cargo test -p agentos-client (47) + -p agentos-sidecar acp_extension (19) + core vitest leak suites (8) all green; tsc --noEmit clean.

@NathanFlurry NathanFlurry changed the title fix(core): plug seven memory leaks in session/process lifecycle fix: memory leaks Jun 25, 2026
@railway-app

railway-app Bot commented Jun 25, 2026

Copy link
Copy Markdown

🚅 Deployed to the agentos-pr-1535 environment in agentos

Service Status Web Updated (UTC)
agentos ✅ Success (View Logs) Web Jun 25, 2026 at 11:09 pm

🚅 Deployed to the agentos-pr-1535 environment in rivet-frontend

Service Status Web Updated (UTC)
agent-os 😴 Sleeping (View Logs) Jun 25, 2026 at 11:18 pm

@NathanFlurry NathanFlurry force-pushed the fix-core-memory-leaks branch from b6ff3a2 to 01c130a Compare June 25, 2026 21:49
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1535 June 25, 2026 21:49 Destroyed
@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1535 June 25, 2026 21:49 Destroyed
@NathanFlurry NathanFlurry force-pushed the fix-core-memory-leaks branch from 01c130a to 2783747 Compare June 25, 2026 22:26
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1535 June 25, 2026 22:26 Destroyed
@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1535 June 25, 2026 22:26 Destroyed
@NathanFlurry NathanFlurry force-pushed the fix-core-memory-leaks branch from 2783747 to 856932c Compare June 25, 2026 22:45
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1535 June 25, 2026 22:45 Destroyed
@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1535 June 25, 2026 22:45 Destroyed
@NathanFlurry NathanFlurry force-pushed the fix-core-memory-leaks branch from 856932c to 261039e Compare June 25, 2026 22:54
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1535 June 25, 2026 22:54 Destroyed
@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1535 June 25, 2026 22:54 Destroyed
Tracking maps and spawned tasks were populated on create/spawn but never
released on the resource's natural end-of-life, growing unbounded with
ordinary process/session/VM churn over a long-running sidecar. Each fix
frees on every termination path and ships a reproducing test.

- rpc-client.ts: release trackedProcesses/trackedProcessesById + listener
  sets after trailing output drains; delete signalStates at the finishProcess
  chokepoint; clear all tracking maps + localMounts in dispose(). (H6, M8, H7)
- agent-os.ts: delete _processes entry after the exit handler; clear in
  queryable per the process-table contract). (H5)
- layers.ts: LayerStore.dispose() clears the in-memory layers map; drop sealed
  layers' filesystem payload at seal time. (H7)
- client/process.rs + agent_os.rs: store output-callback JoinHandles and
  abort+drain the registry in shutdown(); track + abort the ACP event-pump
  task in shutdown(). (H3, M7)
- agentos-sidecar/acp_extension.rs: evict an AcpSessionRecord on adapter
  process-exit, clear the sessions map in on_dispose(), bound stdout_buffer,
  and — completing H4 — override the on_session_disposed hook to evict a
  connection's sessions when the client disconnects without close_session. (H4)

Pins secure-exec to preview 0.0.0-main.f183ed2 (the merged secure-exec leak
fixes + the on_session_disposed connection-teardown hook H4 depends on).

ci.yml: run `secure-exec-dep.mjs prepare-build` before the cargo steps so PR CI
builds the Rust crates against the pinned secure-exec (clone-at-sha for preview
pins, no-op for releases) — matching publish.yaml. Without this, a preview pin's
unreleased crate API (the new Extension hook) isn't visible to PR CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NathanFlurry NathanFlurry force-pushed the fix-core-memory-leaks branch from 261039e to 1bdc38c Compare June 25, 2026 23:09
@railway-app railway-app Bot temporarily deployed to rivet-frontend / agentos-pr-1535 June 25, 2026 23:09 Destroyed
@railway-app railway-app Bot temporarily deployed to agentos / agentos-pr-1535 June 25, 2026 23:09 Destroyed
@NathanFlurry NathanFlurry merged commit c721d61 into main Jun 25, 2026
3 of 4 checks passed
@NathanFlurry NathanFlurry deleted the fix-core-memory-leaks branch June 25, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant