feat(agents): context compaction (budget notices, mid-turn floor, background compaction)#4605
feat(agents): context compaction (budget notices, mid-turn floor, background compaction)#4605kevin-dp wants to merge 30 commits into
Conversation
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4605 +/- ##
==========================================
+ Coverage 57.56% 57.89% +0.32%
==========================================
Files 342 350 +8
Lines 39993 40644 +651
Branches 11633 11828 +195
==========================================
+ Hits 23023 23529 +506
- Misses 16933 17078 +145
Partials 37 37
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Electric Agents Mobile BuildLocal mobile checks ran for commit The EAS Android preview build was skipped because the |
🤖 Automated review — context compactionGenerated by a review agent. Severity-ranked; cite Findings[high] — mid-turn checkpoint over-covers and drops the kept tail on the next reconstruction — [low] — an orphaned [low] — stale "95%" comments — [nit] — Verified correct
Test gaps
OverallThe architecture is sound and unusually well-commented; the background lifecycle, supersession, watermark ordering, and timeout race are carefully designed and well-tested. The one finding I'd block on is the mid-turn over-coverage ( |
…4605] The mid-turn (sync floor) checkpoint was persisted with no attrs.watermark, so reconstruction fell back to the checkpoint row's own (latest) stream position and, on the next turn, dropped every item before it — including the verbatim tail the mid-turn summary deliberately excluded (keepTail). Those recent messages were in neither the summary nor the kept set, so they were silently lost on the 90% path. Fix: the mid-turn compactor now folds the WHOLE context into the summary (Codex-style — no verbatim pre-compaction tail), and the checkpoint is stamped with watermark = current timeline head. Summary and watermark now agree, so reconstruction folds exactly what was summarized and keeps everything appended afterward (the model's post-compaction output + the next prompt). This also removes the keepTail + tool-pair orphan-guard complexity. Recent context is still shown after a compaction via the sticky view (summary + messages appended since), so within-turn coherence is preserved. Also drop the stale "95%" hard-ceiling comments (it has been 90% since the ceiling was lowered to match Codex). Tests: rewrote compaction-midturn for the summarize-everything behavior; added a reconstruction test asserting post-compaction messages survive a mid-turn checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the review (8ce00aa): [high] mid-turn tail-loss — fixed. The mid-turn checkpoint was persisted with no [low] stale "95%" comments — fixed (hard ceiling is 90%). [low] orphaned Full runtime suite green (tsc clean). |
…#4605] A summarize is bounded by a ~120s hard timeout after which a terminal (complete/failed) checkpoint is always written, so a `running` checkpoint that lingers well past that is orphaned — its process crashed before writing the terminal row, and (with watermark-unique background ids) nothing supersedes it, pinning the spinner forever. Treat a `running` checkpoint older than 150s as stale and stop showing it, with a self-clearing timer so the spinner disappears even when no further events arrive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…4605] `pendingRequestMessageCount` records the incoming (uncompacted) message count, not the compacted list the adapter may return. That's intentional: pi-agent passes transformContext the full conversation each step, so the count indexes that original array — the next step's trailing slice then measures exactly the messages appended since. Add a comment so it doesn't read as a bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Circled back and addressed the remaining two as well, each in its own commit:
|
Claude Code ReviewSummary Adds context compaction to the agents runtime: a context-window usage gauge, model-facing budget notices, oversized tool-output truncation, a synchronous mid-turn compaction floor at the 90% hard ceiling, and non-blocking background (turn-end) compaction — all built on the event-sourced timeline as durable What's Working Well (Unchanged since iteration 6 — re-verified intact.)
Review of Changes Since Iteration 6
Issues Found Critical (Must Fix): None. Important (Should Fix): None. Suggestions (Nice to Have):
Issue Conformance No linked issue — per project convention a warning, substantially mitigated by an unusually detailed PR description and changeset that enumerate scope. Implementation matches the described scope (modulo the stale Previous Review Status Incremental review (iteration 7). All items raised across iterations 1–6 remain resolved. The sole change since iteration 6 is the test-fixture typecheck fix described above; the compaction logic is unchanged. Codecov reports all tests successful. The one cosmetic doc nit (stale Review iteration: 7 | 2026-06-18 |
#4605] The mid-turn checkpoint computed its watermark (timeline head) when the `complete` row was written — after the summarize await. Any event that materialized into the StreamDB during that (slow) await would bump the head past what the summary actually covered, so reconstruction could drop an un-summarized item next turn (item.at <= watermark). Narrow (mid-turn blocks the agent, and pending inbox rows don't materialize) but timing-dependent. Snapshot the head when the `running` row is written instead — before the await — so coverage and watermark derive from the same instant, matching how background compaction already captures its head up front. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eview #4605] The orphan-clearing logic added to CompactionIndicator had no test, and a regression where rows stop carrying `timestamp` would silently revert to the lingering-spinner bug (NaN → not orphaned → spinner stays). This package has no React-render harness, so extract the decision into a pure `isRunningCheckpointOrphaned(timestamp, now)` helper (with STALE_RUNNING_MS) in lib/ and unit-test it: fresh → shown, just under the deadline → shown, at/past the deadline → hidden, and missing/unparseable timestamp → shown (documented). The component now imports the helper, so the tested logic is exactly what runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the careful review. Went through each point: Addressed (one commit each):
Declined (with rationale):
Full runtime suite green; UI typechecks + the new helper test pass. |
Foundation for context compaction: measure and surface how full the model's context window is, with no behavior change yet. - pi-adapter: capture cache-INCLUSIVE prompt size (input + cacheRead + cacheWrite) and the model context window per step. The existing input_tokens deliberately excludes cache reads for budget accounting, but cached tokens still occupy the window, so a fullness gauge needs the inclusive total. - persist context_input_tokens + context_window on the step row (optional/additive) via the outbound bridge. - token-accountant: single source of truth for usage ratio, severity level, and the compaction thresholds (85% background / 95% ceiling). - UI: ContextUsageIndicator renders "NN% used" in the composer footer from the same helper, coloured at the 85/95 thresholds. Observational only — nothing compacts yet; this validates the token accounting before later phases act on it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ndow in pi-adapter The bridge persistence and the usage-ratio helper were covered, but the adapter's cache-INCLUSIVE total (input + cacheRead + cacheWrite) — the accuracy premise of the context gauge — was not. Assert it equals 1350 where the uncached input_tokens is 150, and that context_window is emitted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Once context usage reaches 25%, inject a <token_budget> notice into the model's messages stating remaining tokens + percent, so the model can pace itself. Recomputed each call from the latest step's persisted usage, so it is always current. - token-accountant: selectLatestContextUsage (latest step with usage), shouldSurfaceContextBudget (gate at 25%), formatContextBudgetNotice, and withContextBudgetNotice (inject just before the final message). - context-factory/runAgent: synthesize and inject before the model call. Synthesized (not persisted) on purpose: a self-superseding context row would leave load_context_history tombstones, which are misleading for an ephemeral budget hint. Synthesis stays deterministic (pure function of persisted steps) so replay/fork reproduce it, with no row churn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive a real agent run with a capturing streamFn and assert the <token_budget> notice reaches the model's context, gated on a seeded step's usage: present at 80% usage (with correct "20k tokens (20%) remaining" wording), absent at 10%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…se 2) Cap any single tool_result at ~10k tokens, replacing the body with a visible "[Output truncated: ...]" placeholder before the model call, so one giant output can't fill the context window on its own. Preserves toolCallId/isError so tool-call pairing stays valid. Mirrors Codex's per-message truncation. Truncation is always explicit, never silent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Foundation for context compaction: a compaction checkpoint is a context_inserted row tagged attrs.kind="compaction". timelineMessages now treats the newest such checkpoint's order as a watermark — items before it are dropped (summarized away) and the checkpoint renders the summary in their place. No checkpoint -> watermark is -Infinity -> a strict no-op, so this is inert until the summarizer (next step) writes one. Adds compaction.ts with the checkpoint constants and the Codex summarization prompt + summary prefix (reused verbatim) for the summarization step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
summarizeMessages sends the full history + Codex's summarization prompt to the conversation's own model (a cheap small-window model would overflow a near-full context) and prefixes the result with Codex's summary preamble. The model call is injected via a `complete` seam (defaults to pi-ai completeSimple) so it is unit-testable without a network call or API key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wires the compaction engine into the model-call path. Before a turn, if the last step left context at/over 95% of the window (and the history is actually large), summarize it, persist a compaction checkpoint (context_inserted kind=compaction), and send only the summary this turn — the current ask still arrives via runInput, and future turns reconstruct from the checkpoint watermark. Failure degrades gracefully (logs and proceeds uncompacted). - AgentConfig.summarizeComplete: model-call seam for the summarizer (defaults to the conversation model; injected by tests). - guard on estimated history size avoids re-compacting an already compacted (small) history while the last step's usage reads stale-high. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CONTEXT_USAGE_HARD_CEILING 0.95 -> 0.90, matching Codex's auto-compaction threshold. Drives the synchronous compaction trigger and the UI gauge's "critical" colour. Background start stays at 85%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drives the real runtime end-to-end and makes a real Anthropic summarization call, asserting the returned summary carries the Codex prefix and retains key conversation facts. Skipped unless RUN_LIVE_COMPACTION=1 + LIVE_ANTHROPIC_API_KEY are set, so it never runs (or costs) in CI. Verified passing with claude-haiku-4-5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Its purpose (confirming a real model summarizes correctly through the compaction path) is done. Ongoing coverage lives in the deterministic stubbed tests (compaction-trigger, compaction-summarize, timeline-compaction); a live test pinning a model id + needing a paid key would only rot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e/failed) Foundation for surfacing synchronous compaction in the UI. The trigger now writes a `running` checkpoint before summarizing and updates it to `complete` (with the summary) or `failed` after. Only a `complete` checkpoint acts as the reconstruction watermark, so an in-flight or crashed compaction never hides history; running/failed checkpoints are UI-only markers, skipped from the model context. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Shows a spinner + "Compacting context…" in the composer footer while a synchronous compaction is in flight, reading the latest compaction checkpoint row and showing it while attrs.status is "running" (clears on complete). Tells the user why the turn paused and that their next prompt is being queued. Mirrors the ContextUsageIndicator pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ELECTRIC_AGENTS_COMPACT_CEILING (0..1, default 0.9) and ELECTRIC_AGENTS_COMPACT_MIN_TOKENS (default contextWindow/2) let the synchronous compaction path be exercised without filling a real window. RFC §12 tunables; default behavior unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surfaces a completed compaction checkpoint as a collapsed, expandable "Context compacted" marker in the message history, at the point compaction happened. Adds a compaction custom timeline source (mirroring the comments source) reading compaction_summary context_inserted rows, a compaction row kind across the timeline dispatch, and a CompactionTimelineRow card (InlineEventCard, expand to view the summary). Running/failed checkpoints are filtered out — those are shown by the live composer indicator instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compaction now runs before EVERY model step (not just between turns), so a single turn that balloons across many tool calls can no longer exhaust the context window before the turn ends. Wired through pi-agent-core's transformContext hook. - compaction-midturn.ts: createMidTurnCompactor — folds older messages into a summary, returns [summary, ...recent tail], caches the summary for the rest of the turn (re-summarizing chained off the prior summary only if the tail grows back over the ceiling). - pi-adapter: Codex-style token signal — anchor on the last step's REAL cache-inclusive usage + estimate only the trailing items appended since, vs the model's real context window (not an estimate of the whole history). transformContext + initialContextTokens options added. - context-factory: build the compactor and pass it to the adapter; removed the per-turn pre-sampling compaction block it replaces. Reuses the summarizer, checkpoint lifecycle (running/complete/failed), and UI. summarizeAgentMessages summarizes the AgentMessage[] the hook receives without re-converting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The keepTail boundary could split a tool_call/tool_result pair — folding the assistant tool_use into the summary while keeping the matching tool_result in the tail. Anthropic rejects the orphaned tool_result (400 invalid_request_error: "tool_result must have a corresponding tool_use block in the previous message"). Advance the fold boundary past any leading tool-result messages so the kept tail starts on a fresh user/assistant turn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Non-blocking compaction so the user almost never waits for the 90% sync floor. Self-contained so it can be reverted as one commit. - Trigger (process-wake): after a turn whose usage ≥ 85% (env ELECTRIC_AGENTS_COMPACT_BG_CEILING), kick off a DETACHED summarization. Its checkpoint is applied at the NEXT turn's start, OR — if the summarize finishes while the entity is idle — immediately by waking the idle loop and writing the checkpoint WITHOUT running the agent (so the indicator never lingers past completion). The slow summarize never blocks; a fast follow-up prompt just runs un-compacted. - context-factory: maybeStartBackgroundCompaction / writeBackgroundCheckpoint / failBackgroundCheckpoint on the handler-context result. Snapshots the timeline head as the watermark; summarizes the whole reconstructed history; writes a background-flavored running→complete checkpoint. - Unique checkpoint id per generation (compaction-bg-<watermark>): context supersession keys on id alone, so with one shared id the NEXT background's `running` row silently superseded the PREVIOUS `complete` one — erasing the watermark and undoing every compaction (context never shrank, the indicator stuck on "running"). running→complete→failed of one generation share the id; the next generation can't clobber it. Mid-turn sync keeps the shared id on purpose (its re-summarization chain wants supersession). - Reconstruction (timeline-context): checkpoints carry a stored attrs.watermark and the summary is rendered AT that watermark — so a prompt+answer that arrived while a background summary ran (physically after the checkpoint, logically after the watermark) are kept verbatim AFTER the summary (RFC §8.5). Falls back to the row's order for sync. - UI: CompactionIndicator shows a subtle "Compacting in background…" for background checkpoints, distinct from the blocking sync indicator. - Summarize hardening (compaction-summarize): bound every summarize call with a 120s deadline — the anthropic provider only enforces a timeout/ abort when the caller passes timeoutMs/signal and never retries, so a stalled stream hung forever. Forward timeoutMs + an AbortSignal the provider honours AND race a hard timer; a timeout becomes an ordinary failure retried next turn-end. The mid-turn 90% sync floor stays as the safety net for single runaway turns. To drop the feature: revert this commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `minTokens` guard (default 2000, env ELECTRIC_AGENTS_COMPACT_MIN_TOKENS) never changed the outcome at any realistic ceiling — 90% of a real context window is always far above 2000 tokens, so the floor only ever mattered when testing with an artificially low ceiling. Codex has no equivalent floor (it triggers on a single token threshold). Remove the knob, its env override, and the now-unused positiveFromEnv helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…4605] The mid-turn (sync floor) checkpoint was persisted with no attrs.watermark, so reconstruction fell back to the checkpoint row's own (latest) stream position and, on the next turn, dropped every item before it — including the verbatim tail the mid-turn summary deliberately excluded (keepTail). Those recent messages were in neither the summary nor the kept set, so they were silently lost on the 90% path. Fix: the mid-turn compactor now folds the WHOLE context into the summary (Codex-style — no verbatim pre-compaction tail), and the checkpoint is stamped with watermark = current timeline head. Summary and watermark now agree, so reconstruction folds exactly what was summarized and keeps everything appended afterward (the model's post-compaction output + the next prompt). This also removes the keepTail + tool-pair orphan-guard complexity. Recent context is still shown after a compaction via the sticky view (summary + messages appended since), so within-turn coherence is preserved. Also drop the stale "95%" hard-ceiling comments (it has been 90% since the ceiling was lowered to match Codex). Tests: rewrote compaction-midturn for the summarize-everything behavior; added a reconstruction test asserting post-compaction messages survive a mid-turn checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#4605] A summarize is bounded by a ~120s hard timeout after which a terminal (complete/failed) checkpoint is always written, so a `running` checkpoint that lingers well past that is orphaned — its process crashed before writing the terminal row, and (with watermark-unique background ids) nothing supersedes it, pinning the spinner forever. Treat a `running` checkpoint older than 150s as stale and stop showing it, with a self-clearing timer so the spinner disappears even when no further events arrive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…4605] `pendingRequestMessageCount` records the incoming (uncompacted) message count, not the compacted list the adapter may return. That's intentional: pi-agent passes transformContext the full conversation each step, so the count indexes that original array — the next step's trailing slice then measures exactly the messages appended since. Add a comment so it doesn't read as a bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#4605] The mid-turn checkpoint computed its watermark (timeline head) when the `complete` row was written — after the summarize await. Any event that materialized into the StreamDB during that (slow) await would bump the head past what the summary actually covered, so reconstruction could drop an un-summarized item next turn (item.at <= watermark). Narrow (mid-turn blocks the agent, and pending inbox rows don't materialize) but timing-dependent. Snapshot the head when the `running` row is written instead — before the await — so coverage and watermark derive from the same instant, matching how background compaction already captures its head up front. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eview #4605] The orphan-clearing logic added to CompactionIndicator had no test, and a regression where rows stop carrying `timestamp` would silently revert to the lingering-spinner bug (NaN → not orphaned → spinner stays). This package has no React-render harness, so extract the decision into a pure `isRunningCheckpointOrphaned(timestamp, now)` helper (with STALE_RUNNING_MS) in lib/ and unit-test it: fresh → shown, just under the deadline → shown, at/past the deadline → hidden, and missing/unparseable timestamp → shown (documented). The component now imports the helper, so the tested logic is exactly what runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The RFC isn't checked in, so comments can't reference it; phase numbers are internal sequencing, not something the code should narrate. Remove the dangling RFC/§/phase references and tighten the surrounding comments to be brief and only where they clarify something non-obvious. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the `for (;;)` + breaks with a `do…while (appliedBackgroundDuringIdle)` so the loop's actual condition (re-idle while background compactions keep settling) is explicit, and trim the over-long `pendingBackgroundCompaction` comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tighten the over-long comments in the compaction indicator helper + component and the reconstruction watermark block, dropping a redundant inline restatement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
df21a6e to
295d9ab
Compare
The runAgent budget-notice fixture predated main's required doUnobserve field on HandlerContextConfig; the rebase auto-merged without flagging it, breaking the agents-runtime typecheck.
Adds context compaction to the agents runtime, which previously only truncated at the window limit. Modelled on OpenAI Codex's summarization but adapted to our event-sourced timeline: a compaction checkpoint is a durable
context_insertedrow placed at a stored watermark, so history reconstruction folds everything up to the watermark into a summary.What's included
context_input_tokens+context_windowper step;ContextUsageIndicatorshows "X% used" in the composer footer.<token_budget>message into the model context at 25 / 50 / 75% usage (synthesized at the model-call seam, not persisted).tool_resultwith a placeholder.transformContexthook (so a single tool-heavy turn can't blow the window).Design notes
completecheckpoints act as a reconstruction watermark;running/failedare UI-only and never hide history (crash-safe).ELECTRIC_AGENTS_COMPACT_CEILING,ELECTRIC_AGENTS_COMPACT_BG_CEILING,ELECTRIC_AGENTS_COMPACT_MIN_TOKENS.Testing
Note
PR #4596 (context-usage ring gauge + breakdown) is stacked on this branch; re-target it to
mainafter this merges.🤖 Generated with Claude Code