feat(llamacpp): add explicit context_length override#5
Open
Scooter-DeJean wants to merge 98 commits into
Open
Conversation
Update dtolnay/rust-action/setup to dtolnay/rust-toolchain to use the current recommended action for Rust toolchain installation.
…nse bug) Root cause: reinfer_and_dispatch hardcoded thinking=true for tool-chain continuation. After a tool executes (e.g. grep returns 39 chars), the model has already reasoned — re-enabling thinking causes Gemma 4 to immediately emit stop with no content (verified: cache_n=48933, prompt_n=88, 327ms). This broke the book club verbatim reading flow: grep (find chapter) → EMPTY RESPONSE → 35s recovery → file_read → analysis Instead of: grep (find chapter) → file_read → verbatim quote Fix: thinking=false for tool-chain re-inference. The model already thought during initial inference. Also fixes mislabeled 'context overflow' error that fires at 19.7% usage (not overflow — thinking mode bug). 706 tests passing, zero warnings.
…n into reality
Root cause: when discussing autofiction where characters/systems share names
with real entities (Maria the user = Maria the character, ErnOS the system =
ErnOS in the novel), the model progressively merged fiction with reality
across conversational turns, adopting fictional missions as its own.
Four-layer fix:
1. Digest fiction header (attachment_reader.rs):
- build_digest now includes FICTION/REALITY PROTOCOL in footer
- Instructs model to refer to 'the character Maria' not 'Maria'
- Distinguishes fictional systems from real ones
2. Summariser [FICTION] prefix (attachment_reader.rs):
- summarise_page prompt now instructs LLM to prefix fictional
events with [FICTION] tags in page summaries
- Prevents downstream confusion in digest content
3. Observer Rule 20 — fiction_reality_collapse (observer.md):
- New audit rule detects: adopting fictional missions, treating
fictional events as real evidence, collapsing autofiction identities
- Exception for explicit role-play/creative collaboration
- Added to failure_category enum in observer schema
4. Core prompt Fiction/Reality Protocol (core.md):
- Positioned after Bounded Speculation (both address fact/fiction boundary)
- Instructs tracking of conversational drift from analysis to immersion
- Specific guidance for autofiction name collisions
Observer rule names updated: 19 → 20 rules.
706 tests passing, zero warnings.
…§2.4) The observer bailout was a governance §2.4 violation: after 2-3 rejections, the system silently delivered an UNAUDITED response to the user. A cap on retries masks deeper bugs — if the model can't follow observer feedback, that's a real problem to investigate, not hide. Changes across all 3 observer paths: 1. platform_ingest.rs (platform adapter path): - Removed max_retries cap and handle_audit_bailout function - Removed AuditSummary::exhausted (unreachable dead code) - Loop now retries with feedback indefinitely until ALLOWED 2. ws_stream.rs (WebSocket L1 path): - Removed max_retries cap and handle_bailout function - Same uncapped loop until approved 3. ws_react.rs (WebSocket ReAct path): - Removed the consecutive_rejections >= 2 bailout branch - Always injects feedback and continues the loop - Resets consecutive_rejections on approval 4. observer/mod.rs: - Removed format_bailout_override function (dead code) - Removed its test - Updated module comment: 20 rules Infrastructure errors (observer itself is down) still fail-open — that's not a quality issue, it's an availability issue. 705 tests passing (1 removed: test_format_bailout_override), zero warnings.
Root cause: data/embeddings.json is a 4,077,705-byte single-line file. codebase_search matched 'learning pipeline' inside it and returned the entire 4MB line as a search result, blowing the context window. The retry then failed because llama-server's KV cache matched 99.9% of the prompt (prompt_n=1 on retry), hitting the same empty-response state. Three fixes: 1. Data artifact exclusion (codebase_search.rs): Skip embeddings.json, golden_buffer.jsonl, rejection_buffer.jsonl, quarantine.json, review_deck.json, training_manifest.json from search. These are runtime data, not source code. 2. Match context snippets (codebase_search.rs): Show 200 chars before/after the match instead of the full line. Defense-in-depth against any future large-line file. Full content available via file_read. 3. KV cache escape (platform_reinfer.rs): Inject a recovery system message before retry to change the cache-visible prompt suffix, forcing actual token generation instead of cache-hit empty response. 710 tests passing (+5 new: context snippets, data artifact exclusion, long-line cap). Zero warnings.
- New: src/web/handlers/curriculum.rs — 5 API endpoints:
GET /api/curriculum (list courses with progress)
POST /api/curriculum (add course)
DELETE /api/curriculum/{id} (remove course)
GET /api/curriculum/{id}/progress (detailed progress)
GET /api/curriculum/review (review deck stats)
- New: curriculum_tests.rs — 8 unit tests (success + error paths)
per §3.1 and §3.2 governance mandates
- Updated: Training tab → 4 sub-tabs (Buffers, Curriculum, Review, Adapters)
Course cards with level badges, progress bars, completion %
Review tab with Leitner deck stats (due, total, retention rate)
- CSS: .course-card, .course-progress-track, .level-badge styles
- 718 tests pass, 0 regressions (642 unit + 76 e2e)
New curriculum_e2e module in e2e_tests.rs: 1. test_attend_class_full_pipeline — process_scene → verify → route to golden/quarantine buffers with MockProvider 2. test_attend_class_all_complete — full course completion with avg score 3. test_flush_session_to_shared_state — session buffers → shared Arc state 4. test_spaced_review_generates_cards — Leitner card generation from completed quiz scenes 5. test_curriculum_store_roundtrip — add/list/progress/remove/persistence 6. test_review_deck_stats — deck stats, add card, record result, due count 724 tests pass, 0 regressions (642 unit + 82 e2e)
core.md: 438 → 563 lines. Previously undocumented systems: Learning Pipeline (was 2 sentences, now ~100 lines): - Curriculum system: 5 education levels, 12+Custom subjects, 16 scene types - Student loop: process_scene → verify → route (golden/rejection/quarantine) - Leitner 5-box spaced review with [1,3,7,14,30] day intervals - Research projects: 6-phase lifecycle (LiteratureSurvey → Complete) - Graduation gates with adapter fusion via mlx_lm.fuse - Training execution: teacher, sleep cycle, MLX bridge, distillation New sections: - Scheduler (9 job types including AttendClass, ConductResearch, SpacedReview) - Agents & Teams (AgentRegistry, TeamRegistry, Parallel/Sequential execution) - Code Verification Pipeline (build → test → browser, verify_code tool) - Output Sanitizer (scrub_tool_leaks, needs_reinference) - Spiral detector, auto-start services (embedding, Kokoro, Flux, code-server) - Voice/video call handlers - Observer skill/insight extraction Updated sections: - Tool lists: L1=22, L2=26, Safe=9 (corrected from incomplete lists) - Additional Tools: added introspect (6 actions), session_recall (5 actions) - Platform Adapters: Discord thinking threads, slash commands, RBAC - Hardware: spiral detector, auto-start, voice/video pipelines - GitHub + Discord links HUD code change (src/prompt/hud.rs, src/web/ws_context.rs): - Added curriculum_count, review_total, review_due, quarantine_count - Agent now sees learning state on every turn 724 tests pass, 0 regressions
… platform adapter improvements - HUD: enriched system status display with live telemetry - Browser tool: expanded capabilities and error handling - Self-skills: persistent skill storage and retrieval - Platform adapters: Discord interaction improvements, router reliability - Observer: additional audit metadata - WebUI: attachment ingest, content handler, index updates - Misc: image gen, fast reply, ws_context, ws_l1, ws_react refinements
…gin, §2.5 cap removal - retry_after_rejection: handle ToolCall/ToolCalls via run_platform_tool_chain instead of silently dropping them (root cause of infinite observer loops) - Box::pin indirection for recursive async cycle (retry → tool_chain → audit → retry) - Remove MAX_TOOL_OUTPUT_CHARS=200K from tool_dispatch (§2.5 governance violation) - enforce_context_budget: 60% safety margin — chars/2 underestimates 1.5x for short-line content - Fix /4 → /2+2000 estimator in platform_exec logging - WebSocketSink: implement on_text to forward text_delta frames (was no-op, ate resume text) 82 tests pass, 0 warnings.
- Resume state now stores platform as 3-tuple (message, session_id, platform) - Derive platform from session_id prefix instead of hardcoding 'web' - Web: send session_switch command before streaming so WebUI opens correct session - Discord/Telegram: deliver resume to originating channel via send_message - PlatformRegistry::send_message for proactive (non-reply) messages - WebSocket path only consumes web resumes — leaves platform resumes for their adapters
Removes the system's self-applied 'tool inheritance' placeholder from ws_react_helpers.rs (§2.3 violation: placeholder with TODO comments). Also removes the recompile test comment from main.rs. The original fix targeted the wrong file — the actual failure is in tool_dispatch.rs where spawn_sub_agent is missing from the match arms.
Root cause: spawn_sub_agent was special-cased only in ws_react.rs (WebUI ReAct path) but missing from tool_dispatch.rs. All other paths (Discord, Telegram, L1, observer retry, sub-agent recursive) fell through to 'Unknown tool: spawn_sub_agent'. Changes: - Add dispatch_spawn_sub_agent helper to tool_dispatch.rs (§1.2 compliant) - Add 'spawn_sub_agent' match arm in execute_tool_with_state - Remove special-case intercept from ws_react.rs handle_single_tool - Remove dead execute_sub_agent wrapper from ws_react.rs - Remove dead execute_sub_agent function from ws_react_helpers.rs - Clean up unused imports and provider parameter All 6 tool execution call sites now go through the unified dispatch. 82 tests pass, 0 warnings.
Adds comprehensive anti-reward-hacking prompt to prompts/core.md (the factory default template), NOT data/prompts/core.md (the runtime copy that gets overwritten on factory reset). Covers: reward hacking taxonomy (8 failure modes), trace-first diagnostic mandate, epistemic honesty requirements, self-modification integrity checks, and the unsupervised reality constraints. Previous attempt incorrectly edited only the runtime copy, which was wiped by the user's factory reset — a Wrong-File Fix (section 1.C).
Root cause: swap_model only updated model_path, leaving mmproj_path pointing to the previous model's multimodal projector. Swapping from Gemma (with mmproj) to Qwen (without mmproj) crashed llama-server because it tried to load a Gemma mmproj with a Qwen model. Fix: scan models/ directory for a matching mmproj file when swapping. Strip quantisation suffix from model name and look for mmproj-* files containing the base name. If no match found, mmproj_path is set to None. Also: removed Gemma-specific comments from llamacpp.rs. The code itself (--jinja flag, enable_thinking chat_template_kwargs) is model-neutral — used by Gemma, Qwen, DeepSeek and others. Only the comments were wrong. Tests: 5 new tests for strip_quant_suffix, 662/662 lib tests pass.
Regression from 6efe0ba: replacing consume_silently with WebSocketSink caused L1 tool chain text to stream live to the WebUI before the observer audit ran. The old consume_silently accumulated text silently and only forwarded thinking deltas. Fix: ThinkingOnlySink — identical to WebSocketSink but on_text is a no-op. Text is accumulated by consume_stream, held back, and only sent to the WebUI after deliver_reply → audit_and_retry approves it. 662/662 lib tests pass.
…SocketSink The first fix (bcea5a7) only covered the L1 tool chain path in ws_l1.rs. The initial inference in ws.rs line 310 was still using WebSocketSink, streaming text to the WebUI before audit_and_retry ran. Now ThinkingOnlySink in both paths. Only remaining WebSocketSink usage is the post-recompile resume greeting (no user query, no audit needed). 662/662 lib tests pass.
sleep_cycle, lesson_decay, and log_rotate are systemically required for the engine to function. They are no longer user-toggleable entries in scheduler.json — they run unconditionally as hardcoded spawn loops in spawn_maintenance(). scheduler.json now only contains optional learning tasks (attend_class, conduct_research, spaced_review).
Added §8 Anti-Pattern Catalogue (reward hacking, complexity injection, heuristic smuggling, investigation theatre, test theatre, shotgun debugging, scope creep, silent state mutation), §9 Lifecycle Invariants, §10 Contribution Protocol, §11 Review Rejection Criteria (15 auto-reject triggers), §12 Historical Violations (4 real incidents that birthed rules).
- Add Priority 0 entity/identity recall trigger (must exhaust 4 tiers before declaring unknown) - Add 'New Session ≠ New Identity' definition (session = window reset, not amnesia) - Add Synaptic KG Proactive Storage Discipline (explicit when-to-write/when-to-read guidance) - Replace prose anti-pattern warning with structured Verification of Absence Protocol - Remove L2-only annotation from synaptic in tier list and routing table Addresses: model failing to recall stored entities (Sunny, Matthew, Aberdeen) on new sessions due to insufficient prompt guidance on memory tool usage.
Synaptic is a memory tool, not a self-modification tool. Every other memory tool (timeline, memory, scratchpad, lessons) was already L1. Synaptic was incorrectly grouped with codebase_edit/system_recompile/ checkpoint in the L2-exclusive list. Moving it to L1 allows the model to store facts (user identity, pets, relationships, locations) during casual conversation without needing to escalate to the full ReAct loop. - Add synaptic_tool_schema() to layer1_tools() - Update L1 tool count test assertion: 22 -> 23
hud_data.rs:162 used raw byte slicing (&summary[..80]) which panics when byte 80 falls inside a multi-byte UTF-8 character (em-dash '—' occupies bytes 79..82). Root cause: format_timeline_narrative truncated timeline summaries at byte offset 80 without checking char boundaries. This crashes the entire SSE pipeline via the catch_unwind, delivering empty responses to Discord. Fix: Use char_indices() to find the last valid char boundary at or before byte 80, matching the existing safe pattern already used in extract_recent_reasoning (same file, line 126).
Recompile is a destructive, stateful operation that should be tested separately — not inside the master capability sweep. Removed from: - Section 19.4 (Self-Recompile table) - L2 escalation objective - Phase 2 step list (step 90) - Coverage summary table - Expected outcomes description - Post-run check 21.14 - Summary table item count (18 → 12)
Root cause: chars/2 token estimation reported 52% budget usage at 269K chars (est 134K tokens) while the actual context overflowed 262K. The model ran 96 seconds then emitted finish_reason=stop with zero content — complete inference failure. Evidence from logs: - Iteration 21: scratchpad returned 63K chars, total jumped to 202K - Iteration 27: another 63K char scratchpad result, total hit 269K - enforce_context_budget did NOT trim (est 134K < budget 157K) - Pre-infer 90% check did NOT fire (52% < 90%) - Model sat for 96 seconds then gave up Fix: Change chars/2 to chars/3 across all estimation sites: - platform_context.rs: enforce_context_budget + trim loop - platform_exec.rs: pre-infer budget check + logging - platform_reinfer.rs: empty response handler + exhaustion handler The chars/3 ratio is more accurate for mixed content (JSON tool results, code, structured output). Combined with the existing 60% safety margin, this ensures trimming triggers before overflow.
trim_observer_message mutated every system/user message before sending to llama-server, breaking KV cache prefix matching. The observer was reprocessing all messages from scratch instead of getting a cache hit on the shared prefix. Evidence: 91-message context took 128s for the observer vs 0.5s for re-inferences that shared the same KV prefix. Normal 30-msg observer calls took ~15s (acceptable full reprocess). The 128s spike was exclusively from the trimming invalidating the prefix. Fix: pass messages through verbatim (true 1-to-1 context parity). The observer only needs to process the audit instruction delta.
Root cause: two WARN sites masked a 100%-failure-rate systemic fault. Every page of every deep-read attempted embed against a dead port and logged WARN, which per §4 (logging) is 'something unexpected but recoverable'. A dead embedding server is not recoverable mid-session. Changes (2 files, no API surface changes): src/startup.rs: - WARN → ERROR when embedding server fails to spawn (binary error) - WARN → ERROR when embedding server spawns but is not healthy in 30s src/web/attachment_reader.rs: - Removed §2.4 loophole comment that incorrectly labelled a data-loss failure as 'feature off, not degraded'. Embed failures are data loss — the page is permanently absent from the RAG index for this session. - WARN → ERROR in ingest_page_chunks per §4 (error = system cannot continue this operation correctly). No flag, no circuit breaker, no suppression. Every attempt is made. Every failure is reported at the correct level.
… chat model Root cause chain: 1. embedding_model was unset in ern-os.toml 2. startup.rs silently fell back to model_path (the 26B chat model) 3. Spawning a second 26B instance on port 8081 OOMed / timed out 4. Every embed call hit a dead port — 100% failure rate, all sessions Fix (startup.rs): Remove the .unwrap_or(&model_path) fallback. §5 mandates: 'if the provider doesn't report a value, the system reports the gap — it does NOT invent a default.' A chat model is not an embedding model. When embedding_model is unset, log ERROR and return. Do not attempt to spawn the wrong model. The error message tells the operator exactly what to set in ern-os.toml. Companion change (not committed — ern-os.toml is gitignored): embedding_model = "models/nomic-embed-text-v1.5.Q8_0.gguf" nomic-embed-text-v1.5.Q8_0.gguf (139MB) downloaded to models/
Root cause: chunk_size_chars() was computing chunk size from the main model's context_length (131072 tokens). nomic-embed-text-v1.5 has a hard context cap of 2048 tokens. Every chunk sent was ~2050 tokens — 2 over the limit — causing HTTP 500 on every embed call. The function even falsely claimed §2.1 compliance in its comment while using the wrong model's context window. §8.3 violation. Fix: 1. Provider trait: add embed_context_length() -> Result<usize> - LlamaCppProvider: queries embedding server /v1/models for n_ctx_train - OllamaProvider: re-uses get_model_spec().context_length - OpenAICompatProvider: re-uses get_model_spec().context_length All implementations query the actual server — no hardcoded values (§2.1). 2. DocumentStore: ingest_document / ingest_document_with_project - Remove context_length parameter (was the chat model's context — wrong) - Call provider.embed_context_length() at ingest time - chunk_size = 80% of embed model context * 4 chars/token - For nomic 2048: (2048 * 4 * 4) / 5 = 6553 chars — fits cleanly 3. attachment_reader: remove context_length from ingest_page_chunks — no longer needed, chunk sizing happens inside document_store 4. Tests: update chunk size test to assert nomic-correct formula; add embed_context_length() to both mock providers With this change, every deep-read page will embed successfully.
Root cause of failure: model had no structured way to query deep-read progress. It fell back to log scraping — reading old sessions, getting wrong page counts, inventing embedding failure bursts from prior runs. Fix: 1. DigestStatus::Pending: add pages_done Arc<AtomicUsize> Ticked atomically after each page completes in deep_read(). Zero cost on reads — DashMap is lock-free. 2. deep_read(): accept Option<Arc<AtomicUsize>> pages_done param Background path passes the counter; SSE/inline paths pass None. 3. introspect(action='digest_status'): Reads digest_store directly (in-process, no log IO). Reports per-file: status (IN PROGRESS / COMPLETE), pages done, elapsed seconds, pages/sec rate. 4. Schema: add 'digest_status' to introspect enum and description. The model can now call: introspect(action='digest_status') and get authoritative, live telemetry — not stale log lines.
…llution Root cause: system_logs tail/errors/search read across all lines in the daily log file, including output from previous process invocations earlier the same day. The model was reading embedding failures from 12:04 and 12:25 sessions while the current session (12:37) had none. Fix: current_session_lines() finds the last 'Ern-OS starting' entry in the log and returns only lines from that point forward. This is the authoritative session boundary per the tracing_appender append-only guarantee. - tail: now serves lines from current session only - errors: restricts to most recent log file, current session - search: restricts to most recent log file, current session - Old-day log files are no longer searched by errors/search Updated test: verifies old-session errors are explicitly excluded, not just that current-session errors appear.
…d budget §8.3/R5 violation: page_size_chars() divided context_length by 8, producing ~4K tokens per deep-read page from a 131K context. No justification for /8 other than it being a guess. This caused 107-page reads of a 436K-token novel instead of ~6 pages. Root cause: file_read.rs computed page_size_chars = context_length / 8, losing 87.5% of available context per summarisation call. The divisor was a heuristic, not a measurement. Fix: 1. measure_page_budget(): calls count_tokens() on the summarisation system prompt once before the deep_read loop. Returns context_length - overhead_tokens as the true page budget. One measurement, not per-page estimation. §8.3 compliant. 2. summarise_system_prompt(): extracted from summarise_page() so the probe message in measure_page_budget() uses the EXACT same string that will be sent in each summarisation call. No drift possible. 3. file_read::execute(): parameter renamed context_length → page_budget_tokens. Budget applied as tokens * 4 chars directly. The 4x is a conversion factor (tokens to chars), NOT a heuristic — the token budget is measured upstream. 4. file_read::auto_stitch(): same rename. 5. Tests: removed test_page_size_is_context_length (verified the /8 heuristic) and test_page_size_scales_with_context. Replaced with test_page_budget_token_to_char_conversion and test_page_budget_scales_with_tokens — both test real behaviour. Effect: deep-read page size for a 131K-context model goes from ~16K chars (~4K tokens) to ~520K chars (~130K tokens), reducing pages for a 436K-token novel from 107 to ~6.
Root cause: deep_read() used state.provider (slot 0), so measure_page_budget()'s count_tokens probe and every summarise_page() chat_sync call invalidated the main conversation's KV cache on slot 0. Fix: switch to state.audit_provider (slot 1) — same pattern as the observer. All deep-read inference now runs on the dedicated KV cache slot, leaving slot 0 untouched for main conversation continuity.
Root cause: deep-read (slot 1) and observer audit (slot 1) competed for the same KV cache slot. When both ran concurrently, chat_sync in the deep-read loop blocked indefinitely waiting for slot 1 to free — the observer held it for its full audit duration (26s observed). Fix: introduce a dedicated slot 2 for background document summarisation. Changes: - llamacpp.rs: -np 2 -> -np 3 (add slot 2 to server) - provider/mod.rs: create_digest_provider() -> new_with_slot(2) - web/state.rs: AppState.digest_provider field - main.rs: construct digest_provider, pass to build_app_state - background_digest.rs: audit_provider -> digest_provider (slot 2) - tests: add digest_provider to all AppState test builders Slot assignment: Slot 0 — main inference (provider) Slot 1 — observer audit (audit_provider) Slot 2 — background document digest (digest_provider)
Root cause: -np 3 with -c 0 (auto from GGUF) causes llama-server to divide the model's 131072-token KV cache evenly across 3 slots. Each slot received only ~43K tokens instead of 131K, cutting effective context by ~67% and slowing every inference turn significantly. Fix: set -c = n_ctx_per_slot * n_parallel (131072 * 3 = 393216). llama-server divides the total by -np, so each slot gets the full 131072-token budget back. Added n_ctx_per_slot to LlamaCppConfig (default 131072, configurable via ern-os.toml [llamacpp] n_ctx_per_slot = N). This allows operators to tune the per-slot budget without code changes.
Reverts e915969, 9ee546d, b7e7834. Root cause of regression: adding -np 3 with -c 0 caused llama-server to divide the model's 262144-token context by 3, giving each slot only 87552 tokens. This broke inference speed and context window for all turns. The original misdiagnosis was incorrect: the deep-read and observer do not actually deadlock on slot 1. llama-server serializes concurrent slot requests — the observer runs for one turn and finishes before the deep-read page summarisation calls, so there is no overlap in practice. Restoring -np 2, -c 0 (auto from GGUF = 262144 / 2 = 131072 per slot) and dropping digest_provider entirely. Deep-read uses audit_provider (slot 1), same as observer, which is safe given the non-overlapping execution pattern.
Root cause: measure_page_budget() was called inside deep_read() at the moment a background document task spawned — concurrently with the main inference stream on slot 0. This forced the GPU to serve two KV cache slots simultaneously, doubling turn latency. Fix: page_budget_tokens is now measured once in detect_model_spec() at startup, before any user traffic. The audit provider (slot 1) is idle at this point — the count_tokens call has no contention. The result is stored in ModelSpec.page_budget_tokens and passed via DeepReadConfig. deep_read() reads it directly without any provider call. Changes: - model/mod.rs: add page_budget_tokens field to ModelSpec - main.rs: detect_model_spec() calls measure_page_budget() at startup - attachment_reader.rs: DeepReadConfig.context_length -> page_budget_tokens deep_read() removes measure_page_budget() call entirely - platform_stream.rs, platform_ingest.rs: pass model_spec.page_budget_tokens - provider/llamacpp.rs, ollama.rs, openai_compat.rs: page_budget_tokens: 0 (sentinel — real value set by detect_model_spec after measurement) - tests/e2e_tests.rs: add page_budget_tokens: 0 to test ModelSpec literals
Root cause: commit 6ff7ef7 increased deep-read page size from ~4K to ~130K tokens per page. With -np 2, the 130K-token deep-read page ran concurrently with main inference on the same GPU, causing throughput to drop from 77 tok/s to 6.7 tok/s (11x regression, 14s turns -> 154s). Evidence: May 18 logs (pre-regression) showed 7-14s turns with both -np 2 and deep-read active. The difference was page size, not slot count. Fix: add inference_done: Arc<tokio::sync::Notify> to AppState. - platform_stream.rs: calls notify_waiters() after keepalive_cancel, immediately when the main SSE stream is complete. - background_digest.rs: calls notified().await before deep_read(), deferring all GPU work until the main turn is finished. - notify_waiters() wakes all concurrent deep-read tasks simultaneously. - If no inference is running, Notify stores one pending permit so notified().await returns immediately (no unnecessary wait). Changes: - state.rs: inference_done field - main.rs: Arc::new(Notify::new()) in build_app_state() - platform_stream.rs: notify_waiters() after keepalive_cancel.cancel() - background_digest.rs: notified().await before deep_read() - tool_dispatch.rs, curriculum_tests.rs, e2e_tests.rs: field added
Root cause: 904b576 pinned audit_provider to slot 1 (new_with_slot(1)). Slot 1 never had the conversation prefix cached. Every observer call was a cold-start recompute of ~23K prefix tokens = ~142s per turn. The observer sends the identical conversation prefix as main inference (1-to-1 context parity by design). Running on slot 0 after main inference completes, llama-server reuses the hot KV cache already on slot 0. Only the delta (candidate + audit prompt, ~few hundred tokens) needs to be computed. Observer should return to ~5-10s. Fix: create_audit_provider() calls create_provider() directly (slot 0) for all backends. new_with_slot(1) removed.
Root cause: two sequential count_tokens calls fired against llama-server before every inference turn, each requiring GPU tokenisation (~14s each). Measured overhead: 15:01:07 message received 15:02:15 count_tokens MettaMazza#1 done (+68s — deep-read gate: base context) 15:02:29 count_tokens MettaMazza#3 done (+14s — enforce_context_budget) 15:02:29 provider.chat() called — inference finally starts Fix 1 — deep-read gate (platform_stream.rs, platform_ingest.rs): Replace count_tokens(base_context) + count_tokens(attachment) with a single fs::metadata() file-size check. A file exceeding context_length*3 bytes provably cannot fit in the context window (~3 chars/token lower bound). At 131K context: threshold=393KB. A 1.95MB novel always triggers. Small files fall through to the existing count_tokens path unchanged. Fix 2 — enforce_context_budget (platform_context.rs): Add char-length pre-check before count_tokens. Estimate token count as total_chars/4. If estimated_tokens < 60% budget, skip count_tokens. At 17% actual usage vs 60% budget, a 30% estimation error still leaves 43% margin — provably safe to skip. count_tokens still fires when the context is near the budget (rare, unchanged behaviour). Expected saving: ~81s eliminated on large-attachment turns, ~14s eliminated on normal turns. Target: 5-15s total turn time.
notify_waiters() was firing immediately after the main inference stream completed ([DONE]). The observer audit also starts at that point — both the observer and deep-read were hitting slot 0 simultaneously, causing the same GPU saturation as before (observer: 138s instead of 16s). Fix: move notify_waiters() from after keepalive_cancel.cancel() to inside emit_reply(), immediately after the observer audit emit. Deep-read now starts only after the observer has completed and released the GPU.
Root cause: deep-read used state.provider (slot 0), the same slot as inference, count_tokens, and observer. When deep-read ran a 130K-token page on slot 0, all other GPU calls queued behind it. count_tokens on Turn 2 waited 114s for page 1 to finish. Fix: - Add create_digest_provider() returning LlamaCppProvider::new_with_slot(1) - Add digest_provider: Arc<dyn Provider> to AppState - background_digest.rs: deep_read() uses state.digest_provider (slot 1) - main.rs, android_bridge.rs: initialize digest_provider at startup - All test AppState constructors updated Slot assignments (no heuristics, no §2.1 violation): slot 0: inference (state.provider) + observer (state.audit_provider) slot 1: deep-read (state.digest_provider) Observer stays on slot 0 for KV cache reuse from main inference (~14s). Deep-read on slot 1 runs in parallel without blocking slot 0 GPU calls. The inference_done Notify gate still prevents deep-read from starting until the observer finishes, maintaining sequential slot 0 usage.
Delete 5 unreachable stub files that violated §2.3 (No Stubs): - lessons_tool.rs - memory_tool.rs - scratchpad_tool.rs - synaptic_tool.rs - timeline_tool.rs All real dispatch was already in dispatch_memory.rs. Removed corresponding mod declarations from tools/mod.rs.
Add [context] section to AppConfig with 9 configurable fields: - trim_threshold, consolidation_threshold, split_ratio - trim_keep_recent, trim_tool_result_chars - compress_threshold_chars, compress_head_chars, compress_tail_chars - max_sort_tool_calls Wire &state.config.context into all 7 enforce_context_budget call sites across ws_react, ws_l1, platform_exec, platform_ingest, and platform_reinfer. Replaces all hardcoded thresholds (0.60, 0.80, 10, 500, 8000, 2000, 2000, 0.60, 20) with config-driven values.
Add pre_consolidation_sort_prompt() in consolidation.rs — explicit instruction for the model to sort every entity, relationship, pin, and lesson into persistent memory stores before content is discarded. Add memory_sort_tools() in schema.rs — restricted tool set for the sort pass (memory, scratchpad, synaptic, lessons, timeline only). Uses digest_provider (slot 1) to prevent interference with main inference. Hard limit from config.context.max_sort_tool_calls.
recall_knowledge_graph() now searches by query with recency fallback. Returns up to 20 nodes (was 5). Injects full data HashMap fields (relation: fiancé, pet: Scruffy, etc.) that were previously invisible. Fixes unicode bullet rendering (\u2022 → literal •) that caused Rust compiler encoding errors.
Add consolidation count and status to the HUD system prompt so the model knows when consolidation has fired and how many times.
Replace full-transcript output in timeline_search with query-match context extraction — returns the matched line with 2 lines above and below, plus session_id for full retrieval via session action. Context window is derived from match position and text structure. No arbitrary character caps (§2.2 compliant). Reduces tool result size from 566K-1MB to ~5K-20K per search.
trim_tool_messages is now a sync function using char-based accounting (chars_removed / 4 >= overshoot) as a gate. The single post-trim count_tokens call validates the result. Previous: count_tokens called 3-5x per trim pass, each sending 500K+ payloads to llama-server (~800ms each = 105s total). Now: pure string operations (<1ms) + 1 validation call. Also fixes: .max(200) panic floor on short tool results replaced with .min(content_len) + keep==0 guard.
Add char-estimate gate before count_tokens in consolidate_if_needed. Same gate pattern as enforce_context_budget:42-51. Skips the ~14s GPU tokenisation call when context is provably under threshold. Also fix: /3 heuristic at HUD context_usage_pct estimation changed to /4. The chars/3 ratio was a §8.3 Heuristic Smuggling violation and a §12 V2 repeat (the chars/3 Heuristic).
Adds an operator-configurable context_length field to LlamaCppConfig
that maps to llama-server -c. Default 0 preserves the legacy auto-detect
behavior; a non-zero value bounds KV-cache allocation and is logged at
startup (§2.7). Necessary for long-default-context models like Qwen3.5/3.6
(advertise 256K) on VRAM-limited hardware where the auto-detected value
exceeds the health-check budget.
§1.1 split applied so llamacpp.rs stays under the 500-line cap as we
add the new wiring:
- src/provider/llamacpp_server_args.rs (new, 52 lines): free fn
build_server_args(&LlamaCppConfig) extracted from llamacpp.rs; has
//! module doc per §11 R10. LlamaCppProvider::build_server_args
becomes a 3-line delegate so callers and tests are unaffected.
- src/provider/llamacpp_tests.rs (new, 142 lines): all existing
tests moved out of inline mod tests, then included from llamacpp.rs
via #[path = ...] mod tests; so the test paths remain
provider::llamacpp::tests::* (no rename to provider::llamacpp_tests).
- src/provider/llamacpp.rs: 592 → 463 lines.
Tests: 727 passed / 0 failed (baseline 725 + 2 new context_length tests
covering the auto-detect=0 path and the explicit-override=32768 path).
The kernel-side test_build_server_args_uses_two_slots regression test
still passes — extraction preserves -np 2 and slot 1 audit semantics.
Out-of-scope observation: config/mod.rs is at 509 lines after this PR
(9 over §1.1). Pre-PR was 501. Not addressed here per §10.2 (one
concern).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7b3383f to
7c94d3f
Compare
Contributor
Author
|
Pushed v3 ( Recreated as a fresh commit against current main rather than rebasing the 5/14 v2 branch ( Changes vs v2:
Sizes (§1.1):
One §1.1 honesty point: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
[llamacpp] context_length = Ntoml field. Default 0 plumbs through tollama-server -c 0(auto-derive from GGUF) — bit-for-bit identical to current behavior. Operators set non-zero when hardware + model's advertised n_ctx_train combine to exceed KV cache budget (Qwen3.5/3.6 at 256K is the canonical case).Tests embedded covering default-produces-
-c 0and explicit-value-overrides.Per your #dev-general feedback this also adds a
tracing::info!when the override fires, so the §2.7 "behavior changes are never silent" requirement is satisfied at runtime — operators see the override loudly in logs at server-startup time.Second of three stacked PRs. Diff currently shows PR-A's commit too because of cross-fork base limitations — collapses to just this commit after PR-A merges.