feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6
feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6Scooter-DeJean wants to merge 45 commits into
Conversation
Update dtolnay/rust-action/setup to dtolnay/rust-toolchain to use the current recommended action for Rust toolchain installation.
…nse bug) Root cause: reinfer_and_dispatch hardcoded thinking=true for tool-chain continuation. After a tool executes (e.g. grep returns 39 chars), the model has already reasoned — re-enabling thinking causes Gemma 4 to immediately emit stop with no content (verified: cache_n=48933, prompt_n=88, 327ms). This broke the book club verbatim reading flow: grep (find chapter) → EMPTY RESPONSE → 35s recovery → file_read → analysis Instead of: grep (find chapter) → file_read → verbatim quote Fix: thinking=false for tool-chain re-inference. The model already thought during initial inference. Also fixes mislabeled 'context overflow' error that fires at 19.7% usage (not overflow — thinking mode bug). 706 tests passing, zero warnings.
…n into reality
Root cause: when discussing autofiction where characters/systems share names
with real entities (Maria the user = Maria the character, ErnOS the system =
ErnOS in the novel), the model progressively merged fiction with reality
across conversational turns, adopting fictional missions as its own.
Four-layer fix:
1. Digest fiction header (attachment_reader.rs):
- build_digest now includes FICTION/REALITY PROTOCOL in footer
- Instructs model to refer to 'the character Maria' not 'Maria'
- Distinguishes fictional systems from real ones
2. Summariser [FICTION] prefix (attachment_reader.rs):
- summarise_page prompt now instructs LLM to prefix fictional
events with [FICTION] tags in page summaries
- Prevents downstream confusion in digest content
3. Observer Rule 20 — fiction_reality_collapse (observer.md):
- New audit rule detects: adopting fictional missions, treating
fictional events as real evidence, collapsing autofiction identities
- Exception for explicit role-play/creative collaboration
- Added to failure_category enum in observer schema
4. Core prompt Fiction/Reality Protocol (core.md):
- Positioned after Bounded Speculation (both address fact/fiction boundary)
- Instructs tracking of conversational drift from analysis to immersion
- Specific guidance for autofiction name collisions
Observer rule names updated: 19 → 20 rules.
706 tests passing, zero warnings.
…§2.4) The observer bailout was a governance §2.4 violation: after 2-3 rejections, the system silently delivered an UNAUDITED response to the user. A cap on retries masks deeper bugs — if the model can't follow observer feedback, that's a real problem to investigate, not hide. Changes across all 3 observer paths: 1. platform_ingest.rs (platform adapter path): - Removed max_retries cap and handle_audit_bailout function - Removed AuditSummary::exhausted (unreachable dead code) - Loop now retries with feedback indefinitely until ALLOWED 2. ws_stream.rs (WebSocket L1 path): - Removed max_retries cap and handle_bailout function - Same uncapped loop until approved 3. ws_react.rs (WebSocket ReAct path): - Removed the consecutive_rejections >= 2 bailout branch - Always injects feedback and continues the loop - Resets consecutive_rejections on approval 4. observer/mod.rs: - Removed format_bailout_override function (dead code) - Removed its test - Updated module comment: 20 rules Infrastructure errors (observer itself is down) still fail-open — that's not a quality issue, it's an availability issue. 705 tests passing (1 removed: test_format_bailout_override), zero warnings.
Root cause: data/embeddings.json is a 4,077,705-byte single-line file. codebase_search matched 'learning pipeline' inside it and returned the entire 4MB line as a search result, blowing the context window. The retry then failed because llama-server's KV cache matched 99.9% of the prompt (prompt_n=1 on retry), hitting the same empty-response state. Three fixes: 1. Data artifact exclusion (codebase_search.rs): Skip embeddings.json, golden_buffer.jsonl, rejection_buffer.jsonl, quarantine.json, review_deck.json, training_manifest.json from search. These are runtime data, not source code. 2. Match context snippets (codebase_search.rs): Show 200 chars before/after the match instead of the full line. Defense-in-depth against any future large-line file. Full content available via file_read. 3. KV cache escape (platform_reinfer.rs): Inject a recovery system message before retry to change the cache-visible prompt suffix, forcing actual token generation instead of cache-hit empty response. 710 tests passing (+5 new: context snippets, data artifact exclusion, long-line cap). Zero warnings.
- New: src/web/handlers/curriculum.rs — 5 API endpoints:
GET /api/curriculum (list courses with progress)
POST /api/curriculum (add course)
DELETE /api/curriculum/{id} (remove course)
GET /api/curriculum/{id}/progress (detailed progress)
GET /api/curriculum/review (review deck stats)
- New: curriculum_tests.rs — 8 unit tests (success + error paths)
per §3.1 and §3.2 governance mandates
- Updated: Training tab → 4 sub-tabs (Buffers, Curriculum, Review, Adapters)
Course cards with level badges, progress bars, completion %
Review tab with Leitner deck stats (due, total, retention rate)
- CSS: .course-card, .course-progress-track, .level-badge styles
- 718 tests pass, 0 regressions (642 unit + 76 e2e)
New curriculum_e2e module in e2e_tests.rs: 1. test_attend_class_full_pipeline — process_scene → verify → route to golden/quarantine buffers with MockProvider 2. test_attend_class_all_complete — full course completion with avg score 3. test_flush_session_to_shared_state — session buffers → shared Arc state 4. test_spaced_review_generates_cards — Leitner card generation from completed quiz scenes 5. test_curriculum_store_roundtrip — add/list/progress/remove/persistence 6. test_review_deck_stats — deck stats, add card, record result, due count 724 tests pass, 0 regressions (642 unit + 82 e2e)
core.md: 438 → 563 lines. Previously undocumented systems: Learning Pipeline (was 2 sentences, now ~100 lines): - Curriculum system: 5 education levels, 12+Custom subjects, 16 scene types - Student loop: process_scene → verify → route (golden/rejection/quarantine) - Leitner 5-box spaced review with [1,3,7,14,30] day intervals - Research projects: 6-phase lifecycle (LiteratureSurvey → Complete) - Graduation gates with adapter fusion via mlx_lm.fuse - Training execution: teacher, sleep cycle, MLX bridge, distillation New sections: - Scheduler (9 job types including AttendClass, ConductResearch, SpacedReview) - Agents & Teams (AgentRegistry, TeamRegistry, Parallel/Sequential execution) - Code Verification Pipeline (build → test → browser, verify_code tool) - Output Sanitizer (scrub_tool_leaks, needs_reinference) - Spiral detector, auto-start services (embedding, Kokoro, Flux, code-server) - Voice/video call handlers - Observer skill/insight extraction Updated sections: - Tool lists: L1=22, L2=26, Safe=9 (corrected from incomplete lists) - Additional Tools: added introspect (6 actions), session_recall (5 actions) - Platform Adapters: Discord thinking threads, slash commands, RBAC - Hardware: spiral detector, auto-start, voice/video pipelines - GitHub + Discord links HUD code change (src/prompt/hud.rs, src/web/ws_context.rs): - Added curriculum_count, review_total, review_due, quarantine_count - Agent now sees learning state on every turn 724 tests pass, 0 regressions
… platform adapter improvements - HUD: enriched system status display with live telemetry - Browser tool: expanded capabilities and error handling - Self-skills: persistent skill storage and retrieval - Platform adapters: Discord interaction improvements, router reliability - Observer: additional audit metadata - WebUI: attachment ingest, content handler, index updates - Misc: image gen, fast reply, ws_context, ws_l1, ws_react refinements
…gin, §2.5 cap removal - retry_after_rejection: handle ToolCall/ToolCalls via run_platform_tool_chain instead of silently dropping them (root cause of infinite observer loops) - Box::pin indirection for recursive async cycle (retry → tool_chain → audit → retry) - Remove MAX_TOOL_OUTPUT_CHARS=200K from tool_dispatch (§2.5 governance violation) - enforce_context_budget: 60% safety margin — chars/2 underestimates 1.5x for short-line content - Fix /4 → /2+2000 estimator in platform_exec logging - WebSocketSink: implement on_text to forward text_delta frames (was no-op, ate resume text) 82 tests pass, 0 warnings.
- Resume state now stores platform as 3-tuple (message, session_id, platform) - Derive platform from session_id prefix instead of hardcoding 'web' - Web: send session_switch command before streaming so WebUI opens correct session - Discord/Telegram: deliver resume to originating channel via send_message - PlatformRegistry::send_message for proactive (non-reply) messages - WebSocket path only consumes web resumes — leaves platform resumes for their adapters
Removes the system's self-applied 'tool inheritance' placeholder from ws_react_helpers.rs (§2.3 violation: placeholder with TODO comments). Also removes the recompile test comment from main.rs. The original fix targeted the wrong file — the actual failure is in tool_dispatch.rs where spawn_sub_agent is missing from the match arms.
Root cause: spawn_sub_agent was special-cased only in ws_react.rs (WebUI ReAct path) but missing from tool_dispatch.rs. All other paths (Discord, Telegram, L1, observer retry, sub-agent recursive) fell through to 'Unknown tool: spawn_sub_agent'. Changes: - Add dispatch_spawn_sub_agent helper to tool_dispatch.rs (§1.2 compliant) - Add 'spawn_sub_agent' match arm in execute_tool_with_state - Remove special-case intercept from ws_react.rs handle_single_tool - Remove dead execute_sub_agent wrapper from ws_react.rs - Remove dead execute_sub_agent function from ws_react_helpers.rs - Clean up unused imports and provider parameter All 6 tool execution call sites now go through the unified dispatch. 82 tests pass, 0 warnings.
Adds comprehensive anti-reward-hacking prompt to prompts/core.md (the factory default template), NOT data/prompts/core.md (the runtime copy that gets overwritten on factory reset). Covers: reward hacking taxonomy (8 failure modes), trace-first diagnostic mandate, epistemic honesty requirements, self-modification integrity checks, and the unsupervised reality constraints. Previous attempt incorrectly edited only the runtime copy, which was wiped by the user's factory reset — a Wrong-File Fix (section 1.C).
Root cause: swap_model only updated model_path, leaving mmproj_path pointing to the previous model's multimodal projector. Swapping from Gemma (with mmproj) to Qwen (without mmproj) crashed llama-server because it tried to load a Gemma mmproj with a Qwen model. Fix: scan models/ directory for a matching mmproj file when swapping. Strip quantisation suffix from model name and look for mmproj-* files containing the base name. If no match found, mmproj_path is set to None. Also: removed Gemma-specific comments from llamacpp.rs. The code itself (--jinja flag, enable_thinking chat_template_kwargs) is model-neutral — used by Gemma, Qwen, DeepSeek and others. Only the comments were wrong. Tests: 5 new tests for strip_quant_suffix, 662/662 lib tests pass.
Regression from 6efe0ba: replacing consume_silently with WebSocketSink caused L1 tool chain text to stream live to the WebUI before the observer audit ran. The old consume_silently accumulated text silently and only forwarded thinking deltas. Fix: ThinkingOnlySink — identical to WebSocketSink but on_text is a no-op. Text is accumulated by consume_stream, held back, and only sent to the WebUI after deliver_reply → audit_and_retry approves it. 662/662 lib tests pass.
…SocketSink The first fix (bcea5a7) only covered the L1 tool chain path in ws_l1.rs. The initial inference in ws.rs line 310 was still using WebSocketSink, streaming text to the WebUI before audit_and_retry ran. Now ThinkingOnlySink in both paths. Only remaining WebSocketSink usage is the post-recompile resume greeting (no user query, no audit needed). 662/662 lib tests pass.
sleep_cycle, lesson_decay, and log_rotate are systemically required for the engine to function. They are no longer user-toggleable entries in scheduler.json — they run unconditionally as hardcoded spawn loops in spawn_maintenance(). scheduler.json now only contains optional learning tasks (attend_class, conduct_research, spaced_review).
Added §8 Anti-Pattern Catalogue (reward hacking, complexity injection, heuristic smuggling, investigation theatre, test theatre, shotgun debugging, scope creep, silent state mutation), §9 Lifecycle Invariants, §10 Contribution Protocol, §11 Review Rejection Criteria (15 auto-reject triggers), §12 Historical Violations (4 real incidents that birthed rules).
- Add Priority 0 entity/identity recall trigger (must exhaust 4 tiers before declaring unknown) - Add 'New Session ≠ New Identity' definition (session = window reset, not amnesia) - Add Synaptic KG Proactive Storage Discipline (explicit when-to-write/when-to-read guidance) - Replace prose anti-pattern warning with structured Verification of Absence Protocol - Remove L2-only annotation from synaptic in tier list and routing table Addresses: model failing to recall stored entities (Sunny, Matthew, Aberdeen) on new sessions due to insufficient prompt guidance on memory tool usage.
Synaptic is a memory tool, not a self-modification tool. Every other memory tool (timeline, memory, scratchpad, lessons) was already L1. Synaptic was incorrectly grouped with codebase_edit/system_recompile/ checkpoint in the L2-exclusive list. Moving it to L1 allows the model to store facts (user identity, pets, relationships, locations) during casual conversation without needing to escalate to the full ReAct loop. - Add synaptic_tool_schema() to layer1_tools() - Update L1 tool count test assertion: 22 -> 23
hud_data.rs:162 used raw byte slicing (&summary[..80]) which panics when byte 80 falls inside a multi-byte UTF-8 character (em-dash '—' occupies bytes 79..82). Root cause: format_timeline_narrative truncated timeline summaries at byte offset 80 without checking char boundaries. This crashes the entire SSE pipeline via the catch_unwind, delivering empty responses to Discord. Fix: Use char_indices() to find the last valid char boundary at or before byte 80, matching the existing safe pattern already used in extract_recent_reasoning (same file, line 126).
Recompile is a destructive, stateful operation that should be tested separately — not inside the master capability sweep. Removed from: - Section 19.4 (Self-Recompile table) - L2 escalation objective - Phase 2 step list (step 90) - Coverage summary table - Expected outcomes description - Post-run check 21.14 - Summary table item count (18 → 12)
Root cause: chars/2 token estimation reported 52% budget usage at 269K chars (est 134K tokens) while the actual context overflowed 262K. The model ran 96 seconds then emitted finish_reason=stop with zero content — complete inference failure. Evidence from logs: - Iteration 21: scratchpad returned 63K chars, total jumped to 202K - Iteration 27: another 63K char scratchpad result, total hit 269K - enforce_context_budget did NOT trim (est 134K < budget 157K) - Pre-infer 90% check did NOT fire (52% < 90%) - Model sat for 96 seconds then gave up Fix: Change chars/2 to chars/3 across all estimation sites: - platform_context.rs: enforce_context_budget + trim loop - platform_exec.rs: pre-infer budget check + logging - platform_reinfer.rs: empty response handler + exhaustion handler The chars/3 ratio is more accurate for mixed content (JSON tool results, code, structured output). Combined with the existing 60% safety margin, this ensures trimming triggers before overflow.
trim_observer_message mutated every system/user message before sending to llama-server, breaking KV cache prefix matching. The observer was reprocessing all messages from scratch instead of getting a cache hit on the shared prefix. Evidence: 91-message context took 128s for the observer vs 0.5s for re-inferences that shared the same KV prefix. Normal 30-msg observer calls took ~15s (acceptable full reprocess). The 128s spike was exclusively from the trimming invalidating the prefix. Fix: pass messages through verbatim (true 1-to-1 context parity). The observer only needs to process the audit instruction delta.
- Store thinking content in data/reasoning/*.jsonl (not just metadata) - Wire both WebUI and Discord paths to persist thinking - HUD 'Recent Reasoning' now reads actual chain-of-thought from disk - introspect(reasoning_log) shows formatted thinking excerpts - Auto-prune entries older than 1 hour (50-entry threshold) - Wire observer audit results to data/observer_history.json - Wire tool activity events to data/agent_activity.json - All introspect actions now backed by real data - Fix MockProvider missing count_tokens (pre-existing) - Clean stale test comment from main.rs - 745/745 tests pass
- Remove arbitrary 50-iteration cap from L1 tool chain (§2.1 violation) - Fix context trimming bug where oversized tool results were skipped - Parse n_prompt_tokens from 400 Bad Request for accurate budget enforcement - Add re-inference nudge for thinking-only empty replies - Clean up observer, tool dispatch, and introspect logging - Update e2e tests for signature changes
Strip Unicode box-drawing characters (U+2500, U+2550) used as section dividers. The model echoes these back as ASCII '---' at the start of responses, polluting output. Content and structure preserved.
llama-server can generate thinking tokens internally without flushing them to the HTTP response stream, causing parse_sse_stream to block indefinitely on stream.next().await. Spiral detection never fires because ThinkingDelta events never arrive. Fix: add a 10-second interval stall watchdog in parse_sse_stream that queries the server's own /slots endpoint (model-derived data per §2.1). When the server has decoded 500+ tokens but the client received <10 chunks, the stream is aborted and platform_stream retries with thinking disabled. The /slots URL is passed only from llamacpp.rs (§7.2 provider neutrality); other providers pass None. HEURISTIC: 500 decoded / <10 chunks thresholds derived from observed minimum generation speed (~13 tok/s). See check_server_stall doc comment for full derivation and error margin.
…-§15) New governance sections for open-source community safety: - §13 Security Mandates: secrets, input validation, shell execution, network security, auth/authz, capability auditing, dependency security, unsafe Rust - §14 Network & Mesh Safety: zero-trust, TLS 1.3 transport, message signing, capability-based access control, resource boundaries, data sovereignty - §15 AI Agent Contributor Safety: minimal authority, self-modification guardrails, tool execution boundaries, prompt injection defence Also adds rejection criteria R16-R20 for security violations and documents V5 (Stalled SSE Stream) in the historical record.
The provider startup health-check is hardcoded to 60 retries (1s each) in
main.rs. 60s is fine for a 7B model on a fast SSD-backed GPU, but is not
enough for genuinely slow loads:
- >20B models on slower disks (15GB GGUF read at 200MB/s = 75s before
any GPU work begins)
- models split across multiple backends via llama.cpp RPC, where layer
transfer over the network adds tens of seconds even on a LAN
- models that need long KV cache allocation on constrained GPUs
When this trips, llama-server is alive and progressing through the load,
but ErnOS exits with "Provider failed health check after 60 attempts. Is
the server running?" — a misleading error.
This commit makes the retry budget configurable via a new toml field:
[general]
provider_health_check_retries = 240 # 4-minute budget
Default is 60 via a default-fn — matches the legacy hardcoded value, so
existing tomls without this field deserialize to identical pre-patch
behavior.
Tests cover: default-is-60, missing-field-deserializes-to-60, explicit-
value-honored.
ci: update rust toolchain action
feat(general): make provider health-check retry budget configurable
🔴 REJECTED — Governance ViolationRule: §1.1 (Max 500 lines per file) / §11 R11 (Immediate rejection trigger) Evidence: This PR pushes Required before re-review:
The new code itself (rpc_servers field, empty-string guard, §13.4 security warning, edge-case tests) is solid. The file it lands in is not. Fix the file, resubmit. Note: This PR is stacked on #5 which has the same issue. Both need the file split. |
Phase 1 (Critical): Shell/path containment, fail-closed observer, CSS injection gate Phase 2 (Auth): X-Confirm-Destructive gates, git hash sanitize, secret storage Phase 3 (Paths): 17/17 hardcoded data/ paths → config-driven data_dir Phase 4 (Constants): 8 magic numbers → named constants across 6 files Phase 5 (Logging): 12 silent let _ = failures → explicit tracing::error/warn Phase 6 (Size): All files under 500 non-test lines - browser_tool.rs (566→299) → browser_actions.rs - system.rs (559→420) → system_interp.rs - ws.rs (556→479) → ws_resume.rs - stream_parser.rs (531→390) → stream_parser_util.rs Phase 7 (Parity): Retry logic added to ollama.rs + openai_compat.rs Verification: 669/669 tests passed, 0 warnings, 0 errors. 35 files changed, 4 new modules created.
- CONTRIBUTING.md: comprehensive contribution workflow with governance compliance requirements, three-tier testing mandate (automated + E2E + manual verification), anti-pattern awareness, and AI agent constraints. - .github/PULL_REQUEST_TEMPLATE.md: structured PR template enforcing governance checklist, root cause analysis, and manual testing evidence. - .github/ISSUE_TEMPLATE/bug_report.yml: structured bug report with environment details and component selection. - .github/ISSUE_TEMPLATE/feature_request.yml: feature request with governance alignment verification. - .github/workflows/ci.yml: enhanced CI with governance enforcement — file size limits, no todo/unimplemented, doc comment checks, and secret detection on PRs. All contributions must now demonstrate E2E and manual testing before submission. 669/669 tests pass, 0 warnings.
core.md: - Added formal Behavioral Framework with sections D1-D4 - D1: Sycophancy Taxonomy (S1-S7) with positive/negative examples and 'What IS NOT Sycophancy' boundary definitions - D2: Confabulation Taxonomy (C1-C5) with clear distinction between confabulation and honest error - D3: Self-Assessment Integrity consolidating anti-RLHF, anti-hedging, anti-dismissal, anti-external-framing, and critique evaluation - D4: Communication Standards consolidating output format, narration ban, first-person mandate, honesty, and curiosity - Removed duplicate standalone sections now consolidated in framework observer.md: - Restructured all rules with severity tiers: 🔴 CRITICAL (6): capability hallucination, ghost tooling, actionable harm, unparsed tools, memory recall skipped, tool narration, corporate deference 🟡 STANDARD (15): sycophancy, confabulation, stale knowledge, etc. - Each rule now has: Definition, Signal, NOT a violation - Added taxonomy cross-references (S1-S7, C1-C5) to observer rules - Added 3 new rules: 19. Proportionality Violation (response effort vs input effort) 20. Position Collapse (abandoning verified position under pressure) 21. Corporate Deference (elevated to standalone critical rule) - Expanded DO NOT BLOCK section with concrete examples rules.rs: - Updated RULE_NAMES to 21 entries (was 20) - Updated hardcoded fallback to match new 21-rule structure - Fixed tests: rule count assertion, rule numbering verification 669/669 tests pass, 0 warnings.
- README: test count 674 → 669 (badge + inline table) - docs/README.md: test count 674 → 669 - docs/testing.md: test count 674 → 669 - CONTRIBUTING.md: memory tier 5-tier → 7-tier (matches all other docs) - prompts/core.md: observer category count 19 → 21, added taxonomy references (S1-S7, C1-C5), added new categories (proportionality_violation, position_collapse) - Synced data/prompts/core.md with prompts/core.md All documentation now cross-referenced against actual codebase state.
Adds `[llamacpp] context_length = N` toml field. Default 0 plumbs through
to `llama-server -c 0` (auto-derive from GGUF) — bit-for-bit identical to
current behavior. Operators set non-zero when hardware + the model's
advertised n_ctx_train combine to exceed KV cache budget (e.g. Qwen3.5/3.6
models that advertise 256K on hardware that cannot hold that much KV).
§1.1 file-split applied as part of this PR:
- Extracted build_server_args to src/provider/llamacpp_server_args.rs as
a free function over &LlamaCppConfig. LlamaCppProvider keeps a one-line
delegating method to preserve the public API.
- Moved the inline mod tests to src/provider/llamacpp_tests.rs per the
§1.1 audit checklist ("Move tests into a dedicated tests.rs sibling
file if they exceed ~100 lines").
Post-split:
- llamacpp.rs: 422 lines (was 518)
- llamacpp_server_args.rs: 52 lines (new)
- llamacpp_tests.rs: 102 lines (new sibling)
tracing::info! fires at server-startup when context_length override is
non-zero, per §2.7 ("behavior changes are never silent").
Tests cover default-produces-`-c 0` (legacy behavior preserved) and
explicit-value-overrides. cargo test: 11 passed, 0 failed.
Adds `[llamacpp] rpc_servers = "host:port,..."` toml field. When set, llama-server distributes layers across this node's GPU AND the listed remote rpc-server endpoints — enables 27B-32B-class models on combined two-GPU meshes. Default None; empty-string treated as None to avoid emitting `--rpc ` with no value (which llama-server rejects with a confusing parse error). §13.4 security warning: tracing::warn! fires at server-startup whenever rpc_servers is set, surfacing the unauthenticated / unencrypted security posture of llama.cpp's rpc-server so operators see it in logs even if they never read the toml field's doc comment. Stacked on MettaMazza#5 — same §1.1 file-split inherited (build_server_args lives in src/provider/llamacpp_server_args.rs sibling). This PR's incremental additions live in that sibling file + llamacpp_tests.rs + src/config/mod.rs; llamacpp.rs is unchanged from MettaMazza#5. Post-merge sizes: - llamacpp.rs: 422 lines (unchanged from MettaMazza#5) - llamacpp_server_args.rs: 72 lines (was 52 in MettaMazza#5) - llamacpp_tests.rs: 136 lines (was 102 in MettaMazza#5) Tests cover None → no `--rpc`, Some("") → no `--rpc`, Some(value) → `--rpc <value>` emitted verbatim. cargo test: 14 passed, 0 failed (cumulative on MettaMazza#5 stack).
b58add6 to
fa30fb9
Compare
✅ APPROVED — Rebase Required (Blocked on #5)Governance audit: all clear. The rpc_servers implementation is clean — empty-string guard, §13.4 security warning, three-case test coverage. All R1–R20 pass. The §14.1 (zero-trust networking) tension with llama.cpp's unauthenticated RPC is acknowledged — your Merge order: This is stacked on #5, which needs a rebase to resolve the Advisory (non-blocking): Consider a lightweight |
Adds
[llamacpp] rpc_servers = "host:port,..."toml field. When set, llama-server distributes layers across this node's GPU AND the listed remote rpc-server endpoints — enables 27B-32B-class models on combined two-GPU meshes. Default None; empty-string treated as None to avoid emitting--rpcwith no value (which llama-server rejects).Tests embedded covering None → no
--rpc, Some("") → no--rpc, Some(value) →--rpc <value>emitted verbatim.Per your #dev-general feedback this adds a
tracing::warn!at server-startup whenever rpc_servers is set, surfacing the unauthenticated/unencrypted security posture of llama.cpp's rpc-server (§13.4 — your new section). Operators see the warning even if they never read the field's doc comment.Third of three stacked PRs. Diff currently shows PR-A's and PR-B's commits too because of cross-fork base limitations — collapses to just this commit after A and B merge.