Skip to content

feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6

Open
Scooter-DeJean wants to merge 45 commits into
MettaMazza:mainfrom
Scooter-DeJean:feat/rpc-servers
Open

feat(llamacpp): add rpc_servers to enable multi-machine GPU mesh#6
Scooter-DeJean wants to merge 45 commits into
MettaMazza:mainfrom
Scooter-DeJean:feat/rpc-servers

Conversation

@Scooter-DeJean
Copy link
Copy Markdown
Contributor

Adds [llamacpp] rpc_servers = "host:port,..." toml field. When set, llama-server distributes layers across this node's GPU AND the listed remote rpc-server endpoints — enables 27B-32B-class models on combined two-GPU meshes. Default None; empty-string treated as None to avoid emitting --rpc with no value (which llama-server rejects).

Tests embedded covering None → no --rpc, Some("") → no --rpc, Some(value) → --rpc <value> emitted verbatim.

Per your #dev-general feedback this adds a tracing::warn! at server-startup whenever rpc_servers is set, surfacing the unauthenticated/unencrypted security posture of llama.cpp's rpc-server (§13.4 — your new section). Operators see the warning even if they never read the field's doc comment.

Third of three stacked PRs. Diff currently shows PR-A's and PR-B's commits too because of cross-fork base limitations — collapses to just this commit after A and B merge.

kilo-code-bot Bot and others added 30 commits April 19, 2026 10:54
Update dtolnay/rust-action/setup to dtolnay/rust-toolchain to use the current recommended action for Rust toolchain installation.
…nse bug)

Root cause: reinfer_and_dispatch hardcoded thinking=true for tool-chain
continuation. After a tool executes (e.g. grep returns 39 chars), the model
has already reasoned — re-enabling thinking causes Gemma 4 to immediately
emit stop with no content (verified: cache_n=48933, prompt_n=88, 327ms).

This broke the book club verbatim reading flow:
  grep (find chapter) → EMPTY RESPONSE → 35s recovery → file_read → analysis
Instead of:
  grep (find chapter) → file_read → verbatim quote

Fix: thinking=false for tool-chain re-inference. The model already thought
during initial inference. Also fixes mislabeled 'context overflow' error
that fires at 19.7% usage (not overflow — thinking mode bug).

706 tests passing, zero warnings.
…n into reality

Root cause: when discussing autofiction where characters/systems share names
with real entities (Maria the user = Maria the character, ErnOS the system =
ErnOS in the novel), the model progressively merged fiction with reality
across conversational turns, adopting fictional missions as its own.

Four-layer fix:

1. Digest fiction header (attachment_reader.rs):
   - build_digest now includes FICTION/REALITY PROTOCOL in footer
   - Instructs model to refer to 'the character Maria' not 'Maria'
   - Distinguishes fictional systems from real ones

2. Summariser [FICTION] prefix (attachment_reader.rs):
   - summarise_page prompt now instructs LLM to prefix fictional
     events with [FICTION] tags in page summaries
   - Prevents downstream confusion in digest content

3. Observer Rule 20 — fiction_reality_collapse (observer.md):
   - New audit rule detects: adopting fictional missions, treating
     fictional events as real evidence, collapsing autofiction identities
   - Exception for explicit role-play/creative collaboration
   - Added to failure_category enum in observer schema

4. Core prompt Fiction/Reality Protocol (core.md):
   - Positioned after Bounded Speculation (both address fact/fiction boundary)
   - Instructs tracking of conversational drift from analysis to immersion
   - Specific guidance for autofiction name collisions

Observer rule names updated: 19 → 20 rules.
706 tests passing, zero warnings.
…§2.4)

The observer bailout was a governance §2.4 violation: after 2-3 rejections,
the system silently delivered an UNAUDITED response to the user. A cap on
retries masks deeper bugs — if the model can't follow observer feedback,
that's a real problem to investigate, not hide.

Changes across all 3 observer paths:

1. platform_ingest.rs (platform adapter path):
   - Removed max_retries cap and handle_audit_bailout function
   - Removed AuditSummary::exhausted (unreachable dead code)
   - Loop now retries with feedback indefinitely until ALLOWED

2. ws_stream.rs (WebSocket L1 path):
   - Removed max_retries cap and handle_bailout function
   - Same uncapped loop until approved

3. ws_react.rs (WebSocket ReAct path):
   - Removed the consecutive_rejections >= 2 bailout branch
   - Always injects feedback and continues the loop
   - Resets consecutive_rejections on approval

4. observer/mod.rs:
   - Removed format_bailout_override function (dead code)
   - Removed its test
   - Updated module comment: 20 rules

Infrastructure errors (observer itself is down) still fail-open — that's
not a quality issue, it's an availability issue.

705 tests passing (1 removed: test_format_bailout_override), zero warnings.
Root cause: data/embeddings.json is a 4,077,705-byte single-line file.
codebase_search matched 'learning pipeline' inside it and returned the
entire 4MB line as a search result, blowing the context window.

The retry then failed because llama-server's KV cache matched 99.9% of
the prompt (prompt_n=1 on retry), hitting the same empty-response state.

Three fixes:

1. Data artifact exclusion (codebase_search.rs):
   Skip embeddings.json, golden_buffer.jsonl, rejection_buffer.jsonl,
   quarantine.json, review_deck.json, training_manifest.json from search.
   These are runtime data, not source code.

2. Match context snippets (codebase_search.rs):
   Show 200 chars before/after the match instead of the full line.
   Defense-in-depth against any future large-line file.
   Full content available via file_read.

3. KV cache escape (platform_reinfer.rs):
   Inject a recovery system message before retry to change the
   cache-visible prompt suffix, forcing actual token generation
   instead of cache-hit empty response.

710 tests passing (+5 new: context snippets, data artifact exclusion,
long-line cap). Zero warnings.
- New: src/web/handlers/curriculum.rs — 5 API endpoints:
  GET /api/curriculum (list courses with progress)
  POST /api/curriculum (add course)
  DELETE /api/curriculum/{id} (remove course)
  GET /api/curriculum/{id}/progress (detailed progress)
  GET /api/curriculum/review (review deck stats)

- New: curriculum_tests.rs — 8 unit tests (success + error paths)
  per §3.1 and §3.2 governance mandates

- Updated: Training tab → 4 sub-tabs (Buffers, Curriculum, Review, Adapters)
  Course cards with level badges, progress bars, completion %
  Review tab with Leitner deck stats (due, total, retention rate)

- CSS: .course-card, .course-progress-track, .level-badge styles

- 718 tests pass, 0 regressions (642 unit + 76 e2e)
New curriculum_e2e module in e2e_tests.rs:
1. test_attend_class_full_pipeline — process_scene → verify → route
   to golden/quarantine buffers with MockProvider
2. test_attend_class_all_complete — full course completion with avg score
3. test_flush_session_to_shared_state — session buffers → shared Arc state
4. test_spaced_review_generates_cards — Leitner card generation from
   completed quiz scenes
5. test_curriculum_store_roundtrip — add/list/progress/remove/persistence
6. test_review_deck_stats — deck stats, add card, record result, due count

724 tests pass, 0 regressions (642 unit + 82 e2e)
core.md: 438 → 563 lines. Previously undocumented systems:

Learning Pipeline (was 2 sentences, now ~100 lines):
- Curriculum system: 5 education levels, 12+Custom subjects, 16 scene types
- Student loop: process_scene → verify → route (golden/rejection/quarantine)
- Leitner 5-box spaced review with [1,3,7,14,30] day intervals
- Research projects: 6-phase lifecycle (LiteratureSurvey → Complete)
- Graduation gates with adapter fusion via mlx_lm.fuse
- Training execution: teacher, sleep cycle, MLX bridge, distillation

New sections:
- Scheduler (9 job types including AttendClass, ConductResearch, SpacedReview)
- Agents & Teams (AgentRegistry, TeamRegistry, Parallel/Sequential execution)
- Code Verification Pipeline (build → test → browser, verify_code tool)
- Output Sanitizer (scrub_tool_leaks, needs_reinference)
- Spiral detector, auto-start services (embedding, Kokoro, Flux, code-server)
- Voice/video call handlers
- Observer skill/insight extraction

Updated sections:
- Tool lists: L1=22, L2=26, Safe=9 (corrected from incomplete lists)
- Additional Tools: added introspect (6 actions), session_recall (5 actions)
- Platform Adapters: Discord thinking threads, slash commands, RBAC
- Hardware: spiral detector, auto-start, voice/video pipelines
- GitHub + Discord links

HUD code change (src/prompt/hud.rs, src/web/ws_context.rs):
- Added curriculum_count, review_total, review_due, quarantine_count
- Agent now sees learning state on every turn

724 tests pass, 0 regressions
… platform adapter improvements

- HUD: enriched system status display with live telemetry
- Browser tool: expanded capabilities and error handling
- Self-skills: persistent skill storage and retrieval
- Platform adapters: Discord interaction improvements, router reliability
- Observer: additional audit metadata
- WebUI: attachment ingest, content handler, index updates
- Misc: image gen, fast reply, ws_context, ws_l1, ws_react refinements
…gin, §2.5 cap removal

- retry_after_rejection: handle ToolCall/ToolCalls via run_platform_tool_chain
  instead of silently dropping them (root cause of infinite observer loops)
- Box::pin indirection for recursive async cycle (retry → tool_chain → audit → retry)
- Remove MAX_TOOL_OUTPUT_CHARS=200K from tool_dispatch (§2.5 governance violation)
- enforce_context_budget: 60% safety margin — chars/2 underestimates 1.5x for short-line content
- Fix /4 → /2+2000 estimator in platform_exec logging
- WebSocketSink: implement on_text to forward text_delta frames (was no-op, ate resume text)

82 tests pass, 0 warnings.
- Resume state now stores platform as 3-tuple (message, session_id, platform)
- Derive platform from session_id prefix instead of hardcoding 'web'
- Web: send session_switch command before streaming so WebUI opens correct session
- Discord/Telegram: deliver resume to originating channel via send_message
- PlatformRegistry::send_message for proactive (non-reply) messages
- WebSocket path only consumes web resumes — leaves platform resumes for their adapters
Removes the system's self-applied 'tool inheritance' placeholder from
ws_react_helpers.rs (§2.3 violation: placeholder with TODO comments).
Also removes the recompile test comment from main.rs.

The original fix targeted the wrong file — the actual failure is in
tool_dispatch.rs where spawn_sub_agent is missing from the match arms.
Root cause: spawn_sub_agent was special-cased only in ws_react.rs (WebUI
ReAct path) but missing from tool_dispatch.rs. All other paths (Discord,
Telegram, L1, observer retry, sub-agent recursive) fell through to
'Unknown tool: spawn_sub_agent'.

Changes:
- Add dispatch_spawn_sub_agent helper to tool_dispatch.rs (§1.2 compliant)
- Add 'spawn_sub_agent' match arm in execute_tool_with_state
- Remove special-case intercept from ws_react.rs handle_single_tool
- Remove dead execute_sub_agent wrapper from ws_react.rs
- Remove dead execute_sub_agent function from ws_react_helpers.rs
- Clean up unused imports and provider parameter

All 6 tool execution call sites now go through the unified dispatch.
82 tests pass, 0 warnings.
Adds comprehensive anti-reward-hacking prompt to prompts/core.md (the
factory default template), NOT data/prompts/core.md (the runtime copy
that gets overwritten on factory reset).

Covers: reward hacking taxonomy (8 failure modes), trace-first
diagnostic mandate, epistemic honesty requirements, self-modification
integrity checks, and the unsupervised reality constraints.

Previous attempt incorrectly edited only the runtime copy, which was
wiped by the user's factory reset — a Wrong-File Fix (section 1.C).
Root cause: swap_model only updated model_path, leaving mmproj_path
pointing to the previous model's multimodal projector. Swapping from
Gemma (with mmproj) to Qwen (without mmproj) crashed llama-server
because it tried to load a Gemma mmproj with a Qwen model.

Fix: scan models/ directory for a matching mmproj file when swapping.
Strip quantisation suffix from model name and look for mmproj-* files
containing the base name. If no match found, mmproj_path is set to None.

Also: removed Gemma-specific comments from llamacpp.rs. The code itself
(--jinja flag, enable_thinking chat_template_kwargs) is model-neutral —
used by Gemma, Qwen, DeepSeek and others. Only the comments were wrong.

Tests: 5 new tests for strip_quant_suffix, 662/662 lib tests pass.
Regression from 6efe0ba: replacing consume_silently with WebSocketSink
caused L1 tool chain text to stream live to the WebUI before the
observer audit ran. The old consume_silently accumulated text silently
and only forwarded thinking deltas.

Fix: ThinkingOnlySink — identical to WebSocketSink but on_text is a
no-op. Text is accumulated by consume_stream, held back, and only
sent to the WebUI after deliver_reply → audit_and_retry approves it.

662/662 lib tests pass.
…SocketSink

The first fix (bcea5a7) only covered the L1 tool chain path in ws_l1.rs.
The initial inference in ws.rs line 310 was still using WebSocketSink,
streaming text to the WebUI before audit_and_retry ran.

Now ThinkingOnlySink in both paths. Only remaining WebSocketSink usage
is the post-recompile resume greeting (no user query, no audit needed).

662/662 lib tests pass.
sleep_cycle, lesson_decay, and log_rotate are systemically required
for the engine to function. They are no longer user-toggleable entries
in scheduler.json — they run unconditionally as hardcoded spawn loops
in spawn_maintenance(). scheduler.json now only contains optional
learning tasks (attend_class, conduct_research, spaced_review).
Added §8 Anti-Pattern Catalogue (reward hacking, complexity injection,
heuristic smuggling, investigation theatre, test theatre, shotgun debugging,
scope creep, silent state mutation), §9 Lifecycle Invariants, §10
Contribution Protocol, §11 Review Rejection Criteria (15 auto-reject
triggers), §12 Historical Violations (4 real incidents that birthed rules).
- Add Priority 0 entity/identity recall trigger (must exhaust 4 tiers before declaring unknown)
- Add 'New Session ≠ New Identity' definition (session = window reset, not amnesia)
- Add Synaptic KG Proactive Storage Discipline (explicit when-to-write/when-to-read guidance)
- Replace prose anti-pattern warning with structured Verification of Absence Protocol
- Remove L2-only annotation from synaptic in tier list and routing table

Addresses: model failing to recall stored entities (Sunny, Matthew, Aberdeen)
on new sessions due to insufficient prompt guidance on memory tool usage.
Synaptic is a memory tool, not a self-modification tool. Every other
memory tool (timeline, memory, scratchpad, lessons) was already L1.
Synaptic was incorrectly grouped with codebase_edit/system_recompile/
checkpoint in the L2-exclusive list.

Moving it to L1 allows the model to store facts (user identity, pets,
relationships, locations) during casual conversation without needing
to escalate to the full ReAct loop.

- Add synaptic_tool_schema() to layer1_tools()
- Update L1 tool count test assertion: 22 -> 23
hud_data.rs:162 used raw byte slicing (&summary[..80]) which panics
when byte 80 falls inside a multi-byte UTF-8 character (em-dash '—'
occupies bytes 79..82).

Root cause: format_timeline_narrative truncated timeline summaries at
byte offset 80 without checking char boundaries. This crashes the
entire SSE pipeline via the catch_unwind, delivering empty responses
to Discord.

Fix: Use char_indices() to find the last valid char boundary at or
before byte 80, matching the existing safe pattern already used in
extract_recent_reasoning (same file, line 126).
Recompile is a destructive, stateful operation that should be tested
separately — not inside the master capability sweep. Removed from:
- Section 19.4 (Self-Recompile table)
- L2 escalation objective
- Phase 2 step list (step 90)
- Coverage summary table
- Expected outcomes description
- Post-run check 21.14
- Summary table item count (18 → 12)
Root cause: chars/2 token estimation reported 52% budget usage at
269K chars (est 134K tokens) while the actual context overflowed
262K. The model ran 96 seconds then emitted finish_reason=stop
with zero content — complete inference failure.

Evidence from logs:
- Iteration 21: scratchpad returned 63K chars, total jumped to 202K
- Iteration 27: another 63K char scratchpad result, total hit 269K
- enforce_context_budget did NOT trim (est 134K < budget 157K)
- Pre-infer 90% check did NOT fire (52% < 90%)
- Model sat for 96 seconds then gave up

Fix: Change chars/2 to chars/3 across all estimation sites:
- platform_context.rs: enforce_context_budget + trim loop
- platform_exec.rs: pre-infer budget check + logging
- platform_reinfer.rs: empty response handler + exhaustion handler

The chars/3 ratio is more accurate for mixed content (JSON tool
results, code, structured output). Combined with the existing 60%
safety margin, this ensures trimming triggers before overflow.
trim_observer_message mutated every system/user message before
sending to llama-server, breaking KV cache prefix matching. The
observer was reprocessing all messages from scratch instead of
getting a cache hit on the shared prefix.

Evidence: 91-message context took 128s for the observer vs 0.5s
for re-inferences that shared the same KV prefix. Normal 30-msg
observer calls took ~15s (acceptable full reprocess). The 128s
spike was exclusively from the trimming invalidating the prefix.

Fix: pass messages through verbatim (true 1-to-1 context parity).
The observer only needs to process the audit instruction delta.
MettaMazza and others added 7 commits May 5, 2026 11:49
- Store thinking content in data/reasoning/*.jsonl (not just metadata)
- Wire both WebUI and Discord paths to persist thinking
- HUD 'Recent Reasoning' now reads actual chain-of-thought from disk
- introspect(reasoning_log) shows formatted thinking excerpts
- Auto-prune entries older than 1 hour (50-entry threshold)
- Wire observer audit results to data/observer_history.json
- Wire tool activity events to data/agent_activity.json
- All introspect actions now backed by real data
- Fix MockProvider missing count_tokens (pre-existing)
- Clean stale test comment from main.rs
- 745/745 tests pass
- Remove arbitrary 50-iteration cap from L1 tool chain (§2.1 violation)
- Fix context trimming bug where oversized tool results were skipped
- Parse n_prompt_tokens from 400 Bad Request for accurate budget enforcement
- Add re-inference nudge for thinking-only empty replies
- Clean up observer, tool dispatch, and introspect logging
- Update e2e tests for signature changes
Strip Unicode box-drawing characters (U+2500, U+2550) used as section
dividers. The model echoes these back as ASCII '---' at the start of
responses, polluting output. Content and structure preserved.
llama-server can generate thinking tokens internally without flushing
them to the HTTP response stream, causing parse_sse_stream to block
indefinitely on stream.next().await. Spiral detection never fires
because ThinkingDelta events never arrive.

Fix: add a 10-second interval stall watchdog in parse_sse_stream that
queries the server's own /slots endpoint (model-derived data per §2.1).
When the server has decoded 500+ tokens but the client received <10
chunks, the stream is aborted and platform_stream retries with thinking
disabled. The /slots URL is passed only from llamacpp.rs (§7.2 provider
neutrality); other providers pass None.

HEURISTIC: 500 decoded / <10 chunks thresholds derived from observed
minimum generation speed (~13 tok/s). See check_server_stall doc
comment for full derivation and error margin.
…-§15)

New governance sections for open-source community safety:
- §13 Security Mandates: secrets, input validation, shell execution,
  network security, auth/authz, capability auditing, dependency
  security, unsafe Rust
- §14 Network & Mesh Safety: zero-trust, TLS 1.3 transport, message
  signing, capability-based access control, resource boundaries,
  data sovereignty
- §15 AI Agent Contributor Safety: minimal authority, self-modification
  guardrails, tool execution boundaries, prompt injection defence

Also adds rejection criteria R16-R20 for security violations and
documents V5 (Stalled SSE Stream) in the historical record.
The provider startup health-check is hardcoded to 60 retries (1s each) in
main.rs. 60s is fine for a 7B model on a fast SSD-backed GPU, but is not
enough for genuinely slow loads:

  - >20B models on slower disks (15GB GGUF read at 200MB/s = 75s before
    any GPU work begins)
  - models split across multiple backends via llama.cpp RPC, where layer
    transfer over the network adds tens of seconds even on a LAN
  - models that need long KV cache allocation on constrained GPUs

When this trips, llama-server is alive and progressing through the load,
but ErnOS exits with "Provider failed health check after 60 attempts. Is
the server running?" — a misleading error.

This commit makes the retry budget configurable via a new toml field:

    [general]
    provider_health_check_retries = 240   # 4-minute budget

Default is 60 via a default-fn — matches the legacy hardcoded value, so
existing tomls without this field deserialize to identical pre-patch
behavior.

Tests cover: default-is-60, missing-field-deserializes-to-60, explicit-
value-honored.
MettaMazza added 2 commits May 7, 2026 08:45
ci: update rust toolchain action
feat(general): make provider health-check retry budget configurable
@MettaMazza
Copy link
Copy Markdown
Owner

🔴 REJECTED — Governance Violation

Rule: §1.1 (Max 500 lines per file) / §11 R11 (Immediate rejection trigger)

Evidence: This PR pushes src/provider/llamacpp.rs to 611 lines — 111 lines over the §1.1 limit of 500. §11 states: "A PR is immediately rejected if it contains any of the following. No discussion, no exceptions."

Required before re-review:

  • Split llamacpp.rs as part of this PR. The file must be under 500 lines after merge. Extract tests into llamacpp_tests.rs, move build_server_args into a server_args.rs submodule, or both.

The new code itself (rpc_servers field, empty-string guard, §13.4 security warning, edge-case tests) is solid. The file it lands in is not. Fix the file, resubmit.

Note: This PR is stacked on #5 which has the same issue. Both need the file split.

MettaMazza and others added 6 commits May 8, 2026 10:03
Phase 1 (Critical): Shell/path containment, fail-closed observer, CSS injection gate
Phase 2 (Auth): X-Confirm-Destructive gates, git hash sanitize, secret storage
Phase 3 (Paths): 17/17 hardcoded data/ paths → config-driven data_dir
Phase 4 (Constants): 8 magic numbers → named constants across 6 files
Phase 5 (Logging): 12 silent let _ = failures → explicit tracing::error/warn
Phase 6 (Size): All files under 500 non-test lines
  - browser_tool.rs (566→299) → browser_actions.rs
  - system.rs (559→420) → system_interp.rs
  - ws.rs (556→479) → ws_resume.rs
  - stream_parser.rs (531→390) → stream_parser_util.rs
Phase 7 (Parity): Retry logic added to ollama.rs + openai_compat.rs

Verification: 669/669 tests passed, 0 warnings, 0 errors.
35 files changed, 4 new modules created.
- CONTRIBUTING.md: comprehensive contribution workflow with governance
  compliance requirements, three-tier testing mandate (automated + E2E +
  manual verification), anti-pattern awareness, and AI agent constraints.
- .github/PULL_REQUEST_TEMPLATE.md: structured PR template enforcing
  governance checklist, root cause analysis, and manual testing evidence.
- .github/ISSUE_TEMPLATE/bug_report.yml: structured bug report with
  environment details and component selection.
- .github/ISSUE_TEMPLATE/feature_request.yml: feature request with
  governance alignment verification.
- .github/workflows/ci.yml: enhanced CI with governance enforcement —
  file size limits, no todo/unimplemented, doc comment checks, and
  secret detection on PRs.

All contributions must now demonstrate E2E and manual testing before
submission. 669/669 tests pass, 0 warnings.
core.md:
- Added formal Behavioral Framework with sections D1-D4
- D1: Sycophancy Taxonomy (S1-S7) with positive/negative examples
  and 'What IS NOT Sycophancy' boundary definitions
- D2: Confabulation Taxonomy (C1-C5) with clear distinction between
  confabulation and honest error
- D3: Self-Assessment Integrity consolidating anti-RLHF, anti-hedging,
  anti-dismissal, anti-external-framing, and critique evaluation
- D4: Communication Standards consolidating output format, narration
  ban, first-person mandate, honesty, and curiosity
- Removed duplicate standalone sections now consolidated in framework

observer.md:
- Restructured all rules with severity tiers:
  🔴 CRITICAL (6): capability hallucination, ghost tooling, actionable
  harm, unparsed tools, memory recall skipped, tool narration, corporate
  deference
  🟡 STANDARD (15): sycophancy, confabulation, stale knowledge, etc.
- Each rule now has: Definition, Signal, NOT a violation
- Added taxonomy cross-references (S1-S7, C1-C5) to observer rules
- Added 3 new rules:
  19. Proportionality Violation (response effort vs input effort)
  20. Position Collapse (abandoning verified position under pressure)
  21. Corporate Deference (elevated to standalone critical rule)
- Expanded DO NOT BLOCK section with concrete examples

rules.rs:
- Updated RULE_NAMES to 21 entries (was 20)
- Updated hardcoded fallback to match new 21-rule structure
- Fixed tests: rule count assertion, rule numbering verification

669/669 tests pass, 0 warnings.
- README: test count 674 → 669 (badge + inline table)
- docs/README.md: test count 674 → 669
- docs/testing.md: test count 674 → 669
- CONTRIBUTING.md: memory tier 5-tier → 7-tier (matches all other docs)
- prompts/core.md: observer category count 19 → 21, added
  taxonomy references (S1-S7, C1-C5), added new categories
  (proportionality_violation, position_collapse)
- Synced data/prompts/core.md with prompts/core.md

All documentation now cross-referenced against actual codebase state.
Adds `[llamacpp] context_length = N` toml field. Default 0 plumbs through
to `llama-server -c 0` (auto-derive from GGUF) — bit-for-bit identical to
current behavior. Operators set non-zero when hardware + the model's
advertised n_ctx_train combine to exceed KV cache budget (e.g. Qwen3.5/3.6
models that advertise 256K on hardware that cannot hold that much KV).

§1.1 file-split applied as part of this PR:
- Extracted build_server_args to src/provider/llamacpp_server_args.rs as
  a free function over &LlamaCppConfig. LlamaCppProvider keeps a one-line
  delegating method to preserve the public API.
- Moved the inline mod tests to src/provider/llamacpp_tests.rs per the
  §1.1 audit checklist ("Move tests into a dedicated tests.rs sibling
  file if they exceed ~100 lines").

Post-split:
- llamacpp.rs: 422 lines (was 518)
- llamacpp_server_args.rs: 52 lines (new)
- llamacpp_tests.rs: 102 lines (new sibling)

tracing::info! fires at server-startup when context_length override is
non-zero, per §2.7 ("behavior changes are never silent").

Tests cover default-produces-`-c 0` (legacy behavior preserved) and
explicit-value-overrides. cargo test: 11 passed, 0 failed.
Adds `[llamacpp] rpc_servers = "host:port,..."` toml field. When set,
llama-server distributes layers across this node's GPU AND the listed
remote rpc-server endpoints — enables 27B-32B-class models on combined
two-GPU meshes. Default None; empty-string treated as None to avoid
emitting `--rpc ` with no value (which llama-server rejects with a
confusing parse error).

§13.4 security warning: tracing::warn! fires at server-startup whenever
rpc_servers is set, surfacing the unauthenticated / unencrypted security
posture of llama.cpp's rpc-server so operators see it in logs even if
they never read the toml field's doc comment.

Stacked on MettaMazza#5 — same §1.1 file-split inherited (build_server_args lives
in src/provider/llamacpp_server_args.rs sibling). This PR's incremental
additions live in that sibling file + llamacpp_tests.rs + src/config/mod.rs;
llamacpp.rs is unchanged from MettaMazza#5.

Post-merge sizes:
- llamacpp.rs: 422 lines (unchanged from MettaMazza#5)
- llamacpp_server_args.rs: 72 lines (was 52 in MettaMazza#5)
- llamacpp_tests.rs: 136 lines (was 102 in MettaMazza#5)

Tests cover None → no `--rpc`, Some("") → no `--rpc`, Some(value) →
`--rpc <value>` emitted verbatim. cargo test: 14 passed, 0 failed
(cumulative on MettaMazza#5 stack).
@Scooter-DeJean
Copy link
Copy Markdown
Contributor Author

Same split inherited from #5. PR #6 incremental adds the rpc_servers field, the §13.4 security warning, and 3 tests (None / empty-string / set). llamacpp.rs unchanged at 422 lines; llamacpp_server_args.rs 72; llamacpp_tests.rs 136. cargo test passes 14/14 cumulative.

@MettaMazza
Copy link
Copy Markdown
Owner

✅ APPROVED — Rebase Required (Blocked on #5)

Governance audit: all clear. The rpc_servers implementation is clean — empty-string guard, §13.4 security warning, three-case test coverage. All R1–R20 pass.

The §14.1 (zero-trust networking) tension with llama.cpp's unauthenticated RPC is acknowledged — your tracing::warn! recommending Tailscale/WireGuard is the right mitigation given that auth is llama.cpp's responsibility, not ours.

Merge order: This is stacked on #5, which needs a rebase to resolve the -np 1-np 2 slot count conflict. Once #5 is rebased and merged, rebase this onto the updated main and I'll merge.

Advisory (non-blocking): Consider a lightweight host:port format validation on the rpc_servers value — operators who forget the port will get an opaque llama-server error. A contains(':') check with tracing::warn! would improve DX. Can be a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants