Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding by Savid · Pull Request #274 · ethpandaops/panda

Savid · 2026-07-02T04:09:04Z

Summary

Three layers, bottom to top: the proxy embedding API becomes versioned — the existing v1/v2 routes are preserved byte-compatible and a new task-typed v3 is added; semantic search over runbooks becomes hybrid and multi-vector on top of it; and the runbook registry is rebuilt against an explicit authoring standard with retrieval-quality regression checks.

Note: an earlier revision of this branch did a hard cutover (single unversioned /embedding, v1/v2 deleted). That was abandoned — deployed fleets can't flag-day. The current design keeps old binaries working unchanged.

Embedding: versioned endpoints

Route	Contract	Config	Cache key
`/embed` (+`/check`) — v1	symmetric, no dims	`embedding:`	`{model}:{hash}`
`/v2/embedding` (+`/check`) — v2	symmetric, fixed dims, model+dims echoed	`embedding_v2:`	`{model}:{dims}:{hash}`
`/v3/embedding` (+`/check`) — v3	`task=query\|document` required → upstream `input_type` (asymmetric retrieval embedding); 400 on missing/unknown task	shares `embedding_v2:`	`{model}:{dims}:{task}:{hash}`

v3 is a protocol upgrade, not a new embedding space — it shares v2's service, config, and Redis namespace; key shapes are provably disjoint, and warm v1/v2 cache entries survive the proxy upgrade.
Servers negotiate newest-first: v3 probe → v2 probe (a dims-less echo from a historical proxy is tolerated) → confirmed v1 fallback via the restored embedding_model datasources field.
The search runtime treats (model, dimensions, protocol) as the index identity and auto-re-indexes when any changes — a proxy upgrading v2→v3 under a running server flips into task-typed retrieval on the next discovery tick, no restart. Clients verify the advertised space on every v2/v3 response and fail builds fast rather than caching wrong-space vectors.
Hardening: upstream vector-length validation against configured dims, dimensions ≥ 1 validation, bounded items and hashes on every route.

Deploy notes (no ordering requirement): old servers keep working against the new proxy on v1/v2 unchanged; new servers against an old proxy negotiate v2 (symmetric search, one log line noting task-typed retrieval awaits v3) and self-upgrade when the proxy does. Existing embedding:/embedding_v2: config blocks load with identical semantics; a dimensions: under the v1 block logs a warning that it's ignored.

Retrieval

pkg/textmatch: lexical token/trigram scorer shared between the resource resolver and search.
The runbook index is hybrid and multi-vector: a document-space metadata vector; one vector per trigger embedded in query space (a trigger is a hypothetical caller query — same-space embedding makes near-verbatim matches score ~1.0 on v3); per-section child vectors discounted ×0.95 so the authored routing surface wins near-ties. Score = best dense + 0.3 × lexical, unclamped, deterministically ordered. On v1/v2 proxies the same index works symmetric.
dotProduct guards mismatched vector lengths; a once-per-index warning covers the dims-mismatch window.

Runbook registry

Rebuilt as single-responsibility runbooks split by co-retrieval, with required triggers frontmatter (the primary retrieval surface), one-canonical-owner fact discipline, filled-example output contracts, and repo-wide ref-integrity tests. runbooks/AGENTS.md documents the standard; root AGENTS.md is the real guide with CLAUDE.md as a shim.
scripts/runbook-retrieval-check.sh: a golden-query matrix (51 queries incl. confusable near-miss pairs and out-of-scope probes) asserting top-1 retrieval against a running server — currently 51/51.

Verification

Full test suite with race detector + lint: clean. Independent review passes (wire-compat vs deployed shapes, state/concurrency, validation/tests, adversarial) with all findings fixed and re-verified.
Live end-to-end, three phases: fresh v3 negotiation and task-typed index build; a live downgrade to the actual pre-cutover proxy binary (v2 negotiated on the refresh tick, dims-less echo tolerated, symmetric index rebuilt through the old binary); a live upgrade back (protocol flip detected, task-typed re-index from warm cache in ~2s). Zero errors across all phases.

🤖 Generated with Claude Code

…dding cutover Runbook registry - Replace the runbook set with 19 single-responsibility runbooks split by co-retrieval: reference material merged where it is always read together (ethereum_protocol_model, clickhouse_querying, kurtosis_devnet), one debugging entry point (debug_ethereum_network) owning the network_target handle and CL/EL fault localization, and a seven-stage devnet issue pipeline sharing one issue/evidence shape (devnet_issue_contract). - Pipeline stage contracts are one filled example each with <=10 top-level fields, free-text summary before verdicts, and low|medium|high confidence bound to the criteria in evidence_discipline. - Frontmatter gains required `triggers` (example caller queries); the loader rejects runbooks without them. Registry tests enforce cross-ref resolution, ref uniqueness, description length, triggers, tags, body size, and repo-wide ref integrity across skills, CLI help, and root docs. - Discovery addressing inside runbook bodies is surface-neutral (bare runbooks:// refs, prose search phrasing); literal panda commands remain only where the CLI is the procedure itself. runbooks/AGENTS.md documents the authoring standard; the root guide moves to AGENTS.md with CLAUDE.md as a compatibility shim. - scripts/runbook-retrieval-check.sh: 31 golden queries (including confusable near-miss pairs and informational out-of-scope probes) asserting top-1 retrieval against a running server. Retrieval - pkg/textmatch: the lexical token/trigram scorer extracted from the server resource resolver and shared with semantic search. - The runbook index becomes hybrid and multi-vector: a document-space metadata vector; per-trigger vectors embedded in QUERY space via the new Embedder.EmbedQueryBatch (a trigger is a hypothetical caller query, so same-space embedding lets a near-verbatim query match score ~1.0); and per-section child vectors discounted x0.95 so the authored routing surface wins near-ties against incidental body matches. Final score is best dense + 0.3 * lexical, unclamped (clamping collapsed strong candidates into ties), ordered deterministically by score then file path. - dotProduct returns 0 on mismatched vector lengths instead of panicking or computing a truncated product; a once-per-index warning surfaces the mismatch window while ranking degrades to lexical-only until re-index. Embedding hard cutover (breaking) - One embedding path: /embedding and /embedding/check replace both /embed and /v2/embedding. The v1 embedding service, its config block, the version-probe fallback, and the embedding_model datasources advertisement are removed. - Every embed/check request requires task=query|document, mapped upstream to input_type for asymmetric retrieval embedding. A missing or unknown task is a 400 at the handler and an error in the service. - Embedding-space identity is model+dimensions+task end to end: advertised on every response and the startup probe, compared by the search runtime to re-index on any change, folded into proxy and local cache keys, and verified by the client on every response so a proxy config change mid-build fails fast instead of silently poisoning caches and indexes. - The proxy validates upstream vector length against configured dimensions and rejects non-positive dimensions at config load. The default model is google/gemini-embedding-2 at 1536 dimensions; the example config documents that the configured model must support asymmetric retrieval embedding. Verified with the full test suite and lint, plus live end-to-end runs: the index corpus built through a local proxy with real asymmetric embeddings, 31/31 golden retrieval queries, and EIP, consensus-spec, and example search plus ref reads spot-checked. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ry-embedding-cutover

…ng standard The runbook merged from master (#273) predates the registry standard and was only minimally patched during the merge. Rewrite it to comply: - imperative name ("Build a Devnet Consensus Bug Board") and the house Inputs/Output/Procedure/Self-Check skeleton with an "Owns ..." owns-line - bug object reshaped to 7 top-level fields: embeds the issue record owned by runbooks://devnet_issue_contract (summary-first reasoning, structured evidence items, low/medium/high confidence) with board presentation fields grouped under `board`; generator updated to the new shape and now renders a confidence badge, evidence list, and issue-summary fallback for the overview - severity rubric points at runbooks://ethereum_protocol_model for the restated finality thresholds; clustering points at runbooks://devnet_issue_fingerprint_dedupe; scan can seed from runbooks://devnet_watch lanes - surface-neutral addressing: examples-index searches spelled as prose, citations replaced by contract evidence refs, repro header no longer hardcodes the CLI dialect - fixed the malformed Kurtosis-config row in the cross-reference table and the deprecated datetime.utcnow() calls in the generator - added devnet_bug_report golden queries to scripts/runbook-retrieval-check.sh Generator smoke-tested end-to-end against full and minimal bug objects. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Delete the "starflinger asset store" instruction — the name appears nowhere else in the repo, so an agent can neither act on it nor verify it. Reword the two report_template.html mentions to carry their facts directly: the page architecture line states the properties (design tokens, zero external fetches, upload-safe) and the injected-JSON escaping tip explains the </script>-termination risk instead of citing a repo file sandbox agents cannot read. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-07-02T04:46:13Z

🐼 Smoke eval — `2b5276d`: ❌ 0/8 pass

📊 Interactive report — tokens p50 0 · tokens/solve 0.

Reference points: feat/runbook-registry-embedding-cutover@ee6c40f 0% · feat/runbook-registry-embedding-cutover@78489e6 0% · feat/runbook-registry-embedding-cutover@6ebd2e1 0%.

question	result	tokens	tools
`forky_node_coverage`	❌	14,631	4
`tracoor_node_coverage`	❌	14,102	4
`mainnet_block_arrival_p50`	❌	16,177	8
`list_datasources`	❌	13,457	3
`block_count_24h`	❌	16,571	10
`missed_slots_24h`	❌	14,721	5
`chartkit_default_arrival_distribution`	❌	39,064	14
`storage_upload_session_scoped`	❌	30,628	15

🔭 Langfuse traces (8 runs; ⚠️ = failed)

_{The report walks this branch's commits against the master baseline and the most recent release. A self-contained copy is in the run's eval-smoke-* artifact.}

Escaping: timeline text now goes through esc() like its sibling fields, and the escape guarantee names its real boundary — every structured field is escaped; the three *_html fields are the deliberate raw-HTML exceptions, restricted to the class vocabulary by convention, never for un-reviewed upstream text. Generator behavior: - BASELINE input rendered in the header, so the promised healthy-network board (baseline + zero bugs) is actually producible. - Client filter options carry lowercased values matching data-clients; selecting a client no longer matches nothing. - Severity header sorts on a numeric data-sevrank (critical first) instead of lexicographic minor > major > critical. - The viewer-vote localStorage parse is guarded; a corrupted value degrades to zero local votes instead of killing filters, sorting, and voting. - Labels are folded into data-text on rows and detail sections, making the advertised label matching true via text search. Prose accuracy: the cross-reference section now states which patterns the generator automates and that the rest are vocabulary for links composed inside *_html; the issue example marks its truncation inside the YAML, naming the omitted required fields (first_bad, affected, co_present, fingerprint, handles) with the owning contract; the compute-enclave input pointer goes to panda_compute_kurtosis_lifecycle per the entry point's network_target table. The registry standard gains the general rule: a truncated shared-shape example must mark the elision inside the copied block — agents copy the YAML they see, not the prose around it. Verified by executing the generator against fixtures (full bug, minimal bug, zero bugs, hostile HTML in timeline text): no raw script tags in the output, sevranks and lowercase filter values present, baseline rendered. Kept under the 500-line body cap by tightening prose and merging adjacent short CSS lines. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…typed v3 Replace the branch's hard-cutover unversioned /embedding route with versioned endpoints so deployed proxies and servers upgrade independently: - /embed + /embed/check (v1): symmetric, no dimensions, `embedding:` config — wire shapes byte-compatible with what deployed servers call today. - /v2/embedding + /v2/embedding/check (v2): symmetric at fixed dimensions, `embedding_v2:` config; tolerates task-less requests and echoes model+dimensions (additive fields only). - /v3/embedding + /v3/embedding/check (v3): task=query|document REQUIRED, mapped to the upstream input_type for asymmetric retrieval embedding; missing/unknown task is a 400. Shares the embedding_v2 service — v3 is a protocol upgrade, not a new embedding space. Cache keys carry the space identity per protocol — {model}:{hash} (v1), {model}:{dims}:{hash} (v2), {model}:{dims}:{task}:{hash} (v3) — so symmetric and task-typed vectors never collide while v2 and v3 share one Redis namespace and warm entries survive proxy upgrades. The server negotiates newest-first: probe /v3/embedding/check, then /v2/embedding/check (a dimensions-less echo from a historical proxy is tolerated and assumed 1536), then confirm /embed/check before falling back to the embedding_model datasources advertisement (restored for old clients). The search runtime records (model, dimensions, protocol) as the index identity and re-indexes when any of the three changes, so a proxy upgrading v2->v3 under a running server flips into task-typed retrieval on the next discovery tick without a restart. Clients verify the advertised space on every v2/v3 response, failing a build fast instead of caching wrong-space vectors. Hardening kept from the cutover work: upstream vector-length validation against configured dimensions, dimensions >= 1 config validation (v2 block), bounded hashes/items on every route, and a startup warning when the ignored v1 dimensions field is set. Verified end to end: negotiation against this proxy (v3), a live downgrade to the actual pre-cutover proxy binary (v2, dimensions-less echo tolerated, symmetric index rebuilt), and a live upgrade back (protocol flip detected on the refresh tick, task-typed re-index from warm cache in seconds). Full race suite and lint clean; reviewed by independent wire-compat, concurrency, validation, and adversarial passes with all findings fixed and re-verified. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…trix Registry growth: five new runbooks (tracoor_invalid_artifact_forensics, debug_evm_execution_divergence, prometheus_devnet_health, reconcile_chain_sources, devnet_bug_board_html) plus surface and content iteration across the existing set. Retrieval tuning against the golden-query matrix (now 51 queries): the debugging entry point's triggers regain the query-shaped phrasings the neighbors were winning on ("why did it break", the client-roster fork trigger with "can you investigate", the network_target contract), closing the coin-flip margins the new sibling runbooks introduced. scripts/runbook-retrieval-check.sh extends the matrix for the new runbooks. Full matrix: 51 pass, 0 fail against a live server. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Matrix entries gain an optional |top2 tier for confusables between sibling runbooks of one pipeline family: the expected runbook must appear in the top 2 instead of winning top-1. Siblings cross-link and either answer resolves within one hop, so a 0.00x coin-flip between them is noise — demanding top-1 there invited fixing flips by mirroring the query text into triggers, which turns the check into a string-echo measurement. The header documents the tier semantics and that rule. Cross-family queries stay strict top-1. Five churn-prone sibling entries move to the top2 tier; all 51 queries pass against a live server. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Savid and others added 4 commits July 2, 2026 14:08

Merge remote-tracking branch 'origin/master' into feat/runbook-regist…

3187645

…ry-embedding-cutover

Savid mentioned this pull request Jul 2, 2026

eval: grader API errors are scored 0.00 instead of surfacing as errors #275

Open

Savid and others added 4 commits July 2, 2026 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding#274

Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding#274
Savid wants to merge 8 commits into
masterfrom
feat/runbook-registry-embedding-cutover

Savid commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Savid commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Embedding: versioned endpoints

Retrieval

Runbook registry

Verification

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐼 Smoke eval — 2b5276d: ❌ 0/8 pass

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Savid commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

🐼 Smoke eval — `2b5276d`: ❌ 0/8 pass