Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding#274
Open
Savid wants to merge 8 commits into
Open
Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding#274Savid wants to merge 8 commits into
Savid wants to merge 8 commits into
Conversation
…dding cutover Runbook registry - Replace the runbook set with 19 single-responsibility runbooks split by co-retrieval: reference material merged where it is always read together (ethereum_protocol_model, clickhouse_querying, kurtosis_devnet), one debugging entry point (debug_ethereum_network) owning the network_target handle and CL/EL fault localization, and a seven-stage devnet issue pipeline sharing one issue/evidence shape (devnet_issue_contract). - Pipeline stage contracts are one filled example each with <=10 top-level fields, free-text summary before verdicts, and low|medium|high confidence bound to the criteria in evidence_discipline. - Frontmatter gains required `triggers` (example caller queries); the loader rejects runbooks without them. Registry tests enforce cross-ref resolution, ref uniqueness, description length, triggers, tags, body size, and repo-wide ref integrity across skills, CLI help, and root docs. - Discovery addressing inside runbook bodies is surface-neutral (bare runbooks:// refs, prose search phrasing); literal panda commands remain only where the CLI is the procedure itself. runbooks/AGENTS.md documents the authoring standard; the root guide moves to AGENTS.md with CLAUDE.md as a compatibility shim. - scripts/runbook-retrieval-check.sh: 31 golden queries (including confusable near-miss pairs and informational out-of-scope probes) asserting top-1 retrieval against a running server. Retrieval - pkg/textmatch: the lexical token/trigram scorer extracted from the server resource resolver and shared with semantic search. - The runbook index becomes hybrid and multi-vector: a document-space metadata vector; per-trigger vectors embedded in QUERY space via the new Embedder.EmbedQueryBatch (a trigger is a hypothetical caller query, so same-space embedding lets a near-verbatim query match score ~1.0); and per-section child vectors discounted x0.95 so the authored routing surface wins near-ties against incidental body matches. Final score is best dense + 0.3 * lexical, unclamped (clamping collapsed strong candidates into ties), ordered deterministically by score then file path. - dotProduct returns 0 on mismatched vector lengths instead of panicking or computing a truncated product; a once-per-index warning surfaces the mismatch window while ranking degrades to lexical-only until re-index. Embedding hard cutover (breaking) - One embedding path: /embedding and /embedding/check replace both /embed and /v2/embedding. The v1 embedding service, its config block, the version-probe fallback, and the embedding_model datasources advertisement are removed. - Every embed/check request requires task=query|document, mapped upstream to input_type for asymmetric retrieval embedding. A missing or unknown task is a 400 at the handler and an error in the service. - Embedding-space identity is model+dimensions+task end to end: advertised on every response and the startup probe, compared by the search runtime to re-index on any change, folded into proxy and local cache keys, and verified by the client on every response so a proxy config change mid-build fails fast instead of silently poisoning caches and indexes. - The proxy validates upstream vector length against configured dimensions and rejects non-positive dimensions at config load. The default model is google/gemini-embedding-2 at 1536 dimensions; the example config documents that the configured model must support asymmetric retrieval embedding. Verified with the full test suite and lint, plus live end-to-end runs: the index corpus built through a local proxy with real asymmetric embeddings, 31/31 golden retrieval queries, and EIP, consensus-spec, and example search plus ref reads spot-checked. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ry-embedding-cutover
…ng standard The runbook merged from master (#273) predates the registry standard and was only minimally patched during the merge. Rewrite it to comply: - imperative name ("Build a Devnet Consensus Bug Board") and the house Inputs/Output/Procedure/Self-Check skeleton with an "Owns ..." owns-line - bug object reshaped to 7 top-level fields: embeds the issue record owned by runbooks://devnet_issue_contract (summary-first reasoning, structured evidence items, low/medium/high confidence) with board presentation fields grouped under `board`; generator updated to the new shape and now renders a confidence badge, evidence list, and issue-summary fallback for the overview - severity rubric points at runbooks://ethereum_protocol_model for the restated finality thresholds; clustering points at runbooks://devnet_issue_fingerprint_dedupe; scan can seed from runbooks://devnet_watch lanes - surface-neutral addressing: examples-index searches spelled as prose, citations replaced by contract evidence refs, repro header no longer hardcodes the CLI dialect - fixed the malformed Kurtosis-config row in the cross-reference table and the deprecated datetime.utcnow() calls in the generator - added devnet_bug_report golden queries to scripts/runbook-retrieval-check.sh Generator smoke-tested end-to-end against full and minimal bug objects. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Delete the "starflinger asset store" instruction — the name appears nowhere else in the repo, so an agent can neither act on it nor verify it. Reword the two report_template.html mentions to carry their facts directly: the page architecture line states the properties (design tokens, zero external fetches, upload-safe) and the injected-JSON escaping tip explains the </script>-termination risk instead of citing a repo file sandbox agents cannot read. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
🐼 Smoke eval —
|
| question | result | tokens | tools |
|---|---|---|---|
forky_node_coverage |
❌ | 14,631 | 4 |
tracoor_node_coverage |
❌ | 14,102 | 4 |
mainnet_block_arrival_p50 |
❌ | 16,177 | 8 |
list_datasources |
❌ | 13,457 | 3 |
block_count_24h |
❌ | 16,571 | 10 |
missed_slots_24h |
❌ | 14,721 | 5 |
chartkit_default_arrival_distribution |
❌ | 39,064 | 14 |
storage_upload_session_scoped |
❌ | 30,628 | 15 |
🔭 Langfuse traces (8 runs; ⚠️ = failed)
The report walks this branch's commits against the master baseline and the most recent release. A self-contained copy is in the run's eval-smoke-* artifact.
Escaping: timeline text now goes through esc() like its sibling fields, and the escape guarantee names its real boundary — every structured field is escaped; the three *_html fields are the deliberate raw-HTML exceptions, restricted to the class vocabulary by convention, never for un-reviewed upstream text. Generator behavior: - BASELINE input rendered in the header, so the promised healthy-network board (baseline + zero bugs) is actually producible. - Client filter options carry lowercased values matching data-clients; selecting a client no longer matches nothing. - Severity header sorts on a numeric data-sevrank (critical first) instead of lexicographic minor > major > critical. - The viewer-vote localStorage parse is guarded; a corrupted value degrades to zero local votes instead of killing filters, sorting, and voting. - Labels are folded into data-text on rows and detail sections, making the advertised label matching true via text search. Prose accuracy: the cross-reference section now states which patterns the generator automates and that the rest are vocabulary for links composed inside *_html; the issue example marks its truncation inside the YAML, naming the omitted required fields (first_bad, affected, co_present, fingerprint, handles) with the owning contract; the compute-enclave input pointer goes to panda_compute_kurtosis_lifecycle per the entry point's network_target table. The registry standard gains the general rule: a truncated shared-shape example must mark the elision inside the copied block — agents copy the YAML they see, not the prose around it. Verified by executing the generator against fixtures (full bug, minimal bug, zero bugs, hostile HTML in timeline text): no raw script tags in the output, sevranks and lowercase filter values present, baseline rendered. Kept under the 500-line body cap by tightening prose and merging adjacent short CSS lines. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…typed v3
Replace the branch's hard-cutover unversioned /embedding route with versioned
endpoints so deployed proxies and servers upgrade independently:
- /embed + /embed/check (v1): symmetric, no dimensions, `embedding:` config —
wire shapes byte-compatible with what deployed servers call today.
- /v2/embedding + /v2/embedding/check (v2): symmetric at fixed dimensions,
`embedding_v2:` config; tolerates task-less requests and echoes
model+dimensions (additive fields only).
- /v3/embedding + /v3/embedding/check (v3): task=query|document REQUIRED,
mapped to the upstream input_type for asymmetric retrieval embedding;
missing/unknown task is a 400. Shares the embedding_v2 service — v3 is a
protocol upgrade, not a new embedding space.
Cache keys carry the space identity per protocol — {model}:{hash} (v1),
{model}:{dims}:{hash} (v2), {model}:{dims}:{task}:{hash} (v3) — so symmetric
and task-typed vectors never collide while v2 and v3 share one Redis
namespace and warm entries survive proxy upgrades.
The server negotiates newest-first: probe /v3/embedding/check, then
/v2/embedding/check (a dimensions-less echo from a historical proxy is
tolerated and assumed 1536), then confirm /embed/check before falling back to
the embedding_model datasources advertisement (restored for old clients). The
search runtime records (model, dimensions, protocol) as the index identity
and re-indexes when any of the three changes, so a proxy upgrading v2->v3
under a running server flips into task-typed retrieval on the next discovery
tick without a restart. Clients verify the advertised space on every v2/v3
response, failing a build fast instead of caching wrong-space vectors.
Hardening kept from the cutover work: upstream vector-length validation
against configured dimensions, dimensions >= 1 config validation (v2 block),
bounded hashes/items on every route, and a startup warning when the ignored
v1 dimensions field is set.
Verified end to end: negotiation against this proxy (v3), a live downgrade to
the actual pre-cutover proxy binary (v2, dimensions-less echo tolerated,
symmetric index rebuilt), and a live upgrade back (protocol flip detected on
the refresh tick, task-typed re-index from warm cache in seconds). Full race
suite and lint clean; reviewed by independent wire-compat, concurrency,
validation, and adversarial passes with all findings fixed and re-verified.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…trix
Registry growth: five new runbooks (tracoor_invalid_artifact_forensics,
debug_evm_execution_divergence, prometheus_devnet_health,
reconcile_chain_sources, devnet_bug_board_html) plus surface and content
iteration across the existing set.
Retrieval tuning against the golden-query matrix (now 51 queries): the
debugging entry point's triggers regain the query-shaped phrasings the
neighbors were winning on ("why did it break", the client-roster fork
trigger with "can you investigate", the network_target contract), closing
the coin-flip margins the new sibling runbooks introduced.
scripts/runbook-retrieval-check.sh extends the matrix for the new runbooks.
Full matrix: 51 pass, 0 fail against a live server.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Matrix entries gain an optional |top2 tier for confusables between sibling runbooks of one pipeline family: the expected runbook must appear in the top 2 instead of winning top-1. Siblings cross-link and either answer resolves within one hop, so a 0.00x coin-flip between them is noise — demanding top-1 there invited fixing flips by mirroring the query text into triggers, which turns the check into a string-echo measurement. The header documents the tier semantics and that rule. Cross-family queries stay strict top-1. Five churn-prone sibling entries move to the top2 tier; all 51 queries pass against a live server. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three layers, bottom to top: the proxy embedding API becomes versioned — the existing v1/v2 routes are preserved byte-compatible and a new task-typed v3 is added; semantic search over runbooks becomes hybrid and multi-vector on top of it; and the runbook registry is rebuilt against an explicit authoring standard with retrieval-quality regression checks.
Embedding: versioned endpoints
/embed(+/check) — v1embedding:{model}:{hash}/v2/embedding(+/check) — v2embedding_v2:{model}:{dims}:{hash}/v3/embedding(+/check) — v3task=query|documentrequired → upstreaminput_type(asymmetric retrieval embedding); 400 on missing/unknown taskembedding_v2:{model}:{dims}:{task}:{hash}embedding_modeldatasources field.dimensions ≥ 1validation, bounded items and hashes on every route.Deploy notes (no ordering requirement): old servers keep working against the new proxy on v1/v2 unchanged; new servers against an old proxy negotiate v2 (symmetric search, one log line noting task-typed retrieval awaits v3) and self-upgrade when the proxy does. Existing
embedding:/embedding_v2:config blocks load with identical semantics; adimensions:under the v1 block logs a warning that it's ignored.Retrieval
pkg/textmatch: lexical token/trigram scorer shared between the resource resolver and search.dotProductguards mismatched vector lengths; a once-per-index warning covers the dims-mismatch window.Runbook registry
triggersfrontmatter (the primary retrieval surface), one-canonical-owner fact discipline, filled-example output contracts, and repo-wide ref-integrity tests.runbooks/AGENTS.mddocuments the standard; rootAGENTS.mdis the real guide withCLAUDE.mdas a shim.scripts/runbook-retrieval-check.sh: a golden-query matrix (51 queries incl. confusable near-miss pairs and out-of-scope probes) asserting top-1 retrieval against a running server — currently 51/51.Verification
🤖 Generated with Claude Code