Skip to content

Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding#274

Open
Savid wants to merge 8 commits into
masterfrom
feat/runbook-registry-embedding-cutover
Open

Rebuild runbook registry with hybrid retrieval; hard cutover to task-typed embedding#274
Savid wants to merge 8 commits into
masterfrom
feat/runbook-registry-embedding-cutover

Conversation

@Savid

@Savid Savid commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

Three layers, bottom to top: the proxy embedding API becomes versioned — the existing v1/v2 routes are preserved byte-compatible and a new task-typed v3 is added; semantic search over runbooks becomes hybrid and multi-vector on top of it; and the runbook registry is rebuilt against an explicit authoring standard with retrieval-quality regression checks.

Note: an earlier revision of this branch did a hard cutover (single unversioned /embedding, v1/v2 deleted). That was abandoned — deployed fleets can't flag-day. The current design keeps old binaries working unchanged.

Embedding: versioned endpoints

Route Contract Config Cache key
/embed (+/check) — v1 symmetric, no dims embedding: {model}:{hash}
/v2/embedding (+/check) — v2 symmetric, fixed dims, model+dims echoed embedding_v2: {model}:{dims}:{hash}
/v3/embedding (+/check) — v3 task=query|document required → upstream input_type (asymmetric retrieval embedding); 400 on missing/unknown task shares embedding_v2: {model}:{dims}:{task}:{hash}
  • v3 is a protocol upgrade, not a new embedding space — it shares v2's service, config, and Redis namespace; key shapes are provably disjoint, and warm v1/v2 cache entries survive the proxy upgrade.
  • Servers negotiate newest-first: v3 probe → v2 probe (a dims-less echo from a historical proxy is tolerated) → confirmed v1 fallback via the restored embedding_model datasources field.
  • The search runtime treats (model, dimensions, protocol) as the index identity and auto-re-indexes when any changes — a proxy upgrading v2→v3 under a running server flips into task-typed retrieval on the next discovery tick, no restart. Clients verify the advertised space on every v2/v3 response and fail builds fast rather than caching wrong-space vectors.
  • Hardening: upstream vector-length validation against configured dims, dimensions ≥ 1 validation, bounded items and hashes on every route.

Deploy notes (no ordering requirement): old servers keep working against the new proxy on v1/v2 unchanged; new servers against an old proxy negotiate v2 (symmetric search, one log line noting task-typed retrieval awaits v3) and self-upgrade when the proxy does. Existing embedding:/embedding_v2: config blocks load with identical semantics; a dimensions: under the v1 block logs a warning that it's ignored.

Retrieval

  • pkg/textmatch: lexical token/trigram scorer shared between the resource resolver and search.
  • The runbook index is hybrid and multi-vector: a document-space metadata vector; one vector per trigger embedded in query space (a trigger is a hypothetical caller query — same-space embedding makes near-verbatim matches score ~1.0 on v3); per-section child vectors discounted ×0.95 so the authored routing surface wins near-ties. Score = best dense + 0.3 × lexical, unclamped, deterministically ordered. On v1/v2 proxies the same index works symmetric.
  • dotProduct guards mismatched vector lengths; a once-per-index warning covers the dims-mismatch window.

Runbook registry

  • Rebuilt as single-responsibility runbooks split by co-retrieval, with required triggers frontmatter (the primary retrieval surface), one-canonical-owner fact discipline, filled-example output contracts, and repo-wide ref-integrity tests. runbooks/AGENTS.md documents the standard; root AGENTS.md is the real guide with CLAUDE.md as a shim.
  • scripts/runbook-retrieval-check.sh: a golden-query matrix (51 queries incl. confusable near-miss pairs and out-of-scope probes) asserting top-1 retrieval against a running server — currently 51/51.

Verification

  • Full test suite with race detector + lint: clean. Independent review passes (wire-compat vs deployed shapes, state/concurrency, validation/tests, adversarial) with all findings fixed and re-verified.
  • Live end-to-end, three phases: fresh v3 negotiation and task-typed index build; a live downgrade to the actual pre-cutover proxy binary (v2 negotiated on the refresh tick, dims-less echo tolerated, symmetric index rebuilt through the old binary); a live upgrade back (protocol flip detected, task-typed re-index from warm cache in ~2s). Zero errors across all phases.

🤖 Generated with Claude Code

Savid and others added 4 commits July 2, 2026 14:08
…dding cutover

Runbook registry
- Replace the runbook set with 19 single-responsibility runbooks split by
  co-retrieval: reference material merged where it is always read together
  (ethereum_protocol_model, clickhouse_querying, kurtosis_devnet), one
  debugging entry point (debug_ethereum_network) owning the network_target
  handle and CL/EL fault localization, and a seven-stage devnet issue
  pipeline sharing one issue/evidence shape (devnet_issue_contract).
- Pipeline stage contracts are one filled example each with <=10 top-level
  fields, free-text summary before verdicts, and low|medium|high confidence
  bound to the criteria in evidence_discipline.
- Frontmatter gains required `triggers` (example caller queries); the loader
  rejects runbooks without them. Registry tests enforce cross-ref
  resolution, ref uniqueness, description length, triggers, tags, body size,
  and repo-wide ref integrity across skills, CLI help, and root docs.
- Discovery addressing inside runbook bodies is surface-neutral (bare
  runbooks:// refs, prose search phrasing); literal panda commands remain
  only where the CLI is the procedure itself. runbooks/AGENTS.md documents
  the authoring standard; the root guide moves to AGENTS.md with CLAUDE.md
  as a compatibility shim.
- scripts/runbook-retrieval-check.sh: 31 golden queries (including
  confusable near-miss pairs and informational out-of-scope probes)
  asserting top-1 retrieval against a running server.

Retrieval
- pkg/textmatch: the lexical token/trigram scorer extracted from the server
  resource resolver and shared with semantic search.
- The runbook index becomes hybrid and multi-vector: a document-space
  metadata vector; per-trigger vectors embedded in QUERY space via the new
  Embedder.EmbedQueryBatch (a trigger is a hypothetical caller query, so
  same-space embedding lets a near-verbatim query match score ~1.0); and
  per-section child vectors discounted x0.95 so the authored routing
  surface wins near-ties against incidental body matches. Final score is
  best dense + 0.3 * lexical, unclamped (clamping collapsed strong
  candidates into ties), ordered deterministically by score then file path.
- dotProduct returns 0 on mismatched vector lengths instead of panicking or
  computing a truncated product; a once-per-index warning surfaces the
  mismatch window while ranking degrades to lexical-only until re-index.

Embedding hard cutover (breaking)
- One embedding path: /embedding and /embedding/check replace both /embed
  and /v2/embedding. The v1 embedding service, its config block, the
  version-probe fallback, and the embedding_model datasources advertisement
  are removed.
- Every embed/check request requires task=query|document, mapped upstream
  to input_type for asymmetric retrieval embedding. A missing or unknown
  task is a 400 at the handler and an error in the service.
- Embedding-space identity is model+dimensions+task end to end: advertised
  on every response and the startup probe, compared by the search runtime
  to re-index on any change, folded into proxy and local cache keys, and
  verified by the client on every response so a proxy config change
  mid-build fails fast instead of silently poisoning caches and indexes.
- The proxy validates upstream vector length against configured dimensions
  and rejects non-positive dimensions at config load. The default model is
  google/gemini-embedding-2 at 1536 dimensions; the example config
  documents that the configured model must support asymmetric retrieval
  embedding.

Verified with the full test suite and lint, plus live end-to-end runs: the
index corpus built through a local proxy with real asymmetric embeddings,
31/31 golden retrieval queries, and EIP, consensus-spec, and example search
plus ref reads spot-checked.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ng standard

The runbook merged from master (#273) predates the registry standard and was
only minimally patched during the merge. Rewrite it to comply:

- imperative name ("Build a Devnet Consensus Bug Board") and the house
  Inputs/Output/Procedure/Self-Check skeleton with an "Owns ..." owns-line
- bug object reshaped to 7 top-level fields: embeds the issue record owned by
  runbooks://devnet_issue_contract (summary-first reasoning, structured
  evidence items, low/medium/high confidence) with board presentation fields
  grouped under `board`; generator updated to the new shape and now renders a
  confidence badge, evidence list, and issue-summary fallback for the overview
- severity rubric points at runbooks://ethereum_protocol_model for the
  restated finality thresholds; clustering points at
  runbooks://devnet_issue_fingerprint_dedupe; scan can seed from
  runbooks://devnet_watch lanes
- surface-neutral addressing: examples-index searches spelled as prose,
  citations replaced by contract evidence refs, repro header no longer
  hardcodes the CLI dialect
- fixed the malformed Kurtosis-config row in the cross-reference table and
  the deprecated datetime.utcnow() calls in the generator
- added devnet_bug_report golden queries to scripts/runbook-retrieval-check.sh

Generator smoke-tested end-to-end against full and minimal bug objects.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Delete the "starflinger asset store" instruction — the name appears nowhere
else in the repo, so an agent can neither act on it nor verify it. Reword the
two report_template.html mentions to carry their facts directly: the page
architecture line states the properties (design tokens, zero external
fetches, upload-safe) and the injected-JSON escaping tip explains the
</script>-termination risk instead of citing a repo file sandbox agents
cannot read.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

🐼 Smoke eval — 2b5276d: ❌ 0/8 pass

📊 Interactive report — tokens p50 0 · tokens/solve 0.

Reference points: feat/runbook-registry-embedding-cutover@ee6c40f 0% · feat/runbook-registry-embedding-cutover@78489e6 0% · feat/runbook-registry-embedding-cutover@6ebd2e1 0%.

question result tokens tools
forky_node_coverage 14,631 4
tracoor_node_coverage 14,102 4
mainnet_block_arrival_p50 16,177 8
list_datasources 13,457 3
block_count_24h 16,571 10
missed_slots_24h 14,721 5
chartkit_default_arrival_distribution 39,064 14
storage_upload_session_scoped 30,628 15
🔭 Langfuse traces (8 runs; ⚠️ = failed)

The report walks this branch's commits against the master baseline and the most recent release. A self-contained copy is in the run's eval-smoke-* artifact.

Savid and others added 4 commits July 2, 2026 14:55
Escaping: timeline text now goes through esc() like its sibling fields, and
the escape guarantee names its real boundary — every structured field is
escaped; the three *_html fields are the deliberate raw-HTML exceptions,
restricted to the class vocabulary by convention, never for un-reviewed
upstream text.

Generator behavior:
- BASELINE input rendered in the header, so the promised healthy-network
  board (baseline + zero bugs) is actually producible.
- Client filter options carry lowercased values matching data-clients;
  selecting a client no longer matches nothing.
- Severity header sorts on a numeric data-sevrank (critical first) instead
  of lexicographic minor > major > critical.
- The viewer-vote localStorage parse is guarded; a corrupted value degrades
  to zero local votes instead of killing filters, sorting, and voting.
- Labels are folded into data-text on rows and detail sections, making the
  advertised label matching true via text search.

Prose accuracy: the cross-reference section now states which patterns the
generator automates and that the rest are vocabulary for links composed
inside *_html; the issue example marks its truncation inside the YAML,
naming the omitted required fields (first_bad, affected, co_present,
fingerprint, handles) with the owning contract; the compute-enclave input
pointer goes to panda_compute_kurtosis_lifecycle per the entry point's
network_target table.

The registry standard gains the general rule: a truncated shared-shape
example must mark the elision inside the copied block — agents copy the
YAML they see, not the prose around it.

Verified by executing the generator against fixtures (full bug, minimal
bug, zero bugs, hostile HTML in timeline text): no raw script tags in the
output, sevranks and lowercase filter values present, baseline rendered.
Kept under the 500-line body cap by tightening prose and merging adjacent
short CSS lines.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…typed v3

Replace the branch's hard-cutover unversioned /embedding route with versioned
endpoints so deployed proxies and servers upgrade independently:

- /embed + /embed/check (v1): symmetric, no dimensions, `embedding:` config —
  wire shapes byte-compatible with what deployed servers call today.
- /v2/embedding + /v2/embedding/check (v2): symmetric at fixed dimensions,
  `embedding_v2:` config; tolerates task-less requests and echoes
  model+dimensions (additive fields only).
- /v3/embedding + /v3/embedding/check (v3): task=query|document REQUIRED,
  mapped to the upstream input_type for asymmetric retrieval embedding;
  missing/unknown task is a 400. Shares the embedding_v2 service — v3 is a
  protocol upgrade, not a new embedding space.

Cache keys carry the space identity per protocol — {model}:{hash} (v1),
{model}:{dims}:{hash} (v2), {model}:{dims}:{task}:{hash} (v3) — so symmetric
and task-typed vectors never collide while v2 and v3 share one Redis
namespace and warm entries survive proxy upgrades.

The server negotiates newest-first: probe /v3/embedding/check, then
/v2/embedding/check (a dimensions-less echo from a historical proxy is
tolerated and assumed 1536), then confirm /embed/check before falling back to
the embedding_model datasources advertisement (restored for old clients). The
search runtime records (model, dimensions, protocol) as the index identity
and re-indexes when any of the three changes, so a proxy upgrading v2->v3
under a running server flips into task-typed retrieval on the next discovery
tick without a restart. Clients verify the advertised space on every v2/v3
response, failing a build fast instead of caching wrong-space vectors.

Hardening kept from the cutover work: upstream vector-length validation
against configured dimensions, dimensions >= 1 config validation (v2 block),
bounded hashes/items on every route, and a startup warning when the ignored
v1 dimensions field is set.

Verified end to end: negotiation against this proxy (v3), a live downgrade to
the actual pre-cutover proxy binary (v2, dimensions-less echo tolerated,
symmetric index rebuilt), and a live upgrade back (protocol flip detected on
the refresh tick, task-typed re-index from warm cache in seconds). Full race
suite and lint clean; reviewed by independent wire-compat, concurrency,
validation, and adversarial passes with all findings fixed and re-verified.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…trix

Registry growth: five new runbooks (tracoor_invalid_artifact_forensics,
debug_evm_execution_divergence, prometheus_devnet_health,
reconcile_chain_sources, devnet_bug_board_html) plus surface and content
iteration across the existing set.

Retrieval tuning against the golden-query matrix (now 51 queries): the
debugging entry point's triggers regain the query-shaped phrasings the
neighbors were winning on ("why did it break", the client-roster fork
trigger with "can you investigate", the network_target contract), closing
the coin-flip margins the new sibling runbooks introduced.

scripts/runbook-retrieval-check.sh extends the matrix for the new runbooks.
Full matrix: 51 pass, 0 fail against a live server.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Matrix entries gain an optional |top2 tier for confusables between sibling
runbooks of one pipeline family: the expected runbook must appear in the top
2 instead of winning top-1. Siblings cross-link and either answer resolves
within one hop, so a 0.00x coin-flip between them is noise — demanding
top-1 there invited fixing flips by mirroring the query text into triggers,
which turns the check into a string-echo measurement. The header documents
the tier semantics and that rule. Cross-family queries stay strict top-1.

Five churn-prone sibling entries move to the top2 tier; all 51 queries pass
against a live server.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant