fix: annealing/ingestion race — layered defense (closes #402)#404
Conversation
… A) Annealing's candidate selection was graph-state-aware but queue-state- unaware. When the worker dissolved an ontology while operator-submitted ingest jobs were queued against it, those jobs dequeued with a missing target and silently never landed — accepted-then-dropped data loss. Before proposing any demote / merge / decompose / dissolve mutation against ontology X, the cycle now consults kg_api.jobs for non-terminal ingestion jobs targeting X and vetoes that candidate this cycle (skip, not defer) when any exist. Vetoes are logged structurally and counted in the cycle result so they're observable, not silent. Promotions are not vetoed: they create a new ontology and do not modify the source in a way that invalidates queued ingest jobs against it.
… C)
Shipping cadence was too aggressive: cycles evaluated brand-new and
near-empty ontologies before they accumulated enough signal to be judged
fairly, wasting LLM calls and widening the ingestion race window that
Defect A blocks.
Two new floors gate per-ontology cycle eligibility (migration 065):
min_ontology_age_epochs = 3 — ontology must exist ≥3 epoch ticks
before annealing can judge it
min_ontology_concept_count = 5 — ontology must hold ≥5 concepts before
annealing can judge it
Both demotion and promotion candidate selection consult the floors —
high-degree concepts in a sparse or brand-new source ontology haven't
earned an evaluation as natural new nuclei either.
Migration uses INSERT ... ON CONFLICT DO NOTHING so an operator-tuned
row is never overwritten. Existing keys (epoch_interval,
demotion_threshold, promotion_min_degree, max_proposals) are
intentionally not changed here.
…B1) If annealing dissolved a target ontology between job submit and execute, the worker silently called ensure_ontology_exists, which recreated the ontology under a fresh ontology_id. The operator saw "submitted" but their content never landed under the namespace they targeted — silent operator-visible data loss. Both ingest routes now stamp ontology_existed_at_submit on the job. The worker reads it through a new _validate_target_ontology helper: - existed_at_submit=True + node missing → loud raise with a distinct "vanished mid-flight" error string. The job-queue exception handler marks the job status='failed' with the message on the job's error column, which the job-list API already surfaces. - existed_at_submit=False + node missing → first-ever ingest, create it (that IS the operator's intent). - node present + frozen → existing ADR-200 Phase 2 frozen-rejection. Distinct error strings (ONTOLOGY_VANISHED_MID_FLIGHT_ERROR vs. ONTOLOGY_FROZEN_ERROR) so operators and Defect B2's tombstone path can distinguish the failure modes. Pre-B1 jobs without the flag default to the safer "vanished" behavior.
…B2)
B1 made every missing-target ingest fail loudly. That over-corrected:
when annealing dissolves an ontology between queue and execute, the
operator-submitted ingest IS active operator intent that should override
the background reorganization. Failing it strands data that the operator
explicitly told us to write.
B2 narrows the loud-fail trigger to a positive operator-intent signal:
ontology tombstones (migration 066). The operator-delete route writes a
tombstone row; annealing's dissolve_ontology path does not. The
ingestion worker consults the tombstone when its target is missing:
missing + tombstoned → loud fail with ONTOLOGY_TOMBSTONED_ERROR
("deliberately removed by an operator")
missing + no tombstone → recreate via ensure_ontology_exists with an
audit log line naming job_id, actor, and
existed_at_submit so the recreate event is
traceable
Three distinct error strings (tombstoned / frozen / vanished) so the
job-list API surfaces structurally different failure modes that an
operator needs to distinguish. The tombstone-read failure path is
deliberately tolerant — falling back to recreate is recoverable;
failing every ingest when one query fails is not. Defect A's queue-veto
remains the upstream safety layer.
Review: layered defense (#402)What this changes: queue-aware veto (A), per-ontology cadence floors (C), and tombstone-aware loud-fail/recreate (B1+B2). Defense is well-layered and the commit-by-commit decomposition tells the story cleanly. Test coverage on the new logic looks solid. The review below focuses on findings that change risk under load. 1. Operator-initiated dissolve bypasses the tombstone (load-bearing gap)Location: The commit message for B2 frames the distinction as "operator-delete writes a tombstone; annealing dissolve does not." But the operator-facing
This is the same silent-recreate failure mode #402 calls out, just triggered via the dissolve endpoint instead of the dissolve algorithm. Recommend writing a tombstone in this route too (with 2. Tombstone write and graph delete are not atomicLocation:
This is narrow but real. Two structural fixes worth considering: (a) wrap the graph delete and tombstone insert in a single connection/transaction so both succeed or both fail; (b) reorder — write the tombstone first (it costs nothing if the delete fails, and a stranded tombstone is recoverable by removing the row, whereas a stranded "ontology gone, no tombstone" window is the exact race that needs closing). Option (b) is the simpler fix. 3. Proposal executor does not re-check the queue vetoLocation: The A veto blocks proposal creation, but Recommend 4.
|
…D raise (#402, #404 review) PR-404 review findings #3, #4, #5: the previous worker gated tombstone lookup on "ont_node is None", which meant the operator-delete route could only safely write its tombstone *after* delete_ontology_node committed — a worker dequeue in that window saw a missing ontology with no tombstone yet and silently recreated. Reordering the route to write the tombstone first would shift the bug, not close it: in the new window (tombstone present, node still present) the worker would proceed to write content into a graph the operator is removing. Close both windows by checking the tombstone *before* the graph node. A tombstone present in either window — before or after the graph mutation — fails the in-flight ingest with ONTOLOGY_TOMBSTONED_ERROR rather than racing. Also restore the VANISHED raise that B2 made dead code. With Defect A's queue veto and the proposal-executor re-check (separate commit) in place, existed_at_submit=True + missing target should not happen under normal operation. Reaching that branch indicates a real anomaly (rename without job migration, manual surgery, residual race) — surface it with a distinct error string rather than silently recreating. The recreate branch narrows to existed_at_submit=False (first-ever ingest into a new name), which IS positive operator intent. Tombstone-read failure on a missing target now falls through to VANISHED rather than to silent recreate when existed_at_submit=True, which is the correct fallback when we cannot positively rule out an operator delete. Tests updated: the "missing without tombstone triggers recreate" case is now "missing after existing raises VANISHED"; a new test covers the unconditional check (tombstone present + node still present → TOMBSTONED); the tombstone-read-failure path splits by existed_at_submit (True → VANISHED, False → create).
…lve; clear on recreate (#402, #404 review) PR-404 review findings #1, #4 (advisor-revised), #6 (advisor): #1 — Operator-initiated POST /ontology/{name}/dissolve previously wrote no tombstone. Same data-loss class as operator-delete: a racing ingest against the dissolved name would silently recreate it. Dissolve now pre-flights existence + lifecycle in the route (so a refused dissolve doesn't leave a stale tombstone), then writes the tombstone, then calls dissolve_ontology. Annealing's dissolve path still does NOT write a tombstone — that asymmetry is what distinguishes "deliberately removed by operator" (loud-fail) from "absorbed by background reorganization" (recoverable). #4 — Reordering the delete route's tombstone write is only safe in combination with the unconditional tombstone check now in the worker. With that worker change in place, the tombstone is written as the first graph-mutating step of the delete (before delete_ontology_node and the source/embedding cascade). A worker dequeue at any point during the multi-step delete sees the tombstone and fails the ingest TOMBSTONED rather than racing into a graph the operator is removing. #6 — Once tombstones are checked unconditionally, an operator who deletes X and then re-creates X via POST /ontology/ would otherwise hit a permanent "create succeeded, ingest fails forever" trap. The create route now clears any existing tombstone for the same name — explicit operator intent to revive supersedes the prior removal intent. Extracted two helpers (_record_ontology_tombstone, _clear_ontology_ tombstone) so the delete, dissolve, and create routes share a single implementation. Tombstone-write failures are logged but not raised: the tombstone is defense-in-depth on top of the queue veto + the worker's distinct-error raises, and aborting the operator's requested operation because the tombstone INSERT failed would be a worse failure mode. Tests: dissolve route now requires get_ontology_node to return an active node before dissolve_ontology runs; a new test verifies the dissolve tombstone is written and references the absorption target; a new TestDeleteOntologyTombstone class verifies tombstone-before- delete_ontology_node ordering using a side-effect assertion on the helper; a new create-route test verifies tombstone clearing.
…later soft-skip (#402, #404 review) PR-404 review finding #2 + advisor flag: The annealing manager already vetoes demotion candidates with in-flight ingestion at *proposal creation* (Defect A). In autonomous mode the cycle-to-execute gap is small; in human-approval mode an approved proposal can sit for minutes-to-hours between creation and execution. A new ingest enqueued in that window would otherwise be silently dissolved out from under the operator. Re-check the queue at execute time, before calling dissolve_ontology. The advisor caught the secondary failure mode: the existing worker treats any executor return with success=False as 'failed' and writes that status to annealing_proposals. A vetoed-at-execute proposal would therefore be permanently dead, requiring re-approval after the queue clears — the queue veto becomes a footgun rather than a defense. Distinguish veto from real failure with a retry_later flag; the worker reverts the claim to 'approved' (not 'failed') so the proposal stays alive for the next cycle. Genuine failures (ontology gone, target invalid, etc.) still mark 'failed' as before. Residual TOCTOU: the SELECT and the dissolve commit are not atomic — a job enqueued between them slips through. Closing that fully needs an advisory lock on the ontology name in both job-enqueue and dissolve. For now, the ingestion worker's unconditional tombstone check + existed_at_submit=True + missing target → VANISHED raise remain the downstream backstop, and the residual window is the path that the worker-side raise was restored for. Tests: TestExecuteDemotion adds a vetoed-by-inflight test that asserts retry_later is True, the error message names the blocking job IDs, and dissolve_ontology is NOT called. New test_proposal_execution_ worker.py covers the worker-side soft-skip: retry_later=True reverts status to 'approved' (not 'failed'); genuine failures still mark 'failed'.
Second-pass review — fresh eyes on the responseThe three response commits address the original findings cleanly and the rationales in the commit messages are tight. The advisor-flagged fixes (worker-unconditional check, clear-on-create, retry_later soft-skip) are exactly the right shape. Two new findings surface from this pass, one of them parallel to #6. Finding A (request changes): rename does not clear tombstones — same shape as #6
This is the same class of trap that finding #6 closed for Suggested fix: after Finding B (request changes): dissolve leaves a stale tombstone on partial / TOCTOU failure
In any of these, the tombstone has committed and the ontology node still exists. The delete route's docstring acknowledges the analogous "force=true on a nonexistent name" case (lines 1099-1104). The dissolve route's docstring (lines 1817-1825) does not — and dissolve failure is closer to "didn't actually happen" than "operator chose to remove non-existent thing," so leaning on the same recovery path is harder to justify. Two options:
Option 1 is preferable; option 2 is acceptable if option 1 is judged to add too much branching. Acknowledged: what the response got right
Test coverage note (informational, not a finding)
SummaryTwo real changes requested (rename + dissolve stale-tombstone), parallel shape to fixes the response already accepted. Everything else lands cleanly. The reasoning in the commit messages and inline comments is consistent and load-bearing — the asymmetric route/worker design is well-defended. AI-assisted second-pass review via Claude Code |
… tombstone (#402, #404 review-2) Second-pass review of PR #404 surfaced two parallel-shape gaps to fixes already accepted from the first review: Finding A (rename gap) — POST /ontology/{name}/rename establishes an ontology under new_name. The unconditional tombstone check + create- clears-tombstone (advisor finding from review-1) enforces the same invariant at one entry point: any operator action that establishes an ontology under name N supersedes a prior tombstone(N). Rename must do the same, otherwise the delete-then-rename recovery flow fails: ingest into the renamed name dead-ends with TOMBSTONED. Mirror the create-route's _clear_ontology_tombstone(new_name) call after a successful rename. Finding B (dissolve stale tombstone) — dissolve route writes the tombstone before calling client.dissolve_ontology(), which has failure modes beyond the route-level pre-flight (source listing DB error, partial reassign failure, lifecycle TOCTOU between our get + its own re-check). On those, the tombstone has committed but the ontology still exists in the graph. Future ingests would fail TOMBSTONED against an extant ontology. Clear the tombstone on both failure paths (exception and structured success=False) before raising/handling the HTTPException. Tests: rename-clears-tombstone verifies the DELETE is executed against the new_name; dissolve_failure_clears_tombstone exercises the structured-failure rollback path (verifies both INSERT and DELETE ran against the same name).
Closes #402.
The annealing manager and ingestion queue did not coordinate. When annealing dissolved an ontology that still had ingest jobs queued against it, the worker dequeued with a missing target and silently recreated the ontology under a fresh id — operator saw "submitted," content never landed where they asked.
Four atomic commits, one per defect in #402, layered prevent → shrink → recover; then three follow-up commits responding to the review on this PR.
Summary
4282768fkg_api.jobsfor non-terminal ingestion jobs against each demotion candidate; vetoes structurally, logs the count + job_ids, counts vetoes in the cycle result. Promotions are not vetoed (they don't modify the source ontology).99cadd46min_ontology_age_epochs=3andmin_ontology_concept_count=5. Both demotion and promotion candidate selection consult them. Migration usesINSERT … ON CONFLICT DO NOTHINGso operator-tuned rows stay put; existing keys are deliberately not changed.f73737e5ontology_existed_at_submiton the job; a new_validate_target_ontologyhelper raises with a structured, distinct error string when the operator's intent has been invalidated. Job-queue exception handler maps tostatus='failed'with the message on theerrorcolumn; the job-list API already surfaces it.7fbe2498kg_api.ontology_tombstones. Operator-delete writes a tombstone; annealing dissolve does not. Worker consults it: missing + tombstoned → fail; missing + no tombstone → recreate with audit log (operator intent overrides background reorganization).Review response (PR-404, 5 findings)
Code review surfaced 5 findings; advisor reconciliation found that finding #4 as written would have introduced a worse bug (orphan content into a graph being deleted) unless paired with a worker-side change, and that finding #2 had a secondary footgun (vetoed-at-execute proposals permanently failed instead of soft-skipped). Three follow-up commits address all 5 plus both advisor flags.
8f6f0586(worker)existed_at_submit=True+ read-failure now falls through to VANISHED, no longer silently recreatesc0cc5f2b(routes)POST /ontology/clears any prior tombstone so operator-recreate doesn't leave a "create succeeds, ingest fails forever" trap2a99e555(executor)execute_demotionre-checks the queue veto before callingdissolve_ontology(closes the cycle-to-execute gap that's wide in human-approval mode); advisor-flagged footgun — vetoed proposals returnretry_later=Trueand the worker reverts status to'approved'(soft skip) instead of marking'failed'(permanent dead-end)A residual TOCTOU exists between the executor's veto SELECT and the dissolve commit — closing it fully needs an advisory lock on the ontology name in both job-enqueue and dissolve. Documented inline; the worker's
existed_at_submit=True+ missing → VANISHED raise is the downstream backstop for that residual window.Definition of done
Test plan
pytest tests/unit/+tests/api/test_ingest.py+tests/api/test_ontology_routes.py— 568 pass (full suite passes after review-response commits; new tests cover one observable per finding)status=failedwith tombstone error'approved'(not'failed') and dissolve did not runFiles
api/app/services/annealing_manager.py— veto logic + cadence floorsapi/app/services/proposal_executor.py— execute-time veto re-check +retry_laterapi/app/launchers/annealing.py,api/app/workers/annealing_worker.py— option plumbingapi/app/workers/ingestion_worker.py—_validate_target_ontology(unconditional tombstone check, VANISHED restored) +_ontology_tombstoneapi/app/workers/proposal_execution_worker.py— retry_later soft-skipapi/app/routes/ingest.py— stampontology_existed_at_submitapi/app/routes/ontology.py— tombstone helpers (_record_ontology_tombstone,_clear_ontology_tombstone); delete/dissolve write tombstone before mutation; create clears tombstoneschema/migrations/065_annealing_cadence_floors.sqlschema/migrations/066_ontology_tombstones.sqltests/unit/services/test_annealing_manager.py,tests/unit/services/test_proposal_executor.py,tests/unit/workers/test_ingestion_worker.py,tests/unit/workers/test_proposal_execution_worker.py,tests/api/test_ontology_routes.py