fix: annealing/ingestion race — layered defense (closes #402) by aaronsb · Pull Request #404 · aaronsb/knowledge-graph-system

aaronsb · 2026-05-23T04:14:42Z

Closes #402.

The annealing manager and ingestion queue did not coordinate. When annealing dissolved an ontology that still had ingest jobs queued against it, the worker dequeued with a missing target and silently recreated the ontology under a fresh id — operator saw "submitted," content never landed where they asked.

Four atomic commits, one per defect in #402, layered prevent → shrink → recover; then three follow-up commits responding to the review on this PR.

Summary

Layer	Commit	Defect
Prevent	`4282768f`	A — Annealing vetoes demotion candidates with in-flight ingestion. Queue-aware: queries `kg_api.jobs` for non-terminal ingestion jobs against each demotion candidate; vetoes structurally, logs the count + job_ids, counts vetoes in the cycle result. Promotions are not vetoed (they don't modify the source ontology).
Shrink window	`99cadd46`	C — Per-ontology cadence floors. Migration 065 adds `min_ontology_age_epochs=3` and `min_ontology_concept_count=5`. Both demotion and promotion candidate selection consult them. Migration uses `INSERT … ON CONFLICT DO NOTHING` so operator-tuned rows stay put; existing keys are deliberately not changed.
Recover (loud)	`f73737e5`	B1 — Worker fails loudly on missing target. Both ingest routes stamp `ontology_existed_at_submit` on the job; a new `_validate_target_ontology` helper raises with a structured, distinct error string when the operator's intent has been invalidated. Job-queue exception handler maps to `status='failed'` with the message on the `error` column; the job-list API already surfaces it.
Recover (intent)	`7fbe2498`	B2 — Tombstone-aware idempotent recreate. Migration 066 adds `kg_api.ontology_tombstones`. Operator-delete writes a tombstone; annealing dissolve does not. Worker consults it: missing + tombstoned → fail; missing + no tombstone → recreate with audit log (operator intent overrides background reorganization).

Review response (PR-404, 5 findings)

Code review surfaced 5 findings; advisor reconciliation found that finding #4 as written would have introduced a worse bug (orphan content into a graph being deleted) unless paired with a worker-side change, and that finding #2 had a secondary footgun (vetoed-at-execute proposals permanently failed instead of soft-skipped). Three follow-up commits address all 5 plus both advisor flags.

Commit	Findings addressed
`8f6f0586` (worker)	#3 restore VANISHED raise; #4 worker checks tombstone unconditionally (the precondition for safely reordering tombstone writes upstream); #5 tombstone-read fallback reassessed — `existed_at_submit=True` + read-failure now falls through to VANISHED, no longer silently recreates
`c0cc5f2b` (routes)	#1 dissolve writes tombstone (same data-loss class as delete); #4 tombstone is the first graph-mutating step of delete; #6 advisor-flagged — `POST /ontology/` clears any prior tombstone so operator-recreate doesn't leave a "create succeeds, ingest fails forever" trap
`2a99e555` (executor)	#2 `execute_demotion` re-checks the queue veto before calling `dissolve_ontology` (closes the cycle-to-execute gap that's wide in human-approval mode); advisor-flagged footgun — vetoed proposals return `retry_later=True` and the worker reverts status to `'approved'` (soft skip) instead of marking `'failed'` (permanent dead-end)

A residual TOCTOU exists between the executor's veto SELECT and the dissolve commit — closing it fully needs an advisory lock on the ontology name in both job-enqueue and dissolve. Documented inline; the worker's existed_at_submit=True + missing → VANISHED raise is the downstream backstop for that residual window.

Definition of done

An operator running a batch ingest concurrent with annealing observes either successful ingestion or a clearly-failed job — never a silent drop.

A + executor re-check block the common race upstream at two points (proposal creation and proposal execution).
C shrinks the residual window for ontologies still settling.
B1 + B2 + worker-unconditional-tombstone turn the surviving cases into observable, structurally distinct failures (vanished / tombstoned / frozen) or honored recreations.
Dissolve route + create-clear-tombstone close the operator-initiated paths that B2 left exposed.

Test plan

pytest tests/unit/ + tests/api/test_ingest.py + tests/api/test_ontology_routes.py — 568 pass (full suite passes after review-response commits; new tests cover one observable per finding)
Migration 065 applied successfully on dev
Migration 066 applied successfully on dev
Manual smoke: queue an ingest job, force an annealing cycle, observe veto log + no proposal for that ontology
Manual smoke: delete an ontology via API, attempt re-ingest, observe status=failed with tombstone error
Manual smoke: approve a demotion proposal, enqueue an ingest for the same ontology, run the execution worker; verify the proposal status reverts to 'approved' (not 'failed') and dissolve did not run

Files

api/app/services/annealing_manager.py — veto logic + cadence floors
api/app/services/proposal_executor.py — execute-time veto re-check + retry_later
api/app/launchers/annealing.py, api/app/workers/annealing_worker.py — option plumbing
api/app/workers/ingestion_worker.py — _validate_target_ontology (unconditional tombstone check, VANISHED restored) + _ontology_tombstone
api/app/workers/proposal_execution_worker.py — retry_later soft-skip
api/app/routes/ingest.py — stamp ontology_existed_at_submit
api/app/routes/ontology.py — tombstone helpers (_record_ontology_tombstone, _clear_ontology_tombstone); delete/dissolve write tombstone before mutation; create clears tombstone
schema/migrations/065_annealing_cadence_floors.sql
schema/migrations/066_ontology_tombstones.sql
tests/unit/services/test_annealing_manager.py, tests/unit/services/test_proposal_executor.py, tests/unit/workers/test_ingestion_worker.py, tests/unit/workers/test_proposal_execution_worker.py, tests/api/test_ontology_routes.py

… A) Annealing's candidate selection was graph-state-aware but queue-state- unaware. When the worker dissolved an ontology while operator-submitted ingest jobs were queued against it, those jobs dequeued with a missing target and silently never landed — accepted-then-dropped data loss. Before proposing any demote / merge / decompose / dissolve mutation against ontology X, the cycle now consults kg_api.jobs for non-terminal ingestion jobs targeting X and vetoes that candidate this cycle (skip, not defer) when any exist. Vetoes are logged structurally and counted in the cycle result so they're observable, not silent. Promotions are not vetoed: they create a new ontology and do not modify the source in a way that invalidates queued ingest jobs against it.

… C) Shipping cadence was too aggressive: cycles evaluated brand-new and near-empty ontologies before they accumulated enough signal to be judged fairly, wasting LLM calls and widening the ingestion race window that Defect A blocks. Two new floors gate per-ontology cycle eligibility (migration 065): min_ontology_age_epochs = 3 — ontology must exist ≥3 epoch ticks before annealing can judge it min_ontology_concept_count = 5 — ontology must hold ≥5 concepts before annealing can judge it Both demotion and promotion candidate selection consult the floors — high-degree concepts in a sparse or brand-new source ontology haven't earned an evaluation as natural new nuclei either. Migration uses INSERT ... ON CONFLICT DO NOTHING so an operator-tuned row is never overwritten. Existing keys (epoch_interval, demotion_threshold, promotion_min_degree, max_proposals) are intentionally not changed here.

…B1) If annealing dissolved a target ontology between job submit and execute, the worker silently called ensure_ontology_exists, which recreated the ontology under a fresh ontology_id. The operator saw "submitted" but their content never landed under the namespace they targeted — silent operator-visible data loss. Both ingest routes now stamp ontology_existed_at_submit on the job. The worker reads it through a new _validate_target_ontology helper: - existed_at_submit=True + node missing → loud raise with a distinct "vanished mid-flight" error string. The job-queue exception handler marks the job status='failed' with the message on the job's error column, which the job-list API already surfaces. - existed_at_submit=False + node missing → first-ever ingest, create it (that IS the operator's intent). - node present + frozen → existing ADR-200 Phase 2 frozen-rejection. Distinct error strings (ONTOLOGY_VANISHED_MID_FLIGHT_ERROR vs. ONTOLOGY_FROZEN_ERROR) so operators and Defect B2's tombstone path can distinguish the failure modes. Pre-B1 jobs without the flag default to the safer "vanished" behavior.

…B2) B1 made every missing-target ingest fail loudly. That over-corrected: when annealing dissolves an ontology between queue and execute, the operator-submitted ingest IS active operator intent that should override the background reorganization. Failing it strands data that the operator explicitly told us to write. B2 narrows the loud-fail trigger to a positive operator-intent signal: ontology tombstones (migration 066). The operator-delete route writes a tombstone row; annealing's dissolve_ontology path does not. The ingestion worker consults the tombstone when its target is missing: missing + tombstoned → loud fail with ONTOLOGY_TOMBSTONED_ERROR ("deliberately removed by an operator") missing + no tombstone → recreate via ensure_ontology_exists with an audit log line naming job_id, actor, and existed_at_submit so the recreate event is traceable Three distinct error strings (tombstoned / frozen / vanished) so the job-list API surfaces structurally different failure modes that an operator needs to distinguish. The tombstone-read failure path is deliberately tolerant — falling back to recreate is recoverable; failing every ingest when one query fails is not. Defect A's queue-veto remains the upstream safety layer.

aaronsb · 2026-05-23T19:19:13Z

Review: layered defense (#402)

What this changes: queue-aware veto (A), per-ontology cadence floors (C), and tombstone-aware loud-fail/recreate (B1+B2). Defense is well-layered and the commit-by-commit decomposition tells the story cleanly. Test coverage on the new logic looks solid.

The review below focuses on findings that change risk under load.

1. Operator-initiated dissolve bypasses the tombstone (load-bearing gap)

Location: api/app/routes/ontology.py:1740-1794 (POST /ontology/{name}/dissolve)

The commit message for B2 frames the distinction as "operator-delete writes a tombstone; annealing dissolve does not." But the operator-facing dissolve_ontology route is also operator intent and is also a missing-target producer, and it does not write a tombstone. Sequence:

Operator A queues ingest job against X.
Operator B calls POST /ontology/X/dissolve (intentional, deliberate).
Worker dequeues A's job, target missing, no tombstone → recreate path runs with audit log line. Content lands under a fresh ontology_id named X that Operator B explicitly intended to retire.

This is the same silent-recreate failure mode #402 calls out, just triggered via the dissolve endpoint instead of the dissolve algorithm. Recommend writing a tombstone in this route too (with reason="operator-initiated dissolve via API" for distinguishability). The A veto does not cover this path — A only consults kg_api.jobs from within the annealing manager.

2. Tombstone write and graph delete are not atomic

Location: api/app/routes/ontology.py:1147-1184

delete_ontology_node runs first (its own implicit transaction); the tombstone INSERT runs afterward on a separately checked-out connection. Window:

T0: delete_ontology_node commits — ontology gone from graph.
T1: worker on another process dequeues, calls _validate_target_ontology, queries ontology_tombstones, finds none.
T2: worker takes the recreate path.
T3: route writes tombstone.

This is narrow but real. Two structural fixes worth considering: (a) wrap the graph delete and tombstone insert in a single connection/transaction so both succeed or both fail; (b) reorder — write the tombstone first (it costs nothing if the delete fails, and a stranded tombstone is recoverable by removing the row, whereas a stranded "ontology gone, no tombstone" window is the exact race that needs closing).

Option (b) is the simpler fix.

3. Proposal executor does not re-check the queue veto

Location: api/app/services/proposal_executor.py:166-201

The A veto blocks proposal creation, but execute_demotion only re-validates "ontology still exists" and "not pinned/frozen." Approved proposals that pre-date a newly-enqueued ingest job will still dissolve. With operator-approved (non-autonomous) cycles this can sit pending for arbitrary time.

Recommend execute_demotion re-run the same _get_inflight_ingestion_targets({name}) check before calling dissolve_ontology and return {success: False, error: "in-flight ingestion blocking demotion"} if any rows surface. In autonomous mode the window is short; in human-approval mode it's wide-open.

4. `ONTOLOGY_VANISHED_MID_FLIGHT_ERROR` is dead code after B2

Location: api/app/workers/ingestion_worker.py:30-37, 95-147

After B2, _validate_target_ontology raises only ONTOLOGY_TOMBSTONED_ERROR or ONTOLOGY_FROZEN_ERROR. The "vanished" string is defined and the comment claims it surfaces on "legacy/pre-B2 jobs where ontology_existed_at_submit was True but the tombstone path was not checked" — but I don't see a code path that emits it. The existed_at_submit parameter survived only as an audit-log field.

Two acceptable resolutions: delete the constant + the parameter and document that B2 supersedes B1's loud-fail semantics; or restore B1's check for existed_at_submit=True + missing + no tombstone and raise ONTOLOGY_VANISHED_MID_FLIGHT_ERROR instead of recreating in that specific subset (preserving B1's loud-fail for the population where the queue-veto failed but the operator never deliberately deleted). The latter is more conservative — the recreate path is only triggered when both queue-veto (A) and tombstone-check (B2) failed to fire, which is the worst signal-to-noise regime to silently recover from.

5. Tombstone-read failure tolerance (judgment call, not a defect)

Location: api/app/workers/ingestion_worker.py:81-89

The "log loudly, fall back to recreate" stance is defensible given the queue-veto is the upstream layer. If finding #3 lands and the veto becomes truly upstream-effective, this is fine as-is. If #3 does not land, this fallback is the third silent-recreate route in the system, after #1 and the executor-bypass in #3.

6. Tested defaults match migration defaults

A: queue-veto SELECT uses ANY(%s) over status + ontology lists, single statement, atomic snapshot of the jobs table. Good. Note status = 'pending' is on the non-terminal list — matches the initial INSERT in job_queue.enqueue.
C: launcher defaults (3 / 5) match migration 065 seed values and worker job_data.get(..., 3/5) fallbacks — three places stay in sync.
B1: route default True for ontology_existed_at_submit matches worker default True — pre-B1 jobs default to the safer behavior, as advertised.

7. Migration idempotency

065: INSERT … ON CONFLICT (key) DO NOTHING + INSERT INTO schema_migrations … ON CONFLICT (version) DO NOTHING. Re-runnable. Reversible by hand (DELETE FROM annealing_options WHERE key IN (...)).
066: CREATE TABLE IF NOT EXISTS + same migrations guard. Re-runnable. Drop is straightforward.

8. Test file size

tests/unit/services/test_annealing_manager.py is 784 lines but exclusively test functions across nine logical groupings. Test files trade breadth for cohesion — flagging only because you asked. No action recommended.

Assessment

Defense-in-depth structure is correct, the commit decomposition is exemplary, and the per-defect tests carry their weight. Three findings would change behavior under load:

Implement Multi-Tier Neo4j Access Control (ADR-001) #1 (operator dissolve bypasses tombstone) — same data-loss class as Race condition: annealing dissolves ontologies with in-flight ingestion jobs, causing silent data loss #402, different trigger. Should ship with this PR.
Build MCP Server with Semantic Tool Hints (ADR-003) #3 (executor doesn't re-check veto) — leaves a wide-open window in human-approval mode. Should ship with this PR.
Add Provenance Tracking to All Nodes (ADR-004) #4 (dead error string) — either resolution is fine; I'd prefer restoring the B1 raise so the recreate path narrows to "veto-missed AND no operator delete intent AND existed at submit."

Findings #2 (tombstone race window) and #5 (read-failure fallback) are smaller and depend on whether #3 lands as proposed.

AI-assisted review via Claude

…D raise (#402, #404 review) PR-404 review findings #3, #4, #5: the previous worker gated tombstone lookup on "ont_node is None", which meant the operator-delete route could only safely write its tombstone *after* delete_ontology_node committed — a worker dequeue in that window saw a missing ontology with no tombstone yet and silently recreated. Reordering the route to write the tombstone first would shift the bug, not close it: in the new window (tombstone present, node still present) the worker would proceed to write content into a graph the operator is removing. Close both windows by checking the tombstone *before* the graph node. A tombstone present in either window — before or after the graph mutation — fails the in-flight ingest with ONTOLOGY_TOMBSTONED_ERROR rather than racing. Also restore the VANISHED raise that B2 made dead code. With Defect A's queue veto and the proposal-executor re-check (separate commit) in place, existed_at_submit=True + missing target should not happen under normal operation. Reaching that branch indicates a real anomaly (rename without job migration, manual surgery, residual race) — surface it with a distinct error string rather than silently recreating. The recreate branch narrows to existed_at_submit=False (first-ever ingest into a new name), which IS positive operator intent. Tombstone-read failure on a missing target now falls through to VANISHED rather than to silent recreate when existed_at_submit=True, which is the correct fallback when we cannot positively rule out an operator delete. Tests updated: the "missing without tombstone triggers recreate" case is now "missing after existing raises VANISHED"; a new test covers the unconditional check (tombstone present + node still present → TOMBSTONED); the tombstone-read-failure path splits by existed_at_submit (True → VANISHED, False → create).

…lve; clear on recreate (#402, #404 review) PR-404 review findings #1, #4 (advisor-revised), #6 (advisor): #1 — Operator-initiated POST /ontology/{name}/dissolve previously wrote no tombstone. Same data-loss class as operator-delete: a racing ingest against the dissolved name would silently recreate it. Dissolve now pre-flights existence + lifecycle in the route (so a refused dissolve doesn't leave a stale tombstone), then writes the tombstone, then calls dissolve_ontology. Annealing's dissolve path still does NOT write a tombstone — that asymmetry is what distinguishes "deliberately removed by operator" (loud-fail) from "absorbed by background reorganization" (recoverable). #4 — Reordering the delete route's tombstone write is only safe in combination with the unconditional tombstone check now in the worker. With that worker change in place, the tombstone is written as the first graph-mutating step of the delete (before delete_ontology_node and the source/embedding cascade). A worker dequeue at any point during the multi-step delete sees the tombstone and fails the ingest TOMBSTONED rather than racing into a graph the operator is removing. #6 — Once tombstones are checked unconditionally, an operator who deletes X and then re-creates X via POST /ontology/ would otherwise hit a permanent "create succeeded, ingest fails forever" trap. The create route now clears any existing tombstone for the same name — explicit operator intent to revive supersedes the prior removal intent. Extracted two helpers (_record_ontology_tombstone, _clear_ontology_ tombstone) so the delete, dissolve, and create routes share a single implementation. Tombstone-write failures are logged but not raised: the tombstone is defense-in-depth on top of the queue veto + the worker's distinct-error raises, and aborting the operator's requested operation because the tombstone INSERT failed would be a worse failure mode. Tests: dissolve route now requires get_ontology_node to return an active node before dissolve_ontology runs; a new test verifies the dissolve tombstone is written and references the absorption target; a new TestDeleteOntologyTombstone class verifies tombstone-before- delete_ontology_node ordering using a side-effect assertion on the helper; a new create-route test verifies tombstone clearing.

…later soft-skip (#402, #404 review) PR-404 review finding #2 + advisor flag: The annealing manager already vetoes demotion candidates with in-flight ingestion at *proposal creation* (Defect A). In autonomous mode the cycle-to-execute gap is small; in human-approval mode an approved proposal can sit for minutes-to-hours between creation and execution. A new ingest enqueued in that window would otherwise be silently dissolved out from under the operator. Re-check the queue at execute time, before calling dissolve_ontology. The advisor caught the secondary failure mode: the existing worker treats any executor return with success=False as 'failed' and writes that status to annealing_proposals. A vetoed-at-execute proposal would therefore be permanently dead, requiring re-approval after the queue clears — the queue veto becomes a footgun rather than a defense. Distinguish veto from real failure with a retry_later flag; the worker reverts the claim to 'approved' (not 'failed') so the proposal stays alive for the next cycle. Genuine failures (ontology gone, target invalid, etc.) still mark 'failed' as before. Residual TOCTOU: the SELECT and the dissolve commit are not atomic — a job enqueued between them slips through. Closing that fully needs an advisory lock on the ontology name in both job-enqueue and dissolve. For now, the ingestion worker's unconditional tombstone check + existed_at_submit=True + missing target → VANISHED raise remain the downstream backstop, and the residual window is the path that the worker-side raise was restored for. Tests: TestExecuteDemotion adds a vetoed-by-inflight test that asserts retry_later is True, the error message names the blocking job IDs, and dissolve_ontology is NOT called. New test_proposal_execution_ worker.py covers the worker-side soft-skip: retry_later=True reverts status to 'approved' (not 'failed'); genuine failures still mark 'failed'.

aaronsb · 2026-05-24T05:38:26Z

Second-pass review — fresh eyes on the response

The three response commits address the original findings cleanly and the rationales in the commit messages are tight. The advisor-flagged fixes (worker-unconditional check, clear-on-create, retry_later soft-skip) are exactly the right shape. Two new findings surface from this pass, one of them parallel to #6.

Finding A (request changes): rename does not clear tombstones — same shape as #6

api/app/routes/ontology.py:1287-1362 (rename_ontology) does not interact with kg_api.ontology_tombstones. With the unconditional tombstone check now in the worker and clear-on-create only in POST /ontology/, a routine operator recovery flow dead-ends:

DELETE /ontology/X writes tombstone(X)
Operator decides to repurpose Y as X: POST /ontology/Y/rename {new_name: X}
Rename succeeds at the graph level; tombstone(X) still present
Any future ingest into X fails TOMBSTONED permanently

This is the same class of trap that finding #6 closed for POST /ontology/. The underlying invariant the response is reaching for — any operator action that successfully establishes an ontology under name N supersedes tombstone(N) — is enforced at only one of two entry points.

Suggested fix: after client.rename_ontology_node(...) succeeds (the Source-rename path already committed by then), call _clear_ontology_tombstone(client, request.new_name). Mirror the create-route comment so future readers see the invariant.

Finding B (request changes): dissolve leaves a stale tombstone on partial / TOCTOU failure

api/app/routes/ontology.py:1846-1868: the route now pre-flights existence + lifecycle, writes the tombstone, then calls client.dissolve_ontology. dissolve_ontology in api/app/lib/age_client/ontology_scoring.py:527 has additional failure modes the route pre-flight cannot cover:

source-listing query fails (DB error at line 581)
reassign_sources returns success=False (line 598)
TOCTOU on lifecycle between route pre-flight and dissolve_ontology's own pre-flight

In any of these, the tombstone has committed and the ontology node still exists. The delete route's docstring acknowledges the analogous "force=true on a nonexistent name" case (lines 1099-1104). The dissolve route's docstring (lines 1817-1825) does not — and dissolve failure is closer to "didn't actually happen" than "operator chose to remove non-existent thing," so leaning on the same recovery path is harder to justify.

Two options:

Catch dissolve failure and call _clear_ontology_tombstone on the rollback path. Cleanest semantically (failed dissolve ≠ operator removal intent). Subtlety: if reassign_sources was partially successful before failing, the ontology is in a half-state and clearing the tombstone is arguably also wrong — but it's no worse than the current "block forever" outcome.
Add an explicit docstring note matching delete's pattern and document the manual recovery (delete the row from kg_api.ontology_tombstones).

Option 1 is preferable; option 2 is acceptable if option 1 is judged to add too much branching.

Acknowledged: what the response got right

The reasoning in 8f6f0586's commit message for moving the tombstone check before get_ontology_node is exactly the load-bearing argument. The "tombstone present + node still present" test (test_tombstone_present_with_live_node_still_raises) pins the behavior the asymmetric route/worker ordering depends on. Good test surface.
Restoring the VANISHED raise + the new split on tombstone-read failure (existed_at_submit=True → VANISHED, False → create) is the correct read of the original review's intent — the response improved on what I originally asked for.
retry_later + worker reverting the claim to 'approved' instead of 'failed' is the right shape. The scheduler in api/app/services/job_scheduler.py:272-310 re-dispatches stranded approved proposals with a 5-minute reviewed_at floor on a default 1-hour cleanup cadence; combined with _claim_proposal's WHERE status='approved' atomic guard, this gives bounded retry without thrash and without double-execution. Mention this in the PR body's "retry semantics" if helpful.
The recreate-then-ingest sequence is clean: POST /ontology/X clears tombstone(X) before create_ontology_node, and DELETE already calls queue.delete_jobs_by_ontology so there's no pre-delete-queued ingest to dequeue against the recreated namespace.
The TOCTOU residual between the executor's veto SELECT and the dissolve commit is honestly documented at lines 225-232 of proposal_executor.py, and the worker's VANISHED raise is the right downstream backstop. The advisory-lock path you've flagged for future work is the correct direction.

Test coverage note (informational, not a finding)

test_demotion_vetoed_by_inflight_ingestion overrides mock_cursor.fetchall on the shared cursor fixture to return job IDs. The assertion shape is sound (retry_later True, dissolve not called, job IDs in error), but the mock accepts any SQL — a regression that filtered by status='completed' or queried the wrong table would still pass. The behavior under test is right; the SQL surface is unverified. Documented inline in this comment for the record; not requesting a change since the SELECT is a short mirror of AnnealingManager._get_inflight_ingestion_targets whose own tests cover the SQL shape.

TestDeleteOntologyTombstone's side-effect ordering assertion (asserts delete_ontology_node.call_count == 0 inside the tombstone-record side effect) is a robust pattern — couldn't pass under reverse ordering at the Python call level.

Summary

Two real changes requested (rename + dissolve stale-tombstone), parallel shape to fixes the response already accepted. Everything else lands cleanly. The reasoning in the commit messages and inline comments is consistent and load-bearing — the asymmetric route/worker design is well-defended.

AI-assisted second-pass review via Claude Code

… tombstone (#402, #404 review-2) Second-pass review of PR #404 surfaced two parallel-shape gaps to fixes already accepted from the first review: Finding A (rename gap) — POST /ontology/{name}/rename establishes an ontology under new_name. The unconditional tombstone check + create- clears-tombstone (advisor finding from review-1) enforces the same invariant at one entry point: any operator action that establishes an ontology under name N supersedes a prior tombstone(N). Rename must do the same, otherwise the delete-then-rename recovery flow fails: ingest into the renamed name dead-ends with TOMBSTONED. Mirror the create-route's _clear_ontology_tombstone(new_name) call after a successful rename. Finding B (dissolve stale tombstone) — dissolve route writes the tombstone before calling client.dissolve_ontology(), which has failure modes beyond the route-level pre-flight (source listing DB error, partial reassign failure, lifecycle TOCTOU between our get + its own re-check). On those, the tombstone has committed but the ontology still exists in the graph. Future ingests would fail TOMBSTONED against an extant ontology. Clear the tombstone on both failure paths (exception and structured success=False) before raising/handling the HTTPException. Tests: rename-clears-tombstone verifies the DELETE is executed against the new_name; dissolve_failure_clears_tombstone exercises the structured-failure rollback path (verifies both INSERT and DELETE ran against the same name).

aaronsb added 4 commits May 22, 2026 20:56

aaronsb added 3 commits May 24, 2026 00:22

aaronsb merged commit 01cf591 into main May 24, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: annealing/ingestion race — layered defense (closes #402)#404

fix: annealing/ingestion race — layered defense (closes #402)#404
aaronsb merged 8 commits into
mainfrom
fix/issue-402-annealing-ingestion-race

aaronsb commented May 23, 2026 •

edited

Loading

Uh oh!

aaronsb commented May 23, 2026

Uh oh!

aaronsb commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaronsb commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review response (PR-404, 5 findings)

Definition of done

Test plan

Files

Uh oh!

aaronsb commented May 23, 2026

Review: layered defense (#402)

1. Operator-initiated dissolve bypasses the tombstone (load-bearing gap)

2. Tombstone write and graph delete are not atomic

3. Proposal executor does not re-check the queue veto

4. ONTOLOGY_VANISHED_MID_FLIGHT_ERROR is dead code after B2

5. Tombstone-read failure tolerance (judgment call, not a defect)

6. Tested defaults match migration defaults

7. Migration idempotency

8. Test file size

Assessment

Uh oh!

aaronsb commented May 24, 2026

Second-pass review — fresh eyes on the response

Finding A (request changes): rename does not clear tombstones — same shape as #6

Finding B (request changes): dissolve leaves a stale tombstone on partial / TOCTOU failure

Acknowledged: what the response got right

Test coverage note (informational, not a finding)

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aaronsb commented May 23, 2026 •

edited

Loading

4. `ONTOLOGY_VANISHED_MID_FLIGHT_ERROR` is dead code after B2