Skip to content

fix: annealing/ingestion race — layered defense (closes #402)#404

Merged
aaronsb merged 8 commits into
mainfrom
fix/issue-402-annealing-ingestion-race
May 24, 2026
Merged

fix: annealing/ingestion race — layered defense (closes #402)#404
aaronsb merged 8 commits into
mainfrom
fix/issue-402-annealing-ingestion-race

Conversation

@aaronsb
Copy link
Copy Markdown
Owner

@aaronsb aaronsb commented May 23, 2026

Closes #402.

The annealing manager and ingestion queue did not coordinate. When annealing dissolved an ontology that still had ingest jobs queued against it, the worker dequeued with a missing target and silently recreated the ontology under a fresh id — operator saw "submitted," content never landed where they asked.

Four atomic commits, one per defect in #402, layered prevent → shrink → recover; then three follow-up commits responding to the review on this PR.

Summary

Layer Commit Defect
Prevent 4282768f A — Annealing vetoes demotion candidates with in-flight ingestion. Queue-aware: queries kg_api.jobs for non-terminal ingestion jobs against each demotion candidate; vetoes structurally, logs the count + job_ids, counts vetoes in the cycle result. Promotions are not vetoed (they don't modify the source ontology).
Shrink window 99cadd46 C — Per-ontology cadence floors. Migration 065 adds min_ontology_age_epochs=3 and min_ontology_concept_count=5. Both demotion and promotion candidate selection consult them. Migration uses INSERT … ON CONFLICT DO NOTHING so operator-tuned rows stay put; existing keys are deliberately not changed.
Recover (loud) f73737e5 B1 — Worker fails loudly on missing target. Both ingest routes stamp ontology_existed_at_submit on the job; a new _validate_target_ontology helper raises with a structured, distinct error string when the operator's intent has been invalidated. Job-queue exception handler maps to status='failed' with the message on the error column; the job-list API already surfaces it.
Recover (intent) 7fbe2498 B2 — Tombstone-aware idempotent recreate. Migration 066 adds kg_api.ontology_tombstones. Operator-delete writes a tombstone; annealing dissolve does not. Worker consults it: missing + tombstoned → fail; missing + no tombstone → recreate with audit log (operator intent overrides background reorganization).

Review response (PR-404, 5 findings)

Code review surfaced 5 findings; advisor reconciliation found that finding #4 as written would have introduced a worse bug (orphan content into a graph being deleted) unless paired with a worker-side change, and that finding #2 had a secondary footgun (vetoed-at-execute proposals permanently failed instead of soft-skipped). Three follow-up commits address all 5 plus both advisor flags.

Commit Findings addressed
8f6f0586 (worker) #3 restore VANISHED raise; #4 worker checks tombstone unconditionally (the precondition for safely reordering tombstone writes upstream); #5 tombstone-read fallback reassessed — existed_at_submit=True + read-failure now falls through to VANISHED, no longer silently recreates
c0cc5f2b (routes) #1 dissolve writes tombstone (same data-loss class as delete); #4 tombstone is the first graph-mutating step of delete; #6 advisor-flagged — POST /ontology/ clears any prior tombstone so operator-recreate doesn't leave a "create succeeds, ingest fails forever" trap
2a99e555 (executor) #2 execute_demotion re-checks the queue veto before calling dissolve_ontology (closes the cycle-to-execute gap that's wide in human-approval mode); advisor-flagged footgun — vetoed proposals return retry_later=True and the worker reverts status to 'approved' (soft skip) instead of marking 'failed' (permanent dead-end)

A residual TOCTOU exists between the executor's veto SELECT and the dissolve commit — closing it fully needs an advisory lock on the ontology name in both job-enqueue and dissolve. Documented inline; the worker's existed_at_submit=True + missing → VANISHED raise is the downstream backstop for that residual window.

Definition of done

An operator running a batch ingest concurrent with annealing observes either successful ingestion or a clearly-failed job — never a silent drop.

  • A + executor re-check block the common race upstream at two points (proposal creation and proposal execution).
  • C shrinks the residual window for ontologies still settling.
  • B1 + B2 + worker-unconditional-tombstone turn the surviving cases into observable, structurally distinct failures (vanished / tombstoned / frozen) or honored recreations.
  • Dissolve route + create-clear-tombstone close the operator-initiated paths that B2 left exposed.

Test plan

  • pytest tests/unit/ + tests/api/test_ingest.py + tests/api/test_ontology_routes.py — 568 pass (full suite passes after review-response commits; new tests cover one observable per finding)
  • Migration 065 applied successfully on dev
  • Migration 066 applied successfully on dev
  • Manual smoke: queue an ingest job, force an annealing cycle, observe veto log + no proposal for that ontology
  • Manual smoke: delete an ontology via API, attempt re-ingest, observe status=failed with tombstone error
  • Manual smoke: approve a demotion proposal, enqueue an ingest for the same ontology, run the execution worker; verify the proposal status reverts to 'approved' (not 'failed') and dissolve did not run

Files

  • api/app/services/annealing_manager.py — veto logic + cadence floors
  • api/app/services/proposal_executor.py — execute-time veto re-check + retry_later
  • api/app/launchers/annealing.py, api/app/workers/annealing_worker.py — option plumbing
  • api/app/workers/ingestion_worker.py_validate_target_ontology (unconditional tombstone check, VANISHED restored) + _ontology_tombstone
  • api/app/workers/proposal_execution_worker.py — retry_later soft-skip
  • api/app/routes/ingest.py — stamp ontology_existed_at_submit
  • api/app/routes/ontology.py — tombstone helpers (_record_ontology_tombstone, _clear_ontology_tombstone); delete/dissolve write tombstone before mutation; create clears tombstone
  • schema/migrations/065_annealing_cadence_floors.sql
  • schema/migrations/066_ontology_tombstones.sql
  • tests/unit/services/test_annealing_manager.py, tests/unit/services/test_proposal_executor.py, tests/unit/workers/test_ingestion_worker.py, tests/unit/workers/test_proposal_execution_worker.py, tests/api/test_ontology_routes.py

aaronsb added 4 commits May 22, 2026 20:56
… A)

Annealing's candidate selection was graph-state-aware but queue-state-
unaware. When the worker dissolved an ontology while operator-submitted
ingest jobs were queued against it, those jobs dequeued with a missing
target and silently never landed — accepted-then-dropped data loss.

Before proposing any demote / merge / decompose / dissolve mutation
against ontology X, the cycle now consults kg_api.jobs for non-terminal
ingestion jobs targeting X and vetoes that candidate this cycle (skip,
not defer) when any exist. Vetoes are logged structurally and counted in
the cycle result so they're observable, not silent.

Promotions are not vetoed: they create a new ontology and do not modify
the source in a way that invalidates queued ingest jobs against it.
… C)

Shipping cadence was too aggressive: cycles evaluated brand-new and
near-empty ontologies before they accumulated enough signal to be judged
fairly, wasting LLM calls and widening the ingestion race window that
Defect A blocks.

Two new floors gate per-ontology cycle eligibility (migration 065):

  min_ontology_age_epochs    = 3 — ontology must exist ≥3 epoch ticks
                                   before annealing can judge it
  min_ontology_concept_count = 5 — ontology must hold ≥5 concepts before
                                   annealing can judge it

Both demotion and promotion candidate selection consult the floors —
high-degree concepts in a sparse or brand-new source ontology haven't
earned an evaluation as natural new nuclei either.

Migration uses INSERT ... ON CONFLICT DO NOTHING so an operator-tuned
row is never overwritten. Existing keys (epoch_interval,
demotion_threshold, promotion_min_degree, max_proposals) are
intentionally not changed here.
…B1)

If annealing dissolved a target ontology between job submit and execute,
the worker silently called ensure_ontology_exists, which recreated the
ontology under a fresh ontology_id. The operator saw "submitted" but
their content never landed under the namespace they targeted — silent
operator-visible data loss.

Both ingest routes now stamp ontology_existed_at_submit on the job. The
worker reads it through a new _validate_target_ontology helper:

- existed_at_submit=True + node missing → loud raise with a distinct
  "vanished mid-flight" error string. The job-queue exception handler
  marks the job status='failed' with the message on the job's error
  column, which the job-list API already surfaces.
- existed_at_submit=False + node missing → first-ever ingest, create
  it (that IS the operator's intent).
- node present + frozen → existing ADR-200 Phase 2 frozen-rejection.

Distinct error strings (ONTOLOGY_VANISHED_MID_FLIGHT_ERROR vs.
ONTOLOGY_FROZEN_ERROR) so operators and Defect B2's tombstone path can
distinguish the failure modes. Pre-B1 jobs without the flag default to
the safer "vanished" behavior.
…B2)

B1 made every missing-target ingest fail loudly. That over-corrected:
when annealing dissolves an ontology between queue and execute, the
operator-submitted ingest IS active operator intent that should override
the background reorganization. Failing it strands data that the operator
explicitly told us to write.

B2 narrows the loud-fail trigger to a positive operator-intent signal:
ontology tombstones (migration 066). The operator-delete route writes a
tombstone row; annealing's dissolve_ontology path does not. The
ingestion worker consults the tombstone when its target is missing:

  missing + tombstoned   → loud fail with ONTOLOGY_TOMBSTONED_ERROR
                           ("deliberately removed by an operator")
  missing + no tombstone → recreate via ensure_ontology_exists with an
                           audit log line naming job_id, actor, and
                           existed_at_submit so the recreate event is
                           traceable

Three distinct error strings (tombstoned / frozen / vanished) so the
job-list API surfaces structurally different failure modes that an
operator needs to distinguish. The tombstone-read failure path is
deliberately tolerant — falling back to recreate is recoverable;
failing every ingest when one query fails is not. Defect A's queue-veto
remains the upstream safety layer.
@aaronsb
Copy link
Copy Markdown
Owner Author

aaronsb commented May 23, 2026

Review: layered defense (#402)

What this changes: queue-aware veto (A), per-ontology cadence floors (C), and tombstone-aware loud-fail/recreate (B1+B2). Defense is well-layered and the commit-by-commit decomposition tells the story cleanly. Test coverage on the new logic looks solid.

The review below focuses on findings that change risk under load.


1. Operator-initiated dissolve bypasses the tombstone (load-bearing gap)

Location: api/app/routes/ontology.py:1740-1794 (POST /ontology/{name}/dissolve)

The commit message for B2 frames the distinction as "operator-delete writes a tombstone; annealing dissolve does not." But the operator-facing dissolve_ontology route is also operator intent and is also a missing-target producer, and it does not write a tombstone. Sequence:

  1. Operator A queues ingest job against X.
  2. Operator B calls POST /ontology/X/dissolve (intentional, deliberate).
  3. Worker dequeues A's job, target missing, no tombstone → recreate path runs with audit log line. Content lands under a fresh ontology_id named X that Operator B explicitly intended to retire.

This is the same silent-recreate failure mode #402 calls out, just triggered via the dissolve endpoint instead of the dissolve algorithm. Recommend writing a tombstone in this route too (with reason="operator-initiated dissolve via API" for distinguishability). The A veto does not cover this path — A only consults kg_api.jobs from within the annealing manager.


2. Tombstone write and graph delete are not atomic

Location: api/app/routes/ontology.py:1147-1184

delete_ontology_node runs first (its own implicit transaction); the tombstone INSERT runs afterward on a separately checked-out connection. Window:

  1. T0: delete_ontology_node commits — ontology gone from graph.
  2. T1: worker on another process dequeues, calls _validate_target_ontology, queries ontology_tombstones, finds none.
  3. T2: worker takes the recreate path.
  4. T3: route writes tombstone.

This is narrow but real. Two structural fixes worth considering: (a) wrap the graph delete and tombstone insert in a single connection/transaction so both succeed or both fail; (b) reorder — write the tombstone first (it costs nothing if the delete fails, and a stranded tombstone is recoverable by removing the row, whereas a stranded "ontology gone, no tombstone" window is the exact race that needs closing).

Option (b) is the simpler fix.

3. Proposal executor does not re-check the queue veto

Location: api/app/services/proposal_executor.py:166-201

The A veto blocks proposal creation, but execute_demotion only re-validates "ontology still exists" and "not pinned/frozen." Approved proposals that pre-date a newly-enqueued ingest job will still dissolve. With operator-approved (non-autonomous) cycles this can sit pending for arbitrary time.

Recommend execute_demotion re-run the same _get_inflight_ingestion_targets({name}) check before calling dissolve_ontology and return {success: False, error: "in-flight ingestion blocking demotion"} if any rows surface. In autonomous mode the window is short; in human-approval mode it's wide-open.

4. ONTOLOGY_VANISHED_MID_FLIGHT_ERROR is dead code after B2

Location: api/app/workers/ingestion_worker.py:30-37, 95-147

After B2, _validate_target_ontology raises only ONTOLOGY_TOMBSTONED_ERROR or ONTOLOGY_FROZEN_ERROR. The "vanished" string is defined and the comment claims it surfaces on "legacy/pre-B2 jobs where ontology_existed_at_submit was True but the tombstone path was not checked" — but I don't see a code path that emits it. The existed_at_submit parameter survived only as an audit-log field.

Two acceptable resolutions: delete the constant + the parameter and document that B2 supersedes B1's loud-fail semantics; or restore B1's check for existed_at_submit=True + missing + no tombstone and raise ONTOLOGY_VANISHED_MID_FLIGHT_ERROR instead of recreating in that specific subset (preserving B1's loud-fail for the population where the queue-veto failed but the operator never deliberately deleted). The latter is more conservative — the recreate path is only triggered when both queue-veto (A) and tombstone-check (B2) failed to fire, which is the worst signal-to-noise regime to silently recover from.

5. Tombstone-read failure tolerance (judgment call, not a defect)

Location: api/app/workers/ingestion_worker.py:81-89

The "log loudly, fall back to recreate" stance is defensible given the queue-veto is the upstream layer. If finding #3 lands and the veto becomes truly upstream-effective, this is fine as-is. If #3 does not land, this fallback is the third silent-recreate route in the system, after #1 and the executor-bypass in #3.

6. Tested defaults match migration defaults

  • A: queue-veto SELECT uses ANY(%s) over status + ontology lists, single statement, atomic snapshot of the jobs table. Good. Note status = 'pending' is on the non-terminal list — matches the initial INSERT in job_queue.enqueue.
  • C: launcher defaults (3 / 5) match migration 065 seed values and worker job_data.get(..., 3/5) fallbacks — three places stay in sync.
  • B1: route default True for ontology_existed_at_submit matches worker default True — pre-B1 jobs default to the safer behavior, as advertised.

7. Migration idempotency

  • 065: INSERT … ON CONFLICT (key) DO NOTHING + INSERT INTO schema_migrations … ON CONFLICT (version) DO NOTHING. Re-runnable. Reversible by hand (DELETE FROM annealing_options WHERE key IN (...)).
  • 066: CREATE TABLE IF NOT EXISTS + same migrations guard. Re-runnable. Drop is straightforward.

8. Test file size

tests/unit/services/test_annealing_manager.py is 784 lines but exclusively test functions across nine logical groupings. Test files trade breadth for cohesion — flagging only because you asked. No action recommended.


Assessment

Defense-in-depth structure is correct, the commit decomposition is exemplary, and the per-defect tests carry their weight. Three findings would change behavior under load:

Findings #2 (tombstone race window) and #5 (read-failure fallback) are smaller and depend on whether #3 lands as proposed.


AI-assisted review via Claude

aaronsb added 3 commits May 24, 2026 00:22
…D raise (#402, #404 review)

PR-404 review findings #3, #4, #5: the previous worker gated tombstone
lookup on "ont_node is None", which meant the operator-delete route
could only safely write its tombstone *after* delete_ontology_node
committed — a worker dequeue in that window saw a missing ontology
with no tombstone yet and silently recreated. Reordering the route to
write the tombstone first would shift the bug, not close it: in the
new window (tombstone present, node still present) the worker would
proceed to write content into a graph the operator is removing.

Close both windows by checking the tombstone *before* the graph node.
A tombstone present in either window — before or after the graph
mutation — fails the in-flight ingest with ONTOLOGY_TOMBSTONED_ERROR
rather than racing.

Also restore the VANISHED raise that B2 made dead code. With Defect
A's queue veto and the proposal-executor re-check (separate commit)
in place, existed_at_submit=True + missing target should not happen
under normal operation. Reaching that branch indicates a real
anomaly (rename without job migration, manual surgery, residual
race) — surface it with a distinct error string rather than silently
recreating. The recreate branch narrows to existed_at_submit=False
(first-ever ingest into a new name), which IS positive operator
intent.

Tombstone-read failure on a missing target now falls through to
VANISHED rather than to silent recreate when existed_at_submit=True,
which is the correct fallback when we cannot positively rule out an
operator delete.

Tests updated: the "missing without tombstone triggers recreate" case
is now "missing after existing raises VANISHED"; a new test covers
the unconditional check (tombstone present + node still present →
TOMBSTONED); the tombstone-read-failure path splits by
existed_at_submit (True → VANISHED, False → create).
…lve; clear on recreate (#402, #404 review)

PR-404 review findings #1, #4 (advisor-revised), #6 (advisor):

#1 — Operator-initiated POST /ontology/{name}/dissolve previously
wrote no tombstone. Same data-loss class as operator-delete: a
racing ingest against the dissolved name would silently recreate it.
Dissolve now pre-flights existence + lifecycle in the route (so a
refused dissolve doesn't leave a stale tombstone), then writes the
tombstone, then calls dissolve_ontology. Annealing's dissolve path
still does NOT write a tombstone — that asymmetry is what
distinguishes "deliberately removed by operator" (loud-fail) from
"absorbed by background reorganization" (recoverable).

#4 — Reordering the delete route's tombstone write is only safe in
combination with the unconditional tombstone check now in the worker.
With that worker change in place, the tombstone is written as the
first graph-mutating step of the delete (before delete_ontology_node
and the source/embedding cascade). A worker dequeue at any point
during the multi-step delete sees the tombstone and fails the ingest
TOMBSTONED rather than racing into a graph the operator is removing.

#6 — Once tombstones are checked unconditionally, an operator who
deletes X and then re-creates X via POST /ontology/ would otherwise
hit a permanent "create succeeded, ingest fails forever" trap. The
create route now clears any existing tombstone for the same name —
explicit operator intent to revive supersedes the prior removal
intent.

Extracted two helpers (_record_ontology_tombstone, _clear_ontology_
tombstone) so the delete, dissolve, and create routes share a single
implementation. Tombstone-write failures are logged but not raised:
the tombstone is defense-in-depth on top of the queue veto + the
worker's distinct-error raises, and aborting the operator's
requested operation because the tombstone INSERT failed would be a
worse failure mode.

Tests: dissolve route now requires get_ontology_node to return an
active node before dissolve_ontology runs; a new test verifies the
dissolve tombstone is written and references the absorption target; a
new TestDeleteOntologyTombstone class verifies tombstone-before-
delete_ontology_node ordering using a side-effect assertion on the
helper; a new create-route test verifies tombstone clearing.
…later soft-skip (#402, #404 review)

PR-404 review finding #2 + advisor flag:

The annealing manager already vetoes demotion candidates with
in-flight ingestion at *proposal creation* (Defect A). In autonomous
mode the cycle-to-execute gap is small; in human-approval mode an
approved proposal can sit for minutes-to-hours between creation and
execution. A new ingest enqueued in that window would otherwise be
silently dissolved out from under the operator. Re-check the queue
at execute time, before calling dissolve_ontology.

The advisor caught the secondary failure mode: the existing worker
treats any executor return with success=False as 'failed' and writes
that status to annealing_proposals. A vetoed-at-execute proposal
would therefore be permanently dead, requiring re-approval after the
queue clears — the queue veto becomes a footgun rather than a
defense. Distinguish veto from real failure with a retry_later flag;
the worker reverts the claim to 'approved' (not 'failed') so the
proposal stays alive for the next cycle. Genuine failures (ontology
gone, target invalid, etc.) still mark 'failed' as before.

Residual TOCTOU: the SELECT and the dissolve commit are not atomic —
a job enqueued between them slips through. Closing that fully needs
an advisory lock on the ontology name in both job-enqueue and
dissolve. For now, the ingestion worker's unconditional tombstone
check + existed_at_submit=True + missing target → VANISHED raise
remain the downstream backstop, and the residual window is the path
that the worker-side raise was restored for.

Tests: TestExecuteDemotion adds a vetoed-by-inflight test that asserts
retry_later is True, the error message names the blocking job IDs,
and dissolve_ontology is NOT called. New test_proposal_execution_
worker.py covers the worker-side soft-skip: retry_later=True reverts
status to 'approved' (not 'failed'); genuine failures still mark
'failed'.
@aaronsb
Copy link
Copy Markdown
Owner Author

aaronsb commented May 24, 2026

Second-pass review — fresh eyes on the response

The three response commits address the original findings cleanly and the rationales in the commit messages are tight. The advisor-flagged fixes (worker-unconditional check, clear-on-create, retry_later soft-skip) are exactly the right shape. Two new findings surface from this pass, one of them parallel to #6.


Finding A (request changes): rename does not clear tombstones — same shape as #6

api/app/routes/ontology.py:1287-1362 (rename_ontology) does not interact with kg_api.ontology_tombstones. With the unconditional tombstone check now in the worker and clear-on-create only in POST /ontology/, a routine operator recovery flow dead-ends:

  1. DELETE /ontology/X writes tombstone(X)
  2. Operator decides to repurpose Y as X: POST /ontology/Y/rename {new_name: X}
  3. Rename succeeds at the graph level; tombstone(X) still present
  4. Any future ingest into X fails TOMBSTONED permanently

This is the same class of trap that finding #6 closed for POST /ontology/. The underlying invariant the response is reaching for — any operator action that successfully establishes an ontology under name N supersedes tombstone(N) — is enforced at only one of two entry points.

Suggested fix: after client.rename_ontology_node(...) succeeds (the Source-rename path already committed by then), call _clear_ontology_tombstone(client, request.new_name). Mirror the create-route comment so future readers see the invariant.


Finding B (request changes): dissolve leaves a stale tombstone on partial / TOCTOU failure

api/app/routes/ontology.py:1846-1868: the route now pre-flights existence + lifecycle, writes the tombstone, then calls client.dissolve_ontology. dissolve_ontology in api/app/lib/age_client/ontology_scoring.py:527 has additional failure modes the route pre-flight cannot cover:

  • source-listing query fails (DB error at line 581)
  • reassign_sources returns success=False (line 598)
  • TOCTOU on lifecycle between route pre-flight and dissolve_ontology's own pre-flight

In any of these, the tombstone has committed and the ontology node still exists. The delete route's docstring acknowledges the analogous "force=true on a nonexistent name" case (lines 1099-1104). The dissolve route's docstring (lines 1817-1825) does not — and dissolve failure is closer to "didn't actually happen" than "operator chose to remove non-existent thing," so leaning on the same recovery path is harder to justify.

Two options:

  1. Catch dissolve failure and call _clear_ontology_tombstone on the rollback path. Cleanest semantically (failed dissolve ≠ operator removal intent). Subtlety: if reassign_sources was partially successful before failing, the ontology is in a half-state and clearing the tombstone is arguably also wrong — but it's no worse than the current "block forever" outcome.
  2. Add an explicit docstring note matching delete's pattern and document the manual recovery (delete the row from kg_api.ontology_tombstones).

Option 1 is preferable; option 2 is acceptable if option 1 is judged to add too much branching.


Acknowledged: what the response got right

  • The reasoning in 8f6f0586's commit message for moving the tombstone check before get_ontology_node is exactly the load-bearing argument. The "tombstone present + node still present" test (test_tombstone_present_with_live_node_still_raises) pins the behavior the asymmetric route/worker ordering depends on. Good test surface.
  • Restoring the VANISHED raise + the new split on tombstone-read failure (existed_at_submit=True → VANISHED, False → create) is the correct read of the original review's intent — the response improved on what I originally asked for.
  • retry_later + worker reverting the claim to 'approved' instead of 'failed' is the right shape. The scheduler in api/app/services/job_scheduler.py:272-310 re-dispatches stranded approved proposals with a 5-minute reviewed_at floor on a default 1-hour cleanup cadence; combined with _claim_proposal's WHERE status='approved' atomic guard, this gives bounded retry without thrash and without double-execution. Mention this in the PR body's "retry semantics" if helpful.
  • The recreate-then-ingest sequence is clean: POST /ontology/X clears tombstone(X) before create_ontology_node, and DELETE already calls queue.delete_jobs_by_ontology so there's no pre-delete-queued ingest to dequeue against the recreated namespace.
  • The TOCTOU residual between the executor's veto SELECT and the dissolve commit is honestly documented at lines 225-232 of proposal_executor.py, and the worker's VANISHED raise is the right downstream backstop. The advisory-lock path you've flagged for future work is the correct direction.

Test coverage note (informational, not a finding)

test_demotion_vetoed_by_inflight_ingestion overrides mock_cursor.fetchall on the shared cursor fixture to return job IDs. The assertion shape is sound (retry_later True, dissolve not called, job IDs in error), but the mock accepts any SQL — a regression that filtered by status='completed' or queried the wrong table would still pass. The behavior under test is right; the SQL surface is unverified. Documented inline in this comment for the record; not requesting a change since the SELECT is a short mirror of AnnealingManager._get_inflight_ingestion_targets whose own tests cover the SQL shape.

TestDeleteOntologyTombstone's side-effect ordering assertion (asserts delete_ontology_node.call_count == 0 inside the tombstone-record side effect) is a robust pattern — couldn't pass under reverse ordering at the Python call level.


Summary

Two real changes requested (rename + dissolve stale-tombstone), parallel shape to fixes the response already accepted. Everything else lands cleanly. The reasoning in the commit messages and inline comments is consistent and load-bearing — the asymmetric route/worker design is well-defended.


AI-assisted second-pass review via Claude Code

… tombstone (#402, #404 review-2)

Second-pass review of PR #404 surfaced two parallel-shape gaps to
fixes already accepted from the first review:

Finding A (rename gap) — POST /ontology/{name}/rename establishes an
ontology under new_name. The unconditional tombstone check + create-
clears-tombstone (advisor finding from review-1) enforces the same
invariant at one entry point: any operator action that establishes
an ontology under name N supersedes a prior tombstone(N). Rename
must do the same, otherwise the delete-then-rename recovery flow
fails: ingest into the renamed name dead-ends with TOMBSTONED. Mirror
the create-route's _clear_ontology_tombstone(new_name) call after a
successful rename.

Finding B (dissolve stale tombstone) — dissolve route writes the
tombstone before calling client.dissolve_ontology(), which has
failure modes beyond the route-level pre-flight (source listing DB
error, partial reassign failure, lifecycle TOCTOU between our get +
its own re-check). On those, the tombstone has committed but the
ontology still exists in the graph. Future ingests would fail
TOMBSTONED against an extant ontology. Clear the tombstone on both
failure paths (exception and structured success=False) before
raising/handling the HTTPException.

Tests: rename-clears-tombstone verifies the DELETE is executed
against the new_name; dissolve_failure_clears_tombstone exercises
the structured-failure rollback path (verifies both INSERT and
DELETE ran against the same name).
@aaronsb aaronsb merged commit 01cf591 into main May 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition: annealing dissolves ontologies with in-flight ingestion jobs, causing silent data loss

1 participant