Skip to content

fix(hnsw): correct the vector dead-node lifecycle (record on delete, clear on re-insert)#9757

Open
shaunpatterson wants to merge 1 commit into
dgraph-io:mainfrom
shaunpatterson:sp/fix-hnsw-revive-deleted-vector
Open

fix(hnsw): correct the vector dead-node lifecycle (record on delete, clear on re-insert)#9757
shaunpatterson wants to merge 1 commit into
dgraph-io:mainfrom
shaunpatterson:sp/fix-hnsw-revive-deleted-vector

Conversation

@shaunpatterson

Copy link
Copy Markdown
Contributor

Summary

The HNSW vector index maintains a per-predicate dead-node list — a uid array stored in a single posting at DataKey(<pred>__vector_dead, 1) — that removeDeadNodes uses to strip edges pointing at deleted vectors. Its lifecycle has two correctness bugs that this PR fixes.

Bug A — deletes are never recorded

In addIndexMutations, the delete branch reads the deleted value from pl.AllValues(txn.StartTs):

data, _ := pl.AllValues(txn.StartTs)
if info.op == DEL && len(data) > 0 && data[0].Tid == types.VFloatID { /* record dead */ }

But addMutationHelper has already applied the delete before this runs, so AllValues(txn.StartTs) is empty (len(data)==0) and the dead node is never recorded. The value being deleted is already available as info.val and was simply ignored. Net effect: the dead list stays empty, removeDeadNodes is a no-op, and edges to deleted vectors are never cleaned.

Bug B — the list is append-only (never cleared on re-insert)

Re-inserting (re-embedding) a previously deleted uid never removed it from the dead list. Once A is fixed, a re-inserted vector would stay marked dead, get stripped from every neighbour's edge list, and end up orphaned — present but unreachable in search.

This is not hypothetical: a SET that overwrites an existing vector runs addIndexMutations with op=DEL (old value) then op=SET (new value) in a single mutation. So without B, every in-place re-embed would orphan its own node.

Fix

  • Record on delete using info.val (the value actually being deleted).
  • Clear on (re-)insert so a revived uid is no longer treated as dead.
  • Both go through updateDeadNodes, an atomic read-modify-write under the posting's write lock. A multi-edge mutation applies its edges in parallel goroutines sharing one txn (worker/draft.go), and the index rebuilder streams with 16 goroutines — all touching this single posting. The previous unlocked delete-path RMW could lose updates under that concurrency; the locked RMW serializes them and reads the txn's own pending writes, so a delete + re-insert within one mutation nets out correctly.
  • recordDeadNode is idempotent (no duplicate growth on repeated deletes / Raft re-apply).

Tests (posting/index_vector_dead_test.go)

All drive the real AddMutationWithIndexaddIndexMutations path:

  • TestDeletedVectorRecordedAsDead — a vfloat delete records the uid (bug A).
  • TestReinsertedVectorClearedFromDeadList — re-insert clears it (bug B).
  • TestReinsertedVectorIsSearchable — a deleted-then-reinserted vector is reachable in HNSW search (end-to-end).
  • TestDeadNodeConcurrentRecord — concurrent records don't lose updates (passes under -race).
  • TestDeadNodeSameTxnRecordThenClear — record + clear within one txn nets out (covers the SET-overwrite path).

go test ./posting/ ./tok/... passes; gofmt/go vet clean.

Note / future work

The dead list is a single JSON-blob posting. This PR makes its lifecycle correct and concurrency-safe, but the blob still grows O(distinct deletes) and is re-marshalled on each change. A more scalable representation (per-uid postings / a uid bitmap, enabling index.Remove(...)) is left as a follow-up — the existing // TODO ... index.Remove is preserved.

🤖 Generated with Claude Code

…clear on re-insert)

The HNSW vector index keeps a per-predicate "dead-node" list (a uid array at
DataKey(<pred>__vector_dead, 1)) of deleted vectors, which removeDeadNodes uses
to strip edges pointing at them. Two correctness bugs:

(A) Deletes were never recorded. addIndexMutations read the deleted value from
    pl.AllValues(txn.StartTs), but addMutationHelper has already applied the
    delete by then, so the read is empty (len(data)==0) and the dead node is
    never added. The value being deleted is already passed in as info.val and
    was ignored. Net effect: the dead list stays empty and removeDeadNodes is a
    no-op, so edges to deleted vectors are never cleaned up.

(B) The list was append-only: re-inserting (re-embedding) a previously deleted
    uid never cleared it. Once (A) is fixed, a re-inserted vector would stay
    marked dead, be stripped from every neighbour's edge list, and end up
    orphaned — present but unreachable in search. This matters in practice: a
    SET that overwrites an existing vector runs addIndexMutations with op=DEL
    (old value) then op=SET (new value) in one mutation, so without (B) every
    in-place re-embed would orphan its node.

Fix:
- Record on delete using info.val (the value actually being deleted).
- Clear on (re-)insert so a revived uid is no longer treated as dead.
- Both go through updateDeadNodes, an atomic read-modify-write under the
  posting's write lock. A multi-edge mutation applies edges in parallel
  goroutines sharing one txn (worker/draft.go) and the index rebuilder streams
  with 16 goroutines, all touching this single posting; the previous unlocked
  delete-path RMW could lose updates under that concurrency. The locked RMW also
  reads the txn's own pending writes, so a delete+re-insert within one mutation
  nets out correctly.
- recordDeadNode is idempotent (no duplicate growth on repeated deletes / Raft
  re-apply).

Tests: delete records the uid; re-insert clears it; a deleted-then-reinserted
vector stays searchable; concurrent records don't lose updates (-race);
record-then-clear within one txn nets out.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shaunpatterson shaunpatterson requested a review from a team as a code owner June 19, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant