fix(hnsw): de-duplicate dead-node UIDs on vector delete#9756
Open
shaunpatterson wants to merge 1 commit into
Open
fix(hnsw): de-duplicate dead-node UIDs on vector delete#9756shaunpatterson wants to merge 1 commit into
shaunpatterson wants to merge 1 commit into
Conversation
When a vfloat (vector) value is deleted, addIndexMutations records the uid in the HNSW dead-node list — a single posting (DataKey(<pred>__vector_dead, 1)) holding a JSON array that is read, appended to, and re-marshalled in full on every delete. The uid was appended unconditionally, so a uid that is already recorded as dead (delete/reinsert churn, or a Raft re-apply of the same mutation) was appended again, growing the posting with duplicates and re-marshalling the whole blob for no effect. removeDeadNodes de-dupes on read, so the duplicates were invisible but still bloated the posting and its parse. Skip the rewrite when the uid is already present (new addDeadNode helper). Behaviour is unchanged for the first delete of a uid; repeated deletes become a no-op instead of appending a duplicate. This does not address the deeper design issue that the dead list grows unbounded with distinct deletions and is never garbage-collected — that needs the index.Remove() path the existing TODO calls for and is left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a
vfloat(vector) value is deleted,addIndexMutationsrecords the uid in the HNSW dead-node list (posting/index.go). That list is a single posting —DataKey(<pred>__vector_dead, 1)holding a JSON array — which is read, appended to, and re-marshalled in full on every delete:The uid is appended unconditionally, so a uid already recorded as dead — delete/reinsert churn, or a Raft re-apply of the same mutation — is appended again. That grows the posting with duplicates and re-marshals the whole blob for no effect.
removeDeadNodesde-dupes on read (it builds a set), so the duplicates were functionally invisible but still bloated the posting and its parse cost.Fix
Skip the rewrite when the uid is already present, via a small
addDeadNodehelper that reports whether the list changed:Behaviour is unchanged for the first delete of a uid; repeated deletes become a no-op instead of appending a duplicate and rewriting the posting.
Scope
This intentionally does not address the deeper design issue that the dead list grows unbounded with distinct deletions and is never garbage-collected — that needs the
index.Remove()path the existing// TODO look into better alternativescalls for, and is left as-is.Test
TestAddDeadNodecovers first-insert, distinct-append, and already-present (no-op) cases.go test ./posting/— passgo vet/gofmt— clean (no new findings)🤖 Generated with Claude Code