Skip to content

Compaction: eliminating disk scan and updating PQ retrain#691

Open
dian-lun-lin wants to merge 1 commit into
mainfrom
compaction-perf-fix
Open

Compaction: eliminating disk scan and updating PQ retrain#691
dian-lun-lin wants to merge 1 commit into
mainfrom
compaction-perf-fix

Conversation

@dian-lun-lin

@dian-lun-lin dian-lun-lin commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

computeLayerInfoFromSources called getNodes(0) on each source graph to count live nodes at level 0. getNodes(0) sequentially seeks through every node record on disk to filter out deleted entries. On a cold page cache this touches large amounts of source data before compaction even begins, significantly delaying the start of actual graph merging.

Since every live node is present at level 0 by the HNSW invariant, the count is simply liveNodes.get(s).cardinality() — an in-memory popcount requiring no I/O.

Also switch PQ retraining from ProductQuantization.compute() (full k-means++ init) to basePQ.refine() (Lloyd's iterations only, warm-started from the existing codebook). The source codebooks are already trained on the same distribution, so warm-starting converges in far fewer passes with no recall loss.

Dataset Workload main cold fix cold Δ cold main recall fix recall
cohere-1M 2-TIERED 2m56s 1m07s −62% 0.6099 0.6091
cohere-1M 2-UNIFORM 2m58s 1m05s −63% 0.6199 0.6225
cohere-1M 4-UNIFORM 3m06s 1m10s −63% 0.6247 0.6257
cohere-10M 2-TIERED 31m12s 13m37s −56% 0.5647 0.5681
cohere-10M 2-UNIFORM 27m23s 6m31s −76% 0.5665 0.5699
cohere-10M 4-UNIFORM 32m40s 7m14s −78% 0.5698 0.5714
cap-1M 2-TIERED 3m15s 1m20s −59% 0.6672 0.6645
cap-1M 2-UNIFORM 3m15s 1m17s −60% 0.6683 0.6701
cap-1M 4-UNIFORM 3m50s 1m32s −60% 0.6739 0.6715
cap-6M 2-TIERED 17m52s 4m28s −75% 0.6303 0.6307
cap-6M 2-UNIFORM 17m39s 4m16s −76% 0.6375 0.6346
cap-6M 4-UNIFORM 19m02s 5m44s −70% 0.6405 0.6407
dpr-gemma-1m 2-TIERED 2m49s 56s −67% 0.7794 0.7810
dpr-gemma-1m 2-UNIFORM 2m46s 49s −71% 0.7770 0.7777
dpr-gemma-1m 4-UNIFORM 2m55s 1m06s −63% 0.7858 0.7858
dpr-gemma-10m 2-TIERED 25m55s 4m03s −84% 0.6932 0.6928
dpr-gemma-10m 2-UNIFORM 26m01s 4m11s −84% 0.6893 0.6947
dpr-gemma-10m 4-UNIFORM 27m52s 5m45s −79% 0.7048 0.7061

…uteLayerInfoFromSources

computeLayerInfoFromSources called getNodes(0) on each source graph to count live nodes
at level 0. getNodes(0) sequentially seeks through every node record on disk to filter
out deleted entries. On a cold page cache this touches large amounts of source data before
compaction even begins, significantly delaying the start of actual graph merging.

Since every live node is present at level 0 by the HNSW invariant, the count is simply
liveNodes.get(s).cardinality() — an in-memory popcount requiring no I/O.

Also switch PQ retraining from ProductQuantization.compute() (full k-means++ init)
to basePQ.refine() (Lloyd's iterations only, warm-started from the existing codebook).
The source codebooks are already trained on the same distribution, so warm-starting
converges in far fewer passes with no recall loss.
@dian-lun-lin dian-lun-lin marked this pull request as draft July 2, 2026 22:19
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you rebase your branch onto the latest main for regression testing and PR submission?
  • Did you trigger regression testing via Run Bench Main and review results?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?
  • Did you add documentation for this feature to the release notes directory?

If you did not complete any of these, then please explain below.

@dian-lun-lin dian-lun-lin marked this pull request as ready for review July 4, 2026 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant