Compaction: eliminating disk scan and updating PQ retrain by dian-lun-lin · Pull Request #691 · datastax/jvector

dian-lun-lin · 2026-07-02T22:18:59Z

computeLayerInfoFromSources called getNodes(0) on each source graph to count live nodes at level 0. getNodes(0) sequentially seeks through every node record on disk to filter out deleted entries. On a cold page cache this touches large amounts of source data before compaction even begins, significantly delaying the start of actual graph merging.

Since every live node is present at level 0 by the HNSW invariant, the count is simply liveNodes.get(s).cardinality() — an in-memory popcount requiring no I/O.

Also switch PQ retraining from ProductQuantization.compute() (full k-means++ init) to basePQ.refine() (Lloyd's iterations only, warm-started from the existing codebook). The source codebooks are already trained on the same distribution, so warm-starting converges in far fewer passes with no recall loss.

Dataset	Workload	main cold	fix cold	Δ cold	main recall	fix recall
cohere-1M	2-TIERED	2m56s	1m07s	−62%	0.6099	0.6091
cohere-1M	2-UNIFORM	2m58s	1m05s	−63%	0.6199	0.6225
cohere-1M	4-UNIFORM	3m06s	1m10s	−63%	0.6247	0.6257
cohere-10M	2-TIERED	31m12s	13m37s	−56%	0.5647	0.5681
cohere-10M	2-UNIFORM	27m23s	6m31s	−76%	0.5665	0.5699
cohere-10M	4-UNIFORM	32m40s	7m14s	−78%	0.5698	0.5714
cap-1M	2-TIERED	3m15s	1m20s	−59%	0.6672	0.6645
cap-1M	2-UNIFORM	3m15s	1m17s	−60%	0.6683	0.6701
cap-1M	4-UNIFORM	3m50s	1m32s	−60%	0.6739	0.6715
cap-6M	2-TIERED	17m52s	4m28s	−75%	0.6303	0.6307
cap-6M	2-UNIFORM	17m39s	4m16s	−76%	0.6375	0.6346
cap-6M	4-UNIFORM	19m02s	5m44s	−70%	0.6405	0.6407
dpr-gemma-1m	2-TIERED	2m49s	56s	−67%	0.7794	0.7810
dpr-gemma-1m	2-UNIFORM	2m46s	49s	−71%	0.7770	0.7777
dpr-gemma-1m	4-UNIFORM	2m55s	1m06s	−63%	0.7858	0.7858
dpr-gemma-10m	2-TIERED	25m55s	4m03s	−84%	0.6932	0.6928
dpr-gemma-10m	2-UNIFORM	26m01s	4m11s	−84%	0.6893	0.6947
dpr-gemma-10m	4-UNIFORM	27m52s	5m45s	−79%	0.7048	0.7061

…uteLayerInfoFromSources computeLayerInfoFromSources called getNodes(0) on each source graph to count live nodes at level 0. getNodes(0) sequentially seeks through every node record on disk to filter out deleted entries. On a cold page cache this touches large amounts of source data before compaction even begins, significantly delaying the start of actual graph merging. Since every live node is present at level 0 by the HNSW invariant, the count is simply liveNodes.get(s).cardinality() — an in-memory popcount requiring no I/O. Also switch PQ retraining from ProductQuantization.compute() (full k-means++ init) to basePQ.refine() (Lloyd's iterations only, warm-started from the existing codebook). The source codebooks are already trained on the same distribution, so warm-starting converges in far fewer passes with no recall loss.

github-actions · 2026-07-02T22:19:10Z

dian-lun-lin requested review from MarkWolters, ashkrisk, jshook and tlwillke as code owners July 2, 2026 22:19

dian-lun-lin marked this pull request as draft July 2, 2026 22:19

dian-lun-lin marked this pull request as ready for review July 4, 2026 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compaction: eliminating disk scan and updating PQ retrain#691

Compaction: eliminating disk scan and updating PQ retrain#691
dian-lun-lin wants to merge 1 commit into
mainfrom
compaction-perf-fix

dian-lun-lin commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited by dian-lun-lin

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dian-lun-lin commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited by dian-lun-lin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dian-lun-lin commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited by dian-lun-lin

Loading