Cooperative embedding by jshook · Pull Request #690 · datastax/jvector

jshook · 2026-07-02T15:44:20Z

This includes some changes we need to make the embedding surface more robust around shared resource usage and scoping. I've shared it here for visibility and will move it from draft status once our integrated testing bears fruit.

This is now rebased on top of Aaron's compaction improvement fixes, so testing of this branch is all-inclusive of both

Synopsis

Make the on-disk graph compactor embeddable (cooperative resource sharing). The shape of the interfaces types provided here were carefully selected to be useful and compatible with an embedding system like Cassandra or OpenSearch, without being overly specific to any. They were also chosen to be compatible with Java 11 onward.

Purpose

OnDiskGraphIndexCompactor currently runs on its own thread pool, is invisible while it works, and always writes to its own file. This PR adds a few small, optional extension points so a host system (e.g. a database's compaction pipeline) can drive the merge cooperatively — on the host's own threads, under the host's observation and throttling, writing straight into the host's own file. Everything is additive and @experimental; with nothing supplied, behavior and output are unchanged.

Key elements

Bring-your-own executor. The compactor now takes any Executor plus an explicit taskWindowSize instead of requiring a ForkJoinPool. Passing a caller-runs executor (Runnable::run) runs the whole merge on the calling thread, so a host reuses its existing compaction threads with no extra pool.
Progress + throttling (ProgressLimiter). One small SPI the host installs to (a) observe merge progress and (b) block/pace the bytes the merge writes against a shared budget — without jvector knowing anything about the host's limiter. Ships with two ready-made, composable implementations:
- a leaky-bucket rate meter
- a logging wrapper
No-copy output (CompactionDestination / compact(Path, startOffset)). Write the compacted graph body directly into the host's container file after a reserved header — with a commit-on-success / discard-on-failure lifecycle — instead of writing a temp file and copying it. SeekableSink is the small primitive used to address that file region.

Tests and docs/compaction.md cover the new surface.

github-actions · 2026-07-02T15:44:37Z

…uteLayerInfoFromSources computeLayerInfoFromSources called getNodes(0) on each source graph to count live nodes at level 0. getNodes(0) sequentially seeks through every node record on disk to filter out deleted entries. On a cold page cache this touches large amounts of source data before compaction even begins, significantly delaying the start of actual graph merging. Since every live node is present at level 0 by the HNSW invariant, the count is simply liveNodes.get(s).cardinality() — an in-memory popcount requiring no I/O. Also switch PQ retraining from ProductQuantization.compute() (full k-means++ init) to basePQ.refine() (Lloyd's iterations only, warm-started from the existing codebook). The source codebooks are already trained on the same distribution, so warm-starting converges in far fewer passes with no recall loss.

…lexibility

dian-lun-lin and others added 3 commits July 2, 2026 22:14

support cooperative resource shareing when embedded

aeb61c0

additional refinements for type safety, thread pools, and embedding f…

22d1de3

…lexibility

jshook force-pushed the cooperative-embedding branch from fa4de43 to 22d1de3 Compare July 2, 2026 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cooperative embedding#690

Cooperative embedding#690
jshook wants to merge 3 commits into
mainfrom
cooperative-embedding

jshook commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited by jshook

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jshook commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Synopsis

Purpose

Key elements

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited by jshook Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jshook commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited by jshook

Loading