Skip to content

Cooperative embedding#690

Draft
jshook wants to merge 3 commits into
mainfrom
cooperative-embedding
Draft

Cooperative embedding#690
jshook wants to merge 3 commits into
mainfrom
cooperative-embedding

Conversation

@jshook

@jshook jshook commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

This includes some changes we need to make the embedding surface more robust around shared resource usage and scoping. I've shared it here for visibility and will move it from draft status once our integrated testing bears fruit.

This is now rebased on top of Aaron's compaction improvement fixes, so testing of this branch is all-inclusive of both

Synopsis

Make the on-disk graph compactor embeddable (cooperative resource sharing). The shape of the interfaces types provided here were carefully selected to be useful and compatible with an embedding system like Cassandra or OpenSearch, without being overly specific to any. They were also chosen to be compatible with Java 11 onward.

Purpose

OnDiskGraphIndexCompactor currently runs on its own thread pool, is invisible while it works, and always writes to its own file. This PR adds a few small, optional extension points so a host system (e.g. a database's compaction pipeline) can drive the merge cooperatively — on the host's own threads, under the host's observation and throttling, writing straight into the host's own file. Everything is additive and @experimental; with nothing supplied, behavior and output are unchanged.

Key elements

  • Bring-your-own executor. The compactor now takes any Executor plus an explicit taskWindowSize instead of requiring a ForkJoinPool. Passing a caller-runs executor (Runnable::run) runs the whole merge on the calling thread, so a host reuses its existing compaction threads with no extra pool.
  • Progress + throttling (ProgressLimiter). One small SPI the host installs to (a) observe merge progress and (b) block/pace the bytes the merge writes against a shared budget — without jvector knowing anything about the host's limiter. Ships with two ready-made, composable implementations:
    • a leaky-bucket rate meter
    • a logging wrapper
  • No-copy output (CompactionDestination / compact(Path, startOffset)). Write the compacted graph body directly into the host's container file after a reserved header — with a commit-on-success / discard-on-failure lifecycle — instead of writing a temp file and copying it. SeekableSink is the small primitive used to address that file region.

Tests and docs/compaction.md cover the new surface.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you rebase your branch onto the latest main for regression testing and PR submission?
  • Did you trigger regression testing via Run Bench Main and review results?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?
  • Did you add documentation for this feature to the release notes directory?

If you did not complete any of these, then please explain below.

dian-lun-lin and others added 3 commits July 2, 2026 22:14
…uteLayerInfoFromSources

computeLayerInfoFromSources called getNodes(0) on each source graph to count live nodes
at level 0. getNodes(0) sequentially seeks through every node record on disk to filter
out deleted entries. On a cold page cache this touches large amounts of source data before
compaction even begins, significantly delaying the start of actual graph merging.

Since every live node is present at level 0 by the HNSW invariant, the count is simply
liveNodes.get(s).cardinality() — an in-memory popcount requiring no I/O.

Also switch PQ retraining from ProductQuantization.compute() (full k-means++ init)
to basePQ.refine() (Lloyd's iterations only, warm-started from the existing codebook).
The source codebooks are already trained on the same distribution, so warm-starting
converges in far fewer passes with no recall loss.
@jshook jshook force-pushed the cooperative-embedding branch from fa4de43 to 22d1de3 Compare July 2, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants