Binary cache Bloom filter#488
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Bloom filter publication and consumption for binary caches: format, server endpoint and ETag handling, CLI generation, disk persistence, client conditional fetching and negative-lookup short-circuits, documentation, and functional tests. ChangesBloom Filter Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/libstore/nar-info-disk-cache.cc (1)
467-474: 💤 Low valuePotential integer overflow in blob offset cast for extremely large filters.
The cast
(int)(pos / 8)on line 469 could overflow ifmBitsexceeds ~17 billion bits (2GB blob). While practically unlikely given bloom filter sizes are typically much smaller, SQLite'ssqlite3_blob_readacceptsintoffset, which limits addressable blob size to ~2GB.This is acceptable for the intended use case of "relatively small and infrequently changed" caches per the PR objectives.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/libstore/nar-info-disk-cache.cc` around lines 467 - 474, The code casts (pos / 8) to int when calling sqlite3_blob_read which can overflow for extremely large bloom filters; before calling sqlite3_blob_read (in the loop over bitPositions inside nar-info-disk-cache.cc) check that (pos / 8) <= INT_MAX (or otherwise fits in an int) and if not, raise/throw a clear SQLiteError (or return false) so you never pass a truncated negative offset to sqlite3_blob_read; reference the loop over bitPositions and the sqlite3_blob_read call when adding the bounds check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@doc/manual/source/protocols/binary-cache-bloom-filter.md`:
- Around line 38-40: Add explicit language identifiers (e.g., ```text) to the
unlabeled fenced code blocks that contain the formulas so MD040 doesn't trigger;
specifically update the fence that wraps "pos = ((h1 + i * h2) mod 2^64) mod m"
and the fence that wraps the block starting with "m = ceil(-n * ln(p) / (ln
2)^2), rounded up to a multiple of 8" to use a language label such as text (or
math) instead of an unlabeled ``` fence.
In `@doc/manual/source/protocols/nix-cache-info.md`:
- Around line 46-49: The fenced code blocks containing the example protocol
snippets (the blocks that include lines like "BloomFilter: /bloom-filter",
"BloomFilter: https://filters.example.com/cache-abc.bloom", and the block
containing "StoreDir: /nix/store", "WantMassQuery: 1", etc.) need explicit
language tags to satisfy MD040; update both code fences to use a language token
such as "text" (i.e., change ``` to ```text) so the blocks are marked as plain
text in the markdown.
In `@src/libstore/bloom-filter.cc`:
- Around line 16-20: Validate and sanitize falsePositiveRate before computing
Bloom parameters: in the function that computes mBits/k (where
falsePositiveRate, mBits, k, and n are used), check that falsePositiveRate is
strictly between 0 and 1 and n > 0; if not, either clamp falsePositiveRate into
a safe open interval (e.g. min = 1e-9, max = 1 - 1e-9) or return/throw an error,
then compute mF/mBits/k as before and ensure mBits is at least 8 (or >0) to
avoid later modulo by zero; update the code around the existing mF, mBits, k
calculations to perform this validation/sanitization before using
falsePositiveRate.
In `@src/libstore/include/nix/store/binary-cache-store.hh`:
- Around line 88-94: The header comment on BinaryCacheStore's bloom-filter field
overstates support for absolute URLs; update the comment (near the field
documented and references in init() and probe initialization) to state that the
BloomFilter URL is either an absolute path on the same cache root or a path
relative to the cache root and that external HTTP/HTTPS absolute URLs are not
supported by the probe initialization, or alternatively implement full
absolute-URL fetch support in the probe initialization code path (adjusting
init() and any probe init logic accordingly); prefer the simpler change: change
the comment to explicitly say "path relative to the cache root; external
http/https absolute URLs are not supported" and reference init()/probe
initialization behavior.
---
Nitpick comments:
In `@src/libstore/nar-info-disk-cache.cc`:
- Around line 467-474: The code casts (pos / 8) to int when calling
sqlite3_blob_read which can overflow for extremely large bloom filters; before
calling sqlite3_blob_read (in the loop over bitPositions inside
nar-info-disk-cache.cc) check that (pos / 8) <= INT_MAX (or otherwise fits in an
int) and if not, raise/throw a clear SQLiteError (or return false) so you never
pass a truncated negative offset to sqlite3_blob_read; reference the loop over
bitPositions and the sqlite3_blob_read call when adding the bounds check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d30c0226-025b-4e6d-a546-03a368724fcd
📒 Files selected for processing (18)
doc/manual/source/SUMMARY.md.indoc/manual/source/protocols/binary-cache-bloom-filter.mddoc/manual/source/protocols/nix-cache-info.mdsrc/libstore-tests/nar-info-disk-cache.ccsrc/libstore/binary-cache-store.ccsrc/libstore/bloom-filter.ccsrc/libstore/http-binary-cache-store.ccsrc/libstore/include/nix/store/binary-cache-store.hhsrc/libstore/include/nix/store/bloom-filter.hhsrc/libstore/include/nix/store/http-binary-cache-store.hhsrc/libstore/include/nix/store/meson.buildsrc/libstore/include/nix/store/nar-info-disk-cache.hhsrc/libstore/meson.buildsrc/libstore/nar-info-disk-cache.ccsrc/nix/generate-bloom-filter.ccsrc/nix/meson.buildsrc/nix/serve.cctests/functional/binary-cache.sh
a9ab98b to
2cb2a8e
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/libstore/nar-info-disk-cache.cc (1)
443-452: 💤 Low valuePotential integer overflow in blob offset calculation for very large filters.
The
pos / 8value is cast tointon line 448, butposisuint64_tand can theoretically exceedINT_MAX * 8(~17GB). For realistic Bloom filter sizes this is unlikely to be an issue, but the cast could silently truncate and cause incorrect reads for pathologically large filters.Consider using an explicit size check or documenting the implicit limit:
+ // Bloom filters larger than ~2GB are not supported due to + // sqlite3_blob_read offset limitations. forEachBloomBitPosition(path, params->k, params->mBits, [&](uint64_t pos) { if (!allSet) return; unsigned char byte = 0; - if (sqlite3_blob_read(blob, &byte, 1, (int) (bloomFilterHeaderLen + pos / 8)) != SQLITE_OK) + auto offset = bloomFilterHeaderLen + pos / 8; + if (offset > INT_MAX) + SQLiteError::throw_(state->db, "bloom-filter blob too large"); + if (sqlite3_blob_read(blob, &byte, 1, static_cast<int>(offset)) != SQLITE_OK) SQLiteError::throw_(state->db, "reading bloom-filter blob");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/libstore/nar-info-disk-cache.cc` around lines 443 - 452, The loop in forEachBloomBitPosition reads a byte with sqlite3_blob_read using (int)(bloomFilterHeaderLen + pos / 8) which can overflow when pos (uint64_t) produces an offset > INT_MAX; add an explicit bounds check before the sqlite3_blob_read call: compute uint64_t offset = bloomFilterHeaderLen + pos / 8, verify offset <= INT_MAX (or otherwise handle/throw via SQLiteError::throw_), then call sqlite3_blob_read with a safe cast to int; update the code around forEachBloomBitPosition/pos/sqlite3_blob_read to use this checked offset so reads cannot be truncated silently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/libstore/nar-info-disk-cache.cc`:
- Around line 443-452: The loop in forEachBloomBitPosition reads a byte with
sqlite3_blob_read using (int)(bloomFilterHeaderLen + pos / 8) which can overflow
when pos (uint64_t) produces an offset > INT_MAX; add an explicit bounds check
before the sqlite3_blob_read call: compute uint64_t offset =
bloomFilterHeaderLen + pos / 8, verify offset <= INT_MAX (or otherwise
handle/throw via SQLiteError::throw_), then call sqlite3_blob_read with a safe
cast to int; update the code around
forEachBloomBitPosition/pos/sqlite3_blob_read to use this checked offset so
reads cannot be truncated silently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 61c25c05-7634-41aa-87b6-e9658dd8496b
📒 Files selected for processing (8)
src/libstore/binary-cache-store.ccsrc/libstore/bloom-filter.ccsrc/libstore/include/nix/store/binary-cache-store.hhsrc/libstore/include/nix/store/bloom-filter.hhsrc/libstore/include/nix/store/nar-info-disk-cache.hhsrc/libstore/nar-info-disk-cache.ccsrc/nix/generate-bloom-filter.cctests/functional/binary-cache.sh
💤 Files with no reviewable changes (1)
- src/nix/generate-bloom-filter.cc
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/functional/binary-cache.sh
- src/libstore/include/nix/store/binary-cache-store.hh
- src/libstore/binary-cache-store.cc
Add a /bloom-filter endpoint to `nix serve` that returns a Bloom filter of every valid store path's 32-character Nix32 hash part, so clients can rule out definite cache misses without issuing a .narinfo request per candidate path. The filter uses Kirsch-Mitzenmacher double hashing over the decoded 20-byte hash part (no extra hashing needed: the hash part is already cryptographic). Size and number of hash functions are derived from the valid-path count and a 1% target false-positive rate, compiled in as a constant for now. The URL is advertised through a new `BloomFilter` field in nix-cache-info (absolute or relative). The wire format is self-describing: an 8-byte "NixBloom" magic, then version, k, m and the bit array, all little-endian. Both the format and the new nix-cache-info field are documented under doc/manual/source/protocols. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Read the cache's bloom filter (advertised via the new BloomFilter: field in nix-cache-info) and consult it from queryPathInfoUncached and isValidPathUncached. A "definitely missing" answer short-circuits the .narinfo round trip; "possibly present" falls through to the existing fetch. The filter is cached on disk in NarInfoDiskCache as a blob and probed via sqlite3_blob_read so we only touch ~k bytes per query. Reuse of the negative-narinfo TTL keeps the cadence familiar. ETag/304 is plumbed through a new BinaryCacheStore::getFileConditional hook (HTTP overrides it, file:// falls back to a plain GET). Concurrent first probes across processes are serialised on a $XDG_CACHE_HOME lockfile, mirroring the fetcher pattern in libfetchers. Fetch failures log a warning and disable the filter for the rest of the process. Bumps the on-disk cache file to v2 because both BinaryCaches gains a bloomFilterUrl column and a new BloomFilters table is added; v1 files are orphaned and refilled on first use. A debug log records every "definitely missing" decision so the behaviour is easy to inspect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The server-side filter builder in `nix serve` and the client-side probe in BinaryCacheStore were doing the same Nix32-decode + double hashing dance with two slightly different bodies. Hoist the iteration into a templated `forEachBloomBitPosition` in a new `nix/store/bloom-filter.hh` so the wire-format math lives in exactly one place; callers supply a body that either sets a bit or pushes a position. Pure refactor: the served filter bytes and the client probe results are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the bloom-filter blob builder out of `nix serve`'s TU and into `src/libstore/bloom-filter.cc` so the same code can drive a new undocumented CLI command. `nix store generate-bloom-filter` takes the store's valid path set (or a newline-separated file via `--from-file`) and writes the same wire-format blob the `/bloom-filter` endpoint serves. `--false-positive-rate` overrides the 1% default. The command refuses to write to a terminal (mirroring `nix nario export`) and reports the resulting size + path count to stderr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Have `startNixServe` redirect server output to `$TEST_ROOT/nix-serve.log` so tests can inspect what the server actually received, then add two assertions to the existing HTTP-cache section of binary-cache.sh: 1. A `--debug nix path-info` for a fake store path against `nix serve` should log "bloom filter for … ruled out …", proving the client short-circuited without a `.narinfo` round trip. 2. A second `--debug nix path-info` for a different fake path should reuse the disk-cached filter — the server should record exactly one `GET /bloom-filter` across the two probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A fixed fake hashpart can still hit a false positive against the deterministic test-fixture filter, so loop generating fakes (varying the last 6 chars of the hashpart, which is where Nix32 decode's reverse-iteration places the bytes that feed h1/h2) until one is ruled out. The disk-cache-reuse assertion folds in for free since the bloom is fetched at most once across all loop iterations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urce Replace the hand-rolled byte twiddling in buildBloomFilter with a StringSink and the `sink << n` idiom; replace the matching reader in BinaryCacheStore::isDefinitelyMissing with a StringSource and `source >> ...`. The wire format becomes uniformly little-endian u64 for all integer fields, growing the header from 24 to 32 bytes (negligible vs. the bit array). The Source-based parse also reports out-of-range values via SerialisationError rather than silently truncating. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This is already handled correctly in HttpBinaryCacheStore::makeRequest(), so we can just drop the check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Correct the header size (32 bytes, not 24) in the buildBloomFilter docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The client used to cache the filter's k/mBits in an in-process
BloomState and compute bit positions from those cached values while
probing the current sqlite blob. After another process refreshed the
filter (possibly with different parameters) the cached k/mBits drifted
from the probed blob, risking false negatives.
Now:
- The disk cache stores the raw filter body (header + bit array) in a
single `blob` column; the k/mBits/bits columns are gone.
- A combined probeBloomFilter(uri, StorePath) -> optional<bool> reads
the header (for k/mBits) and the probed bits from the same blob in
one locked transaction, so they can't drift. nullopt means "no fresh
filter; (re)fetch". getBloomFilterETag replaces lookupBloomFilter for
the conditional-GET etag.
- Freshness compares the row timestamp against a fixed per-process
startTime (>=), so a filter we just (re)fetched stays fresh for the
rest of the run instead of being re-fetched on every query (which a
moving clock caused under --refresh).
- BloomState is reduced to a maybeDisable()-style per-process cooldown
that suppresses Bloom use after a failed fetch; it no longer caches
any filter parameters.
- Header parsing is factored into parseBloomFilterHeader (+ the
bloomFilterHeaderLen constant) in bloom-filter.{hh,cc}, shared by the
builder, the fetch-time validator, and the probe.
The binary-cache functional test gets a 1s sleep before the --refresh
step so the cached entry deterministically predates it (freshness is at
1-second resolution).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It was only used by fetchBloomFilter; make it a local `disable` lambda there and drop the member function and its declaration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lets a client opt out of consulting a cache's advertised Bloom filter. isDefinitelyMissing bails out early when it's false. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3f3513f to
acef6ce
Compare
acef6ce to
cfce945
Compare
Motivation
This adds the ability for binary caches to expose a Bloom filter that lets clients quickly decide whether a store path definitely does not exist in the binary cache.
nix servenow generates a Bloom filter, which it advertises to clients using a new field innix-cache-info:You can also generate the Bloom filter manually, e.g.
Clients cache the Bloom filter according to
narinfo-cache-negative-ttl. They revalidate the cached filter using its ETag, so revalidation should be quick if the Bloom filter doesn't change often. Thus, Bloom filters are primarily useful for binary caches that are not very big and don't change very often (likehttps://install.determinate.systems).Context
Summary by CodeRabbit
New Features
Documentation
Tests