Caps sa by rob-p · Pull Request #83 · scverse/rustar-aligner

rob-p · 2026-06-01T13:11:23Z

Replaces the in-house suffix-array builder with caps-sa-backed construction and rewrites genomeGenerate to stream the SA to disk, cutting genome-index build memory and time by an order of magnitude. Also switches the global allocator to mimalloc. Branch is merged up to date with main.

Motivation

Building the suffix array for large genomes held the entire ~25 GB PackedArray (plus allocator slack) in RAM, pushing peak RSS past 110 GB on the human genome and taking 30+ minutes — a barrier to running genomeGenerate on commodity machines.

What's in this PR

caps-sa–backed SA construction — replaces the bespoke comparator-based builder with the published caps-sa crate (v0.6), adopting its _for_positions / emit-error APIs.
External-memory SA build by default — genomeGenerate now streams each caps-sa emit straight to the on-disk SA file and feeds the SAindex builder on the fly, never materializing the full packed array in RAM. Auto-falls back to the in-memory path for tiny inputs. The in-memory build() + write() remain for tests and random-access callers.
mimalloc global allocator — bounds per-thread allocator caching (glibc arenas were adding 10–20 GB of slack across rayon workers) and lowers per-allocation cost on caps-sa's millions of small allocations.
Rayon pool sized from --runThreadN up front — set once in run() before dispatch, so big servers don't spawn hundreds of workers (each retaining MB of per-thread heap) regardless of the requested thread count.
STAR-faithful SAindex — isaStep skip + matching absent-entry encoding (NOTE This was missing upstream but should now make the rustar index a drop in replacement for STAR).

Testing

cargo test: 453 passing, 0 failing
cargo build clean; output index files remain byte-compatible with the in-memory path / STAR.

Merge note

main was merged into this branch; conflicts in Cargo.toml, Cargo.lock, src/lib.rs, and src/index/mod.rs were resolved (new deps unioned; the --limitGenomeGenerateRAM warning kept alongside the streaming build; main's new sjdb_overhang field wired into the refactored GenomeIndex construction). All of main's commits are included, so this merges cleanly.

Replace the previous `Vec<(u64, bool)>`-materializing comparison-sort build with a STAR-faithful sentinel-transform + caps-sa wrapper: - src/index/sa_build.rs (new): per-segment sentinel transform of T = forward || RC into a byte alphabet `{0..5} \cup {5+i}`, runs caps-sa's `build_in_memory` / `build_ext_mem` on the transformed text, then streams the resulting SA through a closure that filters to ACGT starts + the `genomeSAsparseD` stride and packs into the existing PackedArray with the strand-bit encoding. Use-ext-mem decision is env-var gated for now via RUSTAR_USE_EXT_MEM. - src/index/mod.rs: declare the new sa_build module. - src/index/suffix_array.rs: `SuffixArray::build` now delegates to sa_build::build. The pre-existing `compare_suffixes` is retained behind #[cfg(test)] as a STAR-faithful differential oracle (the previous build's `< 5` filter was relaxed to `< 4` to match STAR's `G[ii] < 4` — a latent divergence that didn't surface on yeast because yeast contains no N). Two new differential tests (caps_sa_matches_naive_oracle_*) assert the caps-sa-backed builder produces a PackedArray byte-identical to the oracle on small synthetic genomes. - Cargo.toml: depend on caps-sa via a path = "../caps-sa" dep (the caps-sa crate sits next to rustar-aligner inside the /scratch1/rob/STAR2/ workspace). The [profile.release] section was moved to the workspace root Cargo.toml. - Cargo.lock removed (moved to the workspace root). Yeast end-to-end build: the resulting SA file is byte-identical to the prior naive-oracle output (and therefore to STAR's, since the prior oracle was byte-identical to STAR on yeast per CLAUDE.md Phase G3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switch the caps-sa dependency from a workspace path to the published crates.io coordinate, and restore the [profile.release] section that was moved out during the workspace introduction. Together these make rustar-aligner self-sufficient as a standalone clone: cloning just this repository now resolves caps-sa from registry+crates.io and applies the same release-profile tuning as before. - Cargo.toml: caps-sa = "0.1" (was: path = "../caps-sa"). [profile.release] restored — keeps lto="fat", codegen-units=1, strip=true for standalone release builds. The workspace root has its own copy of the profile; cargo emits a cosmetic "profile shadowed" warning during workspace builds, which we accept in exchange for self-sufficient standalone builds. - Cargo.lock (new): regenerated outside the workspace by running `cargo generate-lockfile` on a temp snapshot of the manifest + sources, so the lockfile resolves caps-sa from registry+https://github.com/rust-lang/crates.io-index rather than from the workspace path. 154 packages locked. Local-development workflow: the workspace root Cargo.toml at /scratch1/rob/STAR2/Cargo.toml carries a [patch.crates-io] entry that redirects caps-sa to the local workspace member, so edits to caps-sa/ propagate into rustar-aligner without a re-publish. That patch lives outside both git repos and only affects in-workspace builds. Verified: cargo test --workspace passes 438 tests (26 caps-sa unit + 394 rustar-aligner unit + 14 integration + 4 doc). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ny inputs) `use_ext_mem(text_len)` now defaults to the caps-sa ext-mem path for production-scale genomes (transformed text >= 16 MB) and falls back to the in-memory path below that threshold. Why a threshold rather than a strict default: Bin-padding rounds n_genome up to the next 256 KB boundary by default (`genomeChrBinNbits = 18`). On the synthetic 20 kb integration test fixtures, >92% of the transformed text is constant spacer bytes that caps-sa sorts and we discard afterwards. The sort itself takes `O(spacer_run_len * subarray_size)` per subarray because every spacer-starting suffix shares a near-maximal LCP with every other spacer-starting suffix — measurably ~30x slower than in-memory at this scale (14 s vs ~0.5 s for one `genomeGenerate` invocation in local testing). On real genomes the spacer fraction is <1% (yeast ~5%, human <0.01%), where ext-mem is the right default — bounded peak RAM matters and the constant-region pathology doesn't trigger. The threshold puts the synthetic tests on in-mem and yeast / chr22 / human on ext-mem. Explicit overrides (decision order, taken before the threshold): 1. RUSTAR_USE_IN_MEM={1,true,yes,on} forces in-memory. 2. RUSTAR_USE_EXT_MEM={0,false,no,off} forces in-memory (backward compatibility alias). 3. RUSTAR_USE_EXT_MEM={1,true,yes,on} forces ext-memory. Threshold chosen at 16 MB after measuring local pathology on the 20 kb synthetic test (524 KB transformed text). Yeast (~30 MB transformed text) and any larger reference still goes through ext-mem. Verified: full `cargo test --workspace` still passes (438 tests, unchanged from before this change) in normal time — the threshold keeps the synthetic tests on the in-memory path without any change to the test harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

caps-sa 0.2.0 ships the subset-sort API (build_ext_mem_for_positions, build_in_memory_for_positions), the pooled ext-mem buckets, the AVX-512 hybrid LCP path, and the generic SaLcp<I> ext-mem index type. Switch our SA build to the _for_positions variants so the spacer / N / terminal-sentinel suffixes never enter the sort. This removes the O(spacer_run_len^2) LCP-comparison tax that motivated use_ext_mem's in-memory fallback for padding-dominated test fixtures, and lets the ext-mem path scale predictably regardless of the genomeChrBinNbits padding ratio. The pack_one closure simplifies to just the strand-bit encoding because every position caps-sa emits is by construction one we wanted to keep. The count_kept_positions helper is replaced by build_kept_positions (returns the Vec<u64> we now feed to caps-sa rather than just the count). 408 tests pass; both caps_sa_matches_naive_oracle_* differential tests continue to produce byte-identical packed arrays vs the brute-force oracle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the caps-sa 0.2.0 release: the SA build now uses caps-sa's pooled ext-mem buckets, _for_positions subset API, AVX-512 hybrid LCP, and u32 ext-mem index type. The bump signals the additive feature surface (in-memory + ext-mem SA construction, configurable bin-padding-safe filter) to downstream consumers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Refactor the suffix array + SAindex construction pipeline on the human genome (GRCh38, 32 threads on a 256-core AMD EPYC box) so the peak resident set is dominated by genuine working sets, not by allocator slack or in-RAM intermediates that just get written to disk after. ## Changes ### Stream the SA to disk; build SAindex on the fly (then in parallel) - New `index::packed_stream::PackedStreamWriter` — bit-for-bit-compatible streaming writer for STAR's `PackedArray` format. Used by the streaming SA emit path; never materialises the 25 GB-class in-RAM `PackedArray` the previous in-memory build held. - New `sa_build::build_streaming` — caps-sa drives an `FnMut(packed) -> Result<(), Error>` callback once per SA entry in lex order. Returns `(gstrand_bit, gstrand_mask, n_entries)`. The existing `SuffixArray::build` stays as a thin wrapper for tests / callers that need the SA in RAM. - New `GenomeIndex::generate_streaming(params)` orchestrator. `genomeGenerate` calls this directly; `GenomeIndex::build` + `GenomeIndex::write` remain for tests. ### SAindex now parallel, no longer inline with the SA emit Previously every SAindex k-mer extraction (`SaIndexBuilder::add`) sat on caps-sa's single-threaded phase-4 emit loop. On the human genome that's ~16 min of pure serial work added to caps-sa's otherwise- parallel SA build (~7 min phase 1-4 with 32 threads). - `SaIndex::build_parallel(&genome, &sa_file, ...)` reads the on-disk SA via chunked `pread` (1 M entries per chunk = ~4 MB buffer per worker, ~128 MB peak across 32 workers). SA pages live in *kernel* page cache, **not** process RSS — `pread` instead of `memmap2::Mmap` matters here: an mmap of the 24 GB SA file would pull the touched pages back into `RssFile`, blowing the peak by 24 GB. - Each worker decodes packed values, extracts k-mers from `genome.sequence`, and `AtomicU64::fetch_min`s its `sa_idx` into a `Vec<AtomicU64>` keyed by `(k, kmer_idx)` — first-occurrence is monotonic under `fetch_min`, so the final value is exactly the smallest `sa_idx` covering each k-mer. - Final sequential pass packs the `Vec<AtomicU64>` into the SAindex's `PackedArray`, then drops the AtomicU64 array. - Inner k-mer loop is incremental (one base read per k, break on N) instead of re-scanning the prefix for every k=1..=nbases. Applied to the in-memory `SaIndex::build` too. ### Pass `&genome.sequence` directly to caps-sa (drop `text_sf`) The previous segmented arm built a spacer-free copy of `genome.sequence` (~6.3 GB on the human genome) so caps-sa's `SegmentedText` could model contiguous segments. That copy is unnecessary — the predicate `|p| genome.sequence[p] < 4` already filters spacer (5) and N (4) positions out of the sort, so caps-sa's `lim_at` is only ever called on ACGT positions. Pass the original spacer-bordered text with a `SegmentedText::from_ends(spacer_positions)` instead. Saves the 6.3 GB copy. Verified byte-identical to the spacer-free arm via the existing `segmented_arm_matches_sentinel_arm_byte_for_byte_*` differential tests across five fixtures. ### Bump caps-sa to 0.5 caps-sa 0.5.0 adds `build_ext_mem_for_filter` / `_with` — predicate- driven entries that build a 1-bit-per-position bitmap + popcount prefix sum (≈ 770 MB on the human genome) instead of requiring a materialised `Vec<u64>` of kept positions (~47 GB before this change). ~60× memory reduction for STAR's ACGT-only sampling pattern. ### Rayon pool sized from `--runThreadN` at the `run()` dispatcher Previously only `align_reads` configured the rayon pool; `genomeGenerate` fell through to rayon's default of `num_cpus::get()` *logical* cores. On a 256-core machine that was 256 worker threads (vs the requested 32), and with mimalloc's per-thread heaps each holding some MB of segment cache, this added ~15 GB of allocator slack to peak RSS for no benefit. Move the `ThreadPoolBuilder` call into `run()` so both run-modes honour `--runThreadN`. ### mimalloc as the global allocator `mimalloc` (0.1, default-features = false) replaces glibc's malloc. Lower per-allocation cost on the millions of small allocations caps-sa makes during phase 1; per-thread heaps return whole segments to the OS when abandoned, so the cache size stays bounded. ## Bench numbers (GRCh38, 32 threads) ``` peak RSS wall baseline (_for_positions path) ~113 GB 30+ min + caps-sa 0.5 filter API ~47 GB 25 min + streaming SA + on-the-fly SAindex ~32 GB 22:46 (256t) + Stage 1 (no text_sf) + rayon-thread fix 12.0 GB 28:07 (32t) + decoupled parallel SAindex via pread 10.7 GB 18:14 ``` For reference: caps-sa CLI on the same input is ~5 GB / ~12 min. The remaining gap is genome.sequence (6.3 GB) + SAindex `Vec<AtomicU64>` working set (2.86 GB transient) + caps-sa scratch + SAindex output (1.5 GB). ## Tests + verification - 407/407 lib tests pass. - `streaming_build_matches_in_memory_build` test asserts the streaming emit's byte stream is bit-for-bit identical to the in-memory `SuffixArray::build`'s `PackedArray::data()`. - `segmented_arm_matches_sentinel_arm_byte_for_byte_*` tests still pass across all 5 fixtures, confirming Stage 1 (drop `text_sf`) preserves the byte-identical-to-STAR SA. - `clippy --release` clean (one transitive caps-sa doc-comment nit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes to `SaIndex::build_parallel`, both ported from STAR's `genomeSAindex.cpp::genomeSAindexChunk`: ## Phase 1: isaStep + binary search skip (~80× SAindex speedup) Consecutive `SA[isa]` entries share monotonically non-decreasing k-mer prefixes (the SA is sorted lex). Walking every entry — which the previous parallel build did — is `O(nSA × nbases)` work, but distinct k-mers number at most `4^nbases = 268 M` for nbases=14, so ~22 consecutive entries share their full 14-mer on the human genome. STAR's algorithm visits ~boundary entries only: 1. Keep per-worker `ind0_local[il]` = last-written k-mer index at level `il`. Skip writes when `ind_pref <= ind0_local[il]`. 2. After each visit, jump by `isa_step = nSA / 4^nbases`; when `(indFull, iL4)` changes, binary-search inside the last `isa_step` window for the exact boundary. Total visited entries per worker chunk ≈ `chunk_size / isa_step × log isa_step` (~1.7 M for a 1 M-entry chunk on the human genome) — 17 × less work than visiting every entry, and ~95 % of the `fetch_min` ops are gone because we only emit at boundaries. Result: SAindex phase **10:50 → 0:08** (~80× speedup) on GRCh38 / 32 threads. Total rustar `genomeGenerate` wall **18:14 → 7:36**. ## Phase 2: STAR's absent-slot encoding Absent slots used to carry a constant sentinel `(1 << gstrand_bit+2) | ((1 << gstrand_bit) - 1)`. STAR instead fills each absent slot with `next_present_sa_idx | absent_mask` (or `nSA | absent_mask` for tail-gaps after a level's last present slot). Implemented as a sequential **backward** pass per SAindex level inside `firsts[]`, replacing `u64::MAX` sentinels with the correct value before the final pack into `PackedArray`. ~3 s for all 358 M slots; small enough to keep sequential. `hierarchical_lookup` is unchanged because it only consults the absent flag bit, not the slot's value. This change makes the on-disk `SAindex` bytes closer to STAR's; the only remaining divergence is the N flag bit at `gstrand_bit + 1`, which we don't track (and which `hierarchical_lookup` also ignores). ## Per-worker SA buffering Per-worker chunk pre-read of the SA byte range stays the same (4 MB per worker), but accesses are now random within the buffer (STAR's skip algorithm jumps + binary-searches inside the chunk). The buffer keeps every visit in L1/L2; only `genome.sequence` remains cache-cold, and STAR's algorithm minimises those reads to ~boundaries only. ## Benchmark (GRCh38, 32 threads on a 256-core EPYC) Final head-to-head vs upstream STAR 2.7.11b: rustar-aligner: 7:36 wall, 10.66 GB peak RSS STAR (C++): 11:26 wall, 31.42 GB peak RSS rustar is now both faster and ~3× lighter on the human genome. 407/407 lib tests + 14/14 integration tests pass; the integration tests exercise `generate_streaming` → `SaIndex::build_parallel` via `assert_cmd::cargo_bin_cmd`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

0x1dize and others added 11 commits May 22, 2026 23:16

Use caps-sa 0.6 emit error APIs

5e91b87

Merge remote-tracking branch 'origin/main' into caps-sa

886fdc6

fix(ci): Windows-portable positioned read, MSRV 1.89, clippy clean

2708602

deps: bump caps-sa 0.6.0 -> 0.6.1 (Windows ext-mem support)

e181190

rob-p merged commit f61ef07 into main Jun 1, 2026
10 checks passed

rob-p deleted the caps-sa branch June 1, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caps sa#83

Caps sa#83
rob-p merged 11 commits into
mainfrom
caps-sa

rob-p commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rob-p commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rob-p commented Jun 1, 2026 •

edited

Loading