Skip to content

Caps sa#83

Merged
rob-p merged 11 commits into
mainfrom
caps-sa
Jun 1, 2026
Merged

Caps sa#83
rob-p merged 11 commits into
mainfrom
caps-sa

Conversation

@rob-p
Copy link
Copy Markdown
Contributor

@rob-p rob-p commented Jun 1, 2026

Replaces the in-house suffix-array builder with caps-sa-backed construction and rewrites genomeGenerate to stream the SA to disk, cutting genome-index build memory and time by an order of magnitude. Also switches the global allocator to mimalloc. Branch is merged up to date with main.

Motivation

Building the suffix array for large genomes held the entire ~25 GB PackedArray (plus allocator slack) in RAM, pushing peak RSS past 110 GB on the human genome and taking 30+ minutes — a barrier to running genomeGenerate on commodity machines.

What's in this PR

  • caps-sa–backed SA construction — replaces the bespoke comparator-based builder with the published caps-sa crate (v0.6), adopting its _for_positions / emit-error APIs.
  • External-memory SA build by default — genomeGenerate now streams each caps-sa emit straight to the on-disk SA file and feeds the SAindex builder on the fly, never materializing the full packed array in RAM. Auto-falls back to the in-memory path for tiny inputs. The in-memory build() + write() remain for tests and random-access callers.
  • mimalloc global allocator — bounds per-thread allocator caching (glibc arenas were adding 10–20 GB of slack across rayon workers) and lowers per-allocation cost on caps-sa's millions of small allocations.
  • Rayon pool sized from --runThreadN up front — set once in run() before dispatch, so big servers don't spawn hundreds of workers (each retaining MB of per-thread heap) regardless of the requested thread count.
  • STAR-faithful SAindex — isaStep skip + matching absent-entry encoding (NOTE This was missing upstream but should now make the rustar index a drop in replacement for STAR).

Testing

  • cargo test: 453 passing, 0 failing
  • cargo build clean; output index files remain byte-compatible with the in-memory path / STAR.

Merge note

main was merged into this branch; conflicts in Cargo.toml, Cargo.lock, src/lib.rs, and src/index/mod.rs were resolved (new deps unioned; the --limitGenomeGenerateRAM warning kept alongside the streaming build; main's new sjdb_overhang field wired into the refactored GenomeIndex construction). All of main's commits are included, so this merges cleanly.

0x1dize and others added 11 commits May 22, 2026 23:16
Replace the previous `Vec<(u64, bool)>`-materializing comparison-sort
build with a STAR-faithful sentinel-transform + caps-sa wrapper:

- src/index/sa_build.rs (new): per-segment sentinel transform of T =
  forward || RC into a byte alphabet `{0..5} \cup {5+i}`, runs
  caps-sa's `build_in_memory` / `build_ext_mem` on the transformed
  text, then streams the resulting SA through a closure that filters
  to ACGT starts + the `genomeSAsparseD` stride and packs into the
  existing PackedArray with the strand-bit encoding. Use-ext-mem
  decision is env-var gated for now via RUSTAR_USE_EXT_MEM.

- src/index/mod.rs: declare the new sa_build module.

- src/index/suffix_array.rs: `SuffixArray::build` now delegates to
  sa_build::build. The pre-existing `compare_suffixes` is retained
  behind #[cfg(test)] as a STAR-faithful differential oracle (the
  previous build's `< 5` filter was relaxed to `< 4` to match
  STAR's `G[ii] < 4` — a latent divergence that didn't surface on
  yeast because yeast contains no N). Two new differential tests
  (caps_sa_matches_naive_oracle_*) assert the caps-sa-backed builder
  produces a PackedArray byte-identical to the oracle on small
  synthetic genomes.

- Cargo.toml: depend on caps-sa via a path = "../caps-sa" dep (the
  caps-sa crate sits next to rustar-aligner inside the
  /scratch1/rob/STAR2/ workspace). The [profile.release] section was
  moved to the workspace root Cargo.toml.

- Cargo.lock removed (moved to the workspace root).

Yeast end-to-end build: the resulting SA file is byte-identical to
the prior naive-oracle output (and therefore to STAR's, since the
prior oracle was byte-identical to STAR on yeast per CLAUDE.md
Phase G3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the caps-sa dependency from a workspace path to the published
crates.io coordinate, and restore the [profile.release] section that
was moved out during the workspace introduction. Together these make
rustar-aligner self-sufficient as a standalone clone: cloning just
this repository now resolves caps-sa from registry+crates.io and
applies the same release-profile tuning as before.

- Cargo.toml: caps-sa = "0.1" (was: path = "../caps-sa").
  [profile.release] restored — keeps lto="fat", codegen-units=1,
  strip=true for standalone release builds. The workspace root has
  its own copy of the profile; cargo emits a cosmetic "profile
  shadowed" warning during workspace builds, which we accept in
  exchange for self-sufficient standalone builds.
- Cargo.lock (new): regenerated outside the workspace by running
  `cargo generate-lockfile` on a temp snapshot of the manifest +
  sources, so the lockfile resolves caps-sa from
  registry+https://github.com/rust-lang/crates.io-index rather than
  from the workspace path. 154 packages locked.

Local-development workflow: the workspace root Cargo.toml at
/scratch1/rob/STAR2/Cargo.toml carries a [patch.crates-io] entry that
redirects caps-sa to the local workspace member, so edits to caps-sa/
propagate into rustar-aligner without a re-publish. That patch lives
outside both git repos and only affects in-workspace builds.

Verified: cargo test --workspace passes 438 tests (26 caps-sa unit +
394 rustar-aligner unit + 14 integration + 4 doc).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ny inputs)

`use_ext_mem(text_len)` now defaults to the caps-sa ext-mem path for
production-scale genomes (transformed text >= 16 MB) and falls back
to the in-memory path below that threshold.

Why a threshold rather than a strict default:

Bin-padding rounds n_genome up to the next 256 KB boundary by default
(`genomeChrBinNbits = 18`). On the synthetic 20 kb integration test
fixtures, >92% of the transformed text is constant spacer bytes that
caps-sa sorts and we discard afterwards. The sort itself takes
`O(spacer_run_len * subarray_size)` per subarray because every
spacer-starting suffix shares a near-maximal LCP with every other
spacer-starting suffix — measurably ~30x slower than in-memory at
this scale (14 s vs ~0.5 s for one `genomeGenerate` invocation in
local testing). On real genomes the spacer fraction is <1% (yeast ~5%,
human <0.01%), where ext-mem is the right default — bounded peak RAM
matters and the constant-region pathology doesn't trigger. The
threshold puts the synthetic tests on in-mem and yeast / chr22 / human
on ext-mem.

Explicit overrides (decision order, taken before the threshold):

  1. RUSTAR_USE_IN_MEM={1,true,yes,on}   forces in-memory.
  2. RUSTAR_USE_EXT_MEM={0,false,no,off} forces in-memory (backward
                                          compatibility alias).
  3. RUSTAR_USE_EXT_MEM={1,true,yes,on}  forces ext-memory.

Threshold chosen at 16 MB after measuring local pathology on the
20 kb synthetic test (524 KB transformed text). Yeast (~30 MB
transformed text) and any larger reference still goes through ext-mem.

Verified: full `cargo test --workspace` still passes (438 tests,
unchanged from before this change) in normal time — the threshold
keeps the synthetic tests on the in-memory path without any change to
the test harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
caps-sa 0.2.0 ships the subset-sort API (build_ext_mem_for_positions,
build_in_memory_for_positions), the pooled ext-mem buckets, the
AVX-512 hybrid LCP path, and the generic SaLcp<I> ext-mem index
type.

Switch our SA build to the _for_positions variants so the
spacer / N / terminal-sentinel suffixes never enter the sort. This
removes the O(spacer_run_len^2) LCP-comparison tax that motivated
use_ext_mem's in-memory fallback for padding-dominated test
fixtures, and lets the ext-mem path scale predictably regardless
of the genomeChrBinNbits padding ratio. The pack_one closure
simplifies to just the strand-bit encoding because every position
caps-sa emits is by construction one we wanted to keep.

The count_kept_positions helper is replaced by build_kept_positions
(returns the Vec<u64> we now feed to caps-sa rather than just the
count).

408 tests pass; both caps_sa_matches_naive_oracle_* differential
tests continue to produce byte-identical packed arrays vs the
brute-force oracle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the caps-sa 0.2.0 release: the SA build now uses
caps-sa's pooled ext-mem buckets, _for_positions subset API,
AVX-512 hybrid LCP, and u32 ext-mem index type. The bump signals
the additive feature surface (in-memory + ext-mem SA construction,
configurable bin-padding-safe filter) to downstream consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor the suffix array + SAindex construction pipeline on the human
genome (GRCh38, 32 threads on a 256-core AMD EPYC box) so the peak
resident set is dominated by genuine working sets, not by allocator
slack or in-RAM intermediates that just get written to disk after.

## Changes

### Stream the SA to disk; build SAindex on the fly (then in parallel)

- New `index::packed_stream::PackedStreamWriter` — bit-for-bit-compatible
  streaming writer for STAR's `PackedArray` format. Used by the
  streaming SA emit path; never materialises the 25 GB-class in-RAM
  `PackedArray` the previous in-memory build held.
- New `sa_build::build_streaming` — caps-sa drives an `FnMut(packed) ->
  Result<(), Error>` callback once per SA entry in lex order. Returns
  `(gstrand_bit, gstrand_mask, n_entries)`. The existing
  `SuffixArray::build` stays as a thin wrapper for tests / callers that
  need the SA in RAM.
- New `GenomeIndex::generate_streaming(params)` orchestrator.
  `genomeGenerate` calls this directly; `GenomeIndex::build` +
  `GenomeIndex::write` remain for tests.

### SAindex now parallel, no longer inline with the SA emit

Previously every SAindex k-mer extraction (`SaIndexBuilder::add`) sat
on caps-sa's single-threaded phase-4 emit loop. On the human genome
that's ~16 min of pure serial work added to caps-sa's otherwise-
parallel SA build (~7 min phase 1-4 with 32 threads).

- `SaIndex::build_parallel(&genome, &sa_file, ...)` reads the on-disk SA
  via chunked `pread` (1 M entries per chunk = ~4 MB buffer per worker,
  ~128 MB peak across 32 workers). SA pages live in *kernel* page
  cache, **not** process RSS — `pread` instead of `memmap2::Mmap`
  matters here: an mmap of the 24 GB SA file would pull the touched
  pages back into `RssFile`, blowing the peak by 24 GB.
- Each worker decodes packed values, extracts k-mers from
  `genome.sequence`, and `AtomicU64::fetch_min`s its `sa_idx` into a
  `Vec<AtomicU64>` keyed by `(k, kmer_idx)` — first-occurrence is
  monotonic under `fetch_min`, so the final value is exactly the
  smallest `sa_idx` covering each k-mer.
- Final sequential pass packs the `Vec<AtomicU64>` into the SAindex's
  `PackedArray`, then drops the AtomicU64 array.
- Inner k-mer loop is incremental (one base read per k, break on N)
  instead of re-scanning the prefix for every k=1..=nbases. Applied
  to the in-memory `SaIndex::build` too.

### Pass `&genome.sequence` directly to caps-sa (drop `text_sf`)

The previous segmented arm built a spacer-free copy of `genome.sequence`
(~6.3 GB on the human genome) so caps-sa's `SegmentedText` could model
contiguous segments. That copy is unnecessary — the predicate
`|p| genome.sequence[p] < 4` already filters spacer (5) and N (4)
positions out of the sort, so caps-sa's `lim_at` is only ever called
on ACGT positions. Pass the original spacer-bordered text with a
`SegmentedText::from_ends(spacer_positions)` instead. Saves the 6.3 GB
copy. Verified byte-identical to the spacer-free arm via the existing
`segmented_arm_matches_sentinel_arm_byte_for_byte_*` differential
tests across five fixtures.

### Bump caps-sa to 0.5

caps-sa 0.5.0 adds `build_ext_mem_for_filter` / `_with` — predicate-
driven entries that build a 1-bit-per-position bitmap + popcount
prefix sum (≈ 770 MB on the human genome) instead of requiring a
materialised `Vec<u64>` of kept positions (~47 GB before this change).
~60× memory reduction for STAR's ACGT-only sampling pattern.

### Rayon pool sized from `--runThreadN` at the `run()` dispatcher

Previously only `align_reads` configured the rayon pool;
`genomeGenerate` fell through to rayon's default of
`num_cpus::get()` *logical* cores. On a 256-core machine that was 256
worker threads (vs the requested 32), and with mimalloc's per-thread
heaps each holding some MB of segment cache, this added ~15 GB of
allocator slack to peak RSS for no benefit. Move the
`ThreadPoolBuilder` call into `run()` so both run-modes honour
`--runThreadN`.

### mimalloc as the global allocator

`mimalloc` (0.1, default-features = false) replaces glibc's malloc.
Lower per-allocation cost on the millions of small allocations
caps-sa makes during phase 1; per-thread heaps return whole segments
to the OS when abandoned, so the cache size stays bounded.

## Bench numbers (GRCh38, 32 threads)

```
                                            peak RSS     wall
baseline (_for_positions path)              ~113 GB     30+ min
+ caps-sa 0.5 filter API                    ~47 GB      25 min
+ streaming SA + on-the-fly SAindex         ~32 GB      22:46 (256t)
+ Stage 1 (no text_sf) + rayon-thread fix   12.0 GB     28:07 (32t)
+ decoupled parallel SAindex via pread      10.7 GB     18:14
```

For reference: caps-sa CLI on the same input is ~5 GB / ~12 min. The
remaining gap is genome.sequence (6.3 GB) + SAindex `Vec<AtomicU64>`
working set (2.86 GB transient) + caps-sa scratch + SAindex output
(1.5 GB).

## Tests + verification

- 407/407 lib tests pass.
- `streaming_build_matches_in_memory_build` test asserts the streaming
  emit's byte stream is bit-for-bit identical to the in-memory
  `SuffixArray::build`'s `PackedArray::data()`.
- `segmented_arm_matches_sentinel_arm_byte_for_byte_*` tests still
  pass across all 5 fixtures, confirming Stage 1 (drop `text_sf`)
  preserves the byte-identical-to-STAR SA.
- `clippy --release` clean (one transitive caps-sa doc-comment nit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes to `SaIndex::build_parallel`, both ported from STAR's
`genomeSAindex.cpp::genomeSAindexChunk`:

## Phase 1: isaStep + binary search skip (~80× SAindex speedup)

Consecutive `SA[isa]` entries share monotonically non-decreasing
k-mer prefixes (the SA is sorted lex). Walking every entry — which
the previous parallel build did — is `O(nSA × nbases)` work, but
distinct k-mers number at most `4^nbases = 268 M` for nbases=14, so
~22 consecutive entries share their full 14-mer on the human genome.
STAR's algorithm visits ~boundary entries only:

  1. Keep per-worker `ind0_local[il]` = last-written k-mer index at
     level `il`. Skip writes when `ind_pref <= ind0_local[il]`.
  2. After each visit, jump by `isa_step = nSA / 4^nbases`; when
     `(indFull, iL4)` changes, binary-search inside the last
     `isa_step` window for the exact boundary.

Total visited entries per worker chunk ≈ `chunk_size / isa_step ×
log isa_step` (~1.7 M for a 1 M-entry chunk on the human genome) —
17 × less work than visiting every entry, and ~95 % of the
`fetch_min` ops are gone because we only emit at boundaries.

Result: SAindex phase **10:50 → 0:08** (~80× speedup) on GRCh38 /
32 threads. Total rustar `genomeGenerate` wall **18:14 → 7:36**.

## Phase 2: STAR's absent-slot encoding

Absent slots used to carry a constant sentinel
`(1 << gstrand_bit+2) | ((1 << gstrand_bit) - 1)`. STAR instead
fills each absent slot with `next_present_sa_idx | absent_mask`
(or `nSA | absent_mask` for tail-gaps after a level's last present
slot). Implemented as a sequential **backward** pass per SAindex
level inside `firsts[]`, replacing `u64::MAX` sentinels with the
correct value before the final pack into `PackedArray`. ~3 s for
all 358 M slots; small enough to keep sequential.

`hierarchical_lookup` is unchanged because it only consults the
absent flag bit, not the slot's value. This change makes the
on-disk `SAindex` bytes closer to STAR's; the only remaining
divergence is the N flag bit at `gstrand_bit + 1`, which we don't
track (and which `hierarchical_lookup` also ignores).

## Per-worker SA buffering

Per-worker chunk pre-read of the SA byte range stays the same
(4 MB per worker), but accesses are now random within the buffer
(STAR's skip algorithm jumps + binary-searches inside the chunk).
The buffer keeps every visit in L1/L2; only `genome.sequence`
remains cache-cold, and STAR's algorithm minimises those reads to
~boundaries only.

## Benchmark (GRCh38, 32 threads on a 256-core EPYC)

Final head-to-head vs upstream STAR 2.7.11b:

  rustar-aligner:  7:36 wall, 10.66 GB peak RSS
  STAR (C++):     11:26 wall, 31.42 GB peak RSS

rustar is now both faster and ~3× lighter on the human genome.

407/407 lib tests + 14/14 integration tests pass; the integration
tests exercise `generate_streaming` → `SaIndex::build_parallel` via
`assert_cmd::cargo_bin_cmd`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rob-p rob-p merged commit f61ef07 into main Jun 1, 2026
10 checks passed
@rob-p rob-p deleted the caps-sa branch June 1, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants