Skip to content

perf: chan's parallel mean-var algorithm for dask-backed arrays (sparse/dense)#4143

Merged
flying-sheep merged 54 commits into
mainfrom
ig/chan_mean_var_main
Jun 25, 2026
Merged

perf: chan's parallel mean-var algorithm for dask-backed arrays (sparse/dense)#4143
flying-sheep merged 54 commits into
mainfrom
ig/chan_mean_var_main

Conversation

@ilan-gold

@ilan-gold ilan-gold commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

Based on a #4118 (comment) with @zboldyga

This has two benefits - it allows us to calculate mean/var in one pass instead of effectively two (square sum and sum squared) and gets rid of a numerical instability issue that @zboldyga found the solution to (see removed comment)

  • Closes #
  • Tests included or not required because:

@ilan-gold ilan-gold added this to the 1.12.2 milestone Jun 5, 2026
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.71%. Comparing base (d96d91d) to head (7bf2db4).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/scanpy/get/_aggregated.py 95.83% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           main    #4143       +/-   ##
=========================================
+ Coverage      0   79.71%   +79.71%     
=========================================
  Files         0      120      +120     
  Lines         0    12830    +12830     
=========================================
+ Hits          0    10227    +10227     
- Misses        0     2603     +2603     
Flag Coverage Δ
hatch-test.low-vers 78.74% <39.58%> (?)
hatch-test.pre 79.57% <95.83%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/get/_aggregated.py 92.94% <95.83%> (ø)

... and 119 files with indirect coverage changes

@ilan-gold ilan-gold marked this pull request as ready for review June 5, 2026 15:08
@ilan-gold ilan-gold changed the title perf: chan's parallel mean-var algorithm for dask perf: chan's parallel mean-var algorithm for dask-backed arrays (sparse/dense) Jun 8, 2026
@zboldyga

zboldyga commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@ilan-gold I added njit support, see: #4153 . This enables rank_gene_groups to use njit.

I integrated this to the rank_gene_groups PR and benchmarked there as well as here and it gives a speedup on both at normal group x gene sizes.

@ilan-gold

Copy link
Copy Markdown
Contributor Author

Nice commented there about something, but once you got the pre-commit fixed as well, I'll merge into this

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@ilan-gold ilan-gold force-pushed the ig/chan_mean_var_main branch 2 times, most recently from 6e18c02 to 196e443 Compare June 24, 2026 08:03
@scverse-benchmark

Copy link
Copy Markdown

Benchmark changes

Change Before [d96d91d] After [7bf2db4] Ratio Benchmark (Parameter)
- 1.96G 381M 0.19 preprocessing_counts.Agg.peakmem_agg('var', False, True)
- 898±30ms 736±5ms 0.82 preprocessing_counts.Agg.time_agg('count_nonzero', False, False)
- 2.58±0.01s 19.9±0.7ms 0.01 preprocessing_counts.Agg.time_agg('var', False, True)
- 23.6±0.7ms 16.3±0.3ms 0.69 preprocessing_counts.Agg.time_agg('var', True, True)
- 8.48±0.1ms 7.32±0.3ms 0.86 preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts-off-axis')

Comparison: https://github.com/scverse/scanpy/compare/d96d91de3162f29d901194ac56fd732459389784..7bf2db4aa11c28d5e6ed01644453bb742bab6375
Last changed: Thu, 25 Jun 2026 11:24:00 +0000

More details: https://github.com/scverse/scanpy/pull/4143/checks?check_run_id=83408969507

@ilan-gold

Copy link
Copy Markdown
Contributor Author

I have no idea why this CI job is failing but the one in #4013 passes and contains this branch

@flying-sheep flying-sheep merged commit bd7568e into main Jun 25, 2026
16 of 19 checks passed
@flying-sheep flying-sheep deleted the ig/chan_mean_var_main branch June 25, 2026 12:58
ilan-gold added a commit that referenced this pull request Jun 25, 2026
…gorithm for dask-backed arrays (sparse/dense)) (#4180)

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants