Skip to content

perf: "two-pass" seurat hvg via scanpy.get.aggregate#4013

Merged
ilan-gold merged 91 commits into
mainfrom
ig/two_pass_hvg_v3
Jun 29, 2026
Merged

perf: "two-pass" seurat hvg via scanpy.get.aggregate#4013
ilan-gold merged 91 commits into
mainfrom
ig/two_pass_hvg_v3

Conversation

@ilan-gold

@ilan-gold ilan-gold commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

An idea that popped into my head for disk-bound datasets but likely also normal ones. This should, in theory, greatly improve on-disk access and produce speed ups for disk bound data by reducing the amount of i/o in the worst case, unordered scenario (while, I would guess, leaving in-memory datasets untocuhed or maybe improved thanks to memory access + more efficient mean/var).

Dependent on #4143

  • Closes #
  • Tests included or not required because:

@ilan-gold ilan-gold added this to the 1.12.1 milestone Mar 26, 2026
@ilan-gold ilan-gold changed the title perf: "two-pass" seurat hvg3 via scanpy.get.aggregate perf: "two-pass" seurat hvg via scanpy.get.aggregate Mar 26, 2026
@codecov

codecov Bot commented Mar 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@9c6d029). Learn more about missing BASE report.
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/scanpy/preprocessing/_highly_variable_genes.py 90.47% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4013   +/-   ##
=======================================
  Coverage        ?   79.73%           
=======================================
  Files           ?      120           
  Lines           ?    12852           
  Branches        ?        0           
=======================================
  Hits            ?    10248           
  Misses          ?     2604           
  Partials        ?        0           
Flag Coverage Δ
hatch-test.low-vers 78.80% <90.47%> (?)
hatch-test.pre 79.60% <90.47%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/preprocessing/_highly_variable_genes.py 94.63% <90.47%> (ø)

@scverse-benchmark

scverse-benchmark Bot commented Mar 26, 2026

Copy link
Copy Markdown

Benchmark changes

Change Before [d96d91d] After [added47] Ratio Benchmark (Parameter)
- 2G 373M 0.19 preprocessing_counts.Agg.peakmem_agg('var', False, True)
- 2.66±0.01s 21.0±0.1ms 0.01 preprocessing_counts.Agg.time_agg('var', False, True)
- 25.0±0.4ms 16.3±0.2ms 0.65 preprocessing_counts.Agg.time_agg('var', True, True)
+ 7.15±0.1ms 8.26±0.03ms 1.16 preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts')

Warning

Some benchmarks failed

Comparison: https://github.com/scverse/scanpy/compare/d96d91de3162f29d901194ac56fd732459389784..added47416e86a6412a651f0ddad9e675491d977
Last changed: Thu, 25 Jun 2026 12:41:05 +0000

More details: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191

@flying-sheep flying-sheep modified the milestones: 1.12.1, 1.12.2 Apr 10, 2026
@ilan-gold

ilan-gold commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

The old seurat_v3 (on main) literally timed out with dask: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191 so the performance benefit is at least ~5x (since the timeout is 60s and this branch does 13s)

@ilan-gold ilan-gold requested a review from flying-sheep June 25, 2026 12:50
@ilan-gold ilan-gold changed the base branch from main to ig/chan_mean_var_main June 25, 2026 12:50
Base automatically changed from ig/chan_mean_var_main to main June 25, 2026 12:58

@flying-sheep flying-sheep left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very straightforward, nice idea!

Comment thread src/scanpy/preprocessing/_highly_variable_genes.py Outdated
@ilan-gold ilan-gold enabled auto-merge (squash) June 29, 2026 10:12
Comment thread src/scanpy/preprocessing/_highly_variable_genes.py Outdated
@ilan-gold ilan-gold merged commit 4c89274 into main Jun 29, 2026
13 of 14 checks passed
@ilan-gold ilan-gold deleted the ig/two_pass_hvg_v3 branch June 29, 2026 10:50
@ilan-gold

Copy link
Copy Markdown
Contributor Author

Thanks for the catch @flying-sheep !

ilan-gold added a commit that referenced this pull request Jun 29, 2026
…scanpy.get.aggregate`) (#4186)

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
flying-sheep added a commit that referenced this pull request Jun 30, 2026
Co-authored-by: Zach Boldyga <zboldyga@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Philipp A. <flying-sheep@web.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants