perf: "two-pass" seurat hvg via `scanpy.get.aggregate` by ilan-gold · Pull Request #4013 · scverse/scanpy

ilan-gold · 2026-03-26T13:49:28Z

An idea that popped into my head for disk-bound datasets but likely also normal ones. This should, in theory, greatly improve on-disk access and produce speed ups for disk bound data by reducing the amount of i/o in the worst case, unordered scenario (while, I would guess, leaving in-memory datasets untocuhed or maybe improved thanks to memory access + more efficient mean/var).

Dependent on #4143

Closes #
Tests included or not required because:

Release notes not necessary because:

codecov · 2026-03-26T14:11:05Z

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@9c6d029). Learn more about missing BASE report.
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/scanpy/preprocessing/_highly_variable_genes.py	90.47%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4013   +/-   ##
=======================================
  Coverage        ?   79.73%           
=======================================
  Files           ?      120           
  Lines           ?    12852           
  Branches        ?        0           
=======================================
  Hits            ?    10248           
  Misses          ?     2604           
  Partials        ?        0

Flag	Coverage Δ
hatch-test.low-vers	`78.80% <90.47%> (?)`
hatch-test.pre	`79.60% <90.47%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/scanpy/preprocessing/_highly_variable_genes.py	`94.63% <90.47%> (ø)`

scverse-benchmark · 2026-03-26T14:29:37Z

Benchmark changes

Change	Before [`d96d91d`]	After [`added47`]	Ratio	Benchmark (Parameter)
-	2G	373M	0.19	preprocessing_counts.Agg.peakmem_agg('var', False, True)
-	2.66±0.01s	21.0±0.1ms	0.01	preprocessing_counts.Agg.time_agg('var', False, True)
-	25.0±0.4ms	16.3±0.2ms	0.65	preprocessing_counts.Agg.time_agg('var', True, True)
+	7.15±0.1ms	8.26±0.03ms	1.16	preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts')

Warning

Some benchmarks failed

Comparison: https://github.com/scverse/scanpy/compare/d96d91de3162f29d901194ac56fd732459389784..added47416e86a6412a651f0ddad9e675491d977
Last changed: Thu, 25 Jun 2026 12:41:05 +0000

More details: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191

ilan-gold · 2026-06-25T12:47:44Z

The old seurat_v3 (on main) literally timed out with dask: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191 so the performance benefit is at least ~5x (since the timeout is 60s and this branch does 13s)

flying-sheep

Looks very straightforward, nice idea!

Co-authored-by: Philipp A. <flying-sheep@web.de>

for more information, see https://pre-commit.ci

ilan-gold · 2026-06-29T10:55:08Z

Thanks for the catch @flying-sheep !

…scanpy.get.aggregate`) (#4186) Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Co-authored-by: Zach Boldyga <zboldyga@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Philipp A. <flying-sheep@web.de>

ilan-gold added 2 commits March 26, 2026 14:44

perf: "two-pass" seurat hvg3 via scanpy.get.aggregate

a625c55

chore: hvg v3 benchmark

d839e98

ilan-gold added this to the 1.12.1 milestone Mar 26, 2026

ilan-gold added the benchmark label Mar 26, 2026

ilan-gold changed the title ~~perf: "two-pass" seurat hvg3 via scanpy.get.aggregate~~ perf: "two-pass" seurat hvg via scanpy.get.aggregate Mar 26, 2026

ilan-gold added 2 commits March 26, 2026 14:56

fix: use counts

86db499

fix: use a batch key

d5a6a78

fix: not again

fdc5653

ilan-gold added 3 commits April 8, 2026 15:54

fix: compute single pass!

8f0e426

Merge branch 'main' into ig/two_pass_hvg_v3

8ad893d

fix: unique

7e0390e

flying-sheep modified the milestones: 1.12.1, 1.12.2 Apr 10, 2026

Merge branch 'main' into ig/two_pass_hvg_v3

17be530

ilan-gold mentioned this pull request Apr 15, 2026

perf: numba based aggregations for sparse data #4062

Merged

3 tasks

ilan-gold added 10 commits April 16, 2026 17:01

Merge branch 'main' into ig/two_pass_hvg_v3

cc0d67e

chore: add new dask benchmark

96c16e9

Merge branch 'main' into ig/two_pass_hvg_v3

db4bc2c

fix: actually use dask lol

478af4a

chore: really do dask

54db31b

fix: layers support

4fe84c5

fix: no view check needed

35590a4

fix: no layers eeded

db81d6e

fix: reduce number of batches

b37444e

fix: a little bit more

cf65665

ilan-gold removed the benchmark label May 15, 2026

ilan-gold added 2 commits May 15, 2026 11:19

Merge branch 'main' into ig/two_pass_hvg_v3

8f4ef78

Merge branch 'main' into ig/two_pass_hvg_v3

a7b067d

ilan-gold added 4 commits June 24, 2026 10:07

fix: back to dask

31d42ba

Merge branch 'ig/chan_mean_var_main' into ig/two_pass_hvg_v3

83d8db7

fix: no defaults

7bf2db4

Merge branch 'ig/chan_mean_var_main' into ig/two_pass_hvg_v3

added47

ilan-gold removed the benchmark label Jun 25, 2026

fix: var space

06ecaa2

ilan-gold marked this pull request as ready for review June 25, 2026 11:49

ilan-gold mentioned this pull request Jun 25, 2026

perf: chan's parallel mean-var algorithm for dask-backed arrays (sparse/dense) #4143

Merged

3 tasks

chore: relnote

1302d26

ilan-gold requested a review from flying-sheep June 25, 2026 12:50

ilan-gold changed the base branch from main to ig/chan_mean_var_main June 25, 2026 12:50

Base automatically changed from ig/chan_mean_var_main to main June 25, 2026 12:58

Merge branch 'main' into ig/two_pass_hvg_v3

761f054

flying-sheep approved these changes Jun 25, 2026

View reviewed changes

Comment thread src/scanpy/preprocessing/_highly_variable_genes.py Outdated

ilan-gold and others added 4 commits June 26, 2026 09:33

Merge branch 'main' into ig/two_pass_hvg_v3

3c87db4

Update src/scanpy/preprocessing/_highly_variable_genes.py

5e096b4

Co-authored-by: Philipp A. <flying-sheep@web.de>

[pre-commit.ci] auto fixes from pre-commit.com hooks

552ffac

for more information, see https://pre-commit.ci

Merge branch 'main' into ig/two_pass_hvg_v3

c6ede63

ilan-gold enabled auto-merge (squash) June 29, 2026 10:12

ilan-gold added 2 commits June 29, 2026 12:31

Merge branch 'main' into ig/two_pass_hvg_v3

05446b7

fix: iteration

995288a

flying-sheep reviewed Jun 29, 2026

View reviewed changes

Comment thread src/scanpy/preprocessing/_highly_variable_genes.py Outdated

Update src/scanpy/preprocessing/_highly_variable_genes.py

e7225d1

ilan-gold merged commit 4c89274 into main Jun 29, 2026
13 of 14 checks passed

ilan-gold deleted the ig/two_pass_hvg_v3 branch June 29, 2026 10:50

meeseeksmachine mentioned this pull request Jun 29, 2026

Backport PR #4013 on branch 1.12.x (perf: "two-pass" seurat hvg via scanpy.get.aggregate) #4186

Merged

ilan-gold added a commit that referenced this pull request Jun 29, 2026

Backport PR #4013 on branch 1.12.x (perf: "two-pass" seurat hvg via `…

5f64ecb

…scanpy.get.aggregate`) (#4186) Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: "two-pass" seurat hvg via `scanpy.get.aggregate`#4013

perf: "two-pass" seurat hvg via `scanpy.get.aggregate`#4013
ilan-gold merged 91 commits into
mainfrom
ig/two_pass_hvg_v3

ilan-gold commented Mar 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

scverse-benchmark Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Jun 25, 2026 •

edited

Loading

Uh oh!

flying-sheep left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ilan-gold commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scverse-benchmark Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark changes

Uh oh!

ilan-gold commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilan-gold commented Mar 26, 2026 •

edited

Loading

codecov Bot commented Mar 26, 2026 •

edited

Loading

scverse-benchmark Bot commented Mar 26, 2026 •

edited

Loading

ilan-gold commented Jun 25, 2026 •

edited

Loading