Define 0/0 fold change and percent change as 0.0 instead of NaN#75
Define 0/0 fold change and percent change as 0.0 instead of NaN#75LeonHafner wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the pdex library to ensure that features unexpressed in both groups (resulting in a 0/0 division) report a fold change and percent change of 0.0 instead of NaN. This behavior is documented across the codebase and verified with new unit tests. The reviewer feedback suggests optimizing the implementation of log2_fold_change and percent_change in src/pdex/_math.py by replacing array-wide operations and boolean indexing with explicit parallel loops using nb.prange to avoid intermediate array allocations and improve performance under Numba.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| lfc = np.log2((x + epsilon) / (y + epsilon)) | ||
| lfc[np.isnan(lfc)] = 0.0 | ||
| return lfc |
There was a problem hiding this comment.
Using array-wide operations like np.log2((x + epsilon) / (y + epsilon)) followed by boolean indexing lfc[np.isnan(lfc)] = 0.0 inside a @nb.njit(parallel=True) function triggers multiple intermediate array allocations (one for the division/log2, one for the boolean mask, and one for the indexing).\n\nSince this function is called repeatedly in a loop over all groups, we can optimize performance and memory usage by writing an explicit parallel loop using nb.prange. This allows Numba to compile the entire operation into a single, allocation-free parallel pass.
| lfc = np.log2((x + epsilon) / (y + epsilon)) | |
| lfc[np.isnan(lfc)] = 0.0 | |
| return lfc | |
| n = len(x) | |
| lfc = np.empty(n, dtype=np.float64) | |
| for i in nb.prange(n): # ty: ignore[not-iterable] | |
| val = np.log2((x[i] + epsilon) / (y[i] + epsilon)) | |
| lfc[i] = 0.0 if np.isnan(val) else val | |
| return lfc |
There was a problem hiding this comment.
The allocation observation is accurate: pc[np.isnan(pc)] = 0.0 allocates a temporary boolean mask and adds a second pass (the arithmetic itself is already fused into one parallel loop by ParallelAccelerator). The loop form is genuinely allocation-minimal, and it's semantically identical (0/0 → 0.0, one-sided zeros stay ±inf).
However, the efficiency win doesn't show up in practice. Benchmarking both on realistic sizes, they're within noise at every size, and the "speedup" swings both ways:
| n | vectorized | prange loop |
|---|---|---|
| 1 (on_target) | 21 µs | 22 µs |
| 2,000 | 21 µs | 25 µs |
| 20,000 | 29 µs | 21 µs |
| 200,000 | 24 µs | 26 µs |
The giveaway is that per-call cost stays ~20 µs even at 200k elements — so the cost is Numba's parallel-dispatch overhead, not allocations or the elementwise work. In pdex these run on gene-length vectors (a few thousand elements) once per group, and are dwarfed by the MWU test and pseudobulk, so it's not a hot path. (Minor: in on_target mode these are called with length-1 arrays, where spinning up a prange is pure overhead, and the explicit loop pins the output dtype to float64.)
| pc = (x - y) / (y + prior_count) | ||
| pc[np.isnan(pc)] = 0.0 | ||
| return pc |
There was a problem hiding this comment.
Similar to log2_fold_change, using array-wide division followed by boolean indexing pc[np.isnan(pc)] = 0.0 inside a @nb.njit(parallel=True) function results in multiple intermediate array allocations.\n\nRewriting this with an explicit parallel loop using nb.prange avoids these allocations and maximizes execution efficiency under Numba.
n = len(x)
pc = np.empty(n, dtype=np.float64)
for i in nb.prange(n): # ty: ignore[not-iterable]
val = (x[i] - y[i]) / (y[i] + prior_count)
pc[i] = 0.0 if np.isnan(val) else val
return pc
Summary
A feature that is unexpressed in both the target and reference groups produced
NaNin thelog2_fold_changeandpercent_changecolumns whenepsilon == 0(the default). Both metrics evaluate0 / 0in that case:log2_fold_change=log2((0 + 0) / (0 + 0))=log2(NaN)=NaNpercent_change=(0 - 0) / (0 + 0)=NaNA gene with zero expression on both sides shows no change, so
NaNis misleading and awkward to handle downstream (filtering, plotting, sorting). This PR defines the0/0case as0.0for both columns.Changes
src/pdex/_math.py— after computing each metric, replaceNaNwith0.0inside the numba-jitted helpers:Only
NaNis touched, so legitimate±infvalues from one-sided zeros (a gene expressed in one group but not the other) are preserved.pdex()docstring (theepsilonparameter and the Returns section),CLAUDE.md, andREADME.mdoutput-schema tables.Why this is safe
NaNcan only ever arise from the0/0case here: pseudobulk means are non-negative, so the ratio is always≥ 0, andlog2/ division of a non-negative value yields a finite number,±inf, orNaNonly when numerator and denominator are both zero. So replacingNaNtargets exactly the "unexpressed in both groups" case and nothing else.NaN0.0-inf(log2)-inf(unchanged)+inf(log2)+inf(unchanged)Behavior change
pdex(...)output with the defaultepsilon=0.0now returns0.0instead ofNaNfor genes unexpressed in both groups. Callers that special-cased or droppedNaNrows should be aware. A positiveepsilonalready produced0.0for these genes, so only theepsilon == 0path changes.Tests
tests/test_math.py— unit tests for both helpers:0/0 → 0.0, plus a mixed case confirming finite ratios and±infare left untouched.tests/test_pdex.py— newTestUnexpressedInBothGroupscovering all three modes (ref,all,on_target) acrosslog2_fold_change,fold_change, andpercent_change, plus a one-sided-zero regression test.All tests pass;
ruff format,ruff check, andty checkare clean.