Define 0/0 fold change and percent change as 0.0 instead of NaN by LeonHafner · Pull Request #75 · ArcInstitute/pdex

LeonHafner · 2026-06-01T23:35:32Z

Summary

A feature that is unexpressed in both the target and reference groups produced NaN in the log2_fold_change and percent_change columns when epsilon == 0 (the default). Both metrics evaluate 0 / 0 in that case:

log2_fold_change = log2((0 + 0) / (0 + 0)) = log2(NaN) = NaN
percent_change = (0 - 0) / (0 + 0) = NaN

A gene with zero expression on both sides shows no change, so NaN is misleading and awkward to handle downstream (filtering, plotting, sorting). This PR defines the 0/0 case as 0.0 for both columns.

Changes

src/pdex/_math.py — after computing each metric, replace NaN with 0.0 inside the numba-jitted helpers:

lfc = np.log2((x + epsilon) / (y + epsilon))
lfc[np.isnan(lfc)] = 0.0

Only NaN is touched, so legitimate ±inf values from one-sided zeros (a gene expressed in one group but not the other) are preserved.

Documentation updated to match: pdex() docstring (the epsilon parameter and the Returns section), CLAUDE.md, and README.md output-schema tables.

Why this is safe

NaN can only ever arise from the 0/0 case here: pseudobulk means are non-negative, so the ratio is always ≥ 0, and log2 / division of a non-negative value yields a finite number, ±inf, or NaN only when numerator and denominator are both zero. So replacing NaN targets exactly the "unexpressed in both groups" case and nothing else.

target_mean	ref_mean	before	after
0	0	`NaN`	`0.0`
0	>0	`-inf` (log2)	`-inf` (unchanged)
>0	0	`+inf` (log2)	`+inf` (unchanged)

Behavior change

pdex(...) output with the default epsilon=0.0 now returns 0.0 instead of NaN for genes unexpressed in both groups. Callers that special-cased or dropped NaN rows should be aware. A positive epsilon already produced 0.0 for these genes, so only the epsilon == 0 path changes.

Tests

tests/test_math.py — unit tests for both helpers: 0/0 → 0.0, plus a mixed case confirming finite ratios and ±inf are left untouched.
tests/test_pdex.py — new TestUnexpressedInBothGroups covering all three modes (ref, all, on_target) across log2_fold_change, fold_change, and percent_change, plus a one-sided-zero regression test.

All tests pass; ruff format, ruff check, and ty check are clean.

gemini-code-assist

Code Review

This pull request updates the pdex library to ensure that features unexpressed in both groups (resulting in a 0/0 division) report a fold change and percent change of 0.0 instead of NaN. This behavior is documented across the codebase and verified with new unit tests. The reviewer feedback suggests optimizing the implementation of log2_fold_change and percent_change in src/pdex/_math.py by replacing array-wide operations and boolean indexing with explicit parallel loops using nb.prange to avoid intermediate array allocations and improve performance under Numba.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-01T23:36:23Z

+    lfc = np.log2((x + epsilon) / (y + epsilon))
+    lfc[np.isnan(lfc)] = 0.0
+    return lfc


Using array-wide operations like np.log2((x + epsilon) / (y + epsilon)) followed by boolean indexing lfc[np.isnan(lfc)] = 0.0 inside a @nb.njit(parallel=True) function triggers multiple intermediate array allocations (one for the division/log2, one for the boolean mask, and one for the indexing).\n\nSince this function is called repeatedly in a loop over all groups, we can optimize performance and memory usage by writing an explicit parallel loop using nb.prange. This allows Numba to compile the entire operation into a single, allocation-free parallel pass.

Suggested change

lfc = np.log2((x + epsilon) / (y + epsilon))

lfc[np.isnan(lfc)] = 0.0

return lfc

n = len(x)

lfc = np.empty(n, dtype=np.float64)

for i in nb.prange(n): # ty: ignore[not-iterable]

val = np.log2((x[i] + epsilon) / (y[i] + epsilon))

lfc[i] = 0.0 if np.isnan(val) else val

return lfc

The allocation observation is accurate: pc[np.isnan(pc)] = 0.0 allocates a temporary boolean mask and adds a second pass (the arithmetic itself is already fused into one parallel loop by ParallelAccelerator). The loop form is genuinely allocation-minimal, and it's semantically identical (0/0 → 0.0, one-sided zeros stay ±inf).

However, the efficiency win doesn't show up in practice. Benchmarking both on realistic sizes, they're within noise at every size, and the "speedup" swings both ways:

n vectorized prange loop

1 (on_target) 21 µs 22 µs

2,000 21 µs 25 µs

20,000 29 µs 21 µs

200,000 24 µs 26 µs

The giveaway is that per-call cost stays ~20 µs even at 200k elements — so the cost is Numba's parallel-dispatch overhead, not allocations or the elementwise work. In pdex these run on gene-length vectors (a few thousand elements) once per group, and are dwarfed by the MWU test and pseudobulk, so it's not a hot path. (Minor: in on_target mode these are called with length-1 arrays, where spinning up a prange is pure overhead, and the explicit loop pins the output dtype to float64.)

gemini-code-assist · 2026-06-01T23:36:23Z

+    pc = (x - y) / (y + prior_count)
+    pc[np.isnan(pc)] = 0.0
+    return pc


Similar to log2_fold_change, using array-wide division followed by boolean indexing pc[np.isnan(pc)] = 0.0 inside a @nb.njit(parallel=True) function results in multiple intermediate array allocations.\n\nRewriting this with an explicit parallel loop using nb.prange avoids these allocations and maximizes execution efficiency under Numba.

n = len(x) pc = np.empty(n, dtype=np.float64) for i in nb.prange(n): # ty: ignore[not-iterable] val = (x[i] - y[i]) / (y[i] + prior_count) pc[i] = 0.0 if np.isnan(val) else val return pc

LeonHafner added 2 commits June 1, 2026 23:13

fix_zero_over_zero

1b2fb39

bump version

052f721

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

LeonHafner requested a review from alexdobin June 2, 2026 00:00

LeonHafner self-assigned this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define 0/0 fold change and percent change as 0.0 instead of NaN#75

Define 0/0 fold change and percent change as 0.0 instead of NaN#75
LeonHafner wants to merge 2 commits into
mainfrom
leonhafner/fix_zero_over_zero

LeonHafner commented Jun 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

LeonHafner Jun 1, 2026

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    lfc = np.log2((x + epsilon) / (y + epsilon))
-    lfc[np.isnan(lfc)] = 0.0
-    return lfc
+    n = len(x)
+    lfc = np.empty(n, dtype=np.float64)
+    for i in nb.prange(n):  # ty: ignore[not-iterable]
+        val = np.log2((x[i] + epsilon) / (y[i] + epsilon))
+        lfc[i] = 0.0 if np.isnan(val) else val
+    return lfc

n	vectorized	prange loop
1 (on_target)	21 µs	22 µs
2,000	21 µs	25 µs
20,000	29 µs	21 µs
200,000	24 µs	26 µs

Conversation

LeonHafner commented Jun 1, 2026

Summary

Changes

Why this is safe

Behavior change

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

LeonHafner Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant