Skip to content

Define 0/0 fold change and percent change as 0.0 instead of NaN#75

Open
LeonHafner wants to merge 2 commits into
mainfrom
leonhafner/fix_zero_over_zero
Open

Define 0/0 fold change and percent change as 0.0 instead of NaN#75
LeonHafner wants to merge 2 commits into
mainfrom
leonhafner/fix_zero_over_zero

Conversation

@LeonHafner
Copy link
Copy Markdown
Collaborator

Summary

A feature that is unexpressed in both the target and reference groups produced NaN in the log2_fold_change and percent_change columns when epsilon == 0 (the default). Both metrics evaluate 0 / 0 in that case:

  • log2_fold_change = log2((0 + 0) / (0 + 0)) = log2(NaN) = NaN
  • percent_change = (0 - 0) / (0 + 0) = NaN

A gene with zero expression on both sides shows no change, so NaN is misleading and awkward to handle downstream (filtering, plotting, sorting). This PR defines the 0/0 case as 0.0 for both columns.

Changes

  • src/pdex/_math.py — after computing each metric, replace NaN with 0.0 inside the numba-jitted helpers:
lfc = np.log2((x + epsilon) / (y + epsilon))
lfc[np.isnan(lfc)] = 0.0

Only NaN is touched, so legitimate ±inf values from one-sided zeros (a gene expressed in one group but not the other) are preserved.

  • Documentation updated to match: pdex() docstring (the epsilon parameter and the Returns section), CLAUDE.md, and README.md output-schema tables.

Why this is safe

NaN can only ever arise from the 0/0 case here: pseudobulk means are non-negative, so the ratio is always ≥ 0, and log2 / division of a non-negative value yields a finite number, ±inf, or NaN only when numerator and denominator are both zero. So replacing NaN targets exactly the "unexpressed in both groups" case and nothing else.

target_mean ref_mean before after
0 0 NaN 0.0
0 >0 -inf (log2) -inf (unchanged)
>0 0 +inf (log2) +inf (unchanged)

Behavior change

pdex(...) output with the default epsilon=0.0 now returns 0.0 instead of NaN for genes unexpressed in both groups. Callers that special-cased or dropped NaN rows should be aware. A positive epsilon already produced 0.0 for these genes, so only the epsilon == 0 path changes.

Tests

  • tests/test_math.py — unit tests for both helpers: 0/0 → 0.0, plus a mixed case confirming finite ratios and ±inf are left untouched.
  • tests/test_pdex.py — new TestUnexpressedInBothGroups covering all three modes (ref, all, on_target) across log2_fold_change, fold_change, and percent_change, plus a one-sided-zero regression test.

All tests pass; ruff format, ruff check, and ty check are clean.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the pdex library to ensure that features unexpressed in both groups (resulting in a 0/0 division) report a fold change and percent change of 0.0 instead of NaN. This behavior is documented across the codebase and verified with new unit tests. The reviewer feedback suggests optimizing the implementation of log2_fold_change and percent_change in src/pdex/_math.py by replacing array-wide operations and boolean indexing with explicit parallel loops using nb.prange to avoid intermediate array allocations and improve performance under Numba.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/pdex/_math.py
Comment on lines +121 to +123
lfc = np.log2((x + epsilon) / (y + epsilon))
lfc[np.isnan(lfc)] = 0.0
return lfc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using array-wide operations like np.log2((x + epsilon) / (y + epsilon)) followed by boolean indexing lfc[np.isnan(lfc)] = 0.0 inside a @nb.njit(parallel=True) function triggers multiple intermediate array allocations (one for the division/log2, one for the boolean mask, and one for the indexing).\n\nSince this function is called repeatedly in a loop over all groups, we can optimize performance and memory usage by writing an explicit parallel loop using nb.prange. This allows Numba to compile the entire operation into a single, allocation-free parallel pass.

Suggested change
lfc = np.log2((x + epsilon) / (y + epsilon))
lfc[np.isnan(lfc)] = 0.0
return lfc
n = len(x)
lfc = np.empty(n, dtype=np.float64)
for i in nb.prange(n): # ty: ignore[not-iterable]
val = np.log2((x[i] + epsilon) / (y[i] + epsilon))
lfc[i] = 0.0 if np.isnan(val) else val
return lfc

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation observation is accurate: pc[np.isnan(pc)] = 0.0 allocates a temporary boolean mask and adds a second pass (the arithmetic itself is already fused into one parallel loop by ParallelAccelerator). The loop form is genuinely allocation-minimal, and it's semantically identical (0/0 → 0.0, one-sided zeros stay ±inf).

However, the efficiency win doesn't show up in practice. Benchmarking both on realistic sizes, they're within noise at every size, and the "speedup" swings both ways:

n vectorized prange loop
1 (on_target) 21 µs 22 µs
2,000 21 µs 25 µs
20,000 29 µs 21 µs
200,000 24 µs 26 µs

The giveaway is that per-call cost stays ~20 µs even at 200k elements — so the cost is Numba's parallel-dispatch overhead, not allocations or the elementwise work. In pdex these run on gene-length vectors (a few thousand elements) once per group, and are dwarfed by the MWU test and pseudobulk, so it's not a hot path. (Minor: in on_target mode these are called with length-1 arrays, where spinning up a prange is pure overhead, and the explicit loop pins the output dtype to float64.)

Comment thread src/pdex/_math.py
Comment on lines +141 to +143
pc = (x - y) / (y + prior_count)
pc[np.isnan(pc)] = 0.0
return pc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to log2_fold_change, using array-wide division followed by boolean indexing pc[np.isnan(pc)] = 0.0 inside a @nb.njit(parallel=True) function results in multiple intermediate array allocations.\n\nRewriting this with an explicit parallel loop using nb.prange avoids these allocations and maximizes execution efficiency under Numba.

    n = len(x)
    pc = np.empty(n, dtype=np.float64)
    for i in nb.prange(n):  # ty: ignore[not-iterable]
        val = (x[i] - y[i]) / (y[i] + prior_count)
        pc[i] = 0.0 if np.isnan(val) else val
    return pc

@LeonHafner LeonHafner requested a review from alexdobin June 2, 2026 00:00
@LeonHafner LeonHafner self-assigned this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant