Removed deep-copy data.table ops from the dataProcess pipeline by tonywu1999 · Pull Request #208 · Vitek-Lab/MSstats

tonywu1999 · 2026-05-14T15:24:06Z

User description

Replaced input[, cols, with = FALSE] deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).
Replaced row-shuffle input = input[order(...), ] in .prepareForDataProcess with data.table::setorder() (in place).
Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table.
Replaced the synthesised tmp string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.
Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes.
Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy.
Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame().
Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept.
Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules
I have run the devtools::document() command after my changes and committed the added files

PR Type

Enhancement, Bug fix, Tests

Description

Replace full-table copies with in-place updates
Use keyed lookups for joins
Split survival outputs before finalization
Add memory-regression pipeline tests

Diagram Walkthrough

flowchart LR
  a["Copy-heavy data.table operations"]
  b["In-place set and setorder updates"]
  c["Keyed survival and fraction lookups"]
  d["Lower-memory summarization pipeline"]
  a -- "replaced by" --> b
  b -- "combined with" --> c
  c -- "reduces copies in" --> d

File Walkthrough

Relevant files

Enhancement

dataProcess.R `Limit censored-value updates to matching rows` R/dataProcess.R Replace `ifelse()` rewrites with targeted `:=` updates Only keep `predicted` on applicable censored rows Only overwrite `newABUNDANCE` where imputation applies	+12/-10
utils_checks.R `Avoid copies during input trimming and sorting` R/utils_checks.R Drop unwanted columns via `data.table::set(..., NULL)` Avoid retained-column deep copies during preparation Sort rows in place with `data.table::setorder()`	+13/-5
utils_feature_selection.R `Collapse feature preprocessing into one pass` R/utils_feature_selection.R Build the feature-selection working table once Handle optional `censored` values inline Compute `is_obs` without intermediate tables	+20/-22
utils_normalize.R `Use in-place cleanup and keyed fraction merges` R/utils_normalize.R Remove normalization temp columns in place Return the modified input after cleanup Replace `merge()` with keyed `newRun` assignment Filter fractions by direct `(FEATURE, FRACTION)` lookup	+23/-13

Bug fix

utils_output.R `Streamline summary output and imputation joins` R/utils_output.R Extract combined `predicted_survival` before finalization Free nested survival entries before binding summaries Replace merge-based imputation with keyed writes Guard join columns and update flags in place	+51/-27

Tests

test_memory_optimization_copies.R `Add memory regression tests for copy avoidance` inst/tinytest/test_memory_optimization_copies.R Add address-based assertions for in-place behavior Test `.normalizeMedian` temp-column cleanup Test `.finalizeTMP` keyed matches and unmatched `NA`s Verify `MSstatsSummarizationOutput` list splitting behavior	+469/-0

Motivation and Context — Short summary of the solution

The dataProcess pipeline relied on copy-heavy data.table patterns (column subsetting that materializes new tables, merge-based joins, order-based rewrites, and whole-column ifelse assignments). This increased memory churn and caused regressions in the “predicted survival” imputation path.
This PR replaces those hotspots with in-place data.table operations (set(), setorder(), keyed which lookups, and targeted := updates), restructures summarization to extract and pass predicted_survival explicitly, and fixes the two regressions (join-key intersection with predicted_survival and retention of LABEL). It also adds memory/copy-avoidance tinytests to guard the behavior.

Detailed changes

General / pipeline-level
- Switch copy-heavy patterns to in-place data.table updates (data.table::set, targeted [i, j := v], and data.table::setorder()), including replacing merge-based filtering/temporary-column workflows.
- Summarization refactor: MSstatsSummarizationOutput now extracts predicted_survival from the nested per-subplot summarized output early, clears the nested survival slots, finalizes using predicted_survival, then rebuilds protein-level summaries only.
- Finalize refactor: .finalizeInput / .finalizeTMP now take predicted_survival directly (instead of a nested/summarized structure).
R/dataProcess.R
- In MSstatsSummarizeSingleLinear and MSstatsSummarizeSingleTMP, revise censored-value post-imputation handling:
  - Replace ifelse-based assignments with data.table conditional subset updates using :=.
  - For labeled-reference cases, set predicted to NA for non-censored relevant rows and write fitted predicted values only into newABUNDANCE for the censored subset (mirrored when is_labeled_reference == FALSE).
  - Adjust the labeled-reference fit_data subset expression (explicit negation form) while keeping survival table construction consistent with the updated predicted/newABUNDANCE state.
R/utils_checks.R
- In .checkUnProcessedDataValidity:
  - Normalize/ensure required columns (e.g., add AnomalyScores as NA, coerce INTENSITY to numeric) via data.table::set-style updates.
  - Remove extra/unallowed columns by setting them to NULL rather than deferring to later column subsetting.
- In .prepareForDataProcess:
  - Compute derived columns (PEPTIDE, TRANSITION) and isotope label mapping via data.table::set-style updates.
- In .makeFactorColumns:
  - Replace input[order(...), ] with in-place data.table::setorder() using the same multi-key ordering.
R/utils_feature_selection.R
- In .selectHighQualityFeatures:
  - Simplify optional censored handling by computing log2inty from ABUNDANCE with censored/NA treated as missing.
  - Derive is_obs solely from whether log2inty is non-NA.
  - Remove intermediate renames/dropping logic (censored → is_censored) and associated intermediate computations.
R/utils_normalize.R
- In .normalizeMedian / .normalizeGlobalStandards:
  - Remove helper columns in-place with data.table::set(..., value = NULL) instead of rebuilding tables via column-exclusion expressions.
- MSstatsMergeFractions non-TECHREPLICATE path:
  - Rework merged-run construction and assignment using keyed dcast-derived lookup + which-based indexed updates (data.table::set).
  - Filter out unobserved feature/fraction combinations (zero abundance) via keyed join-index mapping rather than merge + temporary-column filtering.
  - Renumber RUN based on newRun and clear helper columns in-place (removing the prior merge(all.x=TRUE) + tmp flow).
R/utils_output.R
- MSstatsSummarizationOutput:
  - Treat summarized == NULL as summarization failure (instead of relying on try-error inheritance).
  - Build predicted_survival as a combined data.table from summarized[[i]][[2]].
  - Clear nested survival entries (summarized[[i]][[2]] = NULL) before finalization.
  - Finalize via .finalizeInput(input, predicted_survival, ...), then rebuild summarized from protein-level parts only (summarized[[i]][[1]]).
  - Drop non-output columns from input in-place and return data.frame via data.table::setDF().
- .finalizeInput / .finalizeTMP:
  - Change to accept predicted_survival directly.
  - In .finalizeTMP(impute=TRUE):
    - Replace merge-based prediction application with indexed lookup + targeted writes to newABUNDANCE and predicted.
Regressions fixed
- .finalizeTMP join regression: join columns are now explicitly intersected between input and predicted_survival, and the intersection set explicitly includes LABEL, preventing missing-key errors and preserving downstream assumptions.
- Survival-column selection regression: LABEL is retained in the selection/join-column intersection required by later logic/tests.
Documentation
- Update .finalizeInput and .finalizeTMP help topics to reflect the new predicted_survival parameter and removal of the previously documented summarized argument.

Unit tests added / modified

Added: inst/tinytest/test_memory_optimization_copies.R
- Asserts in-place behavior (via data.table::address()) for .normalizeMedian and .finalizeTMP on small and scaled fixtures.
- Verifies temporary helper columns (ABUNDANCE_RUN, ABUNDANCE_FRACTION, and other helpers like newRun) are removed while keeping required output columns.
- Validates .finalizeTMP imputation semantics:
  - matched (cen, RUN, FEATURE) keys update newABUNDANCE from predicted_survival
  - unmatched keys yield NA
  - predicted column is added
- Confirms the predicted_survival contract:
  - .finalizeTMP accepts a combined data.table (not nested list input)
  - MSstatsSummarizationOutput correctly decomposes a nested summarized_list into feature/protein outputs
- Includes a size comparison showing the combined predicted_survival table is smaller than the full nested summarized_list.
Added: inst/tinytest/test_MSstatsMergeFractions.R
- Characterizes MSstatsMergeFractions behavior:
  - single-fraction preservation and originalRUN handling
  - ABUNDANCE clamping before merging (including INTENSITY edge cases)
  - non-TECHREPLICATE merged-run collapse semantics, row dropping for unobserved feature/fraction combos, and factor level expectations
  - absence of helper newRun in outputs

Coding guidelines / potential violations

Caller-side mutation risk from in-place data.table writes
- Multiple changes intentionally mutate data.tables by reference for memory reasons; this can break expectations if the same input object is reused elsewhere.
- The review concern specifically targeted .selectHighQualityFeatures: the implementation now projects input into a reduced table (input = input[, list(...)]) before performing := updates, which confines mutations to the projected table rather than mutating the caller’s original input.
Internal signature/API contract drift
- Internal functions changed from operating on nested summarized structures to accepting predicted_survival directly; this is documented in man pages and guarded by new tinytests targeting the contract.

* Replaced `input[, cols, with = FALSE]` deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL). * Replaced row-shuffle `input = input[order(...), ]` in .prepareForDataProcess with data.table::setorder() (in place). * Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table. * Replaced the synthesised `tmp` string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call. * Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes. * Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy. * Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame(). * Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept. * Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-05-14T15:24:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab7707c4-6327-4337-9631-ed8c2ad0ce3d

📥 Commits

Reviewing files that changed from the base of the PR and between 50a30da and 727d3ee.

📒 Files selected for processing (4)

R/utils_checks.R
R/utils_output.R
man/dot-finalizeInput.Rd
man/dot-finalizeTMP.Rd

✅ Files skipped from review due to trivial changes (2)

man/dot-finalizeTMP.Rd
man/dot-finalizeInput.Rd

🚧 Files skipped from review as they are similar to previous changes (2)

R/utils_checks.R
R/utils_output.R

📝 Walkthrough

Walkthrough

The PR refactors multiple MSstats data-processing paths to use in-place data.table updates, rewrites fraction merging and finalization around indexed lookups, and changes summarization output handling to pass predicted_survival directly. Tests and docs are updated to match the new behavior.

Changes

DataProcess Pipeline Memory Optimization and Output Refactoring

Layer / File(s)	Summary
Input normalization and feature selection `R/utils_checks.R`, `R/utils_feature_selection.R`	Input columns are normalized, filtered, coerced, and ordered in place, and high-quality feature selection now derives observation status from `ABUNDANCE` with optional censored handling.
Normalization cleanup and fraction merge rewrite `R/utils_normalize.R`, `inst/tinytest/test_MSstatsMergeFractions.R`, `inst/tinytest/test_memory_optimization_copies.R`	Temporary normalization columns are removed in place, fraction merging is rebuilt around join-index lookups and merged-run assignment, and tests cover merge behavior plus in-place normalization semantics.
Output assembly and predicted_survival plumbing `R/utils_output.R`, `man/dot-finalizeInput.Rd`, `man/dot-finalizeTMP.Rd`, `inst/tinytest/test_memory_optimization_copies.R`	Summarization output now extracts `predicted_survival`, finalizers consume it directly, output tables are trimmed in place, and tests/docs reflect the updated contract.
Censored-value imputation post-processing `R/dataProcess.R`	Linear and TMP summarizers replace column-wide `ifelse(...)` updates with conditional `data.table` subset assignments for censored-value post-processing and labeled-reference handling.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant SummarizationOutput as MSstatsSummarizationOutput
  participant FinalizeTMP as .finalizeTMP
  participant PredictedSurvival as predicted_survival
  participant InputTable as input
  SummarizationOutput->>PredictedSurvival: extract survival rows from summarized
  SummarizationOutput->>FinalizeTMP: pass predicted_survival into finalizer
  FinalizeTMP->>InputTable: update newABUNDANCE and predicted in place
  FinalizeTMP->>SummarizationOutput: return finalized feature/protein outputs

Possibly related PRs

Vitek-Lab/MSstats#174: Shares the same linear/TMP summarization and censored-imputation path in R/dataProcess.R.
Vitek-Lab/MSstats#192: Touches the labeled-reference and is_labeled_reference handling used by the TMP summarization flow.
Vitek-Lab/MSstats#193: Also changes label-aware censored post-processing around predicted and newABUNDANCE.

Poem

🐇 I hopped through tables, neat and bright,
No copied columns left in sight.
I plucked survival from the heap,
Then let the finalizers leap.
With one small hop, the data sang.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: replacing deep-copy data.table operations in the dataProcess pipeline with in-place updates.
Description check	✅ Passed	The description covers motivation, detailed changes, testing, and includes the required checklist items, so it is mostly complete.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch MSstats/work/20260514_avoid-deep-copy-ops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-05-14T15:28:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Input mutation `MSstatsPrepareForDataProcess` now removes columns from `input` in place and then adds derived columns on the same object. If the caller passes a `data.table` they intend to reuse, this function will now silently delete all non-selected columns from that original object, which did not happen before when the code first created a subset copy. drop_cols = setdiff(colnames(input), cols) for (col in drop_cols) data.table::set(input, j = col, value = NULL) input$PEPTIDE = paste(input$PEPTIDESEQUENCE, input$PRECURSORCHARGE, sep = "_") input$TRANSITION = paste(input$FRAGMENTION, Output mutation `MSstatsSummarizationOutput` now mutates its `input` argument by reference by dropping columns and converting it with `setDF()`. Because `input` is a `data.table`, the caller's object will be stripped down to output columns and lose its `data.table` class after this call, which can break any downstream code that reuses the same table. drop_cols = setdiff(colnames(input), output_cols) for (col in drop_cols) data.table::set(input, j = col, value = NULL) if (is.element("remove", colnames(processed))) { processed = processed[(remove), intersect(output_cols, colnames(processed)), with = FALSE] input = rbind(input, processed, fill = TRUE) } data.table::setDF(input) data.table::setDF(rqall) list(FeatureLevelData = input,

github-actions · 2026-05-14T15:33:19Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Avoid mutating caller objects `data.table::setDF()` mutates by reference, so this function now converts the caller's `input` object into a plain `data.frame`. That is a behavioral regression from `as.data.frame(...)` and can break later code that still expects `input` to be a `data.table`. R/utils_output.R [101-105] -data.table::setDF(input) -data.table::setDF(rqall) -list(FeatureLevelData = input, - ProteinLevelData = rqall, +feature_output = data.table::copy(input) +protein_output = data.table::copy(rqall) +data.table::setDF(feature_output) +data.table::setDF(protein_output) +list(FeatureLevelData = feature_output, + ProteinLevelData = protein_output, SummaryMethod = method) Suggestion importance[1-10]: 7 __ Why: This is correct and important for API behavior: `data.table::setDF()` mutates `input` and `rqall` by reference, unlike the previous `as.data.frame(...)` conversion. Copying before `setDF()` preserves the optimization while avoiding surprising side effects on caller-visible `data.table` objects.	Medium
Possible issue	Guard nested summary extraction Validate `summarized` before indexing `x[[2]]`. The new code dereferences every element immediately, so any `try-error` or short result now crashes here and bypasses the function's existing failure handling. R/utils_output.R [41-44] -predicted_survival = data.table::rbindlist(lapply(summarized, function(x) x[[2]]), - fill = TRUE) +invalid_summary = vapply( + summarized, + function(x) inherits(x, "try-error") \|\| length(x) < 2L, + logical(1) +) +if (any(invalid_summary)) { + stop("`summarized` contains failed or incomplete protein summaries.") +} +predicted_survival = data.table::rbindlist(lapply(summarized, `[[`, 2L), + fill = TRUE) for (i in seq_along(summarized)) summarized[[i]][[2]] = NULL input = .finalizeInput(input, predicted_survival, method, impute, censored_symbol) Suggestion importance[1-10]: 6 __ Why: This is a valid robustness concern: `MSstatsSummarizationOutput()` now dereferences `x[[2]]` before the later `inherits(summarized, "try-error")` check, so malformed entries in `summarized` can fail earlier with an uncontrolled error. The fix is relevant, but it mainly improves error handling rather than changing core results.	Low

mstaniak · 2026-05-15T10:14:47Z

Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr?

mstaniak

Hi,
thanks again for this update. I have a few minor comments

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

inst/tinytest/test_memory_optimization_copies.R (1)
328-369: ⚡ Quick win

Add a mixed-LABEL fixture to this contract test.

These assertions only exercise LABEL = "L", so a regression that drops LABEL from the survival projection or join keys would still pass here. A small L/H fixture with duplicated (RUN, FEATURE) values would cover the regression this stack is guarding against.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inst/tinytest/test_memory_optimization_copies.R` around lines 328 - 369, The
test only uses LABEL = "L", so add mixed LABEL values and duplicated (RUN,
FEATURE) combos to both finalize_input_4 and pred_surv_4 to exercise join keys:
modify finalize_input_4$LABEL to contain a small mixture (e.g. "L" and "H" as a
factor) with duplicated RUN/FEATURE pairs across labels, and add a LABEL column
to pred_surv_4 with matching L/H entries (and duplicate RUN/FEATURE rows) so
MSstats:::.finalizeTMP must preserve/join on LABEL; keep result_4 assertions but
ensure the fixture includes those mixed-label cases to catch regressions that
drop LABEL from survival projection or join keys.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 206-212: The test currently counts all non-NA newABUNDANCE values
(matched_count) which can include rows that were already populated; instead
capture the rows that started with newABUNDANCE = NA before calling
.finalizeTMP() (e.g. store original_na_idx <-
is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.

In `@R/utils_output.R`:
- Around line 41-49: Check whether summarized contains a "try-error" result
before accessing x[[1]]/x[[2]]: if any element of summarized inherits from
"try-error" (the fallback path intended for failed
MSstatsSummarizeWithSingleCore()), do not rbind or unpack
predicted_survival/protein_summaries; instead invoke the existing fallback
behavior (the same path currently guarded at the later check) and avoid calling
.finalizeInput on invalid data. Update the block that builds predicted_survival
and protein_summaries to first detect try-error in summarized and branch to the
fallback handling when present, referencing the summarized variable and the
.finalizeInput call to ensure invalid summary results are not unpacked.
- Around line 101-102: The calls to data.table::setDF(input) and
data.table::setDF(rqall) mutate caller-owned objects in place; update
MSstatsSummarizationOutput to avoid by-reference mutation by operating on copies
instead (e.g., create local copies like input_copy <- input and rqall_copy <-
rqall or coerce with as.data.frame() on copies) and call data.table::setDF() (or
as.data.frame) on those copies so the original input and rqall keep their
data.table class; ensure all subsequent uses in the function reference the
copied variables (input_copy, rqall_copy) rather than the originals.

---

Nitpick comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 328-369: The test only uses LABEL = "L", so add mixed LABEL values
and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to
exercise join keys: modify finalize_input_4$LABEL to contain a small mixture
(e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels,
and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate
RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep
result_4 assertions but ensure the fixture includes those mixed-label cases to
catch regressions that drop LABEL from survival projection or join keys.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2610ce77-e553-465d-8347-26a794f1b249

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa0bd1 and 0764b34.

📒 Files selected for processing (6)

R/dataProcess.R
R/utils_checks.R
R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R
inst/tinytest/test_memory_optimization_copies.R

mstaniak · 2026-06-06T22:18:42Z

Hi,
thanks for the updates, I'll review these changes on Monday

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@man/dot-finalizeInput.Rd`:
- Around line 7-18: The Rd documentation is out of sync: update the \arguments
block for .finalizeInput to replace the stale item {summarized} with an entry
for {predicted_survival} that describes the new data.table contract;
specifically, ensure the arguments list includes input, predicted_survival,
method, impute, and censored_symbol and that the predicted_survival description
matches the function's expected data.table structure (columns/types) used by
.finalizeInput so \usage and \arguments are consistent with the function
signature.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb8c80d9-00e4-4549-b689-e83302de9aac

📥 Commits

Reviewing files that changed from the base of the PR and between 0764b34 and 27d5f7c.

📒 Files selected for processing (6)

R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R
inst/tinytest/test_MSstatsMergeFractions.R
man/dot-finalizeInput.Rd
man/dot-finalizeTMP.Rd

✅ Files skipped from review due to trivial changes (1)

man/dot-finalizeTMP.Rd

🚧 Files skipped from review as they are similar to previous changes (3)

R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R

Rudhik1904 · 2026-06-09T08:22:52Z

@mstaniak

For issue: #208 (comment)

I looked into this but think the current input[, list(...)] is actually the right call here, because of the following:

In-place := on input would corrupt the caller. .selectHighQualityFeatures is called from MSstatsSelectFeatures, which reuses the same input immediately afterwards:

features_quality = .selectHighQualityFeatures(input, min_feature_count)
input = merge(input, features_quality,
              by.x = c("LABEL", "PROTEIN", "FEATURE", "originalRUN"),
              by.y = c("label", "protein", "feature", "run"))

Since data.table :=/setnames modify by reference across function boundaries, renaming the columns to lowercase (or adding log2inty/is_obs) inside the function would mutate the caller's table and break that merge (its by.x keys would no longer exist). The input[, list(...)] form avoids this by returning a new table and only rebinding the local name.

It's already a minimal projection, not a full-table copy. list(...) materializes only the 7 columns the feature-selection step needs — it's not the kind of whole-table duplication this PR is targeting.

happy to reevaluate my decision if I miss something.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/utils_checks.R`:
- Line 173: The AnomalyScores column is being initialized with plain NA which
creates a logical column; update the call to data.table::set in R/utils_checks.R
(the line that sets j = "AnomalyScores") to use NA_real_ instead of NA so the
column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9ac4369-d3ff-416f-8dc1-4fdc7f912e4e

📥 Commits

Reviewing files that changed from the base of the PR and between 27d5f7c and 50a30da.

📒 Files selected for processing (3)

R/dataProcess.R
R/utils_checks.R
inst/tinytest/test_memory_optimization_copies.R

🚧 Files skipped from review as they are similar to previous changes (2)

R/dataProcess.R
inst/tinytest/test_memory_optimization_copies.R

tonywu1999 · 2026-06-15T13:41:44Z

+                         log2inty = ABUNDANCE,
+                         is_censored = if (has_censored) censored else FALSE)]
+    # Censored or missing intensities are not observations.
+    input[is_censored | is.na(log2inty), log2inty := NA]


Should this be is_censored & is.na(log2inty)? Because the negation of !(is.na(ABUNDANCE) | is_censored) switches the OR to an AND

old code was log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored), ABUNDANCE, NA)
log2inty gets the actual ABUNDANCE only for measurements that are both present and not censored;

new code:
input[is_censored | is.na(log2inty), log2inty := NA]
Any measurement that is censored or missing is not a real observation, so blank out its intensity.

so its doing the same thing, it's double negation

github-actions Bot added the Review effort 4/5 label May 14, 2026

Rudhik1904 added 2 commits May 15, 2026 00:11

Updating the comments.

b64b004

Updating some more comments.

e2bd771

mstaniak requested changes May 16, 2026

View reviewed changes

Comment thread inst/tinytest/test_memory_optimization_copies.R Outdated

Comment thread R/utils_feature_selection.R Outdated

Comment thread R/utils_normalize.R Outdated

Rudhik1904 added 2 commits June 2, 2026 00:39

Responding to the comments.

65df2b5

Responding to comments- 2

0764b34

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread inst/tinytest/test_memory_optimization_copies.R Outdated

Comment thread R/utils_output.R Outdated

Comment thread R/utils_output.R Outdated

Rudhik1904 requested review from mstaniak June 3, 2026 00:05

tonywu1999 commented Jun 5, 2026

View reviewed changes

Comment thread R/utils_feature_selection.R Outdated

Comment thread R/utils_feature_selection.R Outdated

tonywu1999 commented Jun 5, 2026

View reviewed changes

Comment thread R/utils_normalize.R Outdated

Comment thread R/utils_normalize.R

tonywu1999 commented Jun 5, 2026

View reviewed changes

Comment thread R/utils_output.R Outdated

Comment thread R/utils_output.R Outdated

Vitek-Lab deleted a comment from coderabbitai Bot Jun 8, 2026

mstaniak requested changes Jun 8, 2026

View reviewed changes

Comment thread R/dataProcess.R

Comment thread R/dataProcess.R

Comment thread R/utils_checks.R Outdated

Comment thread R/utils_feature_selection.R Outdated

Comment thread R/utils_output.R Outdated

Responding to Tony's suggestions

27d5f7c

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread man/dot-finalizeInput.Rd Outdated

Rudhik1904 mentioned this pull request Jun 9, 2026

Also it should be converted to a series of ":=" I think #210

Open

Adressing Matt's comments

50a30da

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread R/utils_checks.R Outdated

tonywu1999 commented Jun 15, 2026

View reviewed changes

Addressing review comments: NA_real_ for AnomalyScores + doc sync

727d3ee

Uh oh!

Conversation

tonywu1999 commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Motivation and Context

Changes

Testing

Checklist Before Requesting a Review

PR Type

Description

Diagram Walkthrough

File Walkthrough

Motivation and Context — Short summary of the solution

Detailed changes

Unit tests added / modified

Coding guidelines / potential violations

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

github-actions Bot commented May 14, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 14, 2026

PR Code Suggestions ✨

Uh oh!

mstaniak commented May 15, 2026

Uh oh!

mstaniak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mstaniak commented Jun 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rudhik1904 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tonywu1999 Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Rudhik1904 Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

tonywu1999 commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Rudhik1904 commented Jun 9, 2026 •

edited

Loading