Skip to content

Removed deep-copy data.table ops from the dataProcess pipeline#208

Open
tonywu1999 wants to merge 8 commits into
develfrom
MSstats/work/20260514_avoid-deep-copy-ops
Open

Removed deep-copy data.table ops from the dataProcess pipeline#208
tonywu1999 wants to merge 8 commits into
develfrom
MSstats/work/20260514_avoid-deep-copy-ops

Conversation

@tonywu1999

@tonywu1999 tonywu1999 commented May 14, 2026

Copy link
Copy Markdown
Contributor

User description

  • Replaced input[, cols, with = FALSE] deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).
  • Replaced row-shuffle input = input[order(...), ] in .prepareForDataProcess with data.table::setorder() (in place).
  • Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table.
  • Replaced the synthesised tmp string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.
  • Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes.
  • Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy.
  • Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame().
  • Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept.
  • Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • I have run the devtools::document() command after my changes and committed the added files

PR Type

Enhancement, Bug fix, Tests


Description

  • Replace full-table copies with in-place updates

  • Use keyed lookups for joins

  • Split survival outputs before finalization

  • Add memory-regression pipeline tests


Diagram Walkthrough

flowchart LR
  a["Copy-heavy data.table operations"]
  b["In-place set and setorder updates"]
  c["Keyed survival and fraction lookups"]
  d["Lower-memory summarization pipeline"]
  a -- "replaced by" --> b
  b -- "combined with" --> c
  c -- "reduces copies in" --> d
Loading

File Walkthrough

Relevant files
Enhancement
dataProcess.R
Limit censored-value updates to matching rows                       

R/dataProcess.R

  • Replace ifelse() rewrites with targeted := updates
  • Only keep predicted on applicable censored rows
  • Only overwrite newABUNDANCE where imputation applies
+12/-10 
utils_checks.R
Avoid copies during input trimming and sorting                     

R/utils_checks.R

  • Drop unwanted columns via data.table::set(..., NULL)
  • Avoid retained-column deep copies during preparation
  • Sort rows in place with data.table::setorder()
+13/-5   
utils_feature_selection.R
Collapse feature preprocessing into one pass                         

R/utils_feature_selection.R

  • Build the feature-selection working table once
  • Handle optional censored values inline
  • Compute is_obs without intermediate tables
+20/-22 
utils_normalize.R
Use in-place cleanup and keyed fraction merges                     

R/utils_normalize.R

  • Remove normalization temp columns in place
  • Return the modified input after cleanup
  • Replace merge() with keyed newRun assignment
  • Filter fractions by direct (FEATURE, FRACTION) lookup
+23/-13 
Bug fix
utils_output.R
Streamline summary output and imputation joins                     

R/utils_output.R

  • Extract combined predicted_survival before finalization
  • Free nested survival entries before binding summaries
  • Replace merge-based imputation with keyed writes
  • Guard join columns and update flags in place
+51/-27 
Tests
test_memory_optimization_copies.R
Add memory regression tests for copy avoidance                     

inst/tinytest/test_memory_optimization_copies.R

  • Add address-based assertions for in-place behavior
  • Test .normalizeMedian temp-column cleanup
  • Test .finalizeTMP keyed matches and unmatched NAs
  • Verify MSstatsSummarizationOutput list splitting behavior
+469/-0 

Motivation and Context — Short summary of the solution

The dataProcess pipeline relied on copy-heavy data.table patterns (column subsetting that materializes new tables, merge-based joins, order-based rewrites, and whole-column ifelse assignments). This increased memory churn and caused regressions in the “predicted survival” imputation path.
This PR replaces those hotspots with in-place data.table operations (set(), setorder(), keyed which lookups, and targeted := updates), restructures summarization to extract and pass predicted_survival explicitly, and fixes the two regressions (join-key intersection with predicted_survival and retention of LABEL). It also adds memory/copy-avoidance tinytests to guard the behavior.

Detailed changes

  • General / pipeline-level

    • Switch copy-heavy patterns to in-place data.table updates (data.table::set, targeted [i, j := v], and data.table::setorder()), including replacing merge-based filtering/temporary-column workflows.
    • Summarization refactor: MSstatsSummarizationOutput now extracts predicted_survival from the nested per-subplot summarized output early, clears the nested survival slots, finalizes using predicted_survival, then rebuilds protein-level summaries only.
    • Finalize refactor: .finalizeInput / .finalizeTMP now take predicted_survival directly (instead of a nested/summarized structure).
  • R/dataProcess.R

    • In MSstatsSummarizeSingleLinear and MSstatsSummarizeSingleTMP, revise censored-value post-imputation handling:
      • Replace ifelse-based assignments with data.table conditional subset updates using :=.
      • For labeled-reference cases, set predicted to NA for non-censored relevant rows and write fitted predicted values only into newABUNDANCE for the censored subset (mirrored when is_labeled_reference == FALSE).
      • Adjust the labeled-reference fit_data subset expression (explicit negation form) while keeping survival table construction consistent with the updated predicted/newABUNDANCE state.
  • R/utils_checks.R

    • In .checkUnProcessedDataValidity:
      • Normalize/ensure required columns (e.g., add AnomalyScores as NA, coerce INTENSITY to numeric) via data.table::set-style updates.
      • Remove extra/unallowed columns by setting them to NULL rather than deferring to later column subsetting.
    • In .prepareForDataProcess:
      • Compute derived columns (PEPTIDE, TRANSITION) and isotope label mapping via data.table::set-style updates.
    • In .makeFactorColumns:
      • Replace input[order(...), ] with in-place data.table::setorder() using the same multi-key ordering.
  • R/utils_feature_selection.R

    • In .selectHighQualityFeatures:
      • Simplify optional censored handling by computing log2inty from ABUNDANCE with censored/NA treated as missing.
      • Derive is_obs solely from whether log2inty is non-NA.
      • Remove intermediate renames/dropping logic (censoredis_censored) and associated intermediate computations.
  • R/utils_normalize.R

    • In .normalizeMedian / .normalizeGlobalStandards:
      • Remove helper columns in-place with data.table::set(..., value = NULL) instead of rebuilding tables via column-exclusion expressions.
    • MSstatsMergeFractions non-TECHREPLICATE path:
      • Rework merged-run construction and assignment using keyed dcast-derived lookup + which-based indexed updates (data.table::set).
      • Filter out unobserved feature/fraction combinations (zero abundance) via keyed join-index mapping rather than merge + temporary-column filtering.
      • Renumber RUN based on newRun and clear helper columns in-place (removing the prior merge(all.x=TRUE) + tmp flow).
  • R/utils_output.R

    • MSstatsSummarizationOutput:
      • Treat summarized == NULL as summarization failure (instead of relying on try-error inheritance).
      • Build predicted_survival as a combined data.table from summarized[[i]][[2]].
      • Clear nested survival entries (summarized[[i]][[2]] = NULL) before finalization.
      • Finalize via .finalizeInput(input, predicted_survival, ...), then rebuild summarized from protein-level parts only (summarized[[i]][[1]]).
      • Drop non-output columns from input in-place and return data.frame via data.table::setDF().
    • .finalizeInput / .finalizeTMP:
      • Change to accept predicted_survival directly.
      • In .finalizeTMP(impute=TRUE):
        • Replace merge-based prediction application with indexed lookup + targeted writes to newABUNDANCE and predicted.
  • Regressions fixed

    • .finalizeTMP join regression: join columns are now explicitly intersected between input and predicted_survival, and the intersection set explicitly includes LABEL, preventing missing-key errors and preserving downstream assumptions.
    • Survival-column selection regression: LABEL is retained in the selection/join-column intersection required by later logic/tests.
  • Documentation

    • Update .finalizeInput and .finalizeTMP help topics to reflect the new predicted_survival parameter and removal of the previously documented summarized argument.

Unit tests added / modified

  • Added: inst/tinytest/test_memory_optimization_copies.R

    • Asserts in-place behavior (via data.table::address()) for .normalizeMedian and .finalizeTMP on small and scaled fixtures.
    • Verifies temporary helper columns (ABUNDANCE_RUN, ABUNDANCE_FRACTION, and other helpers like newRun) are removed while keeping required output columns.
    • Validates .finalizeTMP imputation semantics:
      • matched (cen, RUN, FEATURE) keys update newABUNDANCE from predicted_survival
      • unmatched keys yield NA
      • predicted column is added
    • Confirms the predicted_survival contract:
      • .finalizeTMP accepts a combined data.table (not nested list input)
      • MSstatsSummarizationOutput correctly decomposes a nested summarized_list into feature/protein outputs
    • Includes a size comparison showing the combined predicted_survival table is smaller than the full nested summarized_list.
  • Added: inst/tinytest/test_MSstatsMergeFractions.R

    • Characterizes MSstatsMergeFractions behavior:
      • single-fraction preservation and originalRUN handling
      • ABUNDANCE clamping before merging (including INTENSITY edge cases)
      • non-TECHREPLICATE merged-run collapse semantics, row dropping for unobserved feature/fraction combos, and factor level expectations
      • absence of helper newRun in outputs

Coding guidelines / potential violations

  • Caller-side mutation risk from in-place data.table writes
    • Multiple changes intentionally mutate data.tables by reference for memory reasons; this can break expectations if the same input object is reused elsewhere.
    • The review concern specifically targeted .selectHighQualityFeatures: the implementation now projects input into a reduced table (input = input[, list(...)]) before performing := updates, which confines mutations to the projected table rather than mutating the caller’s original input.
  • Internal signature/API contract drift
    • Internal functions changed from operating on nested summarized structures to accepting predicted_survival directly; this is documented in man pages and guarded by new tinytests targeting the contract.

* Replaced `input[, cols, with = FALSE]` deep-copy in
  MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with
  drop-cols loops via data.table::set(j = ..., value = NULL).
* Replaced row-shuffle `input = input[order(...), ]` in
  .prepareForDataProcess with data.table::setorder() (in place).
* Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and
  .finalizeTMP with keyed-which lookups + data.table::set() writes —
  avoids deep-copying the whole table.
* Replaced the synthesised `tmp` string-join filter in
  MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed
  lookup; drops two large character vectors and a paste() call.
* Replaced ifelse() full-vector writes for predicted/newABUNDANCE
  and nonmissing_orig with targeted [i, j := v] in-place writes.
* Collapsed the two-step subset+transform in .selectHighQualityFeatures
  into a single pass to eliminate one intermediate data.table copy.
* Reworked MSstatsSummarizationOutput to extract predicted_survival
  upfront and null per-protein second slots so the nested-list
  duplication is freed before .finalizeTMP runs; switched the
  final return to data.table::setDF() in place of as.data.frame().
* Fixed two regressions in the original commit: (1) .finalizeTMP's
  join_cols must intersect with predicted_survival's columns so the
  keyed lookup doesn't error on missing LABEL; (2) reverted the
  survival-column-selection tightening that dropped LABEL — a
  downstream test in test_dataProcess.R relies on LABEL being kept.
* Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4
  — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab7707c4-6327-4337-9631-ed8c2ad0ce3d

📥 Commits

Reviewing files that changed from the base of the PR and between 50a30da and 727d3ee.

📒 Files selected for processing (4)
  • R/utils_checks.R
  • R/utils_output.R
  • man/dot-finalizeInput.Rd
  • man/dot-finalizeTMP.Rd
✅ Files skipped from review due to trivial changes (2)
  • man/dot-finalizeTMP.Rd
  • man/dot-finalizeInput.Rd
🚧 Files skipped from review as they are similar to previous changes (2)
  • R/utils_checks.R
  • R/utils_output.R

📝 Walkthrough

Walkthrough

The PR refactors multiple MSstats data-processing paths to use in-place data.table updates, rewrites fraction merging and finalization around indexed lookups, and changes summarization output handling to pass predicted_survival directly. Tests and docs are updated to match the new behavior.

Changes

DataProcess Pipeline Memory Optimization and Output Refactoring

Layer / File(s) Summary
Input normalization and feature selection
R/utils_checks.R, R/utils_feature_selection.R
Input columns are normalized, filtered, coerced, and ordered in place, and high-quality feature selection now derives observation status from ABUNDANCE with optional censored handling.
Normalization cleanup and fraction merge rewrite
R/utils_normalize.R, inst/tinytest/test_MSstatsMergeFractions.R, inst/tinytest/test_memory_optimization_copies.R
Temporary normalization columns are removed in place, fraction merging is rebuilt around join-index lookups and merged-run assignment, and tests cover merge behavior plus in-place normalization semantics.
Output assembly and predicted_survival plumbing
R/utils_output.R, man/dot-finalizeInput.Rd, man/dot-finalizeTMP.Rd, inst/tinytest/test_memory_optimization_copies.R
Summarization output now extracts predicted_survival, finalizers consume it directly, output tables are trimmed in place, and tests/docs reflect the updated contract.
Censored-value imputation post-processing
R/dataProcess.R
Linear and TMP summarizers replace column-wide ifelse(...) updates with conditional data.table subset assignments for censored-value post-processing and labeled-reference handling.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant SummarizationOutput as MSstatsSummarizationOutput
  participant FinalizeTMP as .finalizeTMP
  participant PredictedSurvival as predicted_survival
  participant InputTable as input
  SummarizationOutput->>PredictedSurvival: extract survival rows from summarized
  SummarizationOutput->>FinalizeTMP: pass predicted_survival into finalizer
  FinalizeTMP->>InputTable: update newABUNDANCE and predicted in place
  FinalizeTMP->>SummarizationOutput: return finalized feature/protein outputs
Loading

Possibly related PRs

  • Vitek-Lab/MSstats#174: Shares the same linear/TMP summarization and censored-imputation path in R/dataProcess.R.
  • Vitek-Lab/MSstats#192: Touches the labeled-reference and is_labeled_reference handling used by the TMP summarization flow.
  • Vitek-Lab/MSstats#193: Also changes label-aware censored post-processing around predicted and newABUNDANCE.

Poem

🐇 I hopped through tables, neat and bright,
No copied columns left in sight.
I plucked survival from the heap,
Then let the finalizers leap.
With one small hop, the data sang.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: replacing deep-copy data.table operations in the dataProcess pipeline with in-place updates.
Description check ✅ Passed The description covers motivation, detailed changes, testing, and includes the required checklist items, so it is mostly complete.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstats/work/20260514_avoid-deep-copy-ops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Input mutation

MSstatsPrepareForDataProcess now removes columns from input in place and then adds derived columns on the same object. If the caller passes a data.table they intend to reuse, this function will now silently delete all non-selected columns from that original object, which did not happen before when the code first created a subset copy.

drop_cols = setdiff(colnames(input), cols)
for (col in drop_cols) data.table::set(input, j = col, value = NULL)

input$PEPTIDE = paste(input$PEPTIDESEQUENCE,
                      input$PRECURSORCHARGE, sep = "_")
input$TRANSITION = paste(input$FRAGMENTION, 
Output mutation

MSstatsSummarizationOutput now mutates its input argument by reference by dropping columns and converting it with setDF(). Because input is a data.table, the caller's object will be stripped down to output columns and lose its data.table class after this call, which can break any downstream code that reuses the same table.

drop_cols = setdiff(colnames(input), output_cols)
for (col in drop_cols) data.table::set(input, j = col, value = NULL)

if (is.element("remove", colnames(processed))) {
    processed = processed[(remove),
                          intersect(output_cols,
                                    colnames(processed)), with = FALSE]
    input = rbind(input, processed, fill = TRUE)
}
data.table::setDF(input)
data.table::setDF(rqall)
list(FeatureLevelData = input,

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid mutating caller objects

data.table::setDF() mutates by reference, so this function now converts the caller's
input object into a plain data.frame. That is a behavioral regression from
as.data.frame(...) and can break later code that still expects input to be a
data.table.

R/utils_output.R [101-105]

-data.table::setDF(input)
-data.table::setDF(rqall)
-list(FeatureLevelData = input,
-     ProteinLevelData = rqall,
+feature_output = data.table::copy(input)
+protein_output = data.table::copy(rqall)
+data.table::setDF(feature_output)
+data.table::setDF(protein_output)
+list(FeatureLevelData = feature_output,
+     ProteinLevelData = protein_output,
      SummaryMethod = method)
Suggestion importance[1-10]: 7

__

Why: This is correct and important for API behavior: data.table::setDF() mutates input and rqall by reference, unlike the previous as.data.frame(...) conversion. Copying before setDF() preserves the optimization while avoiding surprising side effects on caller-visible data.table objects.

Medium
Guard nested summary extraction

Validate summarized before indexing x[[2]]. The new code dereferences every element
immediately, so any try-error or short result now crashes here and bypasses the
function's existing failure handling.

R/utils_output.R [41-44]

-predicted_survival = data.table::rbindlist(lapply(summarized, function(x) x[[2]]),
-                                            fill = TRUE)
+invalid_summary = vapply(
+    summarized,
+    function(x) inherits(x, "try-error") || length(x) < 2L,
+    logical(1)
+)
+if (any(invalid_summary)) {
+    stop("`summarized` contains failed or incomplete protein summaries.")
+}
+predicted_survival = data.table::rbindlist(lapply(summarized, `[[`, 2L),
+                                           fill = TRUE)
 for (i in seq_along(summarized)) summarized[[i]][[2]] = NULL
 input = .finalizeInput(input, predicted_survival, method, impute, censored_symbol)
Suggestion importance[1-10]: 6

__

Why: This is a valid robustness concern: MSstatsSummarizationOutput() now dereferences x[[2]] before the later inherits(summarized, "try-error") check, so malformed entries in summarized can fail earlier with an uncontrolled error. The fix is relevant, but it mainly improves error handling rather than changing core results.

Low

@mstaniak

Copy link
Copy Markdown
Contributor

Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr?

@mstaniak mstaniak left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
thanks again for this update. I have a few minor comments

Comment thread inst/tinytest/test_memory_optimization_copies.R Outdated
Comment thread R/utils_feature_selection.R Outdated
Comment thread R/utils_normalize.R Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
inst/tinytest/test_memory_optimization_copies.R (1)

328-369: ⚡ Quick win

Add a mixed-LABEL fixture to this contract test.

These assertions only exercise LABEL = "L", so a regression that drops LABEL from the survival projection or join keys would still pass here. A small L/H fixture with duplicated (RUN, FEATURE) values would cover the regression this stack is guarding against.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inst/tinytest/test_memory_optimization_copies.R` around lines 328 - 369, The
test only uses LABEL = "L", so add mixed LABEL values and duplicated (RUN,
FEATURE) combos to both finalize_input_4 and pred_surv_4 to exercise join keys:
modify finalize_input_4$LABEL to contain a small mixture (e.g. "L" and "H" as a
factor) with duplicated RUN/FEATURE pairs across labels, and add a LABEL column
to pred_surv_4 with matching L/H entries (and duplicate RUN/FEATURE rows) so
MSstats:::.finalizeTMP must preserve/join on LABEL; keep result_4 assertions but
ensure the fixture includes those mixed-label cases to catch regressions that
drop LABEL from survival projection or join keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 206-212: The test currently counts all non-NA newABUNDANCE values
(matched_count) which can include rows that were already populated; instead
capture the rows that started with newABUNDANCE = NA before calling
.finalizeTMP() (e.g. store original_na_idx <-
is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.

In `@R/utils_output.R`:
- Around line 41-49: Check whether summarized contains a "try-error" result
before accessing x[[1]]/x[[2]]: if any element of summarized inherits from
"try-error" (the fallback path intended for failed
MSstatsSummarizeWithSingleCore()), do not rbind or unpack
predicted_survival/protein_summaries; instead invoke the existing fallback
behavior (the same path currently guarded at the later check) and avoid calling
.finalizeInput on invalid data. Update the block that builds predicted_survival
and protein_summaries to first detect try-error in summarized and branch to the
fallback handling when present, referencing the summarized variable and the
.finalizeInput call to ensure invalid summary results are not unpacked.
- Around line 101-102: The calls to data.table::setDF(input) and
data.table::setDF(rqall) mutate caller-owned objects in place; update
MSstatsSummarizationOutput to avoid by-reference mutation by operating on copies
instead (e.g., create local copies like input_copy <- input and rqall_copy <-
rqall or coerce with as.data.frame() on copies) and call data.table::setDF() (or
as.data.frame) on those copies so the original input and rqall keep their
data.table class; ensure all subsequent uses in the function reference the
copied variables (input_copy, rqall_copy) rather than the originals.

---

Nitpick comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 328-369: The test only uses LABEL = "L", so add mixed LABEL values
and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to
exercise join keys: modify finalize_input_4$LABEL to contain a small mixture
(e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels,
and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate
RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep
result_4 assertions but ensure the fixture includes those mixed-label cases to
catch regressions that drop LABEL from survival projection or join keys.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2610ce77-e553-465d-8347-26a794f1b249

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa0bd1 and 0764b34.

📒 Files selected for processing (6)
  • R/dataProcess.R
  • R/utils_checks.R
  • R/utils_feature_selection.R
  • R/utils_normalize.R
  • R/utils_output.R
  • inst/tinytest/test_memory_optimization_copies.R

Comment thread inst/tinytest/test_memory_optimization_copies.R Outdated
Comment thread R/utils_output.R Outdated
Comment thread R/utils_output.R Outdated
@Rudhik1904 Rudhik1904 requested review from mstaniak June 3, 2026 00:05
Comment thread R/utils_feature_selection.R Outdated
Comment thread R/utils_feature_selection.R Outdated
Comment thread R/utils_normalize.R Outdated
Comment thread R/utils_normalize.R
Comment thread R/utils_output.R Outdated
Comment thread R/utils_output.R Outdated
@mstaniak

mstaniak commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Hi,
thanks for the updates, I'll review these changes on Monday

@Vitek-Lab Vitek-Lab deleted a comment from coderabbitai Bot Jun 8, 2026
Comment thread R/dataProcess.R
Comment thread R/dataProcess.R
Comment thread R/utils_checks.R Outdated
Comment thread R/utils_feature_selection.R Outdated
Comment thread R/utils_output.R Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@man/dot-finalizeInput.Rd`:
- Around line 7-18: The Rd documentation is out of sync: update the \arguments
block for .finalizeInput to replace the stale item {summarized} with an entry
for {predicted_survival} that describes the new data.table contract;
specifically, ensure the arguments list includes input, predicted_survival,
method, impute, and censored_symbol and that the predicted_survival description
matches the function's expected data.table structure (columns/types) used by
.finalizeInput so \usage and \arguments are consistent with the function
signature.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb8c80d9-00e4-4549-b689-e83302de9aac

📥 Commits

Reviewing files that changed from the base of the PR and between 0764b34 and 27d5f7c.

📒 Files selected for processing (6)
  • R/utils_feature_selection.R
  • R/utils_normalize.R
  • R/utils_output.R
  • inst/tinytest/test_MSstatsMergeFractions.R
  • man/dot-finalizeInput.Rd
  • man/dot-finalizeTMP.Rd
✅ Files skipped from review due to trivial changes (1)
  • man/dot-finalizeTMP.Rd
🚧 Files skipped from review as they are similar to previous changes (3)
  • R/utils_feature_selection.R
  • R/utils_normalize.R
  • R/utils_output.R

Comment thread man/dot-finalizeInput.Rd Outdated
@Rudhik1904

Rudhik1904 commented Jun 9, 2026

Copy link
Copy Markdown

@mstaniak

For issue: #208 (comment)

I looked into this but think the current input[, list(...)] is actually the right call here, because of the following:

  1. In-place := on input would corrupt the caller. .selectHighQualityFeatures is called from MSstatsSelectFeatures, which reuses the same input immediately afterwards:
features_quality = .selectHighQualityFeatures(input, min_feature_count)
input = merge(input, features_quality,
              by.x = c("LABEL", "PROTEIN", "FEATURE", "originalRUN"),
              by.y = c("label", "protein", "feature", "run"))

Since data.table :=/setnames modify by reference across function boundaries, renaming the columns to lowercase (or adding log2inty/is_obs) inside the function would mutate the caller's table and break that merge (its by.x keys would no longer exist). The input[, list(...)] form avoids this by returning a new table and only rebinding the local name.

  1. It's already a minimal projection, not a full-table copy. list(...) materializes only the 7 columns the feature-selection step needs — it's not the kind of whole-table duplication this PR is targeting.

happy to reevaluate my decision if I miss something.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/utils_checks.R`:
- Line 173: The AnomalyScores column is being initialized with plain NA which
creates a logical column; update the call to data.table::set in R/utils_checks.R
(the line that sets j = "AnomalyScores") to use NA_real_ instead of NA so the
column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9ac4369-d3ff-416f-8dc1-4fdc7f912e4e

📥 Commits

Reviewing files that changed from the base of the PR and between 27d5f7c and 50a30da.

📒 Files selected for processing (3)
  • R/dataProcess.R
  • R/utils_checks.R
  • inst/tinytest/test_memory_optimization_copies.R
🚧 Files skipped from review as they are similar to previous changes (2)
  • R/dataProcess.R
  • inst/tinytest/test_memory_optimization_copies.R

Comment thread R/utils_checks.R Outdated
log2inty = ABUNDANCE,
is_censored = if (has_censored) censored else FALSE)]
# Censored or missing intensities are not observations.
input[is_censored | is.na(log2inty), log2inty := NA]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be is_censored & is.na(log2inty)? Because the negation of !(is.na(ABUNDANCE) | is_censored) switches the OR to an AND

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old code was log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored), ABUNDANCE, NA)
log2inty gets the actual ABUNDANCE only for measurements that are both present and not censored;

new code:
input[is_censored | is.na(log2inty), log2inty := NA]
Any measurement that is censored or missing is not a real observation, so blank out its intensity.

so its doing the same thing, it's double negation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants