Removed deep-copy data.table ops from the dataProcess pipeline#208
Removed deep-copy data.table ops from the dataProcess pipeline#208tonywu1999 wants to merge 8 commits into
Conversation
* Replaced `input[, cols, with = FALSE]` deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL). * Replaced row-shuffle `input = input[order(...), ]` in .prepareForDataProcess with data.table::setorder() (in place). * Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table. * Replaced the synthesised `tmp` string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call. * Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes. * Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy. * Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame(). * Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept. * Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe PR refactors multiple MSstats data-processing paths to use in-place ChangesDataProcess Pipeline Memory Optimization and Output Refactoring
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant SummarizationOutput as MSstatsSummarizationOutput
participant FinalizeTMP as .finalizeTMP
participant PredictedSurvival as predicted_survival
participant InputTable as input
SummarizationOutput->>PredictedSurvival: extract survival rows from summarized
SummarizationOutput->>FinalizeTMP: pass predicted_survival into finalizer
FinalizeTMP->>InputTable: update newABUNDANCE and predicted in place
FinalizeTMP->>SummarizationOutput: return finalized feature/protein outputs
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|
Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr? |
mstaniak
left a comment
There was a problem hiding this comment.
Hi,
thanks again for this update. I have a few minor comments
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
inst/tinytest/test_memory_optimization_copies.R (1)
328-369: ⚡ Quick winAdd a mixed-
LABELfixture to this contract test.These assertions only exercise
LABEL = "L", so a regression that dropsLABELfrom the survival projection or join keys would still pass here. A smallL/Hfixture with duplicated(RUN, FEATURE)values would cover the regression this stack is guarding against.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inst/tinytest/test_memory_optimization_copies.R` around lines 328 - 369, The test only uses LABEL = "L", so add mixed LABEL values and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to exercise join keys: modify finalize_input_4$LABEL to contain a small mixture (e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels, and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep result_4 assertions but ensure the fixture includes those mixed-label cases to catch regressions that drop LABEL from survival projection or join keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 206-212: The test currently counts all non-NA newABUNDANCE values
(matched_count) which can include rows that were already populated; instead
capture the rows that started with newABUNDANCE = NA before calling
.finalizeTMP() (e.g. store original_na_idx <-
is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.
In `@R/utils_output.R`:
- Around line 41-49: Check whether summarized contains a "try-error" result
before accessing x[[1]]/x[[2]]: if any element of summarized inherits from
"try-error" (the fallback path intended for failed
MSstatsSummarizeWithSingleCore()), do not rbind or unpack
predicted_survival/protein_summaries; instead invoke the existing fallback
behavior (the same path currently guarded at the later check) and avoid calling
.finalizeInput on invalid data. Update the block that builds predicted_survival
and protein_summaries to first detect try-error in summarized and branch to the
fallback handling when present, referencing the summarized variable and the
.finalizeInput call to ensure invalid summary results are not unpacked.
- Around line 101-102: The calls to data.table::setDF(input) and
data.table::setDF(rqall) mutate caller-owned objects in place; update
MSstatsSummarizationOutput to avoid by-reference mutation by operating on copies
instead (e.g., create local copies like input_copy <- input and rqall_copy <-
rqall or coerce with as.data.frame() on copies) and call data.table::setDF() (or
as.data.frame) on those copies so the original input and rqall keep their
data.table class; ensure all subsequent uses in the function reference the
copied variables (input_copy, rqall_copy) rather than the originals.
---
Nitpick comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 328-369: The test only uses LABEL = "L", so add mixed LABEL values
and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to
exercise join keys: modify finalize_input_4$LABEL to contain a small mixture
(e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels,
and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate
RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep
result_4 assertions but ensure the fixture includes those mixed-label cases to
catch regressions that drop LABEL from survival projection or join keys.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2610ce77-e553-465d-8347-26a794f1b249
📒 Files selected for processing (6)
R/dataProcess.RR/utils_checks.RR/utils_feature_selection.RR/utils_normalize.RR/utils_output.Rinst/tinytest/test_memory_optimization_copies.R
|
Hi, |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@man/dot-finalizeInput.Rd`:
- Around line 7-18: The Rd documentation is out of sync: update the \arguments
block for .finalizeInput to replace the stale item {summarized} with an entry
for {predicted_survival} that describes the new data.table contract;
specifically, ensure the arguments list includes input, predicted_survival,
method, impute, and censored_symbol and that the predicted_survival description
matches the function's expected data.table structure (columns/types) used by
.finalizeInput so \usage and \arguments are consistent with the function
signature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: eb8c80d9-00e4-4549-b689-e83302de9aac
📒 Files selected for processing (6)
R/utils_feature_selection.RR/utils_normalize.RR/utils_output.Rinst/tinytest/test_MSstatsMergeFractions.Rman/dot-finalizeInput.Rdman/dot-finalizeTMP.Rd
✅ Files skipped from review due to trivial changes (1)
- man/dot-finalizeTMP.Rd
🚧 Files skipped from review as they are similar to previous changes (3)
- R/utils_feature_selection.R
- R/utils_normalize.R
- R/utils_output.R
|
For issue: #208 (comment) I looked into this but think the current input[, list(...)] is actually the right call here, because of the following:
Since data.table :=/setnames modify by reference across function boundaries, renaming the columns to lowercase (or adding log2inty/is_obs) inside the function would mutate the caller's table and break that merge (its by.x keys would no longer exist). The input[, list(...)] form avoids this by returning a new table and only rebinding the local name.
happy to reevaluate my decision if I miss something. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/utils_checks.R`:
- Line 173: The AnomalyScores column is being initialized with plain NA which
creates a logical column; update the call to data.table::set in R/utils_checks.R
(the line that sets j = "AnomalyScores") to use NA_real_ instead of NA so the
column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a9ac4369-d3ff-416f-8dc1-4fdc7f912e4e
📒 Files selected for processing (3)
R/dataProcess.RR/utils_checks.Rinst/tinytest/test_memory_optimization_copies.R
🚧 Files skipped from review as they are similar to previous changes (2)
- R/dataProcess.R
- inst/tinytest/test_memory_optimization_copies.R
| log2inty = ABUNDANCE, | ||
| is_censored = if (has_censored) censored else FALSE)] | ||
| # Censored or missing intensities are not observations. | ||
| input[is_censored | is.na(log2inty), log2inty := NA] |
There was a problem hiding this comment.
Should this be is_censored & is.na(log2inty)? Because the negation of !(is.na(ABUNDANCE) | is_censored) switches the OR to an AND
There was a problem hiding this comment.
old code was log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored), ABUNDANCE, NA)
log2inty gets the actual ABUNDANCE only for measurements that are both present and not censored;
new code:
input[is_censored | is.na(log2inty), log2inty := NA]
Any measurement that is censored or missing is not a real observation, so blank out its intensity.
so its doing the same thing, it's double negation
User description
input[, cols, with = FALSE]deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).input = input[order(...), ]in .prepareForDataProcess with data.table::setorder() (in place).tmpstring-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md
Motivation and Context
Please include relevant motivation and context of the problem along with a short summary of the solution.
Changes
Please provide a detailed bullet point list of your changes.
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
PR Type
Enhancement, Bug fix, Tests
Description
Replace full-table copies with in-place updates
Use keyed lookups for joins
Split survival outputs before finalization
Add memory-regression pipeline tests
Diagram Walkthrough
File Walkthrough
dataProcess.R
Limit censored-value updates to matching rowsR/dataProcess.R
ifelse()rewrites with targeted:=updatespredictedon applicable censored rowsnewABUNDANCEwhere imputation appliesutils_checks.R
Avoid copies during input trimming and sortingR/utils_checks.R
data.table::set(..., NULL)data.table::setorder()utils_feature_selection.R
Collapse feature preprocessing into one passR/utils_feature_selection.R
censoredvalues inlineis_obswithout intermediate tablesutils_normalize.R
Use in-place cleanup and keyed fraction mergesR/utils_normalize.R
merge()with keyednewRunassignment(FEATURE, FRACTION)lookuputils_output.R
Streamline summary output and imputation joinsR/utils_output.R
predicted_survivalbefore finalizationtest_memory_optimization_copies.R
Add memory regression tests for copy avoidanceinst/tinytest/test_memory_optimization_copies.R
.normalizeMediantemp-column cleanup.finalizeTMPkeyed matches and unmatchedNAsMSstatsSummarizationOutputlist splitting behaviorMotivation and Context — Short summary of the solution
The dataProcess pipeline relied on copy-heavy
data.tablepatterns (column subsetting that materializes new tables, merge-based joins, order-based rewrites, and whole-columnifelseassignments). This increased memory churn and caused regressions in the “predicted survival” imputation path.This PR replaces those hotspots with in-place
data.tableoperations (set(),setorder(), keyedwhichlookups, and targeted:=updates), restructures summarization to extract and passpredicted_survivalexplicitly, and fixes the two regressions (join-key intersection withpredicted_survivaland retention ofLABEL). It also adds memory/copy-avoidance tinytests to guard the behavior.Detailed changes
General / pipeline-level
data.tableupdates (data.table::set, targeted[i, j := v], anddata.table::setorder()), including replacing merge-based filtering/temporary-column workflows.MSstatsSummarizationOutputnow extractspredicted_survivalfrom the nested per-subplotsummarizedoutput early, clears the nested survival slots, finalizes usingpredicted_survival, then rebuilds protein-level summaries only..finalizeInput/.finalizeTMPnow takepredicted_survivaldirectly (instead of a nested/summarized structure).R/dataProcess.R
MSstatsSummarizeSingleLinearandMSstatsSummarizeSingleTMP, revise censored-value post-imputation handling:ifelse-based assignments withdata.tableconditional subset updates using:=.predictedtoNAfor non-censored relevant rows and write fittedpredictedvalues only intonewABUNDANCEfor the censored subset (mirrored whenis_labeled_reference == FALSE).fit_datasubset expression (explicit negation form) while keeping survival table construction consistent with the updatedpredicted/newABUNDANCEstate.R/utils_checks.R
.checkUnProcessedDataValidity:AnomalyScoresasNA, coerceINTENSITYto numeric) viadata.table::set-style updates.NULLrather than deferring to later column subsetting..prepareForDataProcess:PEPTIDE,TRANSITION) and isotope label mapping viadata.table::set-style updates..makeFactorColumns:input[order(...), ]with in-placedata.table::setorder()using the same multi-key ordering.R/utils_feature_selection.R
.selectHighQualityFeatures:censoredhandling by computinglog2intyfromABUNDANCEwith censored/NA treated as missing.is_obssolely from whetherlog2intyis non-NA.censored→is_censored) and associated intermediate computations.R/utils_normalize.R
.normalizeMedian/.normalizeGlobalStandards:data.table::set(..., value = NULL)instead of rebuilding tables via column-exclusion expressions.MSstatsMergeFractionsnon-TECHREPLICATEpath:dcast-derived lookup +which-based indexed updates (data.table::set).RUNbased onnewRunand clear helper columns in-place (removing the priormerge(all.x=TRUE) + tmpflow).R/utils_output.R
MSstatsSummarizationOutput:summarized == NULLas summarization failure (instead of relying ontry-errorinheritance).predicted_survivalas a combineddata.tablefromsummarized[[i]][[2]].summarized[[i]][[2]] = NULL) before finalization..finalizeInput(input, predicted_survival, ...), then rebuildsummarizedfrom protein-level parts only (summarized[[i]][[1]]).inputin-place and returndata.frameviadata.table::setDF()..finalizeInput/.finalizeTMP:predicted_survivaldirectly..finalizeTMP(impute=TRUE):newABUNDANCEandpredicted.Regressions fixed
.finalizeTMPjoin regression: join columns are now explicitly intersected betweeninputandpredicted_survival, and the intersection set explicitly includesLABEL, preventing missing-key errors and preserving downstream assumptions.LABELis retained in the selection/join-column intersection required by later logic/tests.Documentation
.finalizeInputand.finalizeTMPhelp topics to reflect the newpredicted_survivalparameter and removal of the previously documentedsummarizedargument.Unit tests added / modified
Added:
inst/tinytest/test_memory_optimization_copies.Rdata.table::address()) for.normalizeMedianand.finalizeTMPon small and scaled fixtures.ABUNDANCE_RUN,ABUNDANCE_FRACTION, and other helpers likenewRun) are removed while keeping required output columns..finalizeTMPimputation semantics:(cen, RUN, FEATURE)keys updatenewABUNDANCEfrompredicted_survivalNApredictedcolumn is addedpredicted_survivalcontract:.finalizeTMPaccepts a combineddata.table(not nested list input)MSstatsSummarizationOutputcorrectly decomposes a nestedsummarized_listinto feature/protein outputspredicted_survivaltable is smaller than the full nestedsummarized_list.Added:
inst/tinytest/test_MSstatsMergeFractions.RMSstatsMergeFractionsbehavior:originalRUNhandlingTECHREPLICATEmerged-run collapse semantics, row dropping for unobserved feature/fraction combos, and factor level expectationsnewRunin outputsCoding guidelines / potential violations
data.tablewritesdata.tables by reference for memory reasons; this can break expectations if the sameinputobject is reused elsewhere..selectHighQualityFeatures: the implementation now projectsinputinto a reduced table (input = input[, list(...)]) before performing:=updates, which confines mutations to the projected table rather than mutating the caller’s original input.predicted_survivaldirectly; this is documented in man pages and guarded by new tinytests targeting the contract.