Untargeted Metabolomics Quad-Workflow Integrated Pipeline
非靶向代谢组学四工作流集成分析系统
MetaboFlow is an R-based end-to-end pipeline for untargeted LC-MS metabolomics, integrating feature extraction, normalization, differential analysis, metabolite annotation, and four parallel pathway enrichment workflows (3× ORA + 1× QEA) for cross-validation. All figures are rendered to Nature publication standards.
MetaboFlow 是一套基于R语言的非靶向LC-MS代谢组学全流程分析系统,集成了特征提取、归一化、差异分析、代谢物注释,以及四个并行通路富集工作流(3个ORA + 1个QEA)用于交叉验证。所有图表按Nature论文投稿标准渲染。
| Workflow | Type | Engine | Database | Strength |
|---|---|---|---|---|
| WF1: SMPDB Pathway Enrichment | ORA | tidymass enrich_hmdb |
SMPDB/HMDB | Seamless tidymass integration |
| WF2: MSEA | ORA | MetaboAnalystR | SMPDB metabolite sets | Hypergeometric test, curated sets |
| WF3: KEGG Pathway Enrichment | ORA | KEGGREST + Fisher | KEGG species-specific | Real-time KEGG data, hypergeometric test |
| WF4: QEA Quantitative Enrichment | QEA | globaltest | SMPDB pathway | Full concentration matrix, GlobalTest |
- macOS 12+ (Monterey or later)
- Ubuntu 20.04+ / Debian 11+ / CentOS 8+
- Windows 10/11
| Component | Requirement | Notes |
|---|---|---|
| R | ≥ 4.5.0 | tidymass requires R ≥ 4.5; recommended: R 4.5.3 |
| RStudio | ≥ 2024.04 | Recommended for interactive use |
| BiocManager | ≥ 1.30.22 | Auto-installed by MetaboFlow |
| Resource | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16 GB+ (large datasets) |
| Disk | 5 GB (for R packages) | 10 GB+ |
| CPU | 4 cores | 8+ cores (parallel peak extraction) |
| Internet | Required for first run | Package download + KEGG API |
MetaboFlow auto-installs all dependencies on first run. The complete dependency list:
CRAN packages:
tidyverse,openxlsx,ggrepel,pheatmap,ggpubr,ggsci,patchworkremotes,Cairo,qs,survival
Bioconductor packages:
- Core:
Biobase,limma,KEGGREST,globaltest - XCMS pipeline:
xcms,MSnbase,BiocParallel - Additional:
impute,pcaMethods,preprocessCore,genefilter,sva,KEGGgraph,multtest,RBGL,Rgraphviz,edgeR,fgsea
GitHub/GitLab packages:
tidymass(GitLab: tidymass/tidymass) — WF1 + upstream processingMetaboAnalystR(GitHub: xia-lab/MetaboAnalystR) — WF2
Option 1: conda (recommended)
# Create environment with R 4.5.3
conda create -n metaboflow -c conda-forge r-base=4.5.3
conda activate metaboflow
# Launch R and run MetaboFlow — packages auto-install on first run
Rscript MetaboFlow_v1.rOption 2: System R Download and install R ≥ 4.5.0 from CRAN. Open RStudio, then run MetaboFlow_v1.r. All packages will be auto-installed.
Option 3: Docker (coming soon)
⚠️ Important: First run may take 20-40 minutes for package installation. Subsequent runs start immediately.
- One input, four outputs — single mzXML input triggers all four pathway analyses
- Auto-installation — all dependencies (CRAN, Bioconductor, tidymass, MetaboAnalystR) auto-detected and installed
- Graceful degradation — if tidymass or MetaboAnalystR fails to install, available workflows still run
- Nature-quality figures — NPG color palette, Arial font, dual PDF vector + TIFF 300 DPI output
- HMDB ID normalization — automatically converts old 7-digit to new 11-digit format
- Smart P-value selection — uses FDR-corrected
adj.P.Valwhen possible; falls back to rawP.Valuefor small sample sizes (n ≤ 3) - Bilingual comments — all code comments in both Chinese and English
- Non-specific pathway filter — auto-removes overly broad pathways (e.g., "Metabolic pathways", "ABC transporters"); outputs both full and filtered results
- Sub-figure mode —
SUBFIG_MODE=TRUEincreases font size for small paper panels (3.5×3 in);TOP_N_PATHWAYScontrols how many pathways to show
- Place all
.mzXMLfiles in one directory, following the naming convention below - Open
MetaboFlow_v1.rin RStudio - Modify the User Parameters section (line ~130):
WORK_DIR <- "path/to/your/data" POLARITY <- "positive" # or "negative" ORGANISM <- "dre" # dre=zebrafish, hsa=human, mmu=mouse LOGFC_CUTOFF <- 0.176 # 1.5-fold change ALPHA <- 0.05 # FDR threshold CONTROL_GROUP <- "control" # must match filename prefix MODEL_GROUP <- "DrugA" # must match filename prefix DB_DIR <- "path/to/your/inhouse_database"
- Run the entire script (
Ctrl+Alt+Rin RStudio, orRscript MetaboFlow_v1.r)
[GroupName][Number].mzXML
Examples:
control1.mzXML,control2.mzXML,control3.mzXMLTreatmentA1.mzXML,TreatmentA2.mzXML,TreatmentA3.mzXML
⚠️ Group names must NOT contain digits. Do not useDose10mg1.mzXML.
Result/
├── PCA_scores.pdf/tiff # PCA scores plot
├── 所有代谢物.csv # All annotations
├── 差异代谢峰/ # Differential features
│ ├── *差异峰.csv # Feature lists
│ └── *差异代谢峰.pdf/tiff # Volcano plots
├── 差异代谢物/ # Pathway analysis
│ ├── smpdb_*.xlsx/pdf/tiff # WF1: SMPDB enrichment (ORA)
│ ├── msea_*.xlsx/pdf/tiff # WF2: MSEA enrichment (ORA)
│ ├── kegg_*.xlsx # WF3: KEGG enrichment (ORA)
│ ├── kegg_metabolome_view_*.pdf # WF3: Enrichment bubble plot
│ ├── qea_*.xlsx/pdf/tiff # WF4: Quantitative enrichment (QEA)
│ ├── *_filtered.xlsx # Filtered results (non-specific pathways removed)
│ ├── heatmap_*.pdf/tiff # Clustered heatmaps
│ └── summary_*.xlsx # Parameter summary
└── Boxplot/ # Per-metabolite boxplots
| Parameter | Default | Description |
|---|---|---|
LOGFC_CUTOFF |
0.176 | log₁₀ FC threshold (0.176 = 1.5×, 0.301 = 2×) |
ALPHA |
0.05 | FDR-adjusted p-value threshold |
ORGANISM |
"dre" | KEGG organism code (full list) |
MS1_PPM |
15 | MS1 mass tolerance (ppm) |
PEAK_WIDTH |
c(5,30) | Chromatographic peak width range (sec) |
SN_THRESH |
5 | Signal-to-noise ratio threshold |
NOISE_LEVEL |
500 | Instrument noise level |
MIN_FRACTION |
0.5 | Minimum fraction of samples a peak must appear in |
NORM_METHOD |
"median" | Normalization method ("median", "mean", "sum", "pqn") |
CONTROL_GROUP |
"control" | Control group name (must match filename prefix) |
MODEL_GROUP |
"MethiocarbA" | Model/treatment group name (must match filename prefix) |
TOP_N_PATHWAYS |
0 | Max pathways in enrichment plots (0 = all significant) |
PATHWAY_FIG_W |
7 | Pathway figure width (inches) |
PATHWAY_FIG_H |
5 | Pathway figure height (inches) |
SUBFIG_MODE |
FALSE | Small sub-figure mode: larger text + compact layout for paper panels |
FILTER_NONSPECIFIC |
TRUE | Auto-filter non-specific pathways (e.g., "Metabolic pathways"). Set FALSE to disable |
All figures follow Nature submission guidelines:
- Font: Arial, 7–9 pt body text
- Format: Dual output — PDF (vector) + TIFF (300 DPI, LZW compression)
- Colors: NPG palette (
#E64B35red,#3C5488blue,#4DBBD5cyan,#00A087green) - Dimensions: Single-column 5×5 in, double-column 7×5 in
ERROR: this R is version X.X.X, package 'masstools' requires R >= 4.5
Solution: Upgrade R to ≥ 4.5.0. With conda: conda install -c conda-forge r-base=4.5.3
Error: promise already under evaluation: recursive default argument reference
Solution: Already fixed in MetaboFlow v1.0. The script passes default.dpi = 72 explicitly.
If WF3 hangs during KEGG pathway download, check your internet connection. MetaboFlow sets a 120-second timeout automatically.
This is common with small sample sizes (n ≤ 3 per group). MetaboFlow automatically detects this and falls back to raw P.Value for volcano plots and differential screening.
If you use MetaboFlow in your research, please cite:
- tidymass: Shen X, et al. TidyMass an object-oriented reproducible analysis framework for LC-MS data. Nature Communications, 2022.
- MetaboAnalystR: Pang Z, et al. MetaboAnalystR 3.0. Metabolites, 2020.
- limma: Ritchie ME, et al. limma powers differential expression analyses. Nucleic Acids Research, 2015.
- globaltest: Goeman JJ, et al. A global test for groups of genes. Bioinformatics, 2004.
- KEGGREST: Tenenbaum D, Maintainer B. KEGGREST: Client-side REST access to KEGG. Bioconductor.
The example_results/ directory contains a complete set of analysis outputs from a Methiocarb B zebrafish toxicology experiment:
example_results/
├── 01_differential_peaks/ # Volcano plots + differential peak CSV
├── 02_pathway_enrichment/ # Four workflow results
│ ├── WF1_SMPDB/ # SMPDB enrichment (ORA, tidymass)
│ ├── WF2_MSEA/ # MSEA enrichment (ORA, MetaboAnalystR)
│ ├── WF3_KEGG/ # KEGG enrichment (ORA, KEGGREST+Fisher)
│ │ ├── kegg_*.xlsx # All pathways
│ │ └── kegg_*_filtered.xlsx # Non-specific pathways removed
│ ├── WF4_QEA/ # Quantitative enrichment (GlobalTest)
│ └── subfig_examples/ # Small sub-figure versions (3.5×3 in)
├── 03_heatmap/ # Clustered heatmaps
├── 04_PCA/ # PCA scores plots
├── 05_boxplot/ # Representative metabolite boxplots (10 of 183)
└── 06_summary/ # Run parameter summary
MIT License