MTNBCPred is a transcriptomics-based machine learning framework developed for predicting lymph node metastasis in Triple Negative Breast Cancer (TNBC) patients using gene expression profiles.
The study focuses on identifying diagnostic biomarkers capable of distinguishing:
- Metastatic TNBC patients
- Non-metastatic TNBC patients
The system uses transcriptomic signatures along with machine learning techniques for prediction and prognostic assessment. https://doi.org/10.5281/zenodo.20196235
Triple Negative Breast Cancer (TNBC) is one of the most aggressive breast cancer subtypes and is associated with:
- High metastatic potential
- Increased recurrence
- Poor prognosis
- Lack of targeted therapies
Unlike other breast cancer types, TNBC lacks:
- Estrogen Receptor (ER)
- Progesterone Receptor (PR)
- HER2 receptor
Because of this, early prediction of lymph node metastasis becomes critically important for treatment planning and survival improvement.
The major objectives of this work were:
- Identify transcriptomic biomarkers for TNBC metastasis
- Develop machine learning models for lymph node metastasis prediction
- Evaluate biomarkers across multiple transcriptomic platforms
- Perform prognostic survival analysis
- Develop a publicly accessible prediction server
The study used transcriptomic data from TCGA consisting of:
| Dataset Type | Number of Samples |
|---|---|
| Metastatic TNBC | 53 |
| Non-metastatic TNBC | 104 |
| Total Samples | 157 |
Independent validation datasets were obtained from GEO using:
- Affymetrix platform
- Illumina platform
- Agilent platform
:contentReference[oaicite:2]{index=2}
The following preprocessing techniques were applied:
- CPM normalization using edgeR TMM method
- log2 CPM transformation using limma voom()
- Background correction
- Quantile normalization
- log2 transformation
:contentReference[oaicite:3]{index=3}
After preprocessing:
- 20,531 genes were initially analyzed
- Low variance genes were removed
- 17,297 genes remained for further analysis
| Type | Number |
|---|---|
| Upregulated genes | 643 |
| Downregulated genes | 367 |
| Total DEGs | 1010 |
:contentReference[oaicite:4]{index=4}
The top 15 genes were selected using logistic regression ranking based on prediction performance.
- DHRS7
- BAIAP3
- ZNRF2
- ETFDH
- HBG1
- RIOK2
- TCEAL4
- TCF21
- FRZB
- POU4F1
- COL24A1
- TRPA1
- IBSP
- VIL1
- PSAT1
The best-performing single gene biomarkers were:
| Gene | AUC |
|---|---|
| DHRS7 | 0.689 |
| ZNRF2 | 0.689 |
These genes showed balanced sensitivity and specificity for LN-TNBC prediction.
The following machine learning classifiers were implemented:
- Gaussian Naive Bayes (GNB)
- Logistic Regression (LR)
- Random Forest (RF)
- Decision Tree (DT)
- Support Vector Classifier (SVC)
- K-Nearest Neighbors (KNN)
- eXtreme Gradient Boosting (XGB)
The Gaussian Naive Bayes (GNB) classifier achieved the best performance using the 15-gene signature.
| Metric | Value |
|---|---|
| Sensitivity | 75.00 |
| Specificity | 79.17 |
| Accuracy | 78.12 |
| AUC | 0.81 |
| MCC | 0.50 |
The dataset was split using:
- 80% training dataset
- 20% validation dataset
A 5-fold cross-validation strategy was applied for robust model evaluation.
The model was validated on multiple GEO microarray datasets.
However, performance decreased significantly across different platforms due to opposite gene regulation trends.
Some genes were:
- Upregulated in TCGA
- Downregulated in GEO datasets
This highlighted challenges in cross-platform transcriptomic validation.
The following genes were associated with poor survival:
| Gene | Hazard Ratio |
|---|---|
| ZNRF2 | 2.711 |
| FRZB | 2.395 |
| TCEAL4 | 2.254 |
The following genes were associated with better survival:
| Gene | Hazard Ratio |
|---|---|
| PSAT1 | 0.41 |
| TRPA1 | 0.329 |
| VIL1 | 0.217 |
Enrichment and interaction analysis showed involvement of selected genes in:
- Neurotransmitter transport regulation
- Cell differentiation
- Osteoblast signaling
- Interleukin-11 signaling
- Serine metabolism
- Vitamin B6 metabolism
The MTNBCPred webserver contains:
Allows users to:
- Upload expression profiles
- Predict metastatic status
- Identify LN-metastatic or non-metastatic TNBC
Allows:
- Single gene analysis
- Biomarker investigation
- Expression-based interpretation
The study demonstrated that:
- Transcriptomic biomarkers can predict TNBC metastasis
- Clinical factors like tumor stage and lymph node status strongly affect prognosis
- Machine learning models can assist in early metastasis prediction
- Cross-platform variability remains a major challenge
- Transcriptomics
- Machine Learning
- Logistic Regression
- Gaussian Naive Bayes
- Random Forest
- Support Vector Classifier
- Python sklearn
- R Bioconductor
- Survival Analysis
- t-SNE Visualization
MTNBCPred provides a machine learning-based framework for predicting lymph node metastasis in Triple Negative Breast Cancer patients using transcriptomic signatures.
The study identified:
- 15-gene diagnostic biomarkers
- Prognostic survival markers
- Cross-platform validation challenges
The developed model achieved strong performance on TCGA datasets and provides a useful resource for TNBC metastasis prediction research.
Email: raghava@iiitd.ac.in
Address:
Indraprastha Institute of Information Technology Delhi
This project is intended for academic and research purposes only.