This project aims to identify significant differentially expressed genes, transcription factors, prognostic biomarkers and hub genes by transcriptomic profiling of older age (>=65 years) Sarcoma patients.
Sarcoma is a rare type of cancer that is more frequently found among children (<18 years) and older adults ( OA- age at diagnosis ≥ 65years). However, these populations are less frequently involved in clinical studies and their survival rates are poorer compared to the younger adults (YA - age at diagnosis - 18-65 years). It is further seen that the tumor microenvironment in OA cancer patients is different compared to the YA. Hence, in this study we utilize the TCGA-SARC RNA-seq database to identify differentially regulated genes, transcription factors, prognostic biomarkers and hub genes. We further perform functional enrichment analysis along with literature survey on the identified genes to understand the dysregulated pathways in OA.
We use R programming language along with several tools like Cytoscape, STRING, ShinyGO and databases TCGA, DoRothEA, TRRUST for the analysis.
Steps to set up the project locally:
- Cloning of repository :
git clone <https://github.com/vidhya2205/Transcriptomic-Profiling-of-Old-Age-Sarcoma-Patients-using-TCGA-RNA-seq-data.git>
- Navigate to the Code directory :
cd Transcriptomic-Profiling-of-Old-Age-Sarcoma-Patients-using-TCGA-RNA-seq-data/Code
Code Directory: This is the folder within the project where the code resides. Ensure you execute all subsequent commands from within this directory to avoid issues with file paths or configurations.
4. Install the R and R package dependencies:
R version 4.4.0 (2024-04-24 ucrt) is used
- CRAN packages-
install.packages(c("dplyr", "tidyr", "ggplot2", "gplots", "tidyverse", "reshape2", "svglite", "survminer", "survival", "forestplot")) <br>
- Bioconductor packages -
if (!requireNamespace("BiocManager", quietly = TRUE))
{
install.packages("BiocManager")
}
BiocManager::install(c("TCGAbiolinks", "SummarizedExperiment", "EnhancedVolcano", "org.Hs.eg.db", "dorothea", "enrichR", "DESeq2"))
- Additional tools used:
- Cytoscape - Download Cytoscape (Version 3.10.2 was used)
- ShinyGO - ShinyGO 0.81 web tool.
- Cytoscape - Download Cytoscape (Version 3.10.2 was used)
This project is built around an R Notebook (`code.Rmd`) that contains multiple sections to perform different tasks. Follow the steps below to use it effectively:
- Open the Code.Rmd file-
Use RStudio or any R-compatible IDE to open `code.Rmd`. - Notebook Structure -
The notebook is organized into the following sections:
-
Load the libraries needed
-
Section 1: Preliminary Survival analysis, Data extraction and preprocessing -
Description:
This section obtains the Clinical and RNA-seq data from the TCGA database for SARC (sarcoma patients). The samples are stratified based on age at diagnosis into OA (≥ 65 years) and YA (18-65 years) Further, the survival analysis is done using cox regression analysis and log rank association test. A bubble plot to represent the subtypes included in the study is plotted. The RNA-seq data is preprocessed and lowly expressed genes with a quantile normalization cutoff of 0.25 are filtered out.
Outputs:
This section produces 2 images -
Section 2: Differential Gene Expression analysis (DGEA) and Functional Enrichment analysis (FEA) -
Description:
DGEA comparing the OA with YA samples is done using the edgeR methodology. Significant differentially regulated genes (Sig-DEG’s) are selected based on logFC > ± 1.5 and p value < 0.005. A Volcano plot and heatmap representing the up and down regulated genes is made. Further, FEA of the sig-DGE’s is done to obtain the top 5 significant GO Terms associated with them.
Outputs:
This section produces 4 images and 2 csv -
Section 3: Transcription Factor Enrichment Analysis -
Description:
Using DoRothEA and TRRUST trancription factor- Target interaction databases, in this section significant transcription factors (sig-TFs) are identified as illustrated in the Flowchart_TFEA. Then we use Cytoscape app and STRING network database to visualize the interactions of the sig-TF’s. FEA analysis of the sig-TF’s is done using the ShinyGO web based tool and the top 5 GO terms are visualized in R.
Inputs:Outputs:
This section produces 4 images and 4 csv -
Section 4: Gene Specific Survival analysis (Prognostic markers) -
Description:
This section performs a gene specific survival analysis of the OA sarcoma patients exclusively by comparing samples with high (expression> median) and low (expression<median) values to identify genes that have a significant association with their lower survival as illustrated in the Flowchart_GSSA. Cox regression and KM log- rank association test based results are used to select the significant survival associated genes (sig-Surv). Functional enrichment analysis of these genes is done using the enrichR package. Further, a forest plot to represent the sig-Surv genes, expression strata (high/low) and their HR’s is plotted.
Outputs:
This section produces 4 images and 2 csv- DGEA of significant survival associated genes
- FEA of significant survival associated genes
- KM Plot for the significant survival associated genes
- Forest Plot (Cox) for the survival associated genes
- Gene Specific Survival Analysis all genes.csv
- Significant differentially expressed genes associated with survival(sig-Survival)
-
The authors would like to express their gratitude to Adewale Ogunleye and Richard Agyekum from the Hackbio team for their mentorship in completing this research project.