Detecting cancer subtypes with machine learning.
This repository contains the data, code, and manuscript accompanying the preprint:
WF Flynn, S Namburi, CA Paisie, HV Reddi, S Li, KRK Murthy, J Georgy. "Trace the cancer of unknown primary origin and molecular subtype via machine learning." Submitted, 2018.
currently available at bioRxiv.
The code present in this repository is free to use for academic and
non-commercial use, and is subject to the following License (also
available in docx format).
This project is organized using a subset of the Cookiecutter Data Science project structure.
All data and results, and most visualizations can be generated from scratch
using the make command. A full build of the project can be done with
make requirements
make data
make models
make vizNote: I've run into a problem building the R portion of the environment
on machines that have existing R installations. Running make requirements
may corrupt your existing R installation. See
this conda issue
for more info. Looking into a work-arounds...
In order to produce the models and visualizations, this project requires
conda, through which R and Python 3.6 will be installed along with their
needed modules/packages.
Running make requirements will:
- Create and activate a conda environment named
tcga_subtype_classification. - Install
RandPython 3.6along with the packages listed inrequirements.txtandrequirements_conda.txt. - Test these installations.
If you do not have a conda installation, you can install a minimal
installation through miniconda.
Figures present in the manuscript preprint can be generated automatically using
make viz or interactively using the notebooks (symlinked) in the /notebooks/
root directory.
We've also include a simple interactive web vizualization that is currently
hosted at Pan Cancer Classification Portal. You
can host your own version locally using the code in the /app/ root directory:
cd app/
python3 run_flask.py [--host IP] [--port PORT]Project based on the cookiecutter data science project template. #cookiecutterdatascience