Skip to content

Latest commit

 

History

History
84 lines (59 loc) · 2.15 KB

File metadata and controls

84 lines (59 loc) · 2.15 KB

Input data format

scTCR-Guide uses bundle directories for training, evaluation, and inference. A bundle stores a count matrix, split tables, and metadata in a simple on-disk layout.

Bundle structure

bundle/
  metadata.json
  rna.npy
  train.parquet
  val.parquet        # required for training and threshold tuning
  test.parquet       # required for internal test evaluation

For inference, only one split table is required. The default split name is train.

rna.npy

rna.npy is a NumPy array with shape:

n_cells x n_genes

Rows are cells. Columns are genes. Values should be raw UMI counts or count-like expression values before log normalization.

metadata.json

The metadata file must contain:

{
  "gene_names": ["CD8A", "NKG7", "GZMB"]
}

gene_names must match the column order of rna.npy.

Training bundles should also contain:

{
  "filtered_gene_names": ["CD8A", "NKG7", "GZMB"],
  "filtered_gene_indices": [0, 1, 2],
  "clone_state_to_id": {"Low": 0, "High": 1},
  "asinh_train_mean": [0.0, 0.0, 0.0],
  "asinh_train_std": [1.0, 1.0, 1.0]
}

For released-model inference, these training statistics are loaded from models/scTCR-Guide-CD8/preprocessing.json, not from the user bundle.

Split tables

Split tables are Parquet files. The required column is:

Column Required Description
rna_index yes Row index into rna.npy

Recommended optional columns:

Column Description
cell_id Cell identifier
barcode Original cell barcode
sample_id Sample identifier
donor_id Donor identifier
study_id Study or cohort identifier
source Data source name

Labeled bundles used for training and evaluation must also contain:

Column Description
clone_state Low or High
clone_state_id 0 for Low, 1 for High
clone_size Observed clonotype size when paired scTCR-seq is available

CD8 cell requirement

The released model is intended for CD8 T cells. Apply standard single-cell quality control, annotate T cells, and subset to high-confidence CD8 T cells before running inference.