Input data format

scTCR-Guide uses bundle directories for training, evaluation, and inference. A bundle stores a count matrix, split tables, and metadata in a simple on-disk layout.

Bundle structure

bundle/
  metadata.json
  rna.npy
  train.parquet
  val.parquet        # required for training and threshold tuning
  test.parquet       # required for internal test evaluation

For inference, only one split table is required. The default split name is train.

`rna.npy`

rna.npy is a NumPy array with shape:

n_cells x n_genes

Rows are cells. Columns are genes. Values should be raw UMI counts or count-like expression values before log normalization.

`metadata.json`

The metadata file must contain:

{
  "gene_names": ["CD8A", "NKG7", "GZMB"]
}

gene_names must match the column order of rna.npy.

Training bundles should also contain:

{
  "filtered_gene_names": ["CD8A", "NKG7", "GZMB"],
  "filtered_gene_indices": [0, 1, 2],
  "clone_state_to_id": {"Low": 0, "High": 1},
  "asinh_train_mean": [0.0, 0.0, 0.0],
  "asinh_train_std": [1.0, 1.0, 1.0]
}

For released-model inference, these training statistics are loaded from models/scTCR-Guide-CD8/preprocessing.json, not from the user bundle.

Split tables

Split tables are Parquet files. The required column is:

Column	Required	Description
`rna_index`	yes	Row index into `rna.npy`

Recommended optional columns:

Column	Description
`cell_id`	Cell identifier
`barcode`	Original cell barcode
`sample_id`	Sample identifier
`donor_id`	Donor identifier
`study_id`	Study or cohort identifier
`source`	Data source name

Labeled bundles used for training and evaluation must also contain:

Column	Description
`clone_state`	`Low` or `High`
`clone_state_id`	`0` for Low, `1` for High
`clone_size`	Observed clonotype size when paired scTCR-seq is available

CD8 cell requirement

The released model is intended for CD8 T cells. Apply standard single-cell quality control, annotate T cells, and subset to high-confidence CD8 T cells before running inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input data format

Bundle structure

`rna.npy`

`metadata.json`

Split tables

CD8 cell requirement

FilesExpand file tree

input_format.md

Latest commit

History

input_format.md

File metadata and controls

Input data format

Bundle structure

rna.npy

metadata.json

Split tables

CD8 cell requirement

`rna.npy`

`metadata.json`