scTCR-Guide uses bundle directories for training, evaluation, and inference. A bundle stores a count matrix, split tables, and metadata in a simple on-disk layout.
bundle/
metadata.json
rna.npy
train.parquet
val.parquet # required for training and threshold tuning
test.parquet # required for internal test evaluation
For inference, only one split table is required. The default split name is train.
rna.npy is a NumPy array with shape:
n_cells x n_genes
Rows are cells. Columns are genes. Values should be raw UMI counts or count-like expression values before log normalization.
The metadata file must contain:
{
"gene_names": ["CD8A", "NKG7", "GZMB"]
}gene_names must match the column order of rna.npy.
Training bundles should also contain:
{
"filtered_gene_names": ["CD8A", "NKG7", "GZMB"],
"filtered_gene_indices": [0, 1, 2],
"clone_state_to_id": {"Low": 0, "High": 1},
"asinh_train_mean": [0.0, 0.0, 0.0],
"asinh_train_std": [1.0, 1.0, 1.0]
}For released-model inference, these training statistics are loaded from models/scTCR-Guide-CD8/preprocessing.json, not from the user bundle.
Split tables are Parquet files. The required column is:
| Column | Required | Description |
|---|---|---|
rna_index |
yes | Row index into rna.npy |
Recommended optional columns:
| Column | Description |
|---|---|
cell_id |
Cell identifier |
barcode |
Original cell barcode |
sample_id |
Sample identifier |
donor_id |
Donor identifier |
study_id |
Study or cohort identifier |
source |
Data source name |
Labeled bundles used for training and evaluation must also contain:
| Column | Description |
|---|---|
clone_state |
Low or High |
clone_state_id |
0 for Low, 1 for High |
clone_size |
Observed clonotype size when paired scTCR-seq is available |
The released model is intended for CD8 T cells. Apply standard single-cell quality control, annotate T cells, and subset to high-confidence CD8 T cells before running inference.