pathotypr implements a custom Random Forest classifier that operates directly on sparse feature vectors — no dense matrix conversion required. This eliminates the O(N × 1M) memory overhead that a library like smartcore would require for N samples with 1M feature buckets.
Each tree is a binary classifier built with the CART algorithm (Classification and Regression Trees):
enum TreeNode {
Split { feature: u32, threshold: f32, left: u32, right: u32 },
Leaf { class: u32 },
}Nodes are stored in a flat arena (Vec<TreeNode>) with index-based child pointers. This gives cache-friendly traversal and compact serialization.
Input: Sparse feature vectors [(bucket_idx, count)], class labels, bootstrap sample indices.
Algorithm (iterative DFS via explicit stack):
- Start with all bootstrap samples at the root
- For each node:
- Count class frequencies in the current sample set
- Stopping conditions: pure node (single class),
depth >= max_depth, orn_samples <= 2 × min_samples_leaf - If not stopped, find the best split (see below)
- Partition samples by the split and push children onto the stack
-
Feature subsampling: Collect all features present in the current node's data. If more than
sqrt(n_features)features are present, randomly samplesqrt(n_features)via partial Fisher-Yates shuffle. -
For each candidate feature:
- Extract
(value, label)pairs via binary search on each sample's sparse row - Sort by value
- Scan sorted values to find the threshold that maximizes Gini impurity reduction
- Only consider splits where both sides have ≥
min_samples_leafsamples
- Extract
-
Gini impurity:
1 - Σ(p_i²)where p_i is the proportion of class i. -
Threshold: Midpoint between consecutive distinct values at the best split point.
Traversal from root to leaf in O(tree_depth):
fn predict_one(row: &[(usize, f32)]) -> usize {
loop {
match node {
Leaf { class } => return class,
Split { feature, threshold, left, right } => {
let val = binary_search(row, feature); // O(log nnz)
node = if val <= threshold { left } else { right };
}
}
}
}The sparse_lookup() function uses binary search on the sorted sparse row, returning 0.0 for absent features.
- 100 trees trained in parallel with rayon
- Each tree gets a bootstrap sample: N samples drawn with replacement from N total (each sample appears in ~63% of trees)
- Deterministic seeds: A master RNG (seed=42) generates one seed per tree. Each tree's bootstrap is reproducible from its seed.
For prediction, all 100 trees vote. The class with the most votes wins:
confidence = winner_votes / total_trees
margin = (winner_votes - runner_up_votes) / total_trees
Both metrics are reported alongside the predicted class.
Each tree's bootstrap sample excludes ~37% of training data. These "out-of-bag" samples provide a free accuracy estimate:
- For each tree, regenerate its bootstrap using the stored seed
- Identify OOB samples (not in the bootstrap)
- Predict OOB samples using only that tree
- Aggregate: for each training sample, collect votes from all trees where it was OOB
- Compare majority vote to true label
Why this matters: OOB accuracy is nearly unbiased and requires no held-out test set. It's always computed, even when using CV for accuracy estimation.
Split-count importance: for each feature (bucket), count how many times it was used as a split across all 100 trees. Features that appear in many splits are more discriminative.
importance(feature_j) = Σ_{trees} count_of_splits_on_feature_j
The top 500 features are exported with:
- Their k-mer sequences (reverse-mapped from the hashing trick)
- Genomic coordinates (sequence index + position) in the training data
| Parameter | Default | Effect |
|---|---|---|
n_trees |
100 | More trees → smoother predictions, diminishing returns past ~100 |
max_depth |
20 | Limits tree complexity; prevents memorization of training noise |
min_samples_leaf |
5 | Regularization — prevents splits that isolate individual samples |
max_features |
√(n_features) | Standard RF: decorrelates trees by forcing different feature subsets |
bootstrap_seed |
42 (master) | Ensures reproducibility; each tree derives its seed from this |
| Operation | Time | Space |
|---|---|---|
| Train one tree | O(n_bootstrap × sqrt(F) × depth) | O(nodes) ≈ O(2^depth) |
| Train ensemble | Above / num_threads (rayon) | O(100 × nodes) |
| Predict one sample | O(100 × depth × log(nnz)) | O(1) |
| OOB accuracy | O(N × 100 × depth) | O(N × n_classes) |