Random Forest on Sparse Vectors

Overview

pathotypr implements a custom Random Forest classifier that operates directly on sparse feature vectors — no dense matrix conversion required. This eliminates the O(N × 1M) memory overhead that a library like smartcore would require for N samples with 1M feature buckets.

Decision Tree (CART)

Each tree is a binary classifier built with the CART algorithm (Classification and Regression Trees):

Node Types

enum TreeNode {
    Split { feature: u32, threshold: f32, left: u32, right: u32 },
    Leaf { class: u32 },
}

Nodes are stored in a flat arena (Vec<TreeNode>) with index-based child pointers. This gives cache-friendly traversal and compact serialization.

Tree Construction

Input: Sparse feature vectors [(bucket_idx, count)], class labels, bootstrap sample indices.

Algorithm (iterative DFS via explicit stack):

Start with all bootstrap samples at the root
For each node:
- Count class frequencies in the current sample set
- Stopping conditions: pure node (single class), depth >= max_depth, or n_samples <= 2 × min_samples_leaf
- If not stopped, find the best split (see below)
- Partition samples by the split and push children onto the stack

Finding the Best Split

Feature subsampling: Collect all features present in the current node's data. If more than sqrt(n_features) features are present, randomly sample sqrt(n_features) via partial Fisher-Yates shuffle.
For each candidate feature:
- Extract (value, label) pairs via binary search on each sample's sparse row
- Sort by value
- Scan sorted values to find the threshold that maximizes Gini impurity reduction
- Only consider splits where both sides have ≥ min_samples_leaf samples
Gini impurity: 1 - Σ(p_i²) where p_i is the proportion of class i.
Threshold: Midpoint between consecutive distinct values at the best split point.

Prediction

Traversal from root to leaf in O(tree_depth):

fn predict_one(row: &[(usize, f32)]) -> usize {
    loop {
        match node {
            Leaf { class } => return class,
            Split { feature, threshold, left, right } => {
                let val = binary_search(row, feature);  // O(log nnz)
                node = if val <= threshold { left } else { right };
            }
        }
    }
}

The sparse_lookup() function uses binary search on the sorted sparse row, returning 0.0 for absent features.

Ensemble (Bagging)

Training

100 trees trained in parallel with rayon
Each tree gets a bootstrap sample: N samples drawn with replacement from N total (each sample appears in ~63% of trees)
Deterministic seeds: A master RNG (seed=42) generates one seed per tree. Each tree's bootstrap is reproducible from its seed.

Majority Voting

For prediction, all 100 trees vote. The class with the most votes wins:

confidence = winner_votes / total_trees
margin = (winner_votes - runner_up_votes) / total_trees

Both metrics are reported alongside the predicted class.

Out-of-Bag (OOB) Accuracy

Each tree's bootstrap sample excludes ~37% of training data. These "out-of-bag" samples provide a free accuracy estimate:

For each tree, regenerate its bootstrap using the stored seed
Identify OOB samples (not in the bootstrap)
Predict OOB samples using only that tree
Aggregate: for each training sample, collect votes from all trees where it was OOB
Compare majority vote to true label

Why this matters: OOB accuracy is nearly unbiased and requires no held-out test set. It's always computed, even when using CV for accuracy estimation.

Feature Importance

Split-count importance: for each feature (bucket), count how many times it was used as a split across all 100 trees. Features that appear in many splits are more discriminative.

importance(feature_j) = Σ_{trees} count_of_splits_on_feature_j

The top 500 features are exported with:

Their k-mer sequences (reverse-mapped from the hashing trick)
Genomic coordinates (sequence index + position) in the training data

Hyperparameters

Parameter	Default	Effect
`n_trees`	100	More trees → smoother predictions, diminishing returns past ~100
`max_depth`	20	Limits tree complexity; prevents memorization of training noise
`min_samples_leaf`	5	Regularization — prevents splits that isolate individual samples
`max_features`	√(n_features)	Standard RF: decorrelates trees by forcing different feature subsets
`bootstrap_seed`	42 (master)	Ensures reproducibility; each tree derives its seed from this

Complexity

Operation	Time	Space
Train one tree	O(n_bootstrap × sqrt(F) × depth)	O(nodes) ≈ O(2^depth)
Train ensemble	Above / num_threads (rayon)	O(100 × nodes)
Predict one sample	O(100 × depth × log(nnz))	O(1)
OOB accuracy	O(N × 100 × depth)	O(N × n_classes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Forest on Sparse Vectors

Overview

Decision Tree (CART)

Node Types

Tree Construction

Finding the Best Split

Prediction

Ensemble (Bagging)

Training

Majority Voting

Out-of-Bag (OOB) Accuracy

Feature Importance

Hyperparameters

Complexity

FilesExpand file tree

random-forest.md

Latest commit

History

random-forest.md

File metadata and controls

Random Forest on Sparse Vectors

Overview

Decision Tree (CART)

Node Types

Tree Construction

Finding the Best Split

Prediction

Ensemble (Bagging)

Training

Majority Voting

Out-of-Bag (OOB) Accuracy

Feature Importance

Hyperparameters

Complexity