Skip to content

ENH prototype histogram splitter for RandomForest benchmarks#6

Open
cakedev0 wants to merge 10 commits into
mainfrom
rfopt/hist-best-splitter
Open

ENH prototype histogram splitter for RandomForest benchmarks#6
cakedev0 wants to merge 10 commits into
mainfrom
rfopt/hist-best-splitter

Conversation

@cakedev0
Copy link
Copy Markdown
Owner

@cakedev0 cakedev0 commented May 19, 2026

Summary

This is an experimental branch for investigating whether a histogram-style splitter can close the RandomForest fit-time gap with sklearnex / oneDAL on CPU.

It adds:

  • an opt-in max_bins=None parameter to dense DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, and RandomForestRegressor;
  • a dense HistBestSplitter for supported criteria (gini, entropy / log_loss, squared_error);
  • per-feature binning into up to max_bins ordered bins, using exact value bins when n_unique <= max_bins and quantile-derived thresholds otherwise;
  • RandomForest-level bin-code precomputation so all newly grown trees reuse the same binned representation;
  • support for sample weights and missing values;
  • explicit errors for unsupported cases: sparse input, random splitter, multi-output, monotonic constraints, absolute_error, poisson, and friedman_mse;
  • a compact benchmark/profiling harness under benchmarks/rf_intelex/ and Markdown reports under reports/rf_intelex/.

How We Got Here

The initial benchmark suite compared scikit-learn RandomForest fit time against sklearnex.ensemble while avoiding unsupported/fallback sklearnex configurations. The suite keeps n_estimators=20, n_jobs=1, per-fit timings under 10s, and the retained suite under 3 minutes.

Profiling and source inspection suggested two main sklearnex advantages:

  1. high-cardinality dense data benefits from a precomputed indexed/binned feature representation;
  2. low-cardinality data avoids repeated per-node sorting by accumulating split statistics over compact bins.

The branch first explored global sorted indices, then moved to a histogram-style splitter. Important iterations included:

  • changing max_bins semantics to mean actual feature binning, similar in spirit to HistGradientBoosting, rather than only enabling exact low-cardinality bins;
  • moving bin-code computation to the RandomForest fit level to avoid recomputing codes per tree;
  • using C pointers in hot histogram loops rather than generic memoryview indexing;
  • removing dead sorted-index setup inherited from BestSplitter.init in the histogram path;
  • clearing only criterion-relevant histogram workspaces with memset.

Performance

Final retained-suite benchmark, max_bins=255, n_estimators=20, n_jobs=1, with 30s warm-up and 10 timed repeats. var is (max - min) / median across repeats.

Case branch s sklearnex s ratio branch var sklearnex var
clf_12f_full_deep 0.768 0.576 1.33x 0.043 0.049
clf_12f_shallow_bootstrap 0.386 0.304 1.27x 0.075 0.034
clf_24f_low_card 0.747 0.511 1.46x 0.110 0.163
clf_96f_sqrt_leaf8 0.477 0.287 1.66x 0.161 0.040
reg_12f_full_deep 1.772 1.196 1.48x 0.039 0.118
reg_12f_full_f64 1.154 0.789 1.46x 0.034 0.047
reg_12f_shallow_bootstrap 0.390 0.287 1.36x 0.047 0.028
reg_1f_deep_full 0.144 0.104 1.38x 0.010 0.016
reg_24f_low_card 1.765 1.319 1.34x 0.215 0.127
reg_80f_sqrt_leaf8 0.310 0.165 1.88x 0.149 0.128

A follow-up run forcing every generated X to float64 produced the same conclusion: all retained cases stayed below 2x slower than sklearnex, with worst ratio 1.84x.

Raw benchmark outputs are committed under reports/rf_intelex/results_max_bins_255_warmup30_repeats10/ and reports/rf_intelex/results_max_bins_255_warmup30_repeats10_xfloat64/.

Validation

Local checks run during development:

  • pytest sklearn/tree/tests/test_tree.py -q
  • pytest sklearn/ensemble/tests/test_forest.py -k 'max_bins' -q
  • pre-commit run ruff-check --files benchmarks/rf_intelex/bench_rf_fit.py sklearn/ensemble/_forest.py sklearn/tree/_classes.py
  • pre-commit run ruff-format --files benchmarks/rf_intelex/bench_rf_fit.py sklearn/ensemble/_forest.py sklearn/tree/_classes.py
  • pre-commit run cython-lint --files sklearn/tree/_splitter.pyx

The commit hooks also passed when creating the final benchmark commit.

Notes / Follow-ups

This is intentionally a prototype branch, not a polished upstream-ready API proposal. Likely follow-ups:

  • compact bin-code dtype (uint8 / uint16) instead of int32;
  • more specialized histogram update kernels for regression/classification;
  • sparse workspace clearing / touched-bin lists;
  • possible auto-detection of low-cardinality features when max_bins=None;
  • broader criterion support if this direction is pursued.

Branch vs main benchmark

I also reran the retained benchmark suite comparing this branch directly against main (commit ffc6cdc20b8d5eb58e38042fd90a2aeecc33dfb8). The branch uses max_bins=255; main uses the standard exact RandomForest implementation. Both runs used n_estimators=20, n_jobs=1, 30s warm-up, and 10 timed repeats. var is (max - min) / median across repeats.

Case branch s main s speedup vs main branch var main var
clf_12f_full_deep 1.083 5.590 5.16x 0.088 0.035
clf_12f_shallow_bootstrap 0.549 2.951 5.37x 0.141 0.092
clf_24f_low_card 1.103 2.465 2.23x 0.178 0.012
clf_96f_sqrt_leaf8 0.585 1.551 2.65x 0.129 0.041
reg_12f_full_deep 1.936 4.754 2.46x 0.238 0.263
reg_12f_full_f64 1.607 2.943 1.83x 0.108 0.041
reg_12f_shallow_bootstrap 0.424 2.116 4.99x 0.181 0.082
reg_1f_deep_full 0.181 1.667 9.23x 0.210 0.037
reg_24f_low_card 2.519 3.901 1.55x 0.085 0.040
reg_80f_sqrt_leaf8 0.427 1.062 2.49x 0.445 0.125

This branch is faster than main on every retained case in this run, with speedups ranging from 1.55x to 9.23x. The wide/sqrt regression case has high branch variability, so the exact point estimate there should be treated with some caution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant