WIP: Add stratified split feature to model_selection.train_test_split by chauhankaranraj · Pull Request #635 · dask/dask-ml

chauhankaranraj · 2020-04-03T23:49:33Z

I took a stab at implementing a solution for issue #535

Adding a WIP label because currently the stratified split is not completely lazily for dask arrays (compute_chunk_sizes being called here). Nonetheless, I think it works fine for dask series and dataframes.

Any feedback would be appreciated :)

TomAugspurger

Can you say a bit about the high-level strategy here, and the challenges?

Stepping back, the idea behind stratify is to get approximately the same frequency of each class in the output splits as in the input? So do we absolutely require computing? At the very least, I think we'll need a full pass over the data to compute the frequencies in stratify, since the data may not be shuffled ahead of time. But can that pass be delayed until .compute time rather than when we construct the graph?

austinzh · 2020-06-18T14:57:59Z

My two cents.

classes can be optional because computing classes from an out-of-core dataset, outside train test split will cost the same.
If we split classes by classes, does it mean the return train, test datasets are ordered by classes?

TomAugspurger · 2020-06-18T15:32:21Z

I don't think we would want to order by class. Does scikit-learn do that?

…

On Thu, Jun 18, 2020 at 9:58 AM austinzh ***@***.***> wrote: My two cents. 1. classes can be optional because computing classes from an out-of-core dataset, outside train test split will cost the same. 2. If we split classes by classes, does it mean the return train, test datasets are ordered by classes? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#635 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQKBAVFY55235RQT4DRXITQXANCNFSM4L4RTUDQ> .

austinzh · 2020-06-18T21:18:41Z

Yes. But if we check for ci in classes: loop, we will found that we split class by class then concatenate them back.
That implies the return array, for example, the train set looks like
[randomlized_classA, randomlized_classB, randomlized_ClassC] meaning in this PR's implementation, same class stick together.

But If we use the same parameter for scikit-learn's train_test_split, the output will be shuffled.

For example, I run this on un-shuffled, iris.csv.
output1 is the output of sklean's train, test = ms.train_test_split(df, test_size=0.2, random_state=0, shuffle=True, stratify=df['species'])
output2 is the output of this PR. And I only print the species column

output1:

setosa
setosa
setosa
setosa
versicolor
setosa
virginica
virginica
versicolor
virginica
virginica
versicolor
setosa
versicolor
virginica
virginica
setosa
versicolor
versicolor
setosa
virginica
setosa
setosa
virginica
virginica
versicolor
versicolor
setosa
virginica
virginica
versicolor
versicolor
setosa
virginica
virginica
versicolor
virginica
versicolor
virginica
versicolor
versicolor
versicolor
setosa
setosa
versicolor
versicolor
virginica
virginica
versicolor
setosa
virginica
virginica
setosa
setosa
versicolor
versicolor
setosa
setosa
versicolor
virginica
setosa
setosa
versicolor
versicolor
virginica
versicolor
virginica
setosa
setosa
virginica
versicolor
versicolor
setosa
setosa
virginica
versicolor
virginica
setosa
versicolor
virginica
virginica
versicolor
virginica
setosa
versicolor
setosa
setosa
virginica
virginica
versicolor
virginica
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
virginica
setosa
virginica
setosa
virginica
setosa
versicolor
versicolor
versicolor
versicolor
setosa
virginica
virginica
setosa
versicolor
versicolor
virginica
setosa
virginica
virginica
virginica

output2

setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica

I don't think we would want to order by class. Does scikit-learn do that?
…
On Thu, Jun 18, 2020 at 9:58 AM austinzh @.***> wrote: My two cents. 1. classes can be optional because computing classes from an out-of-core dataset, outside train test split will cost the same. 2. If we split classes by classes, does it mean the return train, test datasets are ordered by classes? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#635 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQKBAVFY55235RQT4DRXITQXANCNFSM4L4RTUDQ .

austinzh · 2020-06-19T02:06:49Z

+                train_test_pairs.append(
+                    [dd.concat(arr_train_slices), dd.concat(arr_test_slices)]
+                )


For the output is not ordered by classes. I think we needs to add some kind of shuffle here.

Suggested change

train_test_pairs.append(

[dd.concat(arr_train_slices), dd.concat(arr_test_slices)]

)

train = dd.concat(arr_train_slices)

test = dd.concat(arr_test_slices)

train = train.shuffle(train.index)

test = test.shuffle(test.index)

# concat all train subdfs as 1 train df, same for test

train_test_pairs.append([train, test])

hsteinshiromoto

Hi,

I am a new guy and I had a question / suggestion to improve the code.

Best,

hsteinshiromoto · 2021-03-23T09:29:08Z


    types = set(type(arr) for arr in arrays)

+    if stratify is not None:


Quick question: why you are not using if stratify: ?

bpsut · 2026-05-12T17:30:42Z

What is the current status of this? It's 2026 and this appears to still be open, but it would be REALLY nice to have.

Squashed: - Fix linting errors.

chauhankaranraj · 2026-05-27T03:48:37Z

Hi dask community,

Apologies for being MIA, had a busy stretch at my day job (we were launching a new AWS product, S3 Files). Finally got some time to work on this PR this weekend with some LLM help. Would really appreciate a re-review whenever you have cycles! 🙏

Can you say a bit about the high-level strategy here

The high level approach is as follows

Count how many rows of each class live in each block
Sum these counts to get the global class distribution. Use test_size to decide how many test rows each class should contribute overall (keeping in mind sklearn's rule of at least one row of every class in both train and test).
For each class, split its "test rows budget" across blocks in proportion to how many rows of that class each block holds.
Each block picks that many rows at random per class. Slice every input array by those indices and concatenate the pieces.

Note that everything stays lazy until .compute(). Only the small (n_blocks, n_classes) shaped row count matrix is computed and brought into memory, which to me seems like a fair trade-off for the split accuracy.

Please lmk what y'all think!

chauhankaranraj mentioned this pull request Apr 4, 2020

No support for stratified split in dask_ml.model_selection.train_test_split #535

Open

TomAugspurger reviewed Apr 6, 2020

View reviewed changes

Comment thread dask_ml/model_selection/_split.py Outdated

Comment thread dask_ml/model_selection/_split.py Outdated

Comment thread dask_ml/model_selection/_split.py Outdated

Comment thread dask_ml/model_selection/_split.py Outdated

austinzh reviewed Jun 19, 2020

View reviewed changes

Base automatically changed from master to main February 2, 2021 03:43

hsteinshiromoto reviewed Mar 23, 2021

View reviewed changes

chauhankaranraj force-pushed the master branch from 195d074 to a5b3b35 Compare May 27, 2026 03:17

Add stratified splitting to train_test_split.

ab4168e

Squashed: - Fix linting errors.

chauhankaranraj force-pushed the master branch from a5b3b35 to ab4168e Compare May 27, 2026 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Add stratified split feature to model_selection.train_test_split#635

WIP: Add stratified split feature to model_selection.train_test_split#635
chauhankaranraj wants to merge 1 commit into
dask:mainfrom
chauhankaranraj:master

chauhankaranraj commented Apr 3, 2020 •

edited

Loading

Uh oh!

TomAugspurger left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

austinzh commented Jun 18, 2020

Uh oh!

TomAugspurger commented Jun 18, 2020 via email

Uh oh!

austinzh commented Jun 18, 2020 •

edited

Loading

Uh oh!

austinzh Jun 19, 2020

Uh oh!

hsteinshiromoto left a comment

Uh oh!

hsteinshiromoto Mar 23, 2021

Uh oh!

bpsut commented May 12, 2026

Uh oh!

chauhankaranraj commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-                train_test_pairs.append(
-                    [dd.concat(arr_train_slices), dd.concat(arr_test_slices)]
-                )
+                train = dd.concat(arr_train_slices)
+                test = dd.concat(arr_test_slices)
+                train = train.shuffle(train.index)
+                test = test.shuffle(test.index)
+                # concat all train subdfs as 1 train df, same for test
+                train_test_pairs.append([train, test])


		types = set(type(arr) for arr in arrays)

		if stratify is not None:

Uh oh!

Conversation

chauhankaranraj commented Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

austinzh commented Jun 18, 2020

Uh oh!

TomAugspurger commented Jun 18, 2020 via email

Uh oh!

austinzh commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

austinzh Jun 19, 2020

Choose a reason for hiding this comment

Uh oh!

hsteinshiromoto left a comment

Choose a reason for hiding this comment

Uh oh!

hsteinshiromoto Mar 23, 2021

Choose a reason for hiding this comment

Uh oh!

bpsut commented May 12, 2026

Uh oh!

chauhankaranraj commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chauhankaranraj commented Apr 3, 2020 •

edited

Loading

austinzh commented Jun 18, 2020 •

edited

Loading