Fix stale subset metadata in SampleDataset by haoyu-haoyu · Pull Request #925 · sunlabuiuc/PyHealth

haoyu-haoyu · 2026-03-31T21:53:16Z

When SampleDataset.subset() creates a smaller view, it updates the underlying LitData region of interest but leaves patient_to_index and record_to_index copied from the parent dataset. That means downstream code can still see metadata entries that point outside the subset, which is especially confusing after splitting a dataset and then consulting those lookup tables.

This change rebuilds both lookup maps from the subset contents after slicing, for both the streaming and in-memory dataset implementations. I also added regression coverage to make sure list-based subsetting and slice-based subsetting both renumber the metadata against the subset rather than the original dataset.

I checked the patch with git diff --check and python -m py_compile on the touched files. I also tried running tests/core/test_sample_dataset.py locally, but the machine environment is currently blocked by a pre-existing NumPy 2.x and compiled dependency mismatch before the test module can finish importing.

fix: rebuild subset index mappings in SampleDataset

6d25637

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale subset metadata in SampleDataset#925

Fix stale subset metadata in SampleDataset#925
haoyu-haoyu wants to merge 1 commit intosunlabuiuc:masterfrom
haoyu-haoyu:fix/sampledataset-subset-metadata

haoyu-haoyu commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haoyu-haoyu commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant