Skip to content

Fix stale subset metadata in SampleDataset#925

Draft
haoyu-haoyu wants to merge 1 commit intosunlabuiuc:masterfrom
haoyu-haoyu:fix/sampledataset-subset-metadata
Draft

Fix stale subset metadata in SampleDataset#925
haoyu-haoyu wants to merge 1 commit intosunlabuiuc:masterfrom
haoyu-haoyu:fix/sampledataset-subset-metadata

Conversation

@haoyu-haoyu
Copy link
Copy Markdown
Contributor

When SampleDataset.subset() creates a smaller view, it updates the underlying LitData region of interest but leaves patient_to_index and record_to_index copied from the parent dataset. That means downstream code can still see metadata entries that point outside the subset, which is especially confusing after splitting a dataset and then consulting those lookup tables.

This change rebuilds both lookup maps from the subset contents after slicing, for both the streaming and in-memory dataset implementations. I also added regression coverage to make sure list-based subsetting and slice-based subsetting both renumber the metadata against the subset rather than the original dataset.

I checked the patch with git diff --check and python -m py_compile on the touched files. I also tried running tests/core/test_sample_dataset.py locally, but the machine environment is currently blocked by a pre-existing NumPy 2.x and compiled dependency mismatch before the test module can finish importing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant