Fix stale subset metadata in SampleDataset#925
Draft
haoyu-haoyu wants to merge 1 commit intosunlabuiuc:masterfrom
Draft
Fix stale subset metadata in SampleDataset#925haoyu-haoyu wants to merge 1 commit intosunlabuiuc:masterfrom
haoyu-haoyu wants to merge 1 commit intosunlabuiuc:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When SampleDataset.subset() creates a smaller view, it updates the underlying LitData region of interest but leaves patient_to_index and record_to_index copied from the parent dataset. That means downstream code can still see metadata entries that point outside the subset, which is especially confusing after splitting a dataset and then consulting those lookup tables.
This change rebuilds both lookup maps from the subset contents after slicing, for both the streaming and in-memory dataset implementations. I also added regression coverage to make sure list-based subsetting and slice-based subsetting both renumber the metadata against the subset rather than the original dataset.
I checked the patch with git diff --check and python -m py_compile on the touched files. I also tried running tests/core/test_sample_dataset.py locally, but the machine environment is currently blocked by a pre-existing NumPy 2.x and compiled dependency mismatch before the test module can finish importing.