fix(haps): gate dosage by var_fields + lazy info loading (#191)#193
Open
d-laub wants to merge 12 commits into
Open
fix(haps): gate dosage by var_fields + lazy info loading (#191)#193d-laub wants to merge 12 commits into
d-laub wants to merge 12 commits into
Conversation
Two-phase fix: (1) gate dosage output by var_fields to fix the broadcast_records crash; (2) plumb var_fields into Haps.from_path + _Variants.from_table so loading honors what the user requested and available_var_fields reflects the file schema, not just what loaded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Dosage was unconditionally added to RaggedVariants whenever dosages.npy existed on disk, causing ak.broadcast_records errors downstream when a consumer expected a schema without dosage. Gate the field on 'dosage' in self.var_fields, matching how ref/ilen/info fields work. Also list 'dosage' in available_var_fields when the file is present. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_query.py:110 calls `o.squeeze(0)` positionally on each reconstructed
output, but RaggedVariants.squeeze was kwargs-only — broke ds[idx, sample]
on any variants-output dataset. Latent since PR6's _query.py extraction;
no existing test combined indexing with with_seqs("variants").
Discovered while validating the dosage fix for #191.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #191.
Two related changes:
Correctness: Dosage field was unconditionally added to
RaggedVariantswheneverdosages.npyexisted on disk, regardless of the user'svar_fields. This causedak.broadcast_recordscrashes downstream when consumers expected a schema without dosage. Now gated on'dosage' in self.var_fields.Lazy loading:
Dataset.openaccepts a newvar_fieldsparameter. BothDataset.open(var_fields=...)andDataset.with_settings(var_fields=...)now honor the user's request — non-default info columns and the dosages memmap are only loaded when asked for.available_var_fieldsis computed from a schema peek so it reflects what the file could provide, not what was actually loaded.Includes a small companion fix to
RaggedVariants.squeeze(accepts a positional axis arg) — latent since PR6's_query.pyextraction and necessary fords[idx, sample]to work on variants-output datasets.Test plan
pixi run -e dev test(pytest + cargo)pixi run -e dev ruff check python/cleanpixi run -e dev typecheck(pyrefly) — baseline preservedtests/dataset/test_issue_191_var_fields.py(15 tests)🤖 Generated with Claude Code