Skip to content

fix(haps): gate dosage by var_fields + lazy info loading (#191)#193

Open
d-laub wants to merge 12 commits into
mainfrom
fix/issue-191-var-fields-loading
Open

fix(haps): gate dosage by var_fields + lazy info loading (#191)#193
d-laub wants to merge 12 commits into
mainfrom
fix/issue-191-var-fields-loading

Conversation

@d-laub
Copy link
Copy Markdown
Collaborator

@d-laub d-laub commented May 24, 2026

Summary

Closes #191.

Two related changes:

  1. Correctness: Dosage field was unconditionally added to RaggedVariants whenever dosages.npy existed on disk, regardless of the user's var_fields. This caused ak.broadcast_records crashes downstream when consumers expected a schema without dosage. Now gated on 'dosage' in self.var_fields.

  2. Lazy loading: Dataset.open accepts a new var_fields parameter. Both Dataset.open(var_fields=...) and Dataset.with_settings(var_fields=...) now honor the user's request — non-default info columns and the dosages memmap are only loaded when asked for. available_var_fields is computed from a schema peek so it reflects what the file could provide, not what was actually loaded.

Includes a small companion fix to RaggedVariants.squeeze (accepts a positional axis arg) — latent since PR6's _query.py extraction and necessary for ds[idx, sample] to work on variants-output datasets.

Test plan

  • pixi run -e dev test (pytest + cargo)
  • pixi run -e dev ruff check python/ clean
  • pixi run -e dev typecheck (pyrefly) — baseline preserved
  • New regression tests in tests/dataset/test_issue_191_var_fields.py (15 tests)

🤖 Generated with Claude Code

d-laub and others added 12 commits May 24, 2026 00:02
Two-phase fix: (1) gate dosage output by var_fields to fix the
broadcast_records crash; (2) plumb var_fields into Haps.from_path
+ _Variants.from_table so loading honors what the user requested and
available_var_fields reflects the file schema, not just what loaded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Dosage was unconditionally added to RaggedVariants whenever dosages.npy
existed on disk, causing ak.broadcast_records errors downstream when a
consumer expected a schema without dosage. Gate the field on
'dosage' in self.var_fields, matching how ref/ilen/info fields work.

Also list 'dosage' in available_var_fields when the file is present.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_query.py:110 calls `o.squeeze(0)` positionally on each reconstructed
output, but RaggedVariants.squeeze was kwargs-only — broke ds[idx, sample]
on any variants-output dataset. Latent since PR6's _query.py extraction;
no existing test combined indexing with with_seqs("variants").

Discovered while validating the dosage fix for #191.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dosage field attached to Variants source even when not requested (regression?)

1 participant