mcvickerlab · d-laub · May 24, 2026 · May 24, 2026 · May 24, 2026 · May 24, 2026
diff --git a/docs/superpowers/plans/2026-05-24-issue-191-var-fields-loading.md b/docs/superpowers/plans/2026-05-24-issue-191-var-fields-loading.md
diff --git a/docs/superpowers/specs/2026-05-24-issue-191-var-fields-loading-design.md b/docs/superpowers/specs/2026-05-24-issue-191-var-fields-loading-design.md
@@ -0,0 +1,170 @@
+# Issue #191 — var_fields loading + dosage correctness fix
+
+**Status:** Spec ready
+**Date:** 2026-05-24
+**Issue:** [#191](https://github.com/mcvickerlab/GenVarLoader/issues/191) — Dosage field attached to Variants source even when not requested
+**Scope:** `python/genvarloader/_dataset/_haps.py`, no public API changes
+
+## Problem
+
+Opening a `Dataset` with a `RaggedVariants` output and a SVAR backend that has dosages causes a downstream `ak.broadcast_records` crash:
+
+```
+ValueError: cannot broadcast records because fields don't match:
+    alt, ref, start
+    alt, dosage, ref, start
+```
+
+Root cause: in `Haps._get_variants`, `dosage` is added to the output record unconditionally whenever `dosages.npy` exists on disk — independent of the user's `var_fields`. Every other field (`ref`, `ilen`, info columns) is correctly gated by `var_fields`; dosage is the lone exception.
+
+A secondary problem: `_Variants.from_table` eagerly loads every numeric column from the variants table into `self.info` via `to_numpy()`, regardless of what the user asked for. Not a correctness bug, but wasteful on large SVAR files where info columns can be GB-scale.
+
+## Goals
+
+1. **Correctness:** the output `RaggedVariants` record schema matches the user's `var_fields` exactly. No phantom dosage field.
+2. **Discoverability:** `Dataset.available_var_fields` lists every field the on-disk dataset *could* provide (including `dosage` when present), not just what was loaded.
+3. **Lazy loading:** info columns and dosages are only loaded when `var_fields` requests them. Schema discovery does not require loading data.
+4. **`with_settings(var_fields=...)` honors lazy loading:** expanding the field set after open lazily loads the newly requested columns, not the full set.
+
+## Non-goals
+
+- No public API additions (no new methods, no new args beyond what already exists on `Dataset.open` and `Dataset.with_settings`).
+- No `__all__` change.
+- No on-disk format change.
+- No rename of `available_var_fields` / `active_var_fields`.
+- Default `var_fields` stays `["alt", "ilen", "start"]` — the minimum useful set.
+
+## Design
+
+### Phase 1 — correctness (the visible bug)
+
+In `_dataset/_haps.py`:
+
+1. **Gate dosage output by `var_fields`.** In `_get_variants` (~L597):
+   ```python
+   if self.dosages is not None and "dosage" in self.var_fields:
+       ...
+       fields["dosage"] = Ragged(ak.to_packed(dosages))
+   ```
+
+2. **List dosage as available.** In `Haps.__post_init__`, when `self.dosages is not None`, add `"dosage"` to `available_var_fields`.
+
+That's the minimum to fix the user's crash.
+
+### Phase 2 — lazy loading
+
+In `_dataset/_haps.py`:
+
+1. **`_Variants.from_table` accepts an `info_fields` filter:**
+   ```python
+   @classmethod
+   def from_table(
+       cls,
+       path: str | Path,
+       one_based: bool = True,
+       info_fields: set[str] | None = None,  # None = load all (back-compat)
+   ): ...
+   ```
+   When `info_fields` is set, only those numeric columns are `.to_numpy()`'d into `self.info`. POS/ILEN/ALT always load; REF loads if present in schema.
+
+2. **Schema-peek helper:**
+   ```python
+   @staticmethod
+   def available_info_fields(path: str | Path) -> list[str]:
+       """Return numeric column names without loading any data."""
+       schema = pl.scan_ipc(path).collect_schema()
+       return [k for k, v in schema.items()
+               if v.is_numeric() and k not in {"POS", "ILEN"}]
+   ```
+   Used by `Haps` to compute `available_var_fields` from the file's schema rather than from `self.variants.info.keys()`.
+
+3. **Lazy info loading on extension:**
+   ```python
+   def load_info(self, fields: Iterable[str]) -> None:
+       """Add missing numeric fields to self.info by re-reading from path."""
+       missing = [f for f in fields if f not in self.info]
+       if not missing:
+           return
+       # re-read just the missing columns
+       df = pl.read_ipc(self.path, columns=missing, memory_map=False)
+       for f in missing:
+           self.info[f] = df[f].to_numpy()
+   ```
+   Called by `with_settings` when the user expands `var_fields`.
+
+4. **`Haps.available_var_fields`** computed from schema peek + reference presence + dosage presence — not from `self.variants.info.keys()`. Done in `__post_init__` with an extra path argument or by re-peeking inside.
+
+5. **`Haps.from_path(..., var_fields=None)`** new parameter:
+   - When provided, filters which info columns load and gates the dosages memmap on `"dosage" in var_fields`.
+   - When `None`, current behavior: load nothing extra (`var_fields` defaults to the minimum set `["alt", "ilen", "start"]`).
+   - Important: `var_fields` is propagated to both branches of `from_path` (SVAR and legacy `variants.arrow`).
+
+6. **`Dataset.open` plumbs `var_fields` to `Haps.from_path`** rather than relying on the post-construction `replace(haps, var_fields=...)`. The schema-peek `available_var_fields` is computed inside `Haps.__post_init__` so validation in `_impl.py:288` continues to work — but now reflects what the file truly offers, not what got loaded.
+
+7. **`Dataset.with_settings(var_fields=...)`:** if the new set adds fields not yet loaded, call `haps.variants.load_info(new_fields)` and, if `"dosage"` was added, lazily memmap dosages. Then `replace(haps, var_fields=new_var_fields)` as today.
+
+### Default-field semantics
+
+- `Dataset.open()` without `var_fields=` → `["alt", "ilen", "start"]` (minimum useful set). No dosage. No optional info columns. No ref unless the user asks.
+- `Dataset.available_var_fields` lists everything the file could provide.
+- `Dataset.active_var_fields` lists what is currently configured.
+
+This matches the current implicit contract — only the *loading* changes from eager to lazy.
+
+## File-by-file changes
+
+- `python/genvarloader/_dataset/_haps.py`
+  - `_Variants.from_table`: add `info_fields` parameter.
+  - `_Variants`: add `available_info_fields` staticmethod and `load_info` method.
+  - `Haps.__post_init__`: compute `available_var_fields` from schema peek + dosage presence.
+  - `Haps.from_path`: add `var_fields` parameter; plumb to `from_table` and dosage memmap.
+  - `Haps._get_variants`: gate `fields["dosage"]` on `"dosage" in self.var_fields`.
+
+- `python/genvarloader/_dataset/_impl.py`
+  - `Dataset.open`: pass `var_fields` (the user's, or the default) into `Haps.from_path` rather than handling via post-construction `replace`.
+  - `Dataset.with_settings`: when `var_fields` expands the active set, call `load_info` (and memmap dosages if needed) before `replace`.
+
+- `python/genvarloader/_dataset/_open.py`
+  - Update the `OpenRequest` stage that builds `Haps` to forward `var_fields` to `from_path`.
+
+## Testing
+
+In `tests/dataset/`:
+
+1. **Dosage gating (the bug):**
+   - Open a dataset with SVAR-dosages. Set `with_settings(var_fields=["alt", "ref", "start"])`. Fetch a batch. Assert `"dosage"` not in `batch.fields`.
+   - Same dataset, `with_settings(var_fields=["alt", "ref", "start", "dosage"])`. Assert `"dosage"` is in `batch.fields`.
+
+2. **`available_var_fields` discovery:**
+   - SVAR-with-dosages: assert `"dosage"` in `dataset.available_var_fields`.
+   - SVAR-without-dosages: assert `"dosage"` not in `dataset.available_var_fields`.
+
+3. **Lazy loading:**
+   - Open a dataset, assert `dataset._seqs.variants.info` does not contain numeric columns that weren't in `var_fields`. (Internal-state test; OK because the file is internal.)
+   - `with_settings(var_fields=[..., "AF"])` on a dataset that didn't initially request AF — assert `"AF"` is now in `dataset._seqs.variants.info`.
+
+4. **Default-field shape:**
+   - `Dataset.open(...)` without `var_fields` — assert `dataset.active_var_fields == ["alt", "ilen", "start"]`.
+
+5. **Existing test suite stays green** (parity check; the bulk of dataset tests exercises every code path).
+
+## Risk
+
+- Low for Phase 1 (3-line correctness fix + 1-line discoverability fix).
+- Medium for Phase 2 (touches `_Variants`, `Haps.from_path`, `Dataset.open`, `Dataset.with_settings`). Mitigated by: (a) `info_fields=None` keeps legacy behavior for any caller that doesn't pass it; (b) schema-peek is read-only and trivially testable; (c) the existing test suite exercises the load paths.
+
+## Verification
+
+Per project conventions (CLAUDE.md):
+
+- `pixi run -e dev test` — pytest + cargo green.
+- `pixi run -e dev ruff check python/` clean.
+- `pixi run -e dev typecheck` (pyrefly) — baseline preserved, no new errors.
+- `skills/genvarloader/SKILL.md` review: this PR does NOT change public API surface, so no skill update needed. (Confirm during implementation.)
+
+## Out of scope (captured for later)
+
+- Renaming `available_var_fields` / `active_var_fields`.
+- Unified "available_fields" property covering both variants and tracks.
+- Lazy loading for tracks data (separate concern).
+- Caching schema peeks across multiple `Dataset.open` calls on the same path.
diff --git a/python/genvarloader/_dataset/_haps.py b/python/genvarloader/_dataset/_haps.py
@@ -93,17 +93,24 @@ class _Variants:
     info: dict[str, NDArray[np.number]]
 
     @classmethod
-    def from_table(cls, path: str | Path, one_based: bool = True):
+    def from_table(
+        cls,
+        path: str | Path,
+        one_based: bool = True,
+        info_fields: set[str] | None = None,
+    ):
         """
         Loads variant info from a table. Must always have POS, ILEN, and ALT.
-        Any numeric columns will be loaded as info.
 
         Parameters
         ----------
         path : str | Path
-            The path to the variants table or a polars DataFrame.
+            The path to the variants table.
         one_based : bool, optional
             Whether the variants are one-based, by default False.
+        info_fields
+            Optional whitelist of numeric column names to load as info.
+            If ``None`` (default), load every numeric column except POS/ILEN.
         """
         path = Path(path).resolve()
         variants = pl.read_ipc(path, memory_map=False)
@@ -126,7 +133,9 @@ def from_table(cls, path: str | Path, one_based: bool = True):
         info = {
             k: variants[k].to_numpy()
             for k, v in variants.schema.items()
-            if v.is_numeric() and k not in {"POS", "ILEN"}
+            if v.is_numeric()
+            and k not in {"POS", "ILEN"}
+            and (info_fields is None or k in info_fields)
         }
 
         ref = (
@@ -147,6 +156,32 @@ def from_table(cls, path: str | Path, one_based: bool = True):
     def __len__(self) -> int:
         return len(self.start)
 
+    @staticmethod
+    def available_info_fields(path: str | Path) -> list[str]:
+        """Return numeric column names that would be loaded as info, without
+        materializing any data.
+
+        ``POS`` and ``ILEN`` are excluded — they're positional, not info.
+        """
+        schema = pl.scan_ipc(path).collect_schema()
+        return [
+            k for k, v in schema.items() if v.is_numeric() and k not in {"POS", "ILEN"}
+        ]
+
+    def load_info(self, fields) -> None:
+        """Lazily load additional numeric info columns from ``self.path``.
+
+        Fields already present in ``self.info`` are skipped. Unknown numeric
+        columns silently no-op (the caller should validate against
+        :meth:`available_info_fields` first).
+        """
+        missing = [f for f in fields if f not in self.info]
+        if not missing:
+            return
+        df = pl.read_ipc(self.path, columns=missing, memory_map=False)
+        for f in missing:
+            self.info[f] = df[f].to_numpy()
+
 
 _H = TypeVar("_H", RaggedSeqs, RaggedAnnotatedHaps, RaggedVariants)
 _NewH = TypeVar("_NewH", RaggedSeqs, RaggedAnnotatedHaps, RaggedVariants)
@@ -176,20 +211,48 @@ class Haps(Reconstructor[_H]):
 
     def __post_init__(self):
         self.n_variants = ak.num(self.genotypes, -1).to_numpy()
+
+        # Discover available info fields from the on-disk schema, not from the
+        # (possibly-filtered) loaded info dict. This way the user can see every
+        # field they could request, even if only a subset was loaded. Fall back
+        # to whatever was loaded if the variants path isn't a readable file
+        # (e.g. synthetic in-memory _Variants used by the dummy dataset).
+        if self.variants.path.is_file():
+            schema_info_fields = _Variants.available_info_fields(self.variants.path)
+        else:
+            schema_info_fields = list(self.variants.info.keys())
+        has_dosage_file = self._has_dosage_file_on_disk()
+
         self.available_var_fields = (
             ["alt", "ilen", "start"]
-            + list(self.variants.info.keys())
+            + schema_info_fields
             + (["ref"] if self.variants.ref is not None else [])
+            + (["dosage"] if has_dosage_file else [])
         )
 
         if (
             self.min_af is not None or self.max_af is not None
-        ) and "AF" not in self.variants.info:
+        ) and "AF" not in schema_info_fields:
             raise RuntimeError(
                 "Either this dataset is not backed by an SVAR file, or the SVAR file has not had AFs cached yet."
                 + "Doing this automatically is not yet supported."
             )
 
+    def _has_dosage_file_on_disk(self) -> bool:
+        """True iff the linked SVAR contains a dosages.npy.
+
+        Returns False for non-SVAR datasets (no dosage path).
+        """
+        # If we already loaded dosages, we definitely had the file.
+        if self.dosages is not None:
+            return True
+        # Otherwise inspect the SVAR directory next to the variants table.
+        # _Variants.path is set to <svar_dir>/index.arrow for SVAR datasets,
+        # or <gvl>/genotypes/variants.arrow for legacy. We treat "next-to
+        # variants table" as "is dosage possible here".
+        candidate = self.variants.path.parent / "dosages.npy"
+        return candidate.exists()
+
     @classmethod
     def from_path(
         cls: type[Haps[RaggedVariants]],
@@ -204,7 +267,17 @@ def from_path(
         min_af: float | None = None,
         max_af: float | None = None,
         filter: Literal["exonic"] | None = None,
+        var_fields: list[str] | None = None,
     ) -> Haps[RaggedVariants]:
+        # Default var_fields for loading. var_fields=None means "use the default
+        # set" — we resolve it here so we know exactly which info columns to load.
+        if var_fields is None:
+            var_fields = ["alt", "ilen", "start"]
+        # Which numeric info columns to eagerly load: those in var_fields that
+        # aren't built-ins. (alt/ilen/start/ref/dosage are handled separately.)
+        builtin = {"alt", "ilen", "start", "ref", "dosage"}
+        info_fields = {f for f in var_fields if f not in builtin}
+
         svar_meta_path = path / "genotypes" / "svar_meta.json"
         dosages = None
 
@@ -254,20 +327,23 @@ def from_path(
             rag_shape = (*shape[1:], None)
             genotypes = Ragged.from_offsets(v_idxs, rag_shape, offsets.reshape(2, -1))
 
-            if dosage_path.exists():
-                dosages = np.memmap(dosage_path, dtype=DOSAGE_TYPE, mode="r")
+            if "dosage" in var_fields and dosage_path.exists():
+                dosages_mm = np.memmap(dosage_path, dtype=DOSAGE_TYPE, mode="r")
                 dosages = Ragged.from_offsets(
-                    dosages, rag_shape, offsets.reshape(2, -1)
+                    dosages_mm, rag_shape, offsets.reshape(2, -1)
                 )
 
             logger.info("Loading variant data.")
-            variants = _Variants.from_table(svar_path / "index.arrow")
+            variants = _Variants.from_table(
+                svar_path / "index.arrow", info_fields=info_fields
+            )
         else:
             logger.info("Loading variant data.")
             variants = _Variants.from_table(
                 path / "genotypes" / "variants.arrow",
                 one_based=version is not None
                 and version >= SemanticVersion.parse("0.18.0"),
+                info_fields=info_fields,
             )
             v_idxs = np.memmap(
                 path / "genotypes" / "variant_idxs.npy",
@@ -290,6 +366,7 @@ def from_path(
             filter=filter,
             min_af=min_af,
             max_af=max_af,
+            var_fields=var_fields,
         )
 
     def _haplotype_ilens(
@@ -594,7 +671,7 @@ def _get_variants(
                 self.variants.ilen[v_idxs], genos.shape, genos.offsets
             )
 
-        if self.dosages is not None:
+        if self.dosages is not None and "dosage" in self.var_fields:
             # guaranteed to have same shape as genotypes but need to make it contiguous/copy the data
             dosages = self.dosages[r, s]
             if _keep is not None: