Skip to content

Dosage field attached to Variants source even when not requested (regression?) #191

@d-laub

Description

@d-laub

Symptom

When running an autoresearch experiment in gvf-germ-som that constructs a gvl.Dataset with reference + haplotype + annotated variant sources (no dosage requested), the data loader crashes on the first batch with:

[autores] crashed: ValueError: cannot broadcast records because fields don't match:
    alt, ref, start
    alt, dosage, ref, start

Repro context (5-min smoke from gvf-germ-som@main autoresearch harness):

✨ Pixi task (autores-train in cu126): python -m gvf_germ_som.autoresearch
INFO: Using bfloat16 Automatic Mixed Precision (AMP)
INFO: [rank: 0] Seed set to 42
2026-05-23 23:01:30 | INFO | genvarloader._dataset._reconstruct:from_path:304 - Loading variant data.
2026-05-23 23:01:30 | INFO | genvarloader._dataset._impl:open:335 - Opened dataset:
Unspliced GVL dataset at /carter/users/dlaub/projects/gvf-germ-som/data/gvl/F2097152_W16777216.gvl
Is subset: False
# of regions: 213
# of samples: 699
Output length: ragged
Jitter: 0 (max: 0)
Deterministic: True
Sequence type: reference [haplotypes] annotated variants
Active tracks: cnv
Tracks available: cnv

<...attrs warning omitted...>
0rows [00:00, ?rows/s]1047rows [00:00, 283647.87rows/s]
[autores] crashed: ValueError: cannot broadcast records because fields don't match:
    alt, ref, start
    alt, dosage, ref, start

Hypothesis

We thought we had previously identified that GVL was attaching the dosage field to Variants unconditionally — i.e. not respecting whatever was passed to .open / .with_settings — and that we had landed a PR fixing this. The smoke crash above suggests that fix either didn't land, regressed, or doesn't cover this code path (haplotype + annotated-variants combined).

The two records in the broadcast error ({alt, ref, start} vs {alt, dosage, ref, start}) look like two Variants sources that disagree on schema — one with dosage, one without — and GVL is trying to broadcast them together.

Notes

  • Filing now to track; @d-laub will dig in after the in-flight gvl internal refactor lands.
  • Reproducible by running pixi r -e cu126 autores-train on gvf-germ-som@main (or any branch downstream of autoresearch/may23 that doesn't include the submodule bumps in commits b18bb4b, f476431, 79a5892).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions