Symptom
When running an autoresearch experiment in gvf-germ-som that constructs a gvl.Dataset with reference + haplotype + annotated variant sources (no dosage requested), the data loader crashes on the first batch with:
[autores] crashed: ValueError: cannot broadcast records because fields don't match:
alt, ref, start
alt, dosage, ref, start
Repro context (5-min smoke from gvf-germ-som@main autoresearch harness):
✨ Pixi task (autores-train in cu126): python -m gvf_germ_som.autoresearch
INFO: Using bfloat16 Automatic Mixed Precision (AMP)
INFO: [rank: 0] Seed set to 42
2026-05-23 23:01:30 | INFO | genvarloader._dataset._reconstruct:from_path:304 - Loading variant data.
2026-05-23 23:01:30 | INFO | genvarloader._dataset._impl:open:335 - Opened dataset:
Unspliced GVL dataset at /carter/users/dlaub/projects/gvf-germ-som/data/gvl/F2097152_W16777216.gvl
Is subset: False
# of regions: 213
# of samples: 699
Output length: ragged
Jitter: 0 (max: 0)
Deterministic: True
Sequence type: reference [haplotypes] annotated variants
Active tracks: cnv
Tracks available: cnv
<...attrs warning omitted...>
0rows [00:00, ?rows/s]1047rows [00:00, 283647.87rows/s]
[autores] crashed: ValueError: cannot broadcast records because fields don't match:
alt, ref, start
alt, dosage, ref, start
Hypothesis
We thought we had previously identified that GVL was attaching the dosage field to Variants unconditionally — i.e. not respecting whatever was passed to .open / .with_settings — and that we had landed a PR fixing this. The smoke crash above suggests that fix either didn't land, regressed, or doesn't cover this code path (haplotype + annotated-variants combined).
The two records in the broadcast error ({alt, ref, start} vs {alt, dosage, ref, start}) look like two Variants sources that disagree on schema — one with dosage, one without — and GVL is trying to broadcast them together.
Notes
- Filing now to track; @d-laub will dig in after the in-flight gvl internal refactor lands.
- Reproducible by running
pixi r -e cu126 autores-train on gvf-germ-som@main (or any branch downstream of autoresearch/may23 that doesn't include the submodule bumps in commits b18bb4b, f476431, 79a5892).
Symptom
When running an autoresearch experiment in
gvf-germ-somthat constructs agvl.Datasetwith reference + haplotype + annotated variant sources (no dosage requested), the data loader crashes on the first batch with:Repro context (5-min smoke from
gvf-germ-som@mainautoresearch harness):Hypothesis
We thought we had previously identified that GVL was attaching the
dosagefield toVariantsunconditionally — i.e. not respecting whatever was passed to.open/.with_settings— and that we had landed a PR fixing this. The smoke crash above suggests that fix either didn't land, regressed, or doesn't cover this code path (haplotype + annotated-variants combined).The two records in the broadcast error (
{alt, ref, start}vs{alt, dosage, ref, start}) look like twoVariantssources that disagree on schema — one with dosage, one without — and GVL is trying to broadcast them together.Notes
pixi r -e cu126 autores-trainongvf-germ-som@main(or any branch downstream ofautoresearch/may23that doesn't include the submodule bumps in commitsb18bb4b,f476431,79a5892).