Skip to content

refactor(dataset): split _reconstruct + extract _query via QueryView (PR6)#190

Merged
d-laub merged 2 commits into
mainfrom
refactor/pr6-file-splits
May 24, 2026
Merged

refactor(dataset): split _reconstruct + extract _query via QueryView (PR6)#190
d-laub merged 2 commits into
mainfrom
refactor/pr6-file-splits

Conversation

@d-laub
Copy link
Copy Markdown
Collaborator

@d-laub d-laub commented May 24, 2026

Summary

Big architectural split. Decomposes the heavyweight modules along the seams that the campaign's earlier PRs (factory in PR1, OpenRequest in PR4, ReconstructionRequest in PR5) made obvious.

Reconstructor splits:

File Lines Notes
_dataset/_protocol.py 34 (new) Reconstructor Protocol. Lives alone to break the import cycle between leaf modules and the dispatcher
_dataset/_ref.py 76 (new) Ref
_dataset/_haps.py 870 (new) Haps, _Variants, ReconstructionRequest
_dataset/_tracks.py 379 → 736 Tracks + TrackType merged in alongside the existing numba kernels (class lives next to the helpers it dispatches to)
_dataset/_reconstruct.py 1539 → 310 Slimmed to dispatcher only — RefTracks, HapsTracks, _build_reconstructor, plus re-exports so callers (_impl.py, _open.py, _dummy.py) keep working

Dataset.__getitem__ extracted to _query.py via composition:

  • New _dataset/_query.py (403 lines) houses a typed QueryView frozen dataclass over the 9 Dataset fields the query path actually touches (idxer, sp_idxer, full_regions, rng, recon, output_length, jitter, deterministic, rc_neg).
  • The query logic moves to free functions taking QueryView: getitem, _getitem_unspliced, _getitem_spliced, build_recon_splice_plan, reverse_complement_ragged (was Dataset._rc), pad (was Dataset._pad), _regroup. Each is independently testable with a synthetic view.
  • Dataset.__getitem__ becomes a ~15-line facade. The inner_ds = self.with_len(\"ragged\") trick in the spliced path is replaced by passing output_length=\"ragged\" directly to the recon call — no inner Dataset construction.
  • _impl.py: 2015 → 1712 lines (−303).

Audit-driven design. Before implementation, an audit surfaced two wrinkles the original spec underspecified:

  1. Existing _tracks.py was numba-only — not a class home. Resolved by merging the Tracks class in alongside its kernels (semantically coherent).
  2. The _query.py extraction is not a mechanical move — the query cluster touches ~9 Dataset state fields. Resolved with QueryView composition, matching the PR4/PR5 idiom. The spec is updated inline.

Side fix: Reconstructor.__call__ Protocol now declares splice_plan: SplicePlan | None = None, matching what every implementer already accepts. The Protocol previously underspecified its contract; pyrefly caught it.

Test plan

  • pixi run -e dev test (488 pytest + cargo) — all pass
  • pixi run -e dev ruff check python/ — clean
  • pixi run -e dev pyrefly check — 0 errors (baseline unchanged)
  • Public API surface unchanged: Dataset, Dataset.open, Dataset.__getitem__, __all__ — no changes
  • Internal callers (_dummy.py) still import from _reconstruct.py (re-exports preserved)

Part of the refactor campaign tracked in docs/superpowers/specs/2026-05-23-refactor-campaign-design.md.

🤖 Generated with Claude Code

d-laub and others added 2 commits May 23, 2026 22:57
…yView (PR6)

Big move. Splits the heavyweight modules along architectural seams.

Reconstructor splits:

- New _dataset/_protocol.py (34 lines): houses the Reconstructor Protocol.
  Pulled into its own module to break the import cycle that arises when leaf
  modules each import the Protocol and the dispatcher module (_reconstruct.py)
  imports the leaves.
- New _dataset/_ref.py (76 lines): the Ref reconstructor.
- New _dataset/_haps.py (870 lines): the Haps reconstructor along with its
  supporting value types (_Variants, ReconstructionRequest).
- _dataset/_tracks.py (379 -> 736 lines): the Tracks class and TrackType enum
  merged in alongside the existing numba kernels. Class lives next to the
  helpers it dispatches to.
- _dataset/_reconstruct.py (1539 -> 310 lines): slimmed to the dispatcher
  pieces only -- RefTracks, HapsTracks, _build_reconstructor -- plus
  re-exports so callers that imported moved symbols from this module keep
  working (_impl.py, _open.py, _dummy.py).

QueryView composition:

- New _dataset/_query.py (403 lines): houses a typed QueryView frozen
  dataclass over the 9 Dataset fields that the query path needs (idxer,
  sp_idxer, full_regions, rng, recon, output_length, jitter, deterministic,
  rc_neg). The query logic moves to free functions (getitem,
  _getitem_unspliced, _getitem_spliced, build_recon_splice_plan,
  reverse_complement_ragged [was Dataset._rc], pad [was Dataset._pad],
  _regroup) that take QueryView as their first arg. Each stage is testable
  in isolation against a synthetic view.
- Dataset.__getitem__ becomes a ~15-line facade that builds a QueryView and
  delegates to _query.getitem. The "inner_ds = self.with_len('ragged')" trick
  in _getitem_spliced is replaced by passing output_length="ragged" directly
  to the recon call -- no inner Dataset construction needed.
- _impl.py: 2015 -> 1712 lines (-303).

Other:

- Reconstructor.__call__ now declares splice_plan: SplicePlan | None = None,
  matching what all implementers already accept. Previously the Protocol
  underspecified its contract; pyrefly caught it during the QueryView
  rewrite.
- Campaign spec updated inline to record audit findings before
  implementation (existing _tracks.py was numba-only, not a class home; the
  query extraction is not mechanical and benefits from composition) and
  the as-shipped design.

Behavior preserving; kernel calls unchanged; full test suite (488 pytest +
cargo) green; pyrefly + ruff clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@d-laub d-laub merged commit 6b1ac74 into main May 24, 2026
6 checks passed
d-laub added a commit that referenced this pull request May 24, 2026
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant