refactor(dataset): split _reconstruct + extract _query via QueryView (PR6)#190
Merged
Conversation
…yView (PR6)
Big move. Splits the heavyweight modules along architectural seams.
Reconstructor splits:
- New _dataset/_protocol.py (34 lines): houses the Reconstructor Protocol.
Pulled into its own module to break the import cycle that arises when leaf
modules each import the Protocol and the dispatcher module (_reconstruct.py)
imports the leaves.
- New _dataset/_ref.py (76 lines): the Ref reconstructor.
- New _dataset/_haps.py (870 lines): the Haps reconstructor along with its
supporting value types (_Variants, ReconstructionRequest).
- _dataset/_tracks.py (379 -> 736 lines): the Tracks class and TrackType enum
merged in alongside the existing numba kernels. Class lives next to the
helpers it dispatches to.
- _dataset/_reconstruct.py (1539 -> 310 lines): slimmed to the dispatcher
pieces only -- RefTracks, HapsTracks, _build_reconstructor -- plus
re-exports so callers that imported moved symbols from this module keep
working (_impl.py, _open.py, _dummy.py).
QueryView composition:
- New _dataset/_query.py (403 lines): houses a typed QueryView frozen
dataclass over the 9 Dataset fields that the query path needs (idxer,
sp_idxer, full_regions, rng, recon, output_length, jitter, deterministic,
rc_neg). The query logic moves to free functions (getitem,
_getitem_unspliced, _getitem_spliced, build_recon_splice_plan,
reverse_complement_ragged [was Dataset._rc], pad [was Dataset._pad],
_regroup) that take QueryView as their first arg. Each stage is testable
in isolation against a synthetic view.
- Dataset.__getitem__ becomes a ~15-line facade that builds a QueryView and
delegates to _query.getitem. The "inner_ds = self.with_len('ragged')" trick
in _getitem_spliced is replaced by passing output_length="ragged" directly
to the recon call -- no inner Dataset construction needed.
- _impl.py: 2015 -> 1712 lines (-303).
Other:
- Reconstructor.__call__ now declares splice_plan: SplicePlan | None = None,
matching what all implementers already accept. Previously the Protocol
underspecified its contract; pyrefly caught it during the QueryView
rewrite.
- Campaign spec updated inline to record audit findings before
implementation (existing _tracks.py was numba-only, not a class home; the
query extraction is not mechanical and benefits from composition) and
the as-shipped design.
Behavior preserving; kernel calls unchanged; full test suite (488 pytest +
cargo) green; pyrefly + ruff clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
d-laub
added a commit
that referenced
this pull request
May 24, 2026
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Big architectural split. Decomposes the heavyweight modules along the seams that the campaign's earlier PRs (factory in PR1, OpenRequest in PR4, ReconstructionRequest in PR5) made obvious.
Reconstructor splits:
_dataset/_protocol.pyReconstructorProtocol. Lives alone to break the import cycle between leaf modules and the dispatcher_dataset/_ref.pyRef_dataset/_haps.pyHaps,_Variants,ReconstructionRequest_dataset/_tracks.pyTracks+TrackTypemerged in alongside the existing numba kernels (class lives next to the helpers it dispatches to)_dataset/_reconstruct.pyRefTracks,HapsTracks,_build_reconstructor, plus re-exports so callers (_impl.py,_open.py,_dummy.py) keep workingDataset.__getitem__extracted to_query.pyvia composition:_dataset/_query.py(403 lines) houses a typedQueryViewfrozen dataclass over the 9 Dataset fields the query path actually touches (idxer, sp_idxer, full_regions, rng, recon, output_length, jitter, deterministic, rc_neg).QueryView:getitem,_getitem_unspliced,_getitem_spliced,build_recon_splice_plan,reverse_complement_ragged(wasDataset._rc),pad(wasDataset._pad),_regroup. Each is independently testable with a synthetic view.Dataset.__getitem__becomes a ~15-line facade. Theinner_ds = self.with_len(\"ragged\")trick in the spliced path is replaced by passingoutput_length=\"ragged\"directly to the recon call — no inner Dataset construction._impl.py: 2015 → 1712 lines (−303).Audit-driven design. Before implementation, an audit surfaced two wrinkles the original spec underspecified:
_tracks.pywas numba-only — not a class home. Resolved by merging theTracksclass in alongside its kernels (semantically coherent)._query.pyextraction is not a mechanical move — the query cluster touches ~9 Dataset state fields. Resolved withQueryViewcomposition, matching the PR4/PR5 idiom. The spec is updated inline.Side fix:
Reconstructor.__call__Protocol now declaressplice_plan: SplicePlan | None = None, matching what every implementer already accepts. The Protocol previously underspecified its contract; pyrefly caught it.Test plan
pixi run -e dev test(488 pytest + cargo) — all passpixi run -e dev ruff check python/— cleanpixi run -e dev pyrefly check— 0 errors (baseline unchanged)Dataset,Dataset.open,Dataset.__getitem__,__all__— no changes_dummy.py) still import from_reconstruct.py(re-exports preserved)Part of the refactor campaign tracked in
docs/superpowers/specs/2026-05-23-refactor-campaign-design.md.🤖 Generated with Claude Code