Skip to content

refactor(encoding): ADR 0001 — all 6 phases#30

Open
dfa1 wants to merge 27 commits into
mainfrom
worktree-adr-0001-phase0
Open

refactor(encoding): ADR 0001 — all 6 phases#30
dfa1 wants to merge 27 commits into
mainfrom
worktree-adr-0001-phase0

Conversation

@dfa1

@dfa1 dfa1 commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

Lands every phase of ADR 0001 as 7 commits on one branch. ADR 0001 proposed splitting the read and write runtimes out of core; this PR delivers the structural foundation + exemplars for the per-family lifts.

Commit Phase What
b6437ee 0 ReadRegistry + WriteRegistry delegating facades over Registry.
1750aa1 1 EncodingDecoder + EncodingEncoder interfaces. Encoding becomes a marker that extends both.
ed297cd 2 BoolEncoding decoder lifted into reader/decode/BoolEncodingDecoder. Registry.standaloneDecoders map + SL load. Standalone decoders take precedence over bifunctional fallback.
f21e5ab 3 Mirror: BoolEncoding encoder lifted into writer/encode/BoolEncodingEncoder. Registry.standaloneEncoders map.
71e39c6 4 VortexHandle.decodeFlatSegment(...) typed accessor. ScanIterator.readFlat and VortexInspectorTui dictionary preview migrated off raw slice().
a772b00 5 ExtensionEncoder interface. Extension extends ExtensionEncoder.
9eb2d32 6 docs/compatibility.md documents the read-only deployment artifact subset (./mvnw -pl core,reader,inspector verify is the verified subset).

Tests

./mvnw verify green after every commit. 938 unit + 243 integration tests pass, including the Rust cross-language round-trip suite.

What's left as mechanical follow-up

Phases 2 and 3 each lifted one encoding family (BoolEncoding) as exemplar. The remaining ~29 encoding families follow the same shape:

  • Per family: create *EncodingDecoder in reader/decode/ and *EncodingEncoder in writer/encode/; add lines to the respective META-INF/services/ manifests; remove the corresponding decode() / encode() methods from the core *Encoding class once both sides are lifted; eventually delete the core *Encoding file entirely.

Phase 4's typed accessor migration is partial — 2 of the 6 cross-package slice() callers were migrated. The remaining 4 (ScanIterator.readFlatStats, InspectorTree.peek ×2, VortexInspectorTui hex peek, integration test inspector walk) need their own typed accessors before slice() can be removed from the public VortexHandle interface. Each is a small focused PR.

Phase 5's Extension split is also a foundation — concrete spec-extension encoders (Date/Time/Timestamp/Uuid) are not yet lifted into the writer module. Same mechanical follow-up shape as Phase 2-3.

Test plan

  • ./mvnw verify green after every commit
  • ./mvnw -pl core,reader,inspector verify builds the documented read-only subset
  • Reviewer: confirm the standaloneDecoders.getOrDefault(id, encodings.get(id)) precedence rule
  • Reviewer: agree on the per-family commit shape for the remaining ~29 encoding lifts
  • Reviewer: confirm the decodeFlatSegment typed accessor signature

🤖 Generated with Claude Code

@dfa1 dfa1 changed the title refactor(encoding): ADR 0001 Phases 0–1 foundation refactor(encoding): ADR 0001 Phases 0-3 Jun 11, 2026
@dfa1 dfa1 changed the title refactor(encoding): ADR 0001 Phases 0-3 refactor(encoding): ADR 0001 — all 6 phases Jun 11, 2026
@dfa1

dfa1 commented Jun 11, 2026

Copy link
Copy Markdown
Owner Author

Honest accounting after attempt at deletion cleanup

Tried to do the deletion pass tonight. Reverted because it would have broken ~30 test files that exercise decode logic by directly constructing the bifunctional Encoding class (e.g. new NullEncoding().decode(ctx)).

Current state of PR:

  • 29 standalone *EncodingDecoder in reader/decode/ and 29 *EncodingEncoder in writer/encode/
  • Runtime dispatch via Registry.decode prefers standalone (per Phase 2 commit's standaloneDecoders.getOrDefault(id, encodings.get(id)))
  • 4 unlifted (Alp, AlpRd, Bitpacked, Pco) — their bifunctional *Encoding.java in core/ is hit at runtime
  • Original *Encoding.java files in core/ unchanged — decode() / encode() methods still there for the 29 lifted families. They're dead at runtime (standalone wins) but on disk.

Net effect: PR is pure code addition (~12K LOC). No deletions. Lift is duplication, not migration.

Why the deletion didn't happen:

  1. Encoding extends EncodingDecoder, EncodingEncoder — bifunctional. Removing decode() from the 29 lifted core files requires the interface to drop the decode requirement.
  2. Doable: change to Encoding extends EncodingEncoder only; make 4 unlifted classes implements Encoding, EncodingDecoder explicitly.
  3. But ~30 test files do new NullEncoding().decode(ctx) — direct method call. Removing the method from the class breaks compile.
  4. Real fix: migrate tests to either (a) use standalone decoder via new NullEncodingDecoder() — but that's in reader/ so needs reader as a core/test dependency, or (b) construct a Registry and call registry.decode(ctx) per-test — workable but ~30 file rewrites.

Both paths are multi-hour follow-up work. Stopping here rather than landing half-broken changes.

Recommended next PRs (each independent):

  • Migrate core/src/test decode-side tests off direct new XxxEncoding().decode(ctx) calls. Either move them to reader/src/test (test moves, not just rewrites — about 25 files) or rewrite them to dispatch through Registry.decode.
  • Once tests no longer reference XxxEncoding.decode, bulk-delete the decode() methods + inner Decoder classes from the 29 lifted core files. Same for encode() + Encoder inner classes once consumer fallback chains (MaskedEncoding's INNER_FALLBACK, FixedSizeListEncoding's FALLBACK, etc.) route through WriteRegistry.lookupEncoder instead of new PrimitiveEncoding().
  • Lift the remaining 4 (Alp, AlpRd, Bitpacked, Pco) into the same pattern.
  • Eventually: delete the *Encoding.java files in core entirely.

The structural foundation (interfaces, facades, ServiceLoader manifests, standalone dispatch) is solid. Bulk deletion is mechanical once the test-side dependency is broken.

dfa1 and others added 27 commits June 11, 2026 20:09
…[Phase 0/6]

ADR 0001 Phase 0 lays the foundation for the read/write runtime split
by adding two new facade types that delegate to the existing Registry.
No call sites change yet — current readers and writers continue to use
Registry directly. Later phases progressively migrate dispatch and
runtime through these facades.

- ReadRegistry: exposes only decode + lookup + isAllowUnknown, the
  surface a read-only consumer needs.
- WriteRegistry: exposes only lookup, mirroring the encoder-lookup
  surface a writer needs; the encode() entry point is added in a
  later phase once Encoding is split into Decoder + Encoder.

Both facades carry an asRegistry() escape hatch so callers that still
need both surfaces can unwrap the underlying Registry during the
transition.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…[Phase 1/6]

Encoding is now a marker that extends EncodingDecoder + EncodingEncoder.
The two narrower interfaces carry the read-side and write-side method
signatures (encodingId + accepts + decode for decoders; encodingId +
accepts + encode + encodeCascade for encoders).

Every existing Encoding implementation continues to satisfy both
contracts without code changes, so the migration is transparent at this
phase. Read-only call sites (ReadRegistry, FlatSegmentDecoder) can
already type-narrow to EncodingDecoder; write-only call sites
(VortexWriter, CascadingCompressor) can type-narrow to EncodingEncoder.

Later phases lift one encoding family at a time into separate
*EncodingDecoder and *EncodingEncoder implementations that live in
the reader and writer modules respectively; once the last family is
lifted, the bifunctional Encoding interface is removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…2/6]

First encoding family lifted as exemplar. BoolEncodingDecoder is a
standalone EncodingDecoder living in reader/src/main, registered via a
new META-INF/services/io.github.dfa1.vortex.encoding.EncodingDecoder
manifest in the reader module.

Registry gains a parallel standaloneDecoders map (loaded via
ServiceLoader<EncodingDecoder>) plus a Builder.register(EncodingDecoder)
method. Registry.decode and Registry.decodeAsSegment now consult the
standalone map first and fall back to the bifunctional Encoding map.
Standalone decoders that are themselves Encoding implementations
(i.e. the bifunctional ones not yet lifted) are skipped in the
ServiceLoader walk so they aren't double-registered.

BoolEncoding in core remains unchanged. Its decode method still works
as the fallback path, but at runtime the new BoolEncodingDecoder in
reader takes precedence. Once every encoding is lifted, the decode
method on the bifunctional Encoding interface is removed and
core's *Encoding files become encoder-only.

This commit demonstrates the pattern; lifting the remaining ~29
encoding families is mechanical follow-up work that lands one family
per commit on this branch.

./mvnw verify — all unit + integration tests pass (incl. Rust
round-trips), including BoolEncodingTest which now exercises the
standalone decoder path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…3/6]

Mirrors Phase 2 on the write side. BoolEncodingEncoder is a standalone
EncodingEncoder living in writer/src/main, registered via a new
META-INF/services/io.github.dfa1.vortex.encoding.EncodingEncoder
manifest in the writer module.

Registry gains a parallel standaloneEncoders map (loaded via
ServiceLoader<EncodingEncoder>) plus a Builder.register(EncodingEncoder)
method. WriteRegistry exposes a lookupEncoder(EncodingId) entry point
that consults standalone encoders first and falls back to the
bifunctional Encoding map.

BoolEncoding's encode method remains in core for now; once every
encoding is lifted, the encode methods on the bifunctional Encoding
interface are removed and the core *Encoding files are deleted
entirely (read decoders live in reader, write encoders live in writer).

This commit demonstrates the symmetric pattern; lifting the remaining
~29 encoding families on the write side is mechanical follow-up work,
one family per commit.

./mvnw verify — all unit + integration tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
VortexHandle gains a decodeFlatSegment(SegmentSpec, DType, long, Arena)
typed accessor that does the most common pattern previously open-coded
across ScanIterator, the inspector, and the TUI:

    MemorySegment seg = handle.slice(spec.offset(), spec.length());
    return new FlatSegmentDecoder(registry)
            .decode(seg, handle.footer().arraySpecs(), dtype, rowCount, arena);

becomes a single call:

    return handle.decodeFlatSegment(spec, dtype, rowCount, arena);

VortexReader and VortexHttpReader both implement it; the HTTP variant
fetches the range first then runs the same FlatSegmentDecoder dispatch.

Two call sites migrated as exemplars in this commit:

- ScanIterator.readFlat: the hot scan path now goes through the typed
  accessor; the FlatSegmentDecoder import is dropped from the file.
- VortexInspectorTui dictionary preview: 5 lines collapse to 1.

Remaining slice() call sites kept as escape hatch for now:

- ScanIterator.readFlatStats: reads the trailing fbLen of the stats
  segment, slices a smaller stats flatbuffer region, parses ArrayStats.
  Migration requires a typed readSegmentStats(SegmentSpec) accessor.
- InspectorTree.peek (×2): runs peekFlatRoot(seg, arraySpecs) which
  parses the FlatBuffer Array root for inspector metadata. Returning
  the inspector-specific Peek type from VortexHandle would leak the
  inspector module's types into reader; a typed peekFlatRoot accessor
  that returns just the FlatBuffer Array would still leak fbs types.
  Needs its own design discussion.
- VortexInspectorTui hex peek: reads the first N bytes of a segment
  for hex display. Migration requires a readBytes(SegmentSpec, int)
  typed accessor.
- PcoFixtureInspectionIntegrationTest: walks the layout for
  diagnostic dumping. Test-only; migration low priority.

slice() stays @deprecated(forRemoval = true) on the VortexHandle
interface. The next pass adds the remaining typed accessors and then
drops slice() entirely.

./mvnw verify — all unit + integration tests pass (incl. Rust round-trips).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rface [Phase 5/6]

ExtensionEncoder carries the write-side surface (extensionId + dtype +
encodeAll). Extension extends ExtensionEncoder, so every existing
implementation (DateExtension, TimeExtension, TimestampExtension,
UuidExtension) satisfies both contracts unchanged. Read-only consumers
can already type-narrow to either Extension or its parent without
depending on encoder logic.

This mirrors Phase 1's Encoding split. Subsequent commits lift each
spec extension's encodeAll override into a standalone
*ExtensionEncoder living in the writer module, registered via
META-INF/services. Once all four spec extensions are lifted the
bifunctional Extension interface is removed and Extension becomes
read-only.

Extension's decode side does not need a parallel ExtensionDecoder
interface because there is no decode method on Extension itself —
decode behaviour lives on each concrete implementation as typed
decodeAll(Array) methods, exposed for callers via the static
Extension.findKnown(DType.Extension) lookup.

./mvnw verify — all unit + integration tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hase 6/6]

Verifies and documents that vortex-reader (+ optional vortex-inspector)
build with no transitive dependency on vortex-writer. Read-only
consumers — analytics engines, the inspector personality of the CLI,
remote-storage diagnostic tooling — can pull only the read-side jars
without inheriting the encoder surface (FSST builders, ALP encoders,
zstd-jni, etc.).

`./mvnw -pl core,reader,inspector verify` builds the read-only subset
cleanly. ServiceLoader<EncodingDecoder> in the read-only artifact
resolves only the standalone decoders shipped in reader (today:
BoolEncodingDecoder, the Phase 2 exemplar), falling back to the
bifunctional Encoding implementations in core for encoding families
not yet lifted.

This is documentation only; the structural property already exists in
the current pom layout. Subsequent per-family lifts (Phases 2-3 for
the remaining ~29 encodings) progressively shrink the
bifunctional-Encoding fallback footprint to zero, at which point the
read-only artifact carries no encoder code at all.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Phase 2-3 batch]

Lifts three more encoding families into separate decoder/encoder modules
following the BoolEncoding exemplar:

- NullEncoding -> reader/decode/NullEncodingDecoder + writer/encode/NullEncodingEncoder
- ByteBoolEncoding -> reader/decode/ByteBoolEncodingDecoder + writer/encode/ByteBoolEncodingEncoder
- PrimitiveEncoding -> reader/decode/PrimitiveEncodingDecoder + writer/encode/PrimitiveEncodingEncoder

Manifests updated in both modules. Standalone decoders take precedence
over the bifunctional Encoding fallback at runtime; the core *Encoding
classes remain in place during the transition and their decode/encode
methods are no longer hit on the dispatch path.

Side change: ArrayNode, KnownArrayNode, UnknownArrayNode promoted from
package-private to public so decoders living outside the
io.github.dfa1.vortex.encoding package can call
ctx.node().children() etc. The validity-child handling in
PrimitiveEncodingDecoder exercises this path. The sealed interface
contract is preserved — only the existing two permitted subtypes
remain.

./mvnw verify — all unit + integration tests pass (incl. Rust round-trips).

Remaining encoding families to lift (~29):
  Alp, AlpRd, Bitpacked, Chunked, Constant, DateTimeParts,
  DecimalByteParts, Decimal, Delta, Dict, Ext, FixedSizeList,
  FrameOfReference, Fsst, List, ListView, Masked, Patched, Pco,
  Rle, RunEnd, Sequence, Sparse, Struct, VarBin, VarBinView,
  Variant, ZigZag, Zstd.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…batch]

Continues the per-family lift. Two more families now have standalone
decoder/encoder implementations registered via ServiceLoader:

- ZigZagEncoding -> ZigZagEncodingDecoder + ZigZagEncodingEncoder
- SequenceEncoding -> SequenceEncodingDecoder + SequenceEncodingEncoder

Both encoders reuse the existing public helpers in core
(PTypeIO, SegmentBroadcast); no further core API surface changes.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: Bool, Null, ByteBool, Primitive, ZigZag, Sequence (6 of ~33).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tch]

Adds FrameOfReferenceEncodingDecoder (reader) and
FrameOfReferenceEncodingEncoder (writer). Encoder retains the cascade
path via encodeCascade(...) returning a ChildSlot for the residuals
buffer the compressor bitpacks.

PTypeIO.set promoted from package-private to public so the writer-side
encoder can reach it. Already-public methods (LE_SHORT, LE_INT, LE_LONG,
LE_FLOAT, LE_DOUBLE accessors used by Phase 2-3 encoders) were
unaffected.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 7 of ~33 (Bool, Null, ByteBool, Primitive, ZigZag,
Sequence, FrameOfReference).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…[Phase 2-3 batch]

Four more encoding families lifted. Pattern matches earlier batches:
standalone *EncodingDecoder in reader/decode, standalone
*EncodingEncoder in writer/encode, registered via the respective
META-INF/services manifests.

- ExtEncoding: extension storage wrapper. Encoder reuses the bifunctional
  PrimitiveEncoding / FixedSizeListEncoding / MaskedEncoding classes
  from core for the storage child during the transition.
- MaskedEncoding: NullableData payload + validity. Encoder reuses the
  same core fallback list.
- VarBinEncoding: variable-length binary / UTF-8. Self-contained.
- VarBinViewEncoding: Arrow StringView format. Self-contained.

Registry.decode promoted from package-private to public so standalone
decoders in the reader module can recursively call back into the
registry (ExtEncodingDecoder.decode does this for the storage child).

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 11 of ~33 (Bool, Null, ByteBool, Primitive, ZigZag,
Sequence, FrameOfReference, Ext, Masked, VarBin, VarBinView).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds ConstantEncodingDecoder + ConstantEncodingEncoder. The decoder
covers all dtypes (Primitive, Null, Bool, Utf8/Binary, Decimal,
Extension via storage-dtype recursion). The encoder keeps the
isConstant guard + cascading-aware encodeCascade path.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 12 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds DeltaEncodingDecoder + DeltaEncodingEncoder. The FastLanes
transpose helpers (FL_CHUNK_SIZE, FL_ORDER, transposeIndex,
iterateIndex, lanes, typeBits, typeMask, fromLongs) were promoted from
package-private to public on DeltaEncoding so both module-local
encoders/decoders can share the same implementation rather than
duplicating ~50 lines of bit-twiddling math.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 13 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…encodings [Phase 2-3 batch]

Three more families lifted. All self-contained or reusing public core
encodings during the transition.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 16 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two more structural families lifted.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 18 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…batch]

Both currently decode-only (the bifunctional Encoding throws
"not yet implemented" on encode); the lifted encoders preserve that
behaviour and the decoders carry their full logic.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 20 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…se 2-3 batch]

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 22 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…atch]

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 24 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 25 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rle's FL_CHUNK_SIZE constant promoted from package-private to public so
the lifted reader-side decoder and writer-side encoder share the same
value.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 27 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 28 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Promoted PTypeIO.copyArray from package-private to public so the
lifted Dict encoder can build the unique-values primitive segment.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted: 29 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Promoted AlpEncoding's package-private constants (F10_F64, IF10_F64,
F10_F32, IF10_F32, I64_DTYPE, I32_DTYPE) to public so the lifted
reader-side decoder shares the same precomputed tables.

Encoder side stays in core for now.

./mvnw verify — all unit + integration tests pass.

Cumulative lifted decoders: 30 of ~33.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move standalone decoder/encoder counterparts for vortex.alprd to reader/
and writer/ modules. Add Alp encoder under writer/ (decoder previously
lifted). Promote AlpRdEncoding U16/U32/U64 DType constants to public for
cross-module reuse.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move standalone decoder/encoder counterparts for fastlanes.bitpacked to
reader/ and writer/ modules. Promote BitpackedEncoding.FL_ORDER to public
for cross-module reuse.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ncoding]

Move standalone decoder/encoder counterparts for vortex.pco to reader/
and writer/ modules. Writer side is a stub (encode not yet implemented,
matching prior behavior). Promote PcoEncoding format/offset-bits
constants, PcoBin record, PcoTansDecoder class + decode methods, and
LeBitReader class + read methods to public for cross-module reuse.

This completes the lift phase: all 33 encodings now have standalone
EncodingDecoder/EncodingEncoder implementations registered via
ServiceLoader.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port the null/empty-buffer fallback from BitpackedEncoding (main).
Proto3 elides default-valued fields, so a BitPackedMetadata(0, 0, null)
payload serialises to 0 bytes — the writer skips the empty vector
entirely. The lifted decoder must treat absent metadata as all-defaults,
otherwise nested bit_width=0 paths (constant residuals under FoR / FSST
cascade) regress with "missing metadata".

CsvImporterTest.respectsSchemaOverride caught this after rebase against
main's FSST cascade work.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dfa1 dfa1 force-pushed the worktree-adr-0001-phase0 branch from 2a858df to 6f73c9e Compare June 11, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant