Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
358 changes: 358 additions & 0 deletions docs/adr/0001-split-read-and-write-runtimes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
# ADR 0001: Split read and write runtimes out of `core`

- **Status:** Proposed
- **Date:** 2026-06-11
- **Deciders:** project maintainer
- **Supersedes:** —
- **Superseded by:** —

## Context

The current module layout collapses the file-format model, the read runtime,
and the write runtime into a single `core` module. The other modules
(`reader`, `writer`, `inspector`, `cli`, `csv`, `jdbc`, `parquet`) are thin
orchestration layers that call back into `core` for every meaningful operation.

### What `core` currently contains

```
io.github.dfa1.vortex.core — DType, PType, Footer, Layout,
VortexException, VortexFormat,
Array hierarchy, ArrayStats
io.github.dfa1.vortex.encoding — Encoding (encode + decode on one type),
Registry (read + write dispatch),
DecodeContext, EncodeContext,
FlatSegmentDecoder, ArrayNode,
30+ concrete *Encoding.java classes
io.github.dfa1.vortex.extension — Extension interface, ExtensionId,
4 spec extension impls
io.github.dfa1.vortex.proto — generated proto records (in-tree codec)
io.github.dfa1.vortex.fbs — generated flatbuffer types
```

### The smell

1. **`Encoding` is bifunctional.** Every encoding implements both
`encode(DType, Object, EncodeContext)` and `decode(DecodeContext)`. A
read-only consumer (the most common deployment shape — analytical engines
reading columnar files) pulls in the full write path, including Zstd
compression libraries, dictionary builders, and stats sketchers.

2. **`Registry` is a dual dispatcher.** It maps `EncodingId` to a single
`Encoding` instance and exposes both read (`decode`, `decodeAsSegment`) and
write surfaces. Read-only callers carry write-side identity even when no
write code path will ever fire.

3. **`reader` is a shell, not a runtime.** `VortexReader` memory-maps the
file, parses the trailer/postscript/footer/layout, then hands off to
`FlatSegmentDecoder` — which lives in `core`. From that point on every
meaningful operation (per-buffer slicing, `Registry.decode` dispatch,
`Encoding.decode` call) happens inside `core`. The `reader` module
contributes ~3 KLOC of orchestration around ~30 KLOC of decode runtime that
lives in `core`.

4. **The `slice()` escape hatch is forced by this layout.**
[PR #27](https://github.com/dfa1/vortex-java/pull/27) wraps untrusted
`MemorySegment.asSlice` calls in a `BoundedSegment` type, but ended up
shipping 33 `unwrapForSubParser(...)` sites because the consumers
(`Encoding.decode`, `FlatSegmentDecoder`, `Registry`) live in `core` while
the byte-producing handle (`VortexHandle.slice`) lives in `reader`. They
cannot share package-private access; the cross-module API must be public;
the public API must expose raw segments because that is what the consumers
take. Every typed wrapper added to plug the gap (`BoundedSegment`,
`unwrapForSubParser`, `MemorySegments.slice`) is a workaround for the fact
that the read runtime should not be in a different module than the byte
source it consumes.

5. **Concrete consequences of (1)–(4):**
- **`Registry.decodeAsSegment`** exists purely so `ScanIterator` (in
`reader`) can decode a child node back into a raw `MemorySegment` —
an inversion of the natural data-flow direction.
- **`DecodeContext.segmentBuffers`** had to become `BoundedSegment[]` so
the security contract survives the cross-module hand-off; if decoders
lived alongside `FlatSegmentDecoder`, package-private `MemorySegment[]`
would have sufficed.
- **Read-only jars cannot be smaller than the full reactor.** The CLI
uber-jar pulls every encoder, every writer dependency (`zstd-jni`,
`air-compressor`, etc.) even if the binary only reads files.
- **`Extension`** mirrors the same problem at a smaller scale —
`encodeAll` and decode helpers live on one interface, pulling
write-side dependencies into read-only consumers.

## Decision

Split `core` into three logical surfaces along the read/write axis:

```
core/ — only the model
Encoding — id() + accepts(); no encode/decode methods
EncodingId
DType, PType, ArrayStats, Footer, Layout
NullableData — the one Array-shaped type both sides need
BoundedSegment — the byte-access primitive
proto, fbs — generated code (already pure data)

reader/ — read runtime
VortexReader, VortexHandle, VortexHttpReader, ScanIterator
Array hierarchy — BoolArray, IntArray, LongArray, VarBinArray,
StructArray, ... (the read-only data exchange
format; writer never touches these)
ReadRegistry — Map<EncodingId, EncodingDecoder>
EncodingDecoder — Array decode(DecodeContext)
DecodeContext, FlatSegmentDecoder
BitpackedDecoder, AlpDecoder, PcoDecoder, ... (30+ files)

writer/ — write runtime
VortexWriter
WriteRegistry — Map<EncodingId, EncodingEncoder>
EncodingEncoder — EncodeResult encode(DType, Object, EncodeContext)
EncodeContext
BitpackedEncoder, AlpEncoder, PcoEncoder, ... (30+ files)
```

### How the registry attaches at the API surface

`Registry` splits into two distinct types — `ReadRegistry` and
`WriteRegistry` — not a generic `Registry<T>`. Each is passed alongside
its corresponding entry point, **not** folded into an options record:

```java
ReadRegistry rr = ReadRegistry.builder().registerServiceLoaded().build();
VortexReader.open(path, rr); // ReadRegistry directly

WriteRegistry wr = WriteRegistry.builder().registerServiceLoaded().build();
VortexWriter.builder(path, schema)
.registry(wr) // WriteRegistry directly
.options(WriteOptions.defaults()) // tuning knobs only
.build();
```

Two design choices feed this shape.

**Distinct types, not `Registry<T>`.** Reasons:
- Different builder ergonomics: the read side has no cascade chain to
configure; the write side does (`cascadeCodecs`, allowed-cascading depth).
A generic type would carry irrelevant builder methods on both sides.
- `ServiceLoader` manifests are already separate
(`META-INF/services/...EncodingDecoder` vs `...EncodingEncoder`), so
type-level separation matches the runtime story.
- Mistakes like passing a write registry to `VortexReader.open` become
compile errors, not runtime errors.

**Alongside the options, not inside them.** Reasons:
- `WriteOptions` is a record (an immutable value). `WriteRegistry` is a
configured map with lifecycle: typically built once at app startup and
reused across many file writes. Mixing forces re-creating options every
time you want a new file with the same registry.
- Records work badly for fields with non-trivial equality semantics
(`Registry.equals`?).
- Today the registry already lives on the method signature; keeping that
split is the lowest-migration shape.

The same applies to read-side configuration. There is no `ReadOptions`
record today (the reader takes `ScanOptions` per-scan instead); the
proposal keeps that as-is. `ReadRegistry` is the file-open parameter;
`ScanOptions` is the per-scan parameter.

Effect on caller code:
- Read-only callers (analytics engines, inspector, CLI inspector)
construct only `ReadRegistry`. No transitive dependency on writer
encoders — the `writer` module isn't on their classpath at all.
- Write-only callers (CSV importer, JDBC importer) construct only
`WriteRegistry`.
- Tools that do both (integration tests, parquet bridge) construct both.

### What changes structurally

- `Encoding` becomes a small metadata-only interface in `core`. It carries
`EncodingId` and `accepts(DType)` and nothing else. No bifunctional decode
+ encode methods.
- Each encoding's `Decoder` static inner class becomes a top-level
`EncodingDecoder` implementation in `reader`. The `Encoder` inner class
becomes `EncodingEncoder` in `writer`. CLAUDE.md already documents this
split via private inner classes; the migration largely lifts those into
separate compilation units across modules.
- `Registry` splits into `ReadRegistry` and `WriteRegistry`. Each registry
exposes only the dispatch surface its side needs. The `decodeAsSegment`
escape hatch is deleted; the corresponding adapter logic lives in
`FlatSegmentDecoder` (in `reader`) instead.
- `DecodeContext` moves to `reader`; `EncodeContext` moves to `writer`.
- `FlatSegmentDecoder` moves to `reader`, into the same package as
`VortexReader`. The `slice()` method on `VortexHandle` becomes
package-private — `FlatSegmentDecoder`, `Trailer`, and `PostscriptParser`
are its only callers, all co-resident in `reader/io`.
- `unwrapForSubParser` and the corresponding audit trail collapse to the
minority of decoders that genuinely call into a sub-parser (`ProtoReader`)
with a raw `MemorySegment`. Cross-module byte hand-offs disappear.
- `Extension` similarly splits into `ExtensionDecoder` + `ExtensionEncoder`,
or keeps a single interface with read-only and write-only sub-types.

### Effect on the `slice()` problem

The motivating problem disappears as a side effect:

- `VortexHandle.slice(long, long)` → package-private. External callers cannot
see it; cross-module consumers (`ScanIterator`, `InspectorTree`) move into
the same module so the package-private access works.
- `BoundedSegment` stays in `core` as the primitive, but no longer needs to
travel through public API surfaces. Most internal uses can drop back to
raw `MemorySegment` because they live in the same package as the byte
source and the trust boundary is now spatially local.
- The 33 `unwrapForSubParser` sites from PR #27 are mostly eliminated —
not because we wrote more wrappers, but because the wrappers are no longer
needed once read code stops crossing module boundaries to reach its bytes.

## Migration phases

Each phase is a separate PR, lands independently green, and keeps the old
shape running side-by-side until cut-over.

**Phase 0 — preparation (≈0.5 day)**
- Land this ADR.
- Add `Encoding` metadata-only interface in `core` (extends the existing
one for now). Verify all current `Encoding` impls already implement
`id()` and `accepts(DType)`.
- Introduce `ReadRegistry` and `WriteRegistry` skeletons that for now
delegate to the existing `Registry`. No call-site changes yet.

**Phase 1 — split `DecodeContext` and the read registry (≈1 day)**
- Move `DecodeContext`, `ArrayNode`, `FlatSegmentDecoder` to `reader`.
- `ReadRegistry` becomes the canonical read dispatcher; `Registry.decode`
forwards to it during transition.
- `ScanIterator` uses `ReadRegistry` directly.
- `decodeAsSegment` deleted; `FlatSegmentDecoder` gains the equivalent
package-private helper.

**Phase 2 — lift `*Decoder` impls into `reader` (≈1 day per family, ≈3 days)**
- Pick one encoding family at a time (Fastlanes, ALP, Pco, …).
- For each: extract the `Decoder` inner class into a new
`*EncodingDecoder` in `reader/encoding`; register via
`META-INF/services/...EncodingDecoder`; delete the `decode(...)` method
from the old `*Encoding` in `core`.
- After all families lifted, `Encoding` in `core` no longer has a `decode`
method. `Registry` (the old dual) no longer has a read surface.

**Phase 3 — repeat for the write side (≈3 days)**
- Mirror Phase 2 for writers. `Encoding` in `core` becomes the
metadata-only shape promised in the Decision section.

**Phase 4 — `VortexHandle.slice` to package-private (≈0.5 day)**
- Drop `slice()` from the public `VortexHandle` interface. All remaining
callers are now in `reader` and use a package-private accessor on the
concrete `VortexReader` / `VortexHttpReader` types.
- Inspector and CLI inspector code that today calls `handle.slice(...)`
receives a new typed accessor instead (e.g.
`FlatSegmentInspector.peek(handle, spec)`).
- The 33 `unwrapForSubParser` sites from PR #27 are deleted at the same
time; the corresponding decoders take raw `MemorySegment` again because
they live in the same package as the byte source.

**Phase 5 — `Extension` split (≈0.5 day)**
- `ExtensionDecoder` and `ExtensionEncoder` in their respective modules.
- Confirm the four spec extensions (`Date`, `Time`, `Timestamp`, `Uuid`)
ride through the split cleanly.

**Phase 6 — read-only jar artifact (≈0.5 day)**
- Verify the CLI's "read-only" personality (the inspector) can be built
without the writer module on the classpath. Document in `compatibility.md`.

Cumulative effort estimate: ~9 person-days of focused work, plus ~3 days
of CI / integration-test fallout, plus reviewer time. Not a weekend.

## Consequences

### Positive

- **Public API never exposes raw `MemorySegment`** for the read path.
`VortexHandle.slice` disappears from the public surface. The SECURITY.md
contract is enforced architecturally, not by audit-trail convention.
- **PR #27's 33 `unwrapForSubParser` sites collapse to a handful** —
only the decoders that genuinely call a sub-parser
(`ProtoReader`-bound decoders: Constant, Pco, Sparse, plus Zstd's
native lib hand-off) retain a documented trust transfer.
- **`Registry.decodeAsSegment` deleted.** The current adapter exists only
because cross-module dispatch needs a raw-segment escape; once decoders
are co-resident with the byte source, the adapter is no longer needed.
- **Read-only deployments shrink.** No transitive pull on `zstd-jni`
encode paths, FSST dictionary builders, ALP encoders, etc. The CLI
inspector becomes a true read-only artifact.
- **Dependency direction matches data flow.** `reader` depends on `core`;
`writer` depends on `core`; neither depends on the other. Today both
live inside `core` and the dependency direction is invisible.

### Negative

- **Multi-day refactor.** ~9 person-days plus CI iteration. Cannot land in
a single PR; must be staged carefully so each phase runs green.
- **Encoding impls double in file count** during transition. `BitpackedEncoding`
becomes `BitpackedDecoder` (in `reader`) + `BitpackedEncoder` (in `writer`).
Test files split similarly.
- **CLAUDE.md updates** — the "three touch-points for adding an encoding"
rule becomes "decoder side + encoder side + EncodingId enum constant",
each in its own module.
- **CHANGELOG breaking-changes section grows.** External users (none today,
but any future ones) see `Encoding`, `Registry`, and `DecodeContext`
moved. Probably worth bundling under a 0.7.0 release boundary.
- **Two `ServiceLoader` manifests per encoding** instead of one.
- **Integration tests need re-routing.** Tests that today construct a
`Registry` and call `decode` directly will need to construct a
`ReadRegistry` instead — mechanical but pervasive.

### Risks to manage

- **Side-by-side period drift.** Phases 1–3 leave both the old `Registry`
and the new `ReadRegistry`/`WriteRegistry` registered for each encoding
during transition. Risk: divergent behaviour if a bug fix lands on one
side and not the other. Mitigation: integration tests run against both
paths during the transition; the old `Registry` becomes a thin forwarder
early in Phase 1.
- **Extension split.** `Extension` carries the same encode/decode tension
as `Encoding`; the migration plan assumes a parallel split. If the
extension API has tighter user-facing constraints (it does — see
`DateExtension.decodeAll`), Phase 5 may need a separate ADR.
- **JMH benchmarks.** `RustVsJavaReadBenchmark` and friends construct
`Registry` + `DecodeContext` directly. They live in the `performance`
module, which depends on `reader`. The benchmarks need re-wiring at the
end of Phase 2.

## Alternatives considered

- **Keep `core` as-is, hide `slice()` via Java modules (JPMS).** Drops
PR #27's escape-hatch noise but does not address the underlying smell —
`core` still hosts the read runtime, `reader` still calls back into
`core` for every operation, `Registry` still dispatches both sides,
and read-only deployments still pull the writer surface. Rejected as
cosmetic.
- **Move `FlatSegmentDecoder` alone into `reader`, leave everything else.**
Solves the immediate `slice()` problem at the cost of a circular module
dependency: `Encoding.decode` (in `core`) would call
`Registry.decode` (in `core`) which would route into
`FlatSegmentDecoder` (in `reader`). Rejected as architecturally worse
than the current state.
- **Adopt an existing pluggable codec framework (e.g. Arrow's
`CompressionCodec` SPI shape).** Considered briefly. Vortex's
cascading-encoding model has tighter requirements than Arrow's flat
codec model; an external SPI does not fit. Rejected.
- **Status quo + documentation.** Document that `core` is the read runtime
and `reader` is a shell. Cheapest. Rejected because every future
feature that needs cross-module byte access reintroduces the same
escape-hatch problem.

## Decision drivers

- The 33 `unwrapForSubParser` sites in PR #27 are a strong proxy signal:
every one of them documents a place where read code needs bytes that
live in another module.
- A genuinely read-only deployment (inspector + scan) should be possible
without pulling Zstd encoders or FSST builders. Today it is not.
- The `Encoding` interface bifunctional shape blocks ahead-of-time
pruning of the write surface; the refactor is the only path to a
smaller read-only artifact.

## References

- [PR #27 — `sec(parser): BoundedSegment + audit trail for untrusted asSlice`](https://github.com/dfa1/vortex-java/pull/27)
- [Phase 1–4 commits — BoundedSegment introduction and migration](https://github.com/dfa1/vortex-java/pull/27/commits)
- [SECURITY.md — the contract this work hardens](../../SECURITY.md)
- [CLAUDE.md — current "three touch-points" rule for adding an encoding](../../CLAUDE.md)
- [TODO.md — parser hardening backlog](../../TODO.md)
Loading
Loading