diff --git a/docs/adr/0001-split-read-and-write-runtimes.md b/docs/adr/0001-split-read-and-write-runtimes.md new file mode 100644 index 00000000..19cc721f --- /dev/null +++ b/docs/adr/0001-split-read-and-write-runtimes.md @@ -0,0 +1,358 @@ +# ADR 0001: Split read and write runtimes out of `core` + +- **Status:** Proposed +- **Date:** 2026-06-11 +- **Deciders:** project maintainer +- **Supersedes:** — +- **Superseded by:** — + +## Context + +The current module layout collapses the file-format model, the read runtime, +and the write runtime into a single `core` module. The other modules +(`reader`, `writer`, `inspector`, `cli`, `csv`, `jdbc`, `parquet`) are thin +orchestration layers that call back into `core` for every meaningful operation. + +### What `core` currently contains + +``` +io.github.dfa1.vortex.core — DType, PType, Footer, Layout, + VortexException, VortexFormat, + Array hierarchy, ArrayStats +io.github.dfa1.vortex.encoding — Encoding (encode + decode on one type), + Registry (read + write dispatch), + DecodeContext, EncodeContext, + FlatSegmentDecoder, ArrayNode, + 30+ concrete *Encoding.java classes +io.github.dfa1.vortex.extension — Extension interface, ExtensionId, + 4 spec extension impls +io.github.dfa1.vortex.proto — generated proto records (in-tree codec) +io.github.dfa1.vortex.fbs — generated flatbuffer types +``` + +### The smell + +1. **`Encoding` is bifunctional.** Every encoding implements both + `encode(DType, Object, EncodeContext)` and `decode(DecodeContext)`. A + read-only consumer (the most common deployment shape — analytical engines + reading columnar files) pulls in the full write path, including Zstd + compression libraries, dictionary builders, and stats sketchers. + +2. **`Registry` is a dual dispatcher.** It maps `EncodingId` to a single + `Encoding` instance and exposes both read (`decode`, `decodeAsSegment`) and + write surfaces. Read-only callers carry write-side identity even when no + write code path will ever fire. + +3. **`reader` is a shell, not a runtime.** `VortexReader` memory-maps the + file, parses the trailer/postscript/footer/layout, then hands off to + `FlatSegmentDecoder` — which lives in `core`. From that point on every + meaningful operation (per-buffer slicing, `Registry.decode` dispatch, + `Encoding.decode` call) happens inside `core`. The `reader` module + contributes ~3 KLOC of orchestration around ~30 KLOC of decode runtime that + lives in `core`. + +4. **The `slice()` escape hatch is forced by this layout.** + [PR #27](https://github.com/dfa1/vortex-java/pull/27) wraps untrusted + `MemorySegment.asSlice` calls in a `BoundedSegment` type, but ended up + shipping 33 `unwrapForSubParser(...)` sites because the consumers + (`Encoding.decode`, `FlatSegmentDecoder`, `Registry`) live in `core` while + the byte-producing handle (`VortexHandle.slice`) lives in `reader`. They + cannot share package-private access; the cross-module API must be public; + the public API must expose raw segments because that is what the consumers + take. Every typed wrapper added to plug the gap (`BoundedSegment`, + `unwrapForSubParser`, `MemorySegments.slice`) is a workaround for the fact + that the read runtime should not be in a different module than the byte + source it consumes. + +5. **Concrete consequences of (1)–(4):** + - **`Registry.decodeAsSegment`** exists purely so `ScanIterator` (in + `reader`) can decode a child node back into a raw `MemorySegment` — + an inversion of the natural data-flow direction. + - **`DecodeContext.segmentBuffers`** had to become `BoundedSegment[]` so + the security contract survives the cross-module hand-off; if decoders + lived alongside `FlatSegmentDecoder`, package-private `MemorySegment[]` + would have sufficed. + - **Read-only jars cannot be smaller than the full reactor.** The CLI + uber-jar pulls every encoder, every writer dependency (`zstd-jni`, + `air-compressor`, etc.) even if the binary only reads files. + - **`Extension`** mirrors the same problem at a smaller scale — + `encodeAll` and decode helpers live on one interface, pulling + write-side dependencies into read-only consumers. + +## Decision + +Split `core` into three logical surfaces along the read/write axis: + +``` +core/ — only the model + Encoding — id() + accepts(); no encode/decode methods + EncodingId + DType, PType, ArrayStats, Footer, Layout + NullableData — the one Array-shaped type both sides need + BoundedSegment — the byte-access primitive + proto, fbs — generated code (already pure data) + +reader/ — read runtime + VortexReader, VortexHandle, VortexHttpReader, ScanIterator + Array hierarchy — BoolArray, IntArray, LongArray, VarBinArray, + StructArray, ... (the read-only data exchange + format; writer never touches these) + ReadRegistry — Map + EncodingDecoder — Array decode(DecodeContext) + DecodeContext, FlatSegmentDecoder + BitpackedDecoder, AlpDecoder, PcoDecoder, ... (30+ files) + +writer/ — write runtime + VortexWriter + WriteRegistry — Map + EncodingEncoder — EncodeResult encode(DType, Object, EncodeContext) + EncodeContext + BitpackedEncoder, AlpEncoder, PcoEncoder, ... (30+ files) +``` + +### How the registry attaches at the API surface + +`Registry` splits into two distinct types — `ReadRegistry` and +`WriteRegistry` — not a generic `Registry`. Each is passed alongside +its corresponding entry point, **not** folded into an options record: + +```java +ReadRegistry rr = ReadRegistry.builder().registerServiceLoaded().build(); +VortexReader.open(path, rr); // ReadRegistry directly + +WriteRegistry wr = WriteRegistry.builder().registerServiceLoaded().build(); +VortexWriter.builder(path, schema) + .registry(wr) // WriteRegistry directly + .options(WriteOptions.defaults()) // tuning knobs only + .build(); +``` + +Two design choices feed this shape. + +**Distinct types, not `Registry`.** Reasons: +- Different builder ergonomics: the read side has no cascade chain to + configure; the write side does (`cascadeCodecs`, allowed-cascading depth). + A generic type would carry irrelevant builder methods on both sides. +- `ServiceLoader` manifests are already separate + (`META-INF/services/...EncodingDecoder` vs `...EncodingEncoder`), so + type-level separation matches the runtime story. +- Mistakes like passing a write registry to `VortexReader.open` become + compile errors, not runtime errors. + +**Alongside the options, not inside them.** Reasons: +- `WriteOptions` is a record (an immutable value). `WriteRegistry` is a + configured map with lifecycle: typically built once at app startup and + reused across many file writes. Mixing forces re-creating options every + time you want a new file with the same registry. +- Records work badly for fields with non-trivial equality semantics + (`Registry.equals`?). +- Today the registry already lives on the method signature; keeping that + split is the lowest-migration shape. + +The same applies to read-side configuration. There is no `ReadOptions` +record today (the reader takes `ScanOptions` per-scan instead); the +proposal keeps that as-is. `ReadRegistry` is the file-open parameter; +`ScanOptions` is the per-scan parameter. + +Effect on caller code: +- Read-only callers (analytics engines, inspector, CLI inspector) + construct only `ReadRegistry`. No transitive dependency on writer + encoders — the `writer` module isn't on their classpath at all. +- Write-only callers (CSV importer, JDBC importer) construct only + `WriteRegistry`. +- Tools that do both (integration tests, parquet bridge) construct both. + +### What changes structurally + +- `Encoding` becomes a small metadata-only interface in `core`. It carries + `EncodingId` and `accepts(DType)` and nothing else. No bifunctional decode + + encode methods. +- Each encoding's `Decoder` static inner class becomes a top-level + `EncodingDecoder` implementation in `reader`. The `Encoder` inner class + becomes `EncodingEncoder` in `writer`. CLAUDE.md already documents this + split via private inner classes; the migration largely lifts those into + separate compilation units across modules. +- `Registry` splits into `ReadRegistry` and `WriteRegistry`. Each registry + exposes only the dispatch surface its side needs. The `decodeAsSegment` + escape hatch is deleted; the corresponding adapter logic lives in + `FlatSegmentDecoder` (in `reader`) instead. +- `DecodeContext` moves to `reader`; `EncodeContext` moves to `writer`. +- `FlatSegmentDecoder` moves to `reader`, into the same package as + `VortexReader`. The `slice()` method on `VortexHandle` becomes + package-private — `FlatSegmentDecoder`, `Trailer`, and `PostscriptParser` + are its only callers, all co-resident in `reader/io`. +- `unwrapForSubParser` and the corresponding audit trail collapse to the + minority of decoders that genuinely call into a sub-parser (`ProtoReader`) + with a raw `MemorySegment`. Cross-module byte hand-offs disappear. +- `Extension` similarly splits into `ExtensionDecoder` + `ExtensionEncoder`, + or keeps a single interface with read-only and write-only sub-types. + +### Effect on the `slice()` problem + +The motivating problem disappears as a side effect: + +- `VortexHandle.slice(long, long)` → package-private. External callers cannot + see it; cross-module consumers (`ScanIterator`, `InspectorTree`) move into + the same module so the package-private access works. +- `BoundedSegment` stays in `core` as the primitive, but no longer needs to + travel through public API surfaces. Most internal uses can drop back to + raw `MemorySegment` because they live in the same package as the byte + source and the trust boundary is now spatially local. +- The 33 `unwrapForSubParser` sites from PR #27 are mostly eliminated — + not because we wrote more wrappers, but because the wrappers are no longer + needed once read code stops crossing module boundaries to reach its bytes. + +## Migration phases + +Each phase is a separate PR, lands independently green, and keeps the old +shape running side-by-side until cut-over. + +**Phase 0 — preparation (≈0.5 day)** +- Land this ADR. +- Add `Encoding` metadata-only interface in `core` (extends the existing + one for now). Verify all current `Encoding` impls already implement + `id()` and `accepts(DType)`. +- Introduce `ReadRegistry` and `WriteRegistry` skeletons that for now + delegate to the existing `Registry`. No call-site changes yet. + +**Phase 1 — split `DecodeContext` and the read registry (≈1 day)** +- Move `DecodeContext`, `ArrayNode`, `FlatSegmentDecoder` to `reader`. +- `ReadRegistry` becomes the canonical read dispatcher; `Registry.decode` + forwards to it during transition. +- `ScanIterator` uses `ReadRegistry` directly. +- `decodeAsSegment` deleted; `FlatSegmentDecoder` gains the equivalent + package-private helper. + +**Phase 2 — lift `*Decoder` impls into `reader` (≈1 day per family, ≈3 days)** +- Pick one encoding family at a time (Fastlanes, ALP, Pco, …). +- For each: extract the `Decoder` inner class into a new + `*EncodingDecoder` in `reader/encoding`; register via + `META-INF/services/...EncodingDecoder`; delete the `decode(...)` method + from the old `*Encoding` in `core`. +- After all families lifted, `Encoding` in `core` no longer has a `decode` + method. `Registry` (the old dual) no longer has a read surface. + +**Phase 3 — repeat for the write side (≈3 days)** +- Mirror Phase 2 for writers. `Encoding` in `core` becomes the + metadata-only shape promised in the Decision section. + +**Phase 4 — `VortexHandle.slice` to package-private (≈0.5 day)** +- Drop `slice()` from the public `VortexHandle` interface. All remaining + callers are now in `reader` and use a package-private accessor on the + concrete `VortexReader` / `VortexHttpReader` types. +- Inspector and CLI inspector code that today calls `handle.slice(...)` + receives a new typed accessor instead (e.g. + `FlatSegmentInspector.peek(handle, spec)`). +- The 33 `unwrapForSubParser` sites from PR #27 are deleted at the same + time; the corresponding decoders take raw `MemorySegment` again because + they live in the same package as the byte source. + +**Phase 5 — `Extension` split (≈0.5 day)** +- `ExtensionDecoder` and `ExtensionEncoder` in their respective modules. +- Confirm the four spec extensions (`Date`, `Time`, `Timestamp`, `Uuid`) + ride through the split cleanly. + +**Phase 6 — read-only jar artifact (≈0.5 day)** +- Verify the CLI's "read-only" personality (the inspector) can be built + without the writer module on the classpath. Document in `compatibility.md`. + +Cumulative effort estimate: ~9 person-days of focused work, plus ~3 days +of CI / integration-test fallout, plus reviewer time. Not a weekend. + +## Consequences + +### Positive + +- **Public API never exposes raw `MemorySegment`** for the read path. + `VortexHandle.slice` disappears from the public surface. The SECURITY.md + contract is enforced architecturally, not by audit-trail convention. +- **PR #27's 33 `unwrapForSubParser` sites collapse to a handful** — + only the decoders that genuinely call a sub-parser + (`ProtoReader`-bound decoders: Constant, Pco, Sparse, plus Zstd's + native lib hand-off) retain a documented trust transfer. +- **`Registry.decodeAsSegment` deleted.** The current adapter exists only + because cross-module dispatch needs a raw-segment escape; once decoders + are co-resident with the byte source, the adapter is no longer needed. +- **Read-only deployments shrink.** No transitive pull on `zstd-jni` + encode paths, FSST dictionary builders, ALP encoders, etc. The CLI + inspector becomes a true read-only artifact. +- **Dependency direction matches data flow.** `reader` depends on `core`; + `writer` depends on `core`; neither depends on the other. Today both + live inside `core` and the dependency direction is invisible. + +### Negative + +- **Multi-day refactor.** ~9 person-days plus CI iteration. Cannot land in + a single PR; must be staged carefully so each phase runs green. +- **Encoding impls double in file count** during transition. `BitpackedEncoding` + becomes `BitpackedDecoder` (in `reader`) + `BitpackedEncoder` (in `writer`). + Test files split similarly. +- **CLAUDE.md updates** — the "three touch-points for adding an encoding" + rule becomes "decoder side + encoder side + EncodingId enum constant", + each in its own module. +- **CHANGELOG breaking-changes section grows.** External users (none today, + but any future ones) see `Encoding`, `Registry`, and `DecodeContext` + moved. Probably worth bundling under a 0.7.0 release boundary. +- **Two `ServiceLoader` manifests per encoding** instead of one. +- **Integration tests need re-routing.** Tests that today construct a + `Registry` and call `decode` directly will need to construct a + `ReadRegistry` instead — mechanical but pervasive. + +### Risks to manage + +- **Side-by-side period drift.** Phases 1–3 leave both the old `Registry` + and the new `ReadRegistry`/`WriteRegistry` registered for each encoding + during transition. Risk: divergent behaviour if a bug fix lands on one + side and not the other. Mitigation: integration tests run against both + paths during the transition; the old `Registry` becomes a thin forwarder + early in Phase 1. +- **Extension split.** `Extension` carries the same encode/decode tension + as `Encoding`; the migration plan assumes a parallel split. If the + extension API has tighter user-facing constraints (it does — see + `DateExtension.decodeAll`), Phase 5 may need a separate ADR. +- **JMH benchmarks.** `RustVsJavaReadBenchmark` and friends construct + `Registry` + `DecodeContext` directly. They live in the `performance` + module, which depends on `reader`. The benchmarks need re-wiring at the + end of Phase 2. + +## Alternatives considered + +- **Keep `core` as-is, hide `slice()` via Java modules (JPMS).** Drops + PR #27's escape-hatch noise but does not address the underlying smell — + `core` still hosts the read runtime, `reader` still calls back into + `core` for every operation, `Registry` still dispatches both sides, + and read-only deployments still pull the writer surface. Rejected as + cosmetic. +- **Move `FlatSegmentDecoder` alone into `reader`, leave everything else.** + Solves the immediate `slice()` problem at the cost of a circular module + dependency: `Encoding.decode` (in `core`) would call + `Registry.decode` (in `core`) which would route into + `FlatSegmentDecoder` (in `reader`). Rejected as architecturally worse + than the current state. +- **Adopt an existing pluggable codec framework (e.g. Arrow's + `CompressionCodec` SPI shape).** Considered briefly. Vortex's + cascading-encoding model has tighter requirements than Arrow's flat + codec model; an external SPI does not fit. Rejected. +- **Status quo + documentation.** Document that `core` is the read runtime + and `reader` is a shell. Cheapest. Rejected because every future + feature that needs cross-module byte access reintroduces the same + escape-hatch problem. + +## Decision drivers + +- The 33 `unwrapForSubParser` sites in PR #27 are a strong proxy signal: + every one of them documents a place where read code needs bytes that + live in another module. +- A genuinely read-only deployment (inspector + scan) should be possible + without pulling Zstd encoders or FSST builders. Today it is not. +- The `Encoding` interface bifunctional shape blocks ahead-of-time + pruning of the write surface; the refactor is the only path to a + smaller read-only artifact. + +## References + +- [PR #27 — `sec(parser): BoundedSegment + audit trail for untrusted asSlice`](https://github.com/dfa1/vortex-java/pull/27) +- [Phase 1–4 commits — BoundedSegment introduction and migration](https://github.com/dfa1/vortex-java/pull/27/commits) +- [SECURITY.md — the contract this work hardens](../../SECURITY.md) +- [CLAUDE.md — current "three touch-points" rule for adding an encoding](../../CLAUDE.md) +- [TODO.md — parser hardening backlog](../../TODO.md) diff --git a/docs/adr/0002-pluggable-dtype-layout-compute.md b/docs/adr/0002-pluggable-dtype-layout-compute.md new file mode 100644 index 00000000..5d363ebf --- /dev/null +++ b/docs/adr/0002-pluggable-dtype-layout-compute.md @@ -0,0 +1,208 @@ +# ADR 0002: Runtime pluggability of DType, Layout, and Compute + +- **Status:** Deferred — awaiting a real downstream consumer use case +- **Date:** 2026-06-11 +- **Deciders:** project maintainer +- **Related:** [ADR 0001 — Split read and write runtimes](0001-split-read-and-write-runtimes.md), [TODO.md §Audit runtime pluggability vs Rust impl](../../TODO.md) + +## Context + +The Rust reference implementation +([vortex-data/vortex](https://github.com/vortex-data/vortex)) supports +**runtime registration** of four extension axes: + +| Axis | Rust SPI | Java status today | +|-----------|---------------------------------------------------------------------------------------------------------------------------|-------------------| +| Encodings | `VortexSession::register_encoding`, plus `allow_unknown()` passthrough | ✅ Pluggable via `ServiceLoader` + `EncodingRegistry.Builder.register()`; `allowUnknown()` matches Rust | +| DTypes | `VortexSession::register_dtype` | ❌ `sealed interface DType permits ...` — no user-extensible type | +| Layouts | `VortexSession::register_layout` | ❌ Fixed set: `Flat`, `Chunked`, `Zoned`, `Struct` (sealed) | +| Compute | `VortexSession::register_compute` | ❌ No compute layer yet | + +The maintainer flagged this gap on 2026-06-04. The existing TODO entry +(line 215) carries explicit guidance: **"short design note weighing +sealed-vs-pluggable for DType + Layout; revisit when Java impl has a +real downstream consumer asking for it. Don't pre-open these without +a use case."** + +This ADR is that design note. It does **not** propose changing the +current sealed shape. It records the trade-offs so that a future +maintainer faced with a real consumer ask can act quickly, and records +the criteria under which the deferral should be reconsidered. + +## Decision + +**Keep `DType` and `Layout` sealed for now. Defer the compute layer +entirely until the reader is feature-complete and at least one +downstream JVM consumer has filed a concrete request.** + +Re-evaluate when *any* of the following is true: + +1. A downstream JVM consumer (analytics engine, ETL framework, OLAP + query layer) files a concrete request to register a custom logical + type or layout, and provides the wire-format spec they need. +2. The Rust impl ships a new `DType` variant or layout that the Java + reader needs to interoperate with — at which point the cost of + adding it via the sealed hierarchy versus exposing an SPI becomes + a comparable conversation. +3. The compute layer is added on the Rust side in a way that the Java + reader needs to mirror for full file-level compatibility (e.g. + pruning predicates serialized into the layout tree). + +Under any of those conditions, this ADR is superseded by a follow-up +that picks a concrete shape from §"Shapes considered" below. + +## Trade-offs + +### Why sealed today is the right default + +- **Exhaustive `switch` patterns.** Every place the reader pattern-matches + on `DType` (`PostscriptParser.convertDType`, every encoding's + `accepts(DType)`, the inspector's column formatter) compiles to a + fast exhaustive switch with no default-case fallthrough. Adding a + user-defined variant requires either (a) breaking exhaustiveness or + (b) every consumer gracefully handling unknown DTypes — a real + expressiveness loss. +- **Equality and hashing are free.** Records give `equals`/`hashCode` + by structural comparison. A user-registered DType subtype with + arbitrary fields cannot guarantee stable equality across module + boundaries, leaking to caching bugs in layers like `ColumnStats` + and dictionary lookup. +- **No premature abstraction.** The current sealed set covers every + encoding the project supports and every type the Rust reference + emits as of v0.74.0. The cost of opening the hierarchy buys + flexibility that no current Java consumer uses. +- **CLAUDE.md alignment.** The project's tone explicitly prefers + sealed types + pattern matching as a stylistic norm. Opening the + hierarchy fights that tone for a hypothetical benefit. + +### When sealed becomes the wrong default + +- **Downstream consumer with a domain-specific logical type.** A team + building on top of vortex-java for, say, time-series IP-address + columns might want a `DType.IpAddress(byte version)` carrying its + own validity and parsing semantics, instead of stuffing it into + `Extension`. Today they can use `Extension` with a custom id; if + that ergonomic gap becomes noisy in their codebase, a real SPI + request follows. +- **Rust adds a new variant Java cannot read.** Today the only Rust + variant Java does not handle is `DType::Union` (added in Rust 0.71.0, + see `compatibility.md`). The current sealed hierarchy makes adding + it a 3-line patch. The cost-balance flips only if Rust starts + shipping experimental DTypes at a faster cadence than Java can + follow with sealed-hierarchy additions. + +### Why compute is deferred entirely + +- The Java reader does not yet have a query / filter pushdown layer. + Adding a compute SPI before there is anything to plug into is the + textbook overengineering case the TODO guidance is meant to prevent. +- The Rust compute layer drives layout-tree pruning via statistic + predicates. Java's `ZonedEncoding` does the equivalent today by + reading min/max stats directly out of the layout tree — a + hard-coded special case rather than a general predicate engine. + General-purpose pruning is a separate, large piece of work that + warrants its own design discussion. + +## Shapes considered (for the future un-deferral) + +Recorded so that the deferral can be lifted quickly. None are +recommended today. + +### Shape A — Open the sealed hierarchies via SPI, mirror Rust + +```java +public interface DType { // no longer sealed + EncodingId encodingHint(); + boolean nullable(); +} + +public final class DTypeRegistry { + public static DTypeRegistry builder() { ... } + public DTypeRegistry register(DTypeSpec spec) { ... } + public DType resolve(io.github.dfa1.vortex.proto.DType proto) { ... } +} +``` + +Pros: closest to Rust's `VortexSession::register_dtype`; downstream +extensions don't need an `Extension` wrapper hack. Cons: every +`switch (dtype)` in the project becomes non-exhaustive; equality +contract fragile across module boundaries. + +### Shape B — Keep sealed for spec types, allow nominal Extension expansion + +```java +public sealed interface DType permits ... { + record Extension(String id, DType storage, ByteBuffer meta, boolean nullable) + implements DType { ... } +} + +public interface ExtensionPlugin { + String id(); + Object decode(Extension dtype, Array storage); + Object encode(Extension dtype, Object value); +} +``` + +This is roughly the shape vortex-java already ships for the four spec +extensions (`Date`, `Time`, `Timestamp`, `Uuid`). The shift would be +to elevate the plugin interface to a public SPI rather than +package-private. Pros: doesn't break exhaustive switching on the +top-level hierarchy; downstream consumers register through a narrower +API; matches the "extension type" wire-format already in the proto +schema. Cons: not as expressive as Shape A — extensions piggy-back on +a storage type, no new physical layouts. + +### Shape C — Defer to JDK 25+ pattern-matching with default fallthroughs + +If at some point the language adds first-class "open sealed" support +or improved default-case ergonomics, Shape A becomes cheaper. Not a +present option; recorded for completeness. + +## Decision drivers (what would un-defer this) + +A future PR that wants to lift the deferral should be able to point to +all of the following, in writing: + +1. **A named downstream consumer.** Not a hypothetical "someone might + want X." A concrete project / team with a name and a use case. +2. **A spec for the new variant.** Wire format, serialisation, + round-trip semantics. Not just "register a custom type" in the + abstract. +3. **Confirmation the existing `Extension` mechanism does not fit.** + Most user-defined logical types belong in `Extension` already; an + SPI is only justified if `Extension` is the wrong shape for the + request. +4. **A statement on the equality + switch-exhaustiveness implications.** + How will reader-internal pattern matching adapt? Is the project + willing to add `default ->` branches across the codebase? + +Until those four boxes can be ticked, the deferral stands. + +## Consequences of deferring + +- **Positive.** No premature abstraction. Existing sealed hierarchies + keep their performance and ergonomic benefits. Compatibility-doc + follow-ups (e.g. adding `DType::Union`) remain 3-line patches. +- **Negative.** Downstream consumers with custom logical types must + go through `Extension` and tolerate its limitations (storage type + must be one of the built-in DTypes). Java cannot read Rust files + whose authors registered custom DTypes via `VortexSession::register_dtype` + — though no such files have been observed in practice. +- **Risk.** If the deferral becomes the default reflex ("we'll do it + when someone asks"), the project may miss a window where an + emerging Java analytics ecosystem (Polars-on-JVM, DuckDB JVM + bindings, etc.) lands and the SPI gap blocks adoption. Mitigation: + a maintainer should set a calendar reminder to revisit this ADR + every 6 months. + +## References + +- [TODO.md §"Audit runtime pluggability vs Rust impl"](../../TODO.md) + (line ~215, dated 2026-06-04) +- [ADR 0001 — Split read and write runtimes out of `core`](0001-split-read-and-write-runtimes.md) +- [docs/compatibility.md — Known wire-format gaps](../compatibility.md) + (notes `DType::Union` as one new variant Java does not yet decode) +- Rust upstream: + [`VortexSession`](https://docs.rs/vortex/latest/vortex/session/struct.VortexSession.html) + exposes `register_encoding`, `register_dtype`, `register_layout`, + `register_compute`.