Skip to content

Commit 29d7535

Browse files
committed
revise decision framework
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent 4fd0b69 commit 29d7535

1 file changed

Lines changed: 60 additions & 56 deletions

File tree

proposed/0029-types.md

Lines changed: 60 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,11 @@ This RFC formalizes the Vortex type system by grounding `DType` as a quotient ty
99
encodings and establishes a decision framework (based on refinement types) for when new `DType`
1010
variants are justified.
1111

12-
## Motivation
13-
14-
Many of the Vortex maintainers have a good understanding of how the Vortex type system works: we
15-
define a set of logical types, each of which can represent many physical data encodings. We
16-
additionally define a set of `Canonical` encodings that represent the different targets that arrays
17-
can decompress into.
18-
1912
## Overview
2013

14+
Vortex defines a set of logical types, each of which can represent many physical data encodings, and
15+
a set of `Canonical` encodings that represent the different targets that arrays can decompress into.
16+
2117
### Logical vs. Physical Types
2218

2319
A **logical type** (`DType`) describes what the data means, independent of how it is stored (e.g.,
@@ -45,7 +41,7 @@ Vortex's built-in set of logical types will not cover every use case. Extension
4541
consumers to define their own logical types on top of existing `DType`s without modifying the core
4642
type system. See [RFC 0005](./0005-extension.md) for the full design.
4743

48-
### Formalization
44+
## Motivation
4945

5046
This definition has mostly worked well for us. However, several recent discussions have revealed
5147
that this loose definition may be insufficient.
@@ -169,6 +165,10 @@ variable-length list data. Both are valid sections (since both pick a representa
169165
equivalence class), and both satisfy `π(s(d)) = d`. The current system in Vortex simply hardcodes
170166
one particular section.
171167

168+
Note that even dictionary encoding or run-end encoding are theoretically valid sections (they
169+
satisfy `π(s(d)) = d`). The fact that we choose flat, uncompressed forms as canonical is a design
170+
choice optimized for compute, not a theoretical requirement.
171+
172172
### The Church-Rosser Property (Confluence)
173173

174174
A rewriting system has the **Church-Rosser property** (or is **confluent**) if, whenever a term can
@@ -187,9 +187,11 @@ approach is to define **separate reduction relations**, each of which is interna
187187

188188
For example, instead of one global set of reduction rules, you define two strategies: strategy A
189189
always reduces to normal form X, and strategy B always reduces to normal form Y. Each strategy
190-
satisfies Church-Rosser independently, the only difference is which normal form they target.
190+
satisfies Church-Rosser independently, the only difference is which normal form they target. This
191+
is relevant for future work: if Vortex were to support multiple canonical targets (e.g., both `List`
192+
and `ListView`), each target would define an internally confluent reduction strategy.
191193

192-
### Refinement Types and the DType Decision Framework
194+
### Refinement Types
193195

194196
A **refinement type** `{ x : T | P(x) }` is a type `T` restricted to values satisfying a predicate
195197
`P`. Refinement types express subtypes without changing the underlying representation, instead they
@@ -207,65 +209,67 @@ predicate `valid_utf8` is what justifies the separate `DType` variant: it gates
207209
Without this predicate, `Utf8` and `Binary` would be the same type, and maintaining both would be
208210
redundant.
209211

210-
This gives us a concrete decision tree for whether a new `DType` variant is justified:
212+
## What justifies a new type?
211213

212-
```
213-
Does it gate different query operations?
214-
215-
Yes ───────────────┼─────────────── No
216-
│ │
217-
Does Vortex core Is it structurally distinct
218-
own those ops? from an existing DType?
219-
│ │
220-
Yes ──────┼────── No Yes ──────┼────── No
221-
│ │ │ │
222-
Add to DType Extension Add to DType Model as canonical
223-
(refinement type) type (new structure) form or encoding
224-
```
214+
With the formalizations above, we have a framework that gives us a set of questions to guide whether
215+
a new `DType` variant is justified:
216+
217+
1. **Does it gate different query operations?** If yes, does Vortex core own those operations? If
218+
so, it should be a first-class `DType` (refinement type). If only external consumers need them,
219+
an extension type suffices (see [RFC 0005](./0005-extension.md)).
220+
2. **Is it structurally distinct from an existing `DType`?** If yes, there may be a case for a new
221+
`DType` variant. However, this depends on whether the structural difference provides enough
222+
practical benefit to justify the added complexity.
223+
3. If neither, it should just be a new physical encoding.
225224

226-
As described in the [overview](#extension-types), the "Yes" branch distinguishes between
227-
first-class `DType` variants and extension types based on who owns the gated operations. If Vortex
228-
core provides kernels that require the predicate, it belongs in `DType`. If only external consumers
229-
need it, an extension type suffices (see [RFC 0005](./0005-extension.md)).
225+
These decisions are mostly design choices rather than strict theoretical requirements. For example,
226+
`Utf8` could theoretically be an extension type over `Binary` with a `valid_utf8` predicate. The
227+
main reason it is a first-class `DType` is because we want to have optimized string kernels in the
228+
core Vortex library.
230229

231-
## Should `FixedSizeBinary` Be a `DType`?
230+
### Should `FixedSizeList` be a type?
232231

233-
Applying the decision framework above to `FixedSizeBinary<n>` vs. `FixedSizeList<u8, n>`:
232+
We can apply this framework to an existing type, `FixedSizeList`:
234233

235-
### The Case For (Refinement / Nominal Argument)
234+
**Does it gate different query operations?** No. A fixed-size list is a list with the additional
235+
constraint that every element has the same length `n`, but `scalar_at`, `filter`, `take`, etc. all
236+
behave identically regardless of whether the list is fixed-size.
236237

237-
`FixedSizeBinary<n>` could be justified as a refinement type if it carries semantic distinctions
238-
that `FixedSizeList<u8, n>` does not:
238+
**Is it structurally distinct from an existing `DType`?** Yes. `FixedSizeList` has a different
239+
physical layout (no offsets buffer, since all elements have the same size). However, this does not
240+
automatically justify a new `DType` variant. A `List` whose offsets are a constant stride would
241+
compress extremely well (a `SequenceArray`), and encodings in Vortex are designed to exploit exactly
242+
this kind of redundancy.
239243

240-
- **Intent signaling.** `FixedSizeBinary<n>` says "this is opaque binary data" (UUIDs, hashes, IP
241-
addresses), while `FixedSizeList<u8, n>` says "this is a list of bytes that happens to have a
242-
fixed length."
243-
- **Schema compatibility.** Arrow, Parquet, and other formats distinguish these types. A
244-
`FixedSizeBinary` `DType` makes round-tripping schemas lossless.
245-
- **Potential invariant.** `FixedSizeBinary` could carry the invariant that elements are not
246-
individually addressable or meaningful. This is weaker than `valid_utf8` but still semantic.
244+
The argument for keeping `FixedSizeList` as its own `DType` is that it makes the fixed-size
245+
invariant explicit at the type level, which simplifies downstream consumers that want to rely on
246+
it (e.g., fixed-shape tensors). The argument against is that it adds a variant to the `DType` enum
247+
that is logically equivalent to `List` with a constraint.
247248

248-
Under this reading, `FixedSizeBinary` is a lightweight refinement type, and the nominal distinction
249-
earns its place in `DType`.
249+
Ultimately, we decided in the past that the structural difference merits its own `DType` variant,
250+
but an argument can be made that it does not warrant one.
250251

251-
### The Case Against (Canonical Form / Section Argument)
252+
### Should `FixedSizeBinary` be a type?
252253

253-
`FixedSizeBinary` could instead be modeled as a canonical form (section target) of
254-
`FixedSizeList<u8, n>`, or as extension type metadata:
254+
A similar question applies to `FixedSizeBinary<n>` vs. `FixedSizeList<u8, n>`:
255255

256-
- **No gating predicate.** If no operations are meaningful on `FixedSizeBinary` that are not also
257-
meaningful on `FixedSizeList<u8, n>`, then the predicate is empty and the refinement is trivial.
258-
- **Leaky abstraction risk.** Every query engine function that handles list types would need to
259-
additionally handle `FixedSizeBinary`, or we would need coercion rules. If the handling is always
260-
identical, the `DType` distinction adds complexity without semantic payoff.
261-
- **Schema mapping as metadata.** The information "this came from a Parquet `FixedSizeBinary`
262-
column" could live in extension type metadata rather than in the core `DType` enum, keeping the
263-
logical layer minimal.
264-
- **Complexity.** Adding yet another variant to the `DType` enum has a large surface area of change.
256+
**Does it gate different query operations?** This is unclear. `FixedSizeBinary<n>` signals "opaque
257+
binary data" (UUIDs, hashes, IP addresses) whereas `FixedSizeList<u8, n>` signals "a list of bytes
258+
that happens to have a fixed length." One could argue that `FixedSizeBinary` carries the invariant
259+
that individual bytes are not independently meaningful, which would gate byte-level list operations.
260+
However, this invariant is weaker than something like `valid_utf8`, and it is not obvious that the
261+
core Vortex library would ship any operations gated by it.
265262

266-
Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a logical type.
263+
If the answer is **yes** (it gates operations), then the next question is whether Vortex core owns
264+
those operations. If so, `FixedSizeBinary` is a first-class refinement type. If not, it should be
265+
an extension type.
267266

267+
If the answer is **no** (it does not gate operations), then: **is it structurally distinct from an
268+
existing `DType`?** `FixedSizeBinary<n>` has the same physical layout as `FixedSizeList<u8, n>` (a
269+
flat buffer of `n`-byte elements), so the answer is no. By the decision framework, this means it
270+
should be modeled as a canonical form or extension type metadata rather than a new `DType` variant.
268271

272+
This question remains unresolved. See [Unresolved Questions](#unresolved-questions).
269273

270274
## Prior Art
271275

@@ -279,7 +283,7 @@ Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a lo
279283
## Unresolved Questions
280284

281285
- Should `FixedSizeBinary<n>` be a `DType` variant (refinement type) or extension type metadata?
282-
See the [analysis above](#should-fixedsizebinary-be-a-dtype) for the case for and against. It is
286+
See the [analysis above](#should-fixedsizebinary-be-a-type) for the case for and against. It is
283287
not so easy to claim one argument here is better than the other. Comments would be appreciated!
284288

285289
## Future Possibilities

0 commit comments

Comments
 (0)