@@ -9,15 +9,11 @@ This RFC formalizes the Vortex type system by grounding `DType` as a quotient ty
99encodings and establishes a decision framework (based on refinement types) for when new ` DType `
1010variants are justified.
1111
12- ## Motivation
13-
14- Many of the Vortex maintainers have a good understanding of how the Vortex type system works: we
15- define a set of logical types, each of which can represent many physical data encodings. We
16- additionally define a set of ` Canonical ` encodings that represent the different targets that arrays
17- can decompress into.
18-
1912## Overview
2013
14+ Vortex defines a set of logical types, each of which can represent many physical data encodings, and
15+ a set of ` Canonical ` encodings that represent the different targets that arrays can decompress into.
16+
2117### Logical vs. Physical Types
2218
2319A ** logical type** (` DType ` ) describes what the data means, independent of how it is stored (e.g.,
@@ -45,7 +41,7 @@ Vortex's built-in set of logical types will not cover every use case. Extension
4541consumers to define their own logical types on top of existing ` DType ` s without modifying the core
4642type system. See [ RFC 0005] ( ./0005-extension.md ) for the full design.
4743
48- ### Formalization
44+ ## Motivation
4945
5046This definition has mostly worked well for us. However, several recent discussions have revealed
5147that this loose definition may be insufficient.
@@ -169,6 +165,10 @@ variable-length list data. Both are valid sections (since both pick a representa
169165equivalence class), and both satisfy ` π(s(d)) = d ` . The current system in Vortex simply hardcodes
170166one particular section.
171167
168+ Note that even dictionary encoding or run-end encoding are theoretically valid sections (they
169+ satisfy ` π(s(d)) = d ` ). The fact that we choose flat, uncompressed forms as canonical is a design
170+ choice optimized for compute, not a theoretical requirement.
171+
172172### The Church-Rosser Property (Confluence)
173173
174174A rewriting system has the ** Church-Rosser property** (or is ** confluent** ) if, whenever a term can
@@ -187,9 +187,11 @@ approach is to define **separate reduction relations**, each of which is interna
187187
188188For example, instead of one global set of reduction rules, you define two strategies: strategy A
189189always reduces to normal form X, and strategy B always reduces to normal form Y. Each strategy
190- satisfies Church-Rosser independently, the only difference is which normal form they target.
190+ satisfies Church-Rosser independently, the only difference is which normal form they target. This
191+ is relevant for future work: if Vortex were to support multiple canonical targets (e.g., both ` List `
192+ and ` ListView ` ), each target would define an internally confluent reduction strategy.
191193
192- ### Refinement Types and the DType Decision Framework
194+ ### Refinement Types
193195
194196A ** refinement type** ` { x : T | P(x) } ` is a type ` T ` restricted to values satisfying a predicate
195197` P ` . Refinement types express subtypes without changing the underlying representation, instead they
@@ -207,65 +209,67 @@ predicate `valid_utf8` is what justifies the separate `DType` variant: it gates
207209Without this predicate, ` Utf8 ` and ` Binary ` would be the same type, and maintaining both would be
208210redundant.
209211
210- This gives us a concrete decision tree for whether a new ` DType ` variant is justified:
212+ ## What justifies a new type?
211213
212- ```
213- Does it gate different query operations?
214- │
215- Yes ───────────────┼─────────────── No
216- │ │
217- Does Vortex core Is it structurally distinct
218- own those ops? from an existing DType?
219- │ │
220- Yes ──────┼────── No Yes ──────┼────── No
221- │ │ │ │
222- Add to DType Extension Add to DType Model as canonical
223- (refinement type) type (new structure) form or encoding
224- ```
214+ With the formalizations above, we have a framework that gives us a set of questions to guide whether
215+ a new ` DType ` variant is justified:
216+
217+ 1 . ** Does it gate different query operations?** If yes, does Vortex core own those operations? If
218+ so, it should be a first-class ` DType ` (refinement type). If only external consumers need them,
219+ an extension type suffices (see [ RFC 0005] ( ./0005-extension.md ) ).
220+ 2 . ** Is it structurally distinct from an existing ` DType ` ?** If yes, there may be a case for a new
221+ ` DType ` variant. However, this depends on whether the structural difference provides enough
222+ practical benefit to justify the added complexity.
223+ 3 . If neither, it should just be a new physical encoding.
225224
226- As described in the [ overview ] ( #extension-types ) , the "Yes" branch distinguishes between
227- first-class ` DType ` variants and extension types based on who owns the gated operations. If Vortex
228- core provides kernels that require the predicate, it belongs in ` DType ` . If only external consumers
229- need it, an extension type suffices (see [ RFC 0005 ] ( ./0005-extension.md ) ) .
225+ These decisions are mostly design choices rather than strict theoretical requirements. For example,
226+ ` Utf8 ` could theoretically be an extension type over ` Binary ` with a ` valid_utf8 ` predicate. The
227+ main reason it is a first-class ` DType ` is because we want to have optimized string kernels in the
228+ core Vortex library .
230229
231- ## Should ` FixedSizeBinary ` Be a ` DType ` ?
230+ ### Should ` FixedSizeList ` be a type ?
232231
233- Applying the decision framework above to ` FixedSizeBinary<n> ` vs. ` FixedSizeList<u8, n> ` :
232+ We can apply this framework to an existing type, ` FixedSizeList ` :
234233
235- ### The Case For (Refinement / Nominal Argument)
234+ ** Does it gate different query operations?** No. A fixed-size list is a list with the additional
235+ constraint that every element has the same length ` n ` , but ` scalar_at ` , ` filter ` , ` take ` , etc. all
236+ behave identically regardless of whether the list is fixed-size.
236237
237- ` FixedSizeBinary<n> ` could be justified as a refinement type if it carries semantic distinctions
238- that ` FixedSizeList<u8, n> ` does not:
238+ ** Is it structurally distinct from an existing ` DType ` ?** Yes. ` FixedSizeList ` has a different
239+ physical layout (no offsets buffer, since all elements have the same size). However, this does not
240+ automatically justify a new ` DType ` variant. A ` List ` whose offsets are a constant stride would
241+ compress extremely well (a ` SequenceArray ` ), and encodings in Vortex are designed to exploit exactly
242+ this kind of redundancy.
239243
240- - ** Intent signaling.** ` FixedSizeBinary<n> ` says "this is opaque binary data" (UUIDs, hashes, IP
241- addresses), while ` FixedSizeList<u8, n> ` says "this is a list of bytes that happens to have a
242- fixed length."
243- - ** Schema compatibility.** Arrow, Parquet, and other formats distinguish these types. A
244- ` FixedSizeBinary ` ` DType ` makes round-tripping schemas lossless.
245- - ** Potential invariant.** ` FixedSizeBinary ` could carry the invariant that elements are not
246- individually addressable or meaningful. This is weaker than ` valid_utf8 ` but still semantic.
244+ The argument for keeping ` FixedSizeList ` as its own ` DType ` is that it makes the fixed-size
245+ invariant explicit at the type level, which simplifies downstream consumers that want to rely on
246+ it (e.g., fixed-shape tensors). The argument against is that it adds a variant to the ` DType ` enum
247+ that is logically equivalent to ` List ` with a constraint.
247248
248- Under this reading, ` FixedSizeBinary ` is a lightweight refinement type, and the nominal distinction
249- earns its place in ` DType ` .
249+ Ultimately, we decided in the past that the structural difference merits its own ` DType ` variant,
250+ but an argument can be made that it does not warrant one .
250251
251- ### The Case Against (Canonical Form / Section Argument)
252+ ### Should ` FixedSizeBinary ` be a type?
252253
253- ` FixedSizeBinary ` could instead be modeled as a canonical form (section target) of
254- ` FixedSizeList<u8, n> ` , or as extension type metadata:
254+ A similar question applies to ` FixedSizeBinary<n> ` vs. ` FixedSizeList<u8, n> ` :
255255
256- - ** No gating predicate.** If no operations are meaningful on ` FixedSizeBinary ` that are not also
257- meaningful on ` FixedSizeList<u8, n> ` , then the predicate is empty and the refinement is trivial.
258- - ** Leaky abstraction risk.** Every query engine function that handles list types would need to
259- additionally handle ` FixedSizeBinary ` , or we would need coercion rules. If the handling is always
260- identical, the ` DType ` distinction adds complexity without semantic payoff.
261- - ** Schema mapping as metadata.** The information "this came from a Parquet ` FixedSizeBinary `
262- column" could live in extension type metadata rather than in the core ` DType ` enum, keeping the
263- logical layer minimal.
264- - ** Complexity.** Adding yet another variant to the ` DType ` enum has a large surface area of change.
256+ ** Does it gate different query operations?** This is unclear. ` FixedSizeBinary<n> ` signals "opaque
257+ binary data" (UUIDs, hashes, IP addresses) whereas ` FixedSizeList<u8, n> ` signals "a list of bytes
258+ that happens to have a fixed length." One could argue that ` FixedSizeBinary ` carries the invariant
259+ that individual bytes are not independently meaningful, which would gate byte-level list operations.
260+ However, this invariant is weaker than something like ` valid_utf8 ` , and it is not obvious that the
261+ core Vortex library would ship any operations gated by it.
265262
266- Under this reading, ` FixedSizeBinary ` is an encoding or canonical form, not a logical type.
263+ If the answer is ** yes** (it gates operations), then the next question is whether Vortex core owns
264+ those operations. If so, ` FixedSizeBinary ` is a first-class refinement type. If not, it should be
265+ an extension type.
267266
267+ If the answer is ** no** (it does not gate operations), then: ** is it structurally distinct from an
268+ existing ` DType ` ?** ` FixedSizeBinary<n> ` has the same physical layout as ` FixedSizeList<u8, n> ` (a
269+ flat buffer of ` n ` -byte elements), so the answer is no. By the decision framework, this means it
270+ should be modeled as a canonical form or extension type metadata rather than a new ` DType ` variant.
268271
272+ This question remains unresolved. See [ Unresolved Questions] ( #unresolved-questions ) .
269273
270274## Prior Art
271275
@@ -279,7 +283,7 @@ Under this reading, `FixedSizeBinary` is an encoding or canonical form, not a lo
279283## Unresolved Questions
280284
281285- Should ` FixedSizeBinary<n> ` be a ` DType ` variant (refinement type) or extension type metadata?
282- See the [ analysis above] ( #should-fixedsizebinary-be-a-dtype ) for the case for and against. It is
286+ See the [ analysis above] ( #should-fixedsizebinary-be-a-type ) for the case for and against. It is
283287 not so easy to claim one argument here is better than the other. Comments would be appreciated!
284288
285289## Future Possibilities
0 commit comments