docs: add chunk-layout design doc#4040
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4040 +/- ##
=======================================
Coverage 93.53% 93.53%
=======================================
Files 88 88
Lines 11894 11894
=======================================
Hits 11125 11125
Misses 769 769 🚀 New features to boost your workflow:
|
| ### Sharding | ||
|
|
||
| The `ShardingCodec` constructs a `ChunkGrid` per shard using the shard shape as extent and the subchunk shape as `FixedDimension`. Each shard is self-contained — it doesn't need to know whether the outer grid is regular or rectilinear. Validation checks that every unique edge length per dimension is divisible by the inner chunk size, using `dim.unique_edge_lengths` for efficient polymorphic iteration (O(1) for fixed dimensions, lazy-deduplicated for varying). | ||
| The `ShardingCodec` constructs a `ChunkGrid` per shard using the shard shape as extent and the subchunk shape as `FixedDimension`. Each shard is self-contained — it doesn't need to know whether the outer grid is regular or rectilinear. Validation checks that every unique edge length per dimension is divisible by the inner chunk size, using `dim._unique_edge_lengths` for efficient polymorphic iteration (O(1) for fixed dimensions, lazy-deduplicated for varying). |
There was a problem hiding this comment.
i don't think we want to commit to creating a full python object per shard just to model the chunk layout. We should probably say that the sharding codec models shards as having a particular chunk grid.
| @dataclass(frozen=True, kw_only=True) | ||
| class RegularChunkLayout(ChunkLayout): | ||
| """All chunks at this level share one uniform shape.""" | ||
|
|
||
| chunk_shape: tuple[int, ...] | ||
|
|
||
| @property | ||
| def ndim(self) -> int: | ||
| return len(self.chunk_shape) | ||
|
|
||
|
|
||
| @dataclass(frozen=True, kw_only=True) | ||
| class RectilinearChunkLayout(ChunkLayout): | ||
| """Per-dimension chunk specs: a bare int (uniform size) or explicit edge lengths. | ||
|
|
||
| Mirrors ``RectilinearChunkGridMetadata.chunk_shapes``, including the | ||
| bare-int shorthand, so the distillation round-trips faithfully. | ||
| """ | ||
|
|
||
| chunk_shapes: tuple[int | tuple[int, ...], ...] | ||
|
|
||
| @property | ||
| def ndim(self) -> int: | ||
| return len(self.chunk_shapes) |
There was a problem hiding this comment.
given that all regular chunk grids are also rectilinear grids, I don't think representing these as two peer classes will hold up. This is why we went with just 1 chunk grid class. The two grid types are different in metadata, but internally we need to always use the same representation. so instead of using nominal typing here, what if we re-use the is_regular property on the chunk grid?
There was a problem hiding this comment.
How will that approach extend to future chunk grid types? Will we add an 'is_x'?
There was a problem hiding this comment.
which future chunk grid types are you thinking? my understanding was that the question we are answering here is "are all the chunks the same same, or not". This seems resolvable for any future chunk grid.
There was a problem hiding this comment.
I do not have one specifically in mind right now. My thinking is that we should expect to need to extend any component of the zarr spec that is extensible via the zarr extensions repository. I don't want us to need to do a large refactor or change the API if a new chunk grid is added.
There was a problem hiding this comment.
"regular grid" vs "irregular grid" is a property of an abstract grid of numbers. It's not coupled to zarr specs. I thought we were deliberately not exposing details of the exact zarr metadata here, for this very reason: chunk_grid: {"name": "rectilinear", ...} and chunk_grid: {"name": "regular", ...} can describe the exact same abstract grid.
|
|
||
| Version: 1 | ||
|
|
||
| Design document for adding a public, typed chunk-structure introspection API to **zarr-python**: a `ChunkLayout` object that distills the chunk grid metadata and sharding codec configuration of an array. |
There was a problem hiding this comment.
the observed chunk layout depends on all the codecs, not just the sharding codec
| 1. **Is this array sharded?** ([#4036](https://github.com/zarr-developers/zarr-python/issues/4036)) — `.shards is not None` works for regular chunk grids, but `.shards` raises `NotImplementedError` for rectilinear grids, so sharded-rectilinear arrays require `try/except`. | ||
| 2. **What kind of chunk grid does this array use?** ([#4035](https://github.com/zarr-developers/zarr-python/issues/4035)) — requires either catching `NotImplementedError` from `.chunks` or `isinstance(array.metadata.chunk_grid, RegularChunkGridMetadata)` with an import from `zarr.core.metadata.v3`, which is private. | ||
|
|
There was a problem hiding this comment.
I don't think array consumers care if the array uses the sharding codec in particular, or if the array uses chunk grid metadata A or B.
Consumers more likely need to know the partition sizes for reading and writing. This is more abstract than the presence / configuration of the sharding codec, or the specific chunk grid metadata.
There was a problem hiding this comment.
for example, if the codecs look like [sharding_codec(chunk_shape=(10,10)), gzipcodec()], then the effective read and write partitions are defined entirely by the chunk grid and not the sharding configuration. Because every whole shard is gzip-encoded, you can't do subchunk reads or writes. But if we just report the configuration of the sharding codec here, we would give a misleading representation of the inner chunk shape. This is a useless application of the sharding codec but it's perfectly valid array metadata.
And if the storage backend supports partial writes, and if the array -> bytes and bytes -> bytes codecs are array-selection preserving (e.g., codecs=["bytes"], then the effective read and write partition is (1, ...), because every array scalar is individually addressible by byte range in the output chunk, regardless of the chunk grid!
There was a problem hiding this comment.
at the same time, it's also important to know the capacity of every stored object, even if you can read and write at a finer granularity. So ultimately users need access to:
- the size of each chunk (per-file partition)
- the size of each subchunk (inside-a-file partitions, recursive)
- the write granularity (bounded by chunk size)
- the read granularity (bounded by chunk size)
There was a problem hiding this comment.
i think we need to add some capabilities to our codecs to ensure that we can distill all this information.
There was a problem hiding this comment.
I mean chunk grid in the sense of the spec-level grid type, not our internal classes. So, as of today, regular vs rectilinear. But there could be others in the future since it's an extension point.
Xarray needs this because it reads an array from one store and writes it to another (or back to the same one) without changing its structure. It can't just copy the metadata document because the data round-trips through xarray's data model, where the layout survives as an encoding dict that becomes create_array(chunks=..., shards=...) kwargs on write. So it needs the layout configuration through public API, as values it can pass back to create_array. Effective read/write granularities are the right abstraction for access patterns, but they're lossy for reconstruction. This design doc is about the reconstruction use case.
There was a problem hiding this comment.
i don't understand the reconstruction thing. the whole point of our array API is that we abstract over differences in array metadata. you can declare the exact same chunk grid with regular or rectilinear metadata. why does xarray need to care about that?
This PR adds a design doc to decide a public API for #4036 and #4035. It also updates the ChunkGrid design doc to match what landed in Zarr-Python.
TODO:
docs/user-guide/*.mdchanges/