docs: add chunk-layout design doc by maxrjones · Pull Request #4040 · zarr-developers/zarr-python

maxrjones · 2026-06-05T17:50:58Z

This PR adds a design doc to decide a public API for #4036 and #4035. It also updates the ChunkGrid design doc to match what landed in Zarr-Python.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

…ction

codecov · 2026-06-05T17:59:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.53%. Comparing base (b871a22) to head (30eb7a8).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4040   +/-   ##
=======================================
  Coverage   93.53%   93.53%           
=======================================
  Files          88       88           
  Lines       11894    11894           
=======================================
  Hits        11125    11125           
  Misses        769      769

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b · 2026-06-06T09:36:48Z

 ### Sharding

-The `ShardingCodec` constructs a `ChunkGrid` per shard using the shard shape as extent and the subchunk shape as `FixedDimension`. Each shard is self-contained — it doesn't need to know whether the outer grid is regular or rectilinear. Validation checks that every unique edge length per dimension is divisible by the inner chunk size, using `dim.unique_edge_lengths` for efficient polymorphic iteration (O(1) for fixed dimensions, lazy-deduplicated for varying).
+The `ShardingCodec` constructs a `ChunkGrid` per shard using the shard shape as extent and the subchunk shape as `FixedDimension`. Each shard is self-contained — it doesn't need to know whether the outer grid is regular or rectilinear. Validation checks that every unique edge length per dimension is divisible by the inner chunk size, using `dim._unique_edge_lengths` for efficient polymorphic iteration (O(1) for fixed dimensions, lazy-deduplicated for varying).


i don't think we want to commit to creating a full python object per shard just to model the chunk layout. We should probably say that the sharding codec models shards as having a particular chunk grid.

d-v-b · 2026-06-06T09:43:33Z

+@dataclass(frozen=True, kw_only=True)
+class RegularChunkLayout(ChunkLayout):
+    """All chunks at this level share one uniform shape."""
+
+    chunk_shape: tuple[int, ...]
+
+    @property
+    def ndim(self) -> int:
+        return len(self.chunk_shape)
+
+
+@dataclass(frozen=True, kw_only=True)
+class RectilinearChunkLayout(ChunkLayout):
+    """Per-dimension chunk specs: a bare int (uniform size) or explicit edge lengths.
+
+    Mirrors ``RectilinearChunkGridMetadata.chunk_shapes``, including the
+    bare-int shorthand, so the distillation round-trips faithfully.
+    """
+
+    chunk_shapes: tuple[int | tuple[int, ...], ...]
+
+    @property
+    def ndim(self) -> int:
+        return len(self.chunk_shapes)


given that all regular chunk grids are also rectilinear grids, I don't think representing these as two peer classes will hold up. This is why we went with just 1 chunk grid class. The two grid types are different in metadata, but internally we need to always use the same representation. so instead of using nominal typing here, what if we re-use the is_regular property on the chunk grid?

How will that approach extend to future chunk grid types? Will we add an 'is_x'?

which future chunk grid types are you thinking? my understanding was that the question we are answering here is "are all the chunks the same same, or not". This seems resolvable for any future chunk grid.

I do not have one specifically in mind right now. My thinking is that we should expect to need to extend any component of the zarr spec that is extensible via the zarr extensions repository. I don't want us to need to do a large refactor or change the API if a new chunk grid is added.

"regular grid" vs "irregular grid" is a property of an abstract grid of numbers. It's not coupled to zarr specs. I thought we were deliberately not exposing details of the exact zarr metadata here, for this very reason: chunk_grid: {"name": "rectilinear", ...} and chunk_grid: {"name": "regular", ...} can describe the exact same abstract grid.

d-v-b · 2026-06-06T11:55:23Z

+
+Version: 1
+
+Design document for adding a public, typed chunk-structure introspection API to **zarr-python**: a `ChunkLayout` object that distills the chunk grid metadata and sharding codec configuration of an array.


the observed chunk layout depends on all the codecs, not just the sharding codec

d-v-b · 2026-06-06T12:00:36Z

+1. **Is this array sharded?** ([#4036](https://github.com/zarr-developers/zarr-python/issues/4036)) — `.shards is not None` works for regular chunk grids, but `.shards` raises `NotImplementedError` for rectilinear grids, so sharded-rectilinear arrays require `try/except`.
+2. **What kind of chunk grid does this array use?** ([#4035](https://github.com/zarr-developers/zarr-python/issues/4035)) — requires either catching `NotImplementedError` from `.chunks` or `isinstance(array.metadata.chunk_grid, RegularChunkGridMetadata)` with an import from `zarr.core.metadata.v3`, which is private.
+


I don't think array consumers care if the array uses the sharding codec in particular, or if the array uses chunk grid metadata A or B.

Consumers more likely need to know the partition sizes for reading and writing. This is more abstract than the presence / configuration of the sharding codec, or the specific chunk grid metadata.

for example, if the codecs look like [sharding_codec(chunk_shape=(10,10)), gzipcodec()], then the effective read and write partitions are defined entirely by the chunk grid and not the sharding configuration. Because every whole shard is gzip-encoded, you can't do subchunk reads or writes. But if we just report the configuration of the sharding codec here, we would give a misleading representation of the inner chunk shape. This is a useless application of the sharding codec but it's perfectly valid array metadata.

And if the storage backend supports partial writes, and if the array -> bytes and bytes -> bytes codecs are array-selection preserving (e.g., codecs=["bytes"], then the effective read and write partition is (1, ...), because every array scalar is individually addressible by byte range in the output chunk, regardless of the chunk grid!

at the same time, it's also important to know the capacity of every stored object, even if you can read and write at a finer granularity. So ultimately users need access to:

the size of each chunk (per-file partition)

the size of each subchunk (inside-a-file partitions, recursive)

the write granularity (bounded by chunk size)

the read granularity (bounded by chunk size)

i think we need to add some capabilities to our codecs to ensure that we can distill all this information.

I mean chunk grid in the sense of the spec-level grid type, not our internal classes. So, as of today, regular vs rectilinear. But there could be others in the future since it's an extension point.

Xarray needs this because it reads an array from one store and writes it to another (or back to the same one) without changing its structure. It can't just copy the metadata document because the data round-trips through xarray's data model, where the layout survives as an encoding dict that becomes create_array(chunks=..., shards=...) kwargs on write. So it needs the layout configuration through public API, as values it can pass back to create_array. Effective read/write granularities are the right abstraction for access patterns, but they're lossy for reconstruction. This design doc is about the reconstruction use case.

i don't understand the reconstruction thing. the whole point of our array API is that we abstract over differences in array metadata. you can declare the exact same chunk grid with regular or rectilinear metadata. why does xarray need to care about that?

maxrjones added 2 commits June 5, 2026 13:41

docs: update chunk-grid design doc to v7 to match implementation

293f731

docs: add chunk-layout design doc for public chunk-structure introspe…

58bc27c

…ction

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jun 5, 2026

maxrjones changed the title ~~docs: update chunk-grid design doc to v7 to match implementation~~ docs: add chunk-layout design doc Jun 5, 2026

d-v-b reviewed Jun 6, 2026

View reviewed changes

Merge branch 'main' into docs/design-chunks

fcd4d1b

d-v-b reviewed Jun 6, 2026

View reviewed changes

docs: V2 design

30eb7a8


		Version: 1

		Design document for adding a public, typed chunk-structure introspection API to zarr-python: a `ChunkLayout` object that distills the chunk grid metadata and sharding codec configuration of an array.

		1. Is this array sharded? ([#4036](https://github.com/zarr-developers/zarr-python/issues/4036)) — `.shards is not None` works for regular chunk grids, but `.shards` raises `NotImplementedError` for rectilinear grids, so sharded-rectilinear arrays require `try/except`.
		2. What kind of chunk grid does this array use? ([#4035](https://github.com/zarr-developers/zarr-python/issues/4035)) — requires either catching `NotImplementedError` from `.chunks` or `isinstance(array.metadata.chunk_grid, RegularChunkGridMetadata)` with an import from `zarr.core.metadata.v3`, which is private.

Uh oh!

Conversation

maxrjones commented Jun 5, 2026

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 5, 2026 •

edited

Loading

d-v-b Jun 6, 2026 •

edited

Loading