Skip to content

docs: add chunk-layout design doc#4040

Open
maxrjones wants to merge 4 commits into
zarr-developers:mainfrom
maxrjones:docs/design-chunks
Open

docs: add chunk-layout design doc#4040
maxrjones wants to merge 4 commits into
zarr-developers:mainfrom
maxrjones:docs/design-chunks

Conversation

@maxrjones
Copy link
Copy Markdown
Member

This PR adds a design doc to decide a public API for #4036 and #4035. It also updates the ChunkGrid design doc to match what landed in Zarr-Python.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jun 5, 2026
@maxrjones maxrjones changed the title docs: update chunk-grid design doc to v7 to match implementation docs: add chunk-layout design doc Jun 5, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.53%. Comparing base (b871a22) to head (30eb7a8).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4040   +/-   ##
=======================================
  Coverage   93.53%   93.53%           
=======================================
  Files          88       88           
  Lines       11894    11894           
=======================================
  Hits        11125    11125           
  Misses        769      769           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread design/chunk-grid.md
### Sharding

The `ShardingCodec` constructs a `ChunkGrid` per shard using the shard shape as extent and the subchunk shape as `FixedDimension`. Each shard is self-contained — it doesn't need to know whether the outer grid is regular or rectilinear. Validation checks that every unique edge length per dimension is divisible by the inner chunk size, using `dim.unique_edge_lengths` for efficient polymorphic iteration (O(1) for fixed dimensions, lazy-deduplicated for varying).
The `ShardingCodec` constructs a `ChunkGrid` per shard using the shard shape as extent and the subchunk shape as `FixedDimension`. Each shard is self-contained — it doesn't need to know whether the outer grid is regular or rectilinear. Validation checks that every unique edge length per dimension is divisible by the inner chunk size, using `dim._unique_edge_lengths` for efficient polymorphic iteration (O(1) for fixed dimensions, lazy-deduplicated for varying).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we want to commit to creating a full python object per shard just to model the chunk layout. We should probably say that the sharding codec models shards as having a particular chunk grid.

Comment thread design/chunk-layout.md Outdated
Comment on lines +93 to +116
@dataclass(frozen=True, kw_only=True)
class RegularChunkLayout(ChunkLayout):
"""All chunks at this level share one uniform shape."""

chunk_shape: tuple[int, ...]

@property
def ndim(self) -> int:
return len(self.chunk_shape)


@dataclass(frozen=True, kw_only=True)
class RectilinearChunkLayout(ChunkLayout):
"""Per-dimension chunk specs: a bare int (uniform size) or explicit edge lengths.

Mirrors ``RectilinearChunkGridMetadata.chunk_shapes``, including the
bare-int shorthand, so the distillation round-trips faithfully.
"""

chunk_shapes: tuple[int | tuple[int, ...], ...]

@property
def ndim(self) -> int:
return len(self.chunk_shapes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that all regular chunk grids are also rectilinear grids, I don't think representing these as two peer classes will hold up. This is why we went with just 1 chunk grid class. The two grid types are different in metadata, but internally we need to always use the same representation. so instead of using nominal typing here, what if we re-use the is_regular property on the chunk grid?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will that approach extend to future chunk grid types? Will we add an 'is_x'?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which future chunk grid types are you thinking? my understanding was that the question we are answering here is "are all the chunks the same same, or not". This seems resolvable for any future chunk grid.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have one specifically in mind right now. My thinking is that we should expect to need to extend any component of the zarr spec that is extensible via the zarr extensions repository. I don't want us to need to do a large refactor or change the API if a new chunk grid is added.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"regular grid" vs "irregular grid" is a property of an abstract grid of numbers. It's not coupled to zarr specs. I thought we were deliberately not exposing details of the exact zarr metadata here, for this very reason: chunk_grid: {"name": "rectilinear", ...} and chunk_grid: {"name": "regular", ...} can describe the exact same abstract grid.

Comment thread design/chunk-layout.md Outdated

Version: 1

Design document for adding a public, typed chunk-structure introspection API to **zarr-python**: a `ChunkLayout` object that distills the chunk grid metadata and sharding codec configuration of an array.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the observed chunk layout depends on all the codecs, not just the sharding codec

Comment thread design/chunk-layout.md
Comment on lines +22 to +24
1. **Is this array sharded?** ([#4036](https://github.com/zarr-developers/zarr-python/issues/4036)) — `.shards is not None` works for regular chunk grids, but `.shards` raises `NotImplementedError` for rectilinear grids, so sharded-rectilinear arrays require `try/except`.
2. **What kind of chunk grid does this array use?** ([#4035](https://github.com/zarr-developers/zarr-python/issues/4035)) — requires either catching `NotImplementedError` from `.chunks` or `isinstance(array.metadata.chunk_grid, RegularChunkGridMetadata)` with an import from `zarr.core.metadata.v3`, which is private.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think array consumers care if the array uses the sharding codec in particular, or if the array uses chunk grid metadata A or B.

Consumers more likely need to know the partition sizes for reading and writing. This is more abstract than the presence / configuration of the sharding codec, or the specific chunk grid metadata.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, if the codecs look like [sharding_codec(chunk_shape=(10,10)), gzipcodec()], then the effective read and write partitions are defined entirely by the chunk grid and not the sharding configuration. Because every whole shard is gzip-encoded, you can't do subchunk reads or writes. But if we just report the configuration of the sharding codec here, we would give a misleading representation of the inner chunk shape. This is a useless application of the sharding codec but it's perfectly valid array metadata.

And if the storage backend supports partial writes, and if the array -> bytes and bytes -> bytes codecs are array-selection preserving (e.g., codecs=["bytes"], then the effective read and write partition is (1, ...), because every array scalar is individually addressible by byte range in the output chunk, regardless of the chunk grid!

Copy link
Copy Markdown
Contributor

@d-v-b d-v-b Jun 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the same time, it's also important to know the capacity of every stored object, even if you can read and write at a finer granularity. So ultimately users need access to:

  • the size of each chunk (per-file partition)
  • the size of each subchunk (inside-a-file partitions, recursive)
  • the write granularity (bounded by chunk size)
  • the read granularity (bounded by chunk size)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to add some capabilities to our codecs to ensure that we can distill all this information.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean chunk grid in the sense of the spec-level grid type, not our internal classes. So, as of today, regular vs rectilinear. But there could be others in the future since it's an extension point.

Xarray needs this because it reads an array from one store and writes it to another (or back to the same one) without changing its structure. It can't just copy the metadata document because the data round-trips through xarray's data model, where the layout survives as an encoding dict that becomes create_array(chunks=..., shards=...) kwargs on write. So it needs the layout configuration through public API, as values it can pass back to create_array. Effective read/write granularities are the right abstraction for access patterns, but they're lossy for reconstruction. This design doc is about the reconstruction use case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand the reconstruction thing. the whole point of our array API is that we abstract over differences in array metadata. you can declare the exact same chunk grid with regular or rectilinear metadata. why does xarray need to care about that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants