Add sparse-read primitives: `shards_initialized` and `read_regions` by espg · Pull Request #4028 · zarr-developers/zarr-python

espg · 2026-06-03T19:21:12Z

Related to / closes #3929 (first of two PRs)

Summary

Adds two composable, public functions for efficiently reading sparse arrays — arrays where most chunks are empty and resolve to the fill value:

zarr.shards_initialized(array, *, strategy="auto") — discover which shards (or chunks, when unsharded) actually exist in the store.
zarr.read_regions(array, regions=None, *, concurrency=None) — concurrently read and decode array regions — by default only the populated ones — yielding each (region, data) pair spatially resolved to its location in the array.

Both are available synchronously (zarr.*, zarr.api.synchronous) and asynchronously (zarr.api.asynchronous); the async read_regions is a generator that streams each region as soon as its data is available. Nothing about the existing arr[:] path changes — these are additive.

Motivation

On a sparse array, arr[:] pays a store round-trip + codec call for every chunk, including empty ones. In the issue's 49,152-chunk HEALPix example (~3% populated), ~150 s of the 173 s wall time is spent iterating empty chunks with zero useful I/O.

These primitives let callers touch only the populated chunks, so cost scales with the populated count rather than the total count.

Design

This follows the direction from the discussion in #3929: rather than mutable state on the array that changes how __getitem__ behaves, expose plain, composable functions -- decomposes into two pieces:

Discover the chunks that exist (shards_initialized). Reported at the granularity of stored objects — shard keys for sharded arrays, chunk keys otherwise — because that is what physically exists in the store and is what a single list_prefix returns. Two strategies, selected by strategy=:
- "list" — one store.list_prefix, filtered to this array's shard grid (ignores zarr.json and any other objects sharing the prefix).
- "probe" — concurrent per-key exists() checks; avoids listing a prefix that may hold many unrelated objects, and is faster when there are few possible keys.
- "auto" (default) — probe for small grids, list otherwise.
Read + decode those chunks, spatially resolved (read_regions). Keyed on array regions (a tuple of slices) rather than key strings, on the assumption that regions are the more reusable handle. Reads concurrently and yields (region, data) in completion order. For sharded arrays it yields whole shard regions; empty inner chunks within a populated shard are still skipped efficiently by the existing ShardingCodec partial-decode path.

The "pack N decoded chunks into one contiguous array" step that arr[:] performs is deliberately not forced here — pipelines that operate per chunk skip it for a further performance win. A pack/read_sparse convenience will follow in a second PR underzarr.experimental.

Implementation notes

A single private discovery core (_initialized_shards) returns (coords, key) pairs; shards_initialized projects it to keys and read_regions projects it to regions, so neither has to reverse-parse the other's output. This mirrors the existing _nchunks_initialized → nchunks_initialized and _iter_* core/wrapper pattern in array.py.
The pre-existing private _shards_initialized (used by nchunks_initialized / nshards_initialized / info) now delegates to that same core, removing duplicated list_prefix-and-intersect logic and incidentally fixing an O(grid×objects) membership check (list → set).

API

import zarr

# 1. Which shards/chunks actually exist in the store?
keys = zarr.shards_initialized(arr)                  # auto strategy
keys = zarr.shards_initialized(arr, strategy="probe")

# 2. Read only the populated regions, each paired with its location
for region, data in zarr.read_regions(arr):
    ...                                              # region: tuple[slice, ...]

# Reproduce arr[:] without touching empty chunks
out = np.full(arr.shape, arr.fill_value, dtype=arr.dtype)
for region, data in zarr.read_regions(arr):
    out[region] = np.asarray(data)

# Async: stream each region as soon as it is decoded
import zarr.api.asynchronous as za
async for region, data in za.read_regions(arr):
    ...

Benchmarks

bench/empty_chunks.py sweeps chunk count at ~3% sparsity, comparing stock arr[:] against read_regions + pack and a per-region stream:

store           n_chunks  populated  arr[:] (s)  pack (s)  stream (s)  pack x  stream x
MemoryStore         1024         32     0.0458    0.0111      0.0100    4.1x     4.6x
LocalStore          1024         32     0.2863    0.0236      0.0218   12.1x    13.2x
MemoryStore         4096        128     0.1780    0.0281      0.0458    6.3x     3.9x
LocalStore          4096        128     1.0605    0.0934      0.1008   11.4x    10.5x
MemoryStore        16384        512     0.8202    0.1699      0.1442    4.8x     5.7x
LocalStore         16384        512     5.2325    0.4762      0.4040   11.0x    13.0x
MemoryStore        49152       1536     2.7726    0.6218      0.5360    4.5x     5.2x
LocalStore         49152       1536    13.4691    1.2704      1.2380   10.6x    10.9x

LocalStore plateaus around ~10–13×; remote object stores see much more (~64× in the issue's S3 report) because each skipped empty chunk avoids a network round-trip.

Testing

tests/test_chunk_access.py (memory + local stores; unsharded, sharded, 2-D; all-empty / all-populated / sparse layouts):

all three strategies agree, with hand-known populated counts;
the "list" strategy ignores non-chunk objects sharing the prefix;
packing read_regions output reproduces arr[:] byte-for-byte;
default region count matches shards_initialized;
explicit regions and concurrency=1 paths;
async streaming yields the same set as the sync wrapper.

Existing test_array / test_api (incl. the sync/async docstring-match test) and test_zarr pass unchanged.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

AI Disclosure

This PR contains AI-generated content.
- I have tested any AI-generated content in my PR.
- I take responsibility for any AI-generated content in my PR. Tools: Claude Code

d-v-b · 2026-06-03T19:22:59Z

@@ -0,0 +1,157 @@
+"""Benchmark for sparse-array reads via the chunk-access primitives.


not sure we want this checked in -- we have a benchmarks directory already, could you see if these code paths are already exercised there? Those benchmarks get run in CI, which is nice.

codecov · 2026-06-03T19:29:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.55%. Comparing base (b871a22) to head (c7b557a).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4028      +/-   ##
==========================================
+ Coverage   93.53%   93.55%   +0.02%     
==========================================
  Files          88       88              
  Lines       11894    11932      +38     
==========================================
+ Hits        11125    11163      +38     
  Misses        769      769

Files with missing lines	Coverage Δ
src/zarr/__init__.py	`100.00% <ø> (ø)`
src/zarr/api/asynchronous.py	`94.05% <ø> (ø)`
src/zarr/api/synchronous.py	`93.82% <100.00%> (+0.86%)`	⬆️
src/zarr/core/array.py	`97.93% <100.00%> (+0.05%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b · 2026-06-03T19:36:06Z

I'm not sure this approach would be useful, but we could also frame the question "how should we store our knowledge that a chunk is missing" as a caching problem, and express this in the storage layer by caching missing keys. I'm not sure if our experimental storage cache does this already.

espg · 2026-06-04T21:10:10Z

@d-v-b I had a look through the cache_store.py module; I'm still getting my head wrapped around it, but it looks like it caches present values only and doesn't do any sort of 'negative caching' of where missing chunks at all. On a miss it deletes any stale entry and stores nothing (cache_store.py:264-268) — and it doesn't cache list_prefix/exists either (those aren't overridden, so they pass straight through to the source).

My understanding is that passing arr[:] for a CacheStore backend will stream the cached chunks (which exist), but also issue a get() for any missing chunks (which don't exist — each misses, caches nothing, and we re-pay the round trip next time). Accessing a CacheStore backend via read_regions should cache those keys and avoid the get() calls to empty chunks.

espg · 2026-06-04T22:39:53Z

we could also frame the question "how should we store our knowledge that a chunk is missing" as a caching problem

I don't think it's an either or-- it probably makes sense to have both populated shard/chunk discovery, and enable some sort of caching for which regions/shards/chunks are empty. It's a bit hard for me to see the proper design pattern for this ... sparse arrays often are realized with plans to revisit and fill them. So if we are caching regions that were previously empty, is the proper path to start run async shard discovery while starting to read from the cached keys? Or is this fully on the caller to update the cache status and mapping?

d-v-b · 2026-06-05T16:05:27Z

but it looks like it caches present values only and doesn't do any sort of 'negative caching' of where missing chunks at all.

exactly, we would need to modify the cache store to remember misses, and evict the cached miss when we write to that object. I feel like someone raised an issue about this a while back...

d-v-b · 2026-06-05T16:22:54Z

+@pytest.mark.parametrize("store", ["memory", "memory_get_latency"], indirect=["store"])
+@pytest.mark.parametrize("shards", sparse_shards, ids=str)
+@pytest.mark.parametrize("reader", ["full", "read_regions"], ids=str)
+def test_sparse_read(


I'm not really sure we need this benchmark -- it basically proves that reading fewer chunks is faster than reading more chunks? I think just the tests confirming that the chunk discovery routine worked are sufficient.

d-v-b · 2026-06-05T16:28:59Z

+
+
+@pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"])
+def test_list_strategy_ignores_non_chunk_objects(store: Store) -> None:


shouldn't this test insert some non-chunk objects in the store? and I don't think you need any real chunks present to test this, and I don't think it needs to be parametrized over different stores. Just use memory storage, create an array (dont write any chunks), and set b"blablabla" to key "array/foo", and ensure that the initialized shards are reported to be empty

d-v-b · 2026-06-05T16:38:34Z

+@pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"])
+@pytest.mark.parametrize(
+    ("setup_name", "expected_count"),
+    [
+        ("sparse_1d", 2),
+        ("dense_1d", 4),
+        ("sparse_2d", 2),
+        ("sharded_sparse", 2),
+        ("all_empty", 0),
+        ("all_populated", 4),
+    ],
+)
+def test_shards_initialized_counts(store: Store, setup_name: str, expected_count: int) -> None:
+    arr, _ = _CA_SETUPS[setup_name](store)
+    assert len(zarr.shards_initialized(arr)) == expected_count


I think these tests can be a lot simpler. I would start with a tuple of regions (parametrize over different tuples of regions), then create the array, then write the regions, then check that the initialized regions are exactly the ones you wrote. This will remove the need for a few of these test functions.

d-v-b · 2026-06-05T16:50:05Z

If your OK with me pushing to this branch I'd be happy addressing some of my concerns about test organization.

I think the core functionality is good but I want some general approval from other devs before we commit to new public API. We might need a little bikeshedding over the function names, for example.

This PR adds shards_initialized, initialized_regions, and read_regions functions. Should the form be x_initialized or initialized_x? Since these routines are scoped to arrays, should we indicate that with the name, e.g. initialized_array_regions, initialized_array_shards (redundant but consistent), and read_array_regions?

We also need to ensure that people understand that these functions don't introspect the contents of shards, so a shard file that has no subchunks written will appear as an initialized region.

@zarr-developers/python-core-devs please have a look. I'd like feedback from at someone other than me before committing to the addition of these new routines.

espg · 2026-06-05T17:30:40Z

@d-v-b feel free to push to the branch and get things better lined up for a merge. Very open on the naming conventions!

exactly, we would need to modify the cache store to remember misses, and evict the cached miss when we write to that object.

Happy to tackle a prototype for this in another PR

espg added 2 commits June 2, 2026 18:37

populated shards primative

2887142

delegation consistency for shards

82bdbf8

d-v-b reviewed Jun 3, 2026

View reviewed changes

Comment thread src/zarr/core/array.py Outdated

d-v-b reviewed Jun 3, 2026

View reviewed changes

Comment thread src/zarr/core/array.py Outdated

d-v-b reviewed Jun 3, 2026

View reviewed changes

Comment thread tests/test_chunk_access.py Outdated

d-v-b reviewed Jun 3, 2026

View reviewed changes

Comment thread tests/test_chunk_access.py Outdated

espg added 4 commits June 3, 2026 16:47

easy fixes (tests, file paths, etc)

a027be5

switching to caller to provide regions

aec3260

linting fixes

cc94ab8

bumping codecov to include final three lines...

b8d1d8c

Merge branch 'main' into feat/chunk-access-primitives

c7b557a

d-v-b reviewed Jun 5, 2026

View reviewed changes

espg mentioned this pull request Jun 5, 2026

[WIP] prototype for negative caching in StoreCache #4042

Open

6 tasks

		@@ -0,0 +1,157 @@
		"""Benchmark for sparse-array reads via the chunk-access primitives.



		@pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"])
		def test_list_strategy_ignores_non_chunk_objects(store: Store) -> None:

Uh oh!

Conversation

espg commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Implementation notes

API

Benchmarks

Testing

AI Disclosure

Uh oh!

d-v-b Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d-v-b commented Jun 3, 2026

Uh oh!

espg commented Jun 4, 2026

Uh oh!

espg commented Jun 4, 2026

Uh oh!

d-v-b commented Jun 5, 2026

Uh oh!

d-v-b Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

d-v-b Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espg commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

espg commented Jun 3, 2026 •

edited

Loading

codecov Bot commented Jun 3, 2026 •

edited

Loading

d-v-b Jun 5, 2026 •

edited

Loading

d-v-b commented Jun 5, 2026 •

edited

Loading