Skip to content

refactor: one array class#4034

Open
d-v-b wants to merge 27 commits into
zarr-developers:mainfrom
d-v-b:one-array-class
Open

refactor: one array class#4034
d-v-b wants to merge 27 commits into
zarr-developers:mainfrom
d-v-b:one-array-class

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Jun 4, 2026

This PR makes 2 fundamental changes to our top-level Array class:

  • it adds Array._runner: Runner to the Array. Runner is a protocol that looks like this:

    class SyncRunner:
        """The default `Runner`. Runs coroutines on Zarr's shared background event
        loop via `sync`.
        """
    
        def run(self, coro: Coroutine[Any, Any, T]) -> T:
            return sync(coro)

    the runner parameter allows a caller to provide their own event loop that the array will use when blocking on the execution of a coroutine. If the user doesn't declare a runner, we use a house default, which is just sync. So if you don't request a different runnner, everything is the same.

  • it adds async methods for every sync method. the sync methods use self.runner.run(self.do_thing()) to run

This means the AsyncArray class has no use and can be phased out. NOTE: it is not removed.

The goal here is no breaking changes. Removing the AsyncArray class can happen at its own pace. If you find any breaking changes in this PR, we can fix them.

d-v-b and others added 20 commits June 3, 2026 21:00
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocol

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Array no longer wraps an AsyncArray. It owns metadata, store_path, config,
codec_pipeline, _chunk_grid, and a pluggable _runner (defaulting to
SyncRunner). Adds Array._from_async_array and a deprecated async_array
property. External Array(async_array) construction sites are converted to
Array._from_async_array. Fixes downstream typing fallout from removing the
_async_array attribute.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…roperty

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…runner

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…unner

Routes resize/append/update_attributes/nchunks_initialized/nbytes_stored/
info_complete through self._runner.run(self.*_async(...)), which mutate the
live Array. Fixes resize/append not updating array state. Array no longer
delegates to the deprecated async_array property.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hods

Adds get/set_{orthogonal,mask,coordinate,block}_selection_async to Array and
migrates tests off the deprecated async_array property where an Array async
equivalent now exists.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Eliminates duplicated indexer construction and coordinate value-validation
by routing each sync selection method through self._runner.run of its
*_async twin. Adds get/set_basic_selection_async for a complete surface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- update_attributes (sync) returns a fresh Array, preserving the prior contract
- from_array docstring example uses a public construction path
- align SupportsArrayState._iter_shard_keys signature with the real methods
- restore AsyncArray coverage in test_get_shape_chunks
- extract shared sharding-codec helper to dedup Array/AsyncArray properties

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

❌ Patch coverage is 99.24812% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.50%. Comparing base (b9d3964) to head (1a344f9).

Files with missing lines Patch % Lines
src/zarr/api/synchronous.py 93.33% 1 Missing ⚠️
src/zarr/core/group.py 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4034      +/-   ##
==========================================
- Coverage   93.53%   93.50%   -0.04%     
==========================================
  Files          88       88              
  Lines       11894    12007     +113     
==========================================
+ Hits        11125    11227     +102     
- Misses        769      780      +11     
Files with missing lines Coverage Δ
src/zarr/core/array.py 97.14% <100.00%> (-0.74%) ⬇️
src/zarr/core/attributes.py 96.15% <100.00%> (ø)
src/zarr/core/sync.py 94.64% <100.00%> (+0.30%) ⬆️
src/zarr/metadata/migrate_v3.py 98.36% <100.00%> (ø)
src/zarr/testing/stateful.py 36.46% <ø> (ø)
src/zarr/api/synchronous.py 92.95% <93.33%> (ø)
src/zarr/core/group.py 95.20% <92.30%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Softens the constructor break: Array(async_array) still works but emits a
DeprecationWarning, constructing from the async array's metadata/store_path/
config. The new Array(metadata, store_path, ...) form is preferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b changed the title one array class refactor: one array class Jun 4, 2026
d-v-b and others added 4 commits June 4, 2026 13:47
The docs build guards that every python block declares exec/test; the new
custom-runner example was a bare fence. Mark it exec="true" (it constructs
an Array with a custom runner, which runs cleanly) and drop the unused Runner
import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…compressor/filters)

Closes coverage gaps introduced by the Array unification: the store_path-required
TypeError, __eq__ NotImplemented path, the sharded read_chunk_sizes/_chunk_grid_shape
branch, and the v2/v3 compressor and v2 filters property branches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s, iterators, async_array

- legacy Array(async_array) raises if store_path/config also supplied
- update_attributes_async returns a fresh Array, consistent with the sync form
- align Array._iter_shard_coords signature with sibling iterators
- async_array property left uncached: resize/append replace metadata so caching would be stale

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Jun 4, 2026

the main breaking change I was worried about here is Array.__init__, which previously took a single AsyncArray argument, e.g. Array(my_async_array). This still works on this branch, but it's deprecated. The preferred constructor for Array looks much more user-friendly now -- you pass metadata, a storepath, etc.

@d-v-b d-v-b marked this pull request as ready for review June 4, 2026 12:58
@d-v-b d-v-b requested a review from mkitti June 4, 2026 12:58
@d-v-b d-v-b requested a review from dcherian June 5, 2026 07:48
Copy link
Copy Markdown
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose this approach over using AsyncArray as the single source of truth, and making Array a facade holding a reference to it? The regression section details the most critical issue with this PR, and would be prevented by having AsyncArray as the single source of truth. Using AsyncArray as the single source of truth would reduce the amount of code needed by a few hundred lines (see duplication section). A facade Array can still expose every *_async method and the runner-dispatched sync wrappers by delegating to the held AsyncArray, so the user-facing API of this PR is identical either way.

 The regression and the code duplication below are two symptoms of the same root cause: after this PR, Array and AsyncArray each independently own array state and derive behavior from it. Caching the handle would fix the regression and delegating to the shared helpers would fix the method duplication. Even with those changes, the state and the derived properties remain mirrored across both classes, and every future change must land on both sides. The facade removes the second source of truth instead of patching its symptoms one at a time.

Also, what's the use-case for a synchronous Array having a custom runner? I would think that most people who want to bring an event loop would just use the async methods. It seems like a YAGNI case, and not worth exposing publicly in this PR since it's not usable via zarr.open_array, zarr.create_array, etc.

regression

On main, Array.async_array returned the shared backing _async_array, and all Array properties read through it — so arr.async_array.resize((N,)) updated arr.shape too. On this branch the property constructs a fresh, throwaway AsyncArray(self.metadata, self.store_path, self.config) on every access. Three consequences, all verified:

  • arr.async_array.resize((N,)) — a previously working pattern — now runs _resize against the detached temporary: the store metadata is updated (array.py:6683) but arr.metadata/arr.shape stay stale. Subsequent arr[...] reads/writes index against the old shape while the store has the new one.
  • aa = arr.async_array; arr.resize(...) leaves aa holding the pre-resize metadata, since resize rebinds arr.metadata via object.__setattr__ on arr only (array.py:4634, 6686).
  • arr.async_array is arr.async_array is now False, breaking identity-based caching.

The changelog frames async_array as "deprecated but still works for now"; for mutating operations it does not work — it silently desynchronizes the handle from the array.

code duplication avoided by using AsyncArray as the single source of truth

AsyncArray's selection methods are one-line delegations to shared module-level helpers:

# AsyncArray.get_orthogonal_selection (array.py:1544)
return await _get_orthogonal_selection(
    self.store_path, self.metadata, self.codec_pipeline, self.config, self._chunk_grid,
    selection, out=out, fields=fields, prototype=prototype,
)

The new Array.get_orthogonal_selection_async re-inlines the body of that same helper instead of calling it:

# Array.get_orthogonal_selection_async (array.py:2865)
if prototype is None:
    prototype = default_buffer_prototype()
indexer = OrthogonalIndexer(selection, self.shape, self._chunk_grid)
return await self._get_selection(indexer=indexer, out=out, fields=fields, prototype=prototype)
# _get_orthogonal_selection — the existing shared helper (array.py:6336)
if prototype is None:
    prototype = default_buffer_prototype()
indexer = OrthogonalIndexer(selection, metadata.shape, chunk_grid)
return await _get_selection(..., indexer=indexer, out=out, fields=fields, prototype=prototype)

The same pattern repeats for the mask, coordinate, and block selection getters and setters. Beyond the methods:

  • Array._info (array.py:1954) is a byte-for-byte copy of AsyncArray._info (array.py:1825) — a 20-line ArrayInfo construction duplicated verbatim.
  • The __init__ state-derivation block (parse_array_metadata / ChunkGrid.from_metadata / create_codec_pipeline) is duplicated between AsyncArray.__init__ (array.py:415–426) and Array.__init__ (array.py:1892–1902).
  • The derived properties (order, read_only, filters, serializer, compressors, nchunks, …) are mirrored across both classes.

With the docstrings these twins carry, that's the few hundred lines. Each pair is a place where a future fix can land on one side and silently miss the other. Sync and async selection silently returning different results for the same call. A facade Array delegating to its held AsyncArray would have exactly one copy of each.

Comment thread src/zarr/core/array.py
Comment on lines +206 to +210
"""Return the array's sharding codec, or `None` if the array is not sharded.

An array is considered sharded when its metadata declares exactly one codec
and that codec is a `ShardingCodec`.
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return the array's sharding codec, or `None` if the array is not sharded.
An array is considered sharded when its metadata declares exactly one codec
and that codec is a `ShardingCodec`.
"""
"""Return the array's sole sharding codec, or `None`.
The gate used by the chunk/shard accessors: sharding is reported only
when the sharding codec is the only declared codec, because any other
codec (e.g. ``codecs=[sharding, gzip]``) makes the inner chunks not
independently addressable. Not a general sharded-ness predicate; see #4036.
"""

I think it's important to document that this won't always return the sharding codec to prevent misuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants