refactor: one array class#4034
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ocol Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Array no longer wraps an AsyncArray. It owns metadata, store_path, config, codec_pipeline, _chunk_grid, and a pluggable _runner (defaulting to SyncRunner). Adds Array._from_async_array and a deprecated async_array property. External Array(async_array) construction sites are converted to Array._from_async_array. Fixes downstream typing fallout from removing the _async_array attribute. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…roperty Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…runner Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…unner Routes resize/append/update_attributes/nchunks_initialized/nbytes_stored/ info_complete through self._runner.run(self.*_async(...)), which mutate the live Array. Fixes resize/append not updating array state. Array no longer delegates to the deprecated async_array property. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hods
Adds get/set_{orthogonal,mask,coordinate,block}_selection_async to Array and
migrates tests off the deprecated async_array property where an Array async
equivalent now exists.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Eliminates duplicated indexer construction and coordinate value-validation by routing each sync selection method through self._runner.run of its *_async twin. Adds get/set_basic_selection_async for a complete surface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- update_attributes (sync) returns a fresh Array, preserving the prior contract - from_array docstring example uses a public construction path - align SupportsArrayState._iter_shard_keys signature with the real methods - restore AsyncArray coverage in test_get_shape_chunks - extract shared sharding-codec helper to dedup Array/AsyncArray properties Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4034 +/- ##
==========================================
- Coverage 93.53% 93.50% -0.04%
==========================================
Files 88 88
Lines 11894 12007 +113
==========================================
+ Hits 11125 11227 +102
- Misses 769 780 +11
🚀 New features to boost your workflow:
|
Softens the constructor break: Array(async_array) still works but emits a DeprecationWarning, constructing from the async array's metadata/store_path/ config. The new Array(metadata, store_path, ...) form is preferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The docs build guards that every python block declares exec/test; the new custom-runner example was a bare fence. Mark it exec="true" (it constructs an Array with a custom runner, which runs cleanly) and drop the unused Runner import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…compressor/filters) Closes coverage gaps introduced by the Array unification: the store_path-required TypeError, __eq__ NotImplemented path, the sharded read_chunk_sizes/_chunk_grid_shape branch, and the v2/v3 compressor and v2 filters property branches. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s, iterators, async_array - legacy Array(async_array) raises if store_path/config also supplied - update_attributes_async returns a fresh Array, consistent with the sync form - align Array._iter_shard_coords signature with sibling iterators - async_array property left uncached: resize/append replace metadata so caching would be stale Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
the main breaking change I was worried about here is |
maxrjones
left a comment
There was a problem hiding this comment.
Why did you choose this approach over using AsyncArray as the single source of truth, and making Array a facade holding a reference to it? The regression section details the most critical issue with this PR, and would be prevented by having AsyncArray as the single source of truth. Using AsyncArray as the single source of truth would reduce the amount of code needed by a few hundred lines (see duplication section). A facade Array can still expose every *_async method and the runner-dispatched sync wrappers by delegating to the held AsyncArray, so the user-facing API of this PR is identical either way.
The regression and the code duplication below are two symptoms of the same root cause: after this PR, Array and AsyncArray each independently own array state and derive behavior from it. Caching the handle would fix the regression and delegating to the shared helpers would fix the method duplication. Even with those changes, the state and the derived properties remain mirrored across both classes, and every future change must land on both sides. The facade removes the second source of truth instead of patching its symptoms one at a time.
Also, what's the use-case for a synchronous Array having a custom runner? I would think that most people who want to bring an event loop would just use the async methods. It seems like a YAGNI case, and not worth exposing publicly in this PR since it's not usable via zarr.open_array, zarr.create_array, etc.
regression
On main, Array.async_array returned the shared backing _async_array, and all Array properties read through it — so arr.async_array.resize((N,)) updated arr.shape too. On this branch the property constructs a fresh, throwaway AsyncArray(self.metadata, self.store_path, self.config) on every access. Three consequences, all verified:
arr.async_array.resize((N,))— a previously working pattern — now runs_resizeagainst the detached temporary: the store metadata is updated (array.py:6683) butarr.metadata/arr.shapestay stale. Subsequentarr[...]reads/writes index against the old shape while the store has the new one.aa = arr.async_array; arr.resize(...)leavesaaholding the pre-resize metadata, sinceresizerebindsarr.metadataviaobject.__setattr__onarronly (array.py:4634,6686).arr.async_array is arr.async_arrayis nowFalse, breaking identity-based caching.
The changelog frames async_array as "deprecated but still works for now"; for mutating operations it does not work — it silently desynchronizes the handle from the array.
code duplication avoided by using AsyncArray as the single source of truth
AsyncArray's selection methods are one-line delegations to shared module-level helpers:
# AsyncArray.get_orthogonal_selection (array.py:1544)
return await _get_orthogonal_selection(
self.store_path, self.metadata, self.codec_pipeline, self.config, self._chunk_grid,
selection, out=out, fields=fields, prototype=prototype,
)The new Array.get_orthogonal_selection_async re-inlines the body of that same helper instead of calling it:
# Array.get_orthogonal_selection_async (array.py:2865)
if prototype is None:
prototype = default_buffer_prototype()
indexer = OrthogonalIndexer(selection, self.shape, self._chunk_grid)
return await self._get_selection(indexer=indexer, out=out, fields=fields, prototype=prototype)# _get_orthogonal_selection — the existing shared helper (array.py:6336)
if prototype is None:
prototype = default_buffer_prototype()
indexer = OrthogonalIndexer(selection, metadata.shape, chunk_grid)
return await _get_selection(..., indexer=indexer, out=out, fields=fields, prototype=prototype)The same pattern repeats for the mask, coordinate, and block selection getters and setters. Beyond the methods:
Array._info(array.py:1954) is a byte-for-byte copy ofAsyncArray._info(array.py:1825) — a 20-lineArrayInfoconstruction duplicated verbatim.- The
__init__state-derivation block (parse_array_metadata/ChunkGrid.from_metadata/create_codec_pipeline) is duplicated betweenAsyncArray.__init__(array.py:415–426) andArray.__init__(array.py:1892–1902). - The derived properties (
order,read_only,filters,serializer,compressors,nchunks, …) are mirrored across both classes.
With the docstrings these twins carry, that's the few hundred lines. Each pair is a place where a future fix can land on one side and silently miss the other. Sync and async selection silently returning different results for the same call. A facade Array delegating to its held AsyncArray would have exactly one copy of each.
| """Return the array's sharding codec, or `None` if the array is not sharded. | ||
|
|
||
| An array is considered sharded when its metadata declares exactly one codec | ||
| and that codec is a `ShardingCodec`. | ||
| """ |
There was a problem hiding this comment.
| """Return the array's sharding codec, or `None` if the array is not sharded. | |
| An array is considered sharded when its metadata declares exactly one codec | |
| and that codec is a `ShardingCodec`. | |
| """ | |
| """Return the array's sole sharding codec, or `None`. | |
| The gate used by the chunk/shard accessors: sharding is reported only | |
| when the sharding codec is the only declared codec, because any other | |
| codec (e.g. ``codecs=[sharding, gzip]``) makes the inner chunks not | |
| independently addressable. Not a general sharded-ness predicate; see #4036. | |
| """ |
I think it's important to document that this won't always return the sharding codec to prevent misuse.
This PR makes 2 fundamental changes to our top-level
Arrayclass:it adds
Array._runner: Runnerto theArray.Runneris a protocol that looks like this:the
runnerparameter allows a caller to provide their own event loop that the array will use when blocking on the execution of a coroutine. If the user doesn't declare a runner, we use a house default, which is justsync. So if you don't request a different runnner, everything is the same.it adds async methods for every sync method. the sync methods use
self.runner.run(self.do_thing())to runThis means the
AsyncArrayclass has no use and can be phased out. NOTE: it is not removed.The goal here is no breaking changes. Removing the
AsyncArrayclass can happen at its own pace. If you find any breaking changes in this PR, we can fix them.