Skip to content

perf: phased codecpipeline#3885

Open
d-v-b wants to merge 53 commits into
zarr-developers:mainfrom
d-v-b:perf/prepared-write-v2
Open

perf: phased codecpipeline#3885
d-v-b wants to merge 53 commits into
zarr-developers:mainfrom
d-v-b:perf/prepared-write-v2

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 8, 2026

This PR defines a new codec pipeline class called PhasedCodecPipeline that enables much higher performance for chunk encoding and decoding than the current BatchedCodecPipeline.

The approach here is to completely ignore how the v3 spec defines array -> bytes codecs 😆. Instead of treating codecs as functions that mix IO and compute, we treat codec encoding and decoding as a sequence:

  1. preparatory IO, async
    fetch exactly what we need to fetch from storage, given the codecs we have. So if there's a sharding codec in the first array->bytes position, the codec pipeline knows it must fetch the shard index, then fetch the involved subchunks, before passing them to compute.
  2. pure compute. sync. Apply filters and compressors. safe to parallelize over chunks.
  3. (if writing): final IO, async. reconcile the in-memory compressed chunks against our model of the stored chunk. Write out bytes.

Basically, we use the first array -> bytes codec to figure out what kind of preparatory IO and final IO we need to perform, and the rest of the codecs to figure out what kind of chunk encoding we need to do. Separating IO from compute in different phases makes things simpler and faster.

Happy to chat more about this direction. IMO the spec should be re-written with this framing, because it makes much more sense than trying to shoe-horn sharding in as a codec.

I don't want to make our benchmarking suite any bigger but on my laptop this codec pipeline is 2-5x faster than the batchedcodec pipeline for a lot of workloads. I can include some of those benchmarks later.

This was mostly written by claude, based on previous work in #3719. All these changes should be non-breaking, so I think this is in principle safe for us to play around with in a patch or minor release.

Edit: this PR depends on changes submitted in #3907 and #3908

d-v-b added 4 commits April 7, 2026 10:38
`PreparedWrite` models a set of per-chunk changes that would be applied to a stored chunk. `SupportsChunkPacking`
is a protocol for array -> bytes codecs that can use `PreparedWrite` objects to update an existing chunk.
@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 87.36413% with 93 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.25%. Comparing base (b871a22) to head (11bba96).

Files with missing lines Patch % Lines
src/zarr/core/codec_pipeline.py 84.96% 43 Missing ⚠️
src/zarr/codecs/sharding.py 90.36% 32 Missing ⚠️
src/zarr/abc/store.py 74.28% 9 Missing ⚠️
src/zarr/codecs/numcodecs/_codecs.py 83.33% 4 Missing ⚠️
src/zarr/core/array.py 78.57% 3 Missing ⚠️
src/zarr/storage/_local.py 93.75% 1 Missing ⚠️
src/zarr/storage/_memory.py 94.73% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3885      +/-   ##
==========================================
- Coverage   93.53%   93.25%   -0.29%     
==========================================
  Files          88       88              
  Lines       11894    12526     +632     
==========================================
+ Hits        11125    11681     +556     
- Misses        769      845      +76     
Files with missing lines Coverage Δ
src/zarr/codecs/_v2.py 94.11% <100.00%> (+0.50%) ⬆️
src/zarr/core/config.py 100.00% <ø> (ø)
src/zarr/storage/_local.py 96.99% <93.75%> (-0.26%) ⬇️
src/zarr/storage/_memory.py 96.57% <94.73%> (-0.18%) ⬇️
src/zarr/core/array.py 97.62% <78.57%> (-0.26%) ⬇️
src/zarr/codecs/numcodecs/_codecs.py 95.45% <83.33%> (-0.94%) ⬇️
src/zarr/abc/store.py 92.61% <74.28%> (-3.82%) ⬇️
src/zarr/codecs/sharding.py 90.94% <90.36%> (-0.58%) ⬇️
src/zarr/core/codec_pipeline.py 88.34% <84.96%> (-3.83%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 9, 2026

@TomAugspurger how would this design work with CUDA codecs?

@d-v-b d-v-b force-pushed the perf/prepared-write-v2 branch from 5d3064e to b67a5a0 Compare April 15, 2026 09:51
@github-actions github-actions Bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 15, 2026
@d-v-b d-v-b force-pushed the perf/prepared-write-v2 branch 2 times, most recently from a84a15a to 68a7cdc Compare April 17, 2026 10:41
Comment thread src/zarr/core/codec_pipeline.py Outdated
Comment on lines +943 to +962
# Phase 1: fetch all chunks (IO, sequential)
raw_buffers: list[Buffer | None] = [
bg.get_sync(prototype=cs.prototype) # type: ignore[attr-defined]
for bg, cs, *_ in batch
]

# Phase 2: decode (compute, optionally threaded)
def _decode_one(raw: Buffer | None, chunk_spec: ArraySpec) -> NDBuffer | None:
if raw is None:
return None
return transform.decode_chunk(raw, chunk_spec)

specs = [cs for _, cs, *_ in batch]
if n_workers > 0 and len(batch) > 1:
with ThreadPoolExecutor(max_workers=n_workers) as pool:
decoded_list = list(pool.map(_decode_one, raw_buffers, specs))
else:
decoded_list = [
_decode_one(raw, spec) for raw, spec in zip(raw_buffers, specs, strict=True)
]
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this all multi-threaded i.e., the I/O as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should benchmark this, but my expectation was that IO against memory storage and local storage is not compute-limited, and so threads wouldn't remove a real bottleneck. for memory storage i'm sure this is true, not sure about local storage though

d-v-b and others added 6 commits April 17, 2026 22:51
Adds a SupportsSetRange protocol to zarr.abc.store for stores that
allow overwriting a byte range within an existing value. Implementations
are added for LocalStore (using file-handle seek+write) and MemoryStore
(in-memory bytearray slice assignment).

This is the prerequisite for the partial-shard write fast path in
ShardingCodec, which can patch individual inner-chunk slots without
rewriting the entire shard blob when the inner codec chain is fixed-size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2Codec, BytesCodec, BloscCodec, etc. previously only implemented the
async _decode_single / _encode_single methods. Add their sync
counterparts (_decode_sync / _encode_sync) so that the upcoming
SyncCodecPipeline can dispatch through them without spinning up an
event loop.

For codecs that wrap external compressors (numcodecs.Zstd, numcodecs.Blosc,
the V2 fallback chain), the sync versions just call the underlying
compressor's blocking API directly instead of routing through
asyncio.to_thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arallelism

Adds SyncCodecPipeline alongside BatchedCodecPipeline. The new pipeline
runs codecs through their sync entry points (_decode_sync / _encode_sync)
and dispatches per-chunk work to a module-level thread pool sized by
the codec_pipeline.max_workers config (default = os.cpu_count()).

Each chunk's full lifecycle (fetch + decode + scatter for reads;
get-existing + merge + encode + set/delete for writes) runs as one
pool task — overlapping IO of one chunk with compute of another.
Scatter into the shared output buffer is thread-safe because chunks
have non-overlapping output selections.

The async wrappers (read/write) detect SupportsGetSync/SupportsSetSync
stores and dispatch to the sync fast path, passing the configured
max_workers. Other stores fall through to the async path, which still
uses asyncio.concurrent_map at async.concurrency.

Notes on perf:
- Default (None → cpu_count) is tuned for chunks ≥ ~512 KB.
- Small chunks (≤ 64 KB) regress 1.5-3x because pool dispatch overhead
  (~30-50 µs/task) dominates per-chunk work. Workaround:
  zarr.config.set({"codec_pipeline.max_workers": 1}).
- For large chunks on local/memory stores, IO+compute parallelism
  yields 1.7-2.5x over BatchedCodecPipeline on direct-API reads and
  ~2.5x on roundtrip.

ChunkTransform encapsulates the sync codec chain. It caches resolved
ArraySpecs across calls with the same chunk_spec — combined with the
constant-ArraySpec optimization in indexing, hot-path overhead is
minimized.

Includes test scaffolding for the new pipeline (test_sync_codec_pipeline)
and config plumbing for the max_workers key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds _encode_partial_sync and _decode_partial_sync to ShardingCodec.
For fixed-size inner codec chains and stores that implement
SupportsSetRange, partial writes patch individual inner-chunk slots
in-place instead of rewriting the whole shard:

  - Reads existing shard index (one byte-range get).
  - For each affected inner chunk: decodes the slot, merges the new
    region, re-encodes.
  - Writes each modified slot at its deterministic byte offset, then
    rewrites just the index.

For variable-size inner codecs (e.g. with compression) or stores that
don't support byte-range writes, falls through to a full-shard rewrite
matching BatchedCodecPipeline semantics.

The partial-decode path computes a ReadPlan from the shard index and
issues one byte-range get per overlapping chunk, decoding only what
the read selection touches.

Both paths are dispatched from SyncCodecPipeline via the existing
supports_partial_decode / supports_partial_encode protocol checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new test files:

  test_codec_invariants — asserts contract-level properties that every
  codec / shard / buffer combination must satisfy: round-trip exactness,
  prototype propagation, fill-value handling, all-empty shard handling.

  test_pipeline_parity — exhaustive matrix asserting that
  SyncCodecPipeline and BatchedCodecPipeline produce semantically
  identical results across codec configs, layouts (including
  nested sharding), write sequences, and write_empty_chunks settings.
  Three checks per cell:
    1. Same array contents on read.
    2. Same set of store keys after writes.
    3. Each pipeline reads the other's output identically (catches
       layout-divergence bugs).

These tests pinned the design throughout the SyncCodecPipeline +
partial-shard development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds .gitignore entries for .claude/, CLAUDE.md, and docs/superpowers/
so local IDE/agent planning artifacts don't get committed by accident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b force-pushed the perf/prepared-write-v2 branch from aa111a2 to 1be5563 Compare April 17, 2026 21:04
selected = decoded[chunk_selection]
if drop_axes:
selected = selected.squeeze(axis=drop_axes)
out[out_selection] = selected
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth experimenting with moving this setting operation out[out_selection] = selected outside the threadpool execution since, IIRC, it holds the GIL and is probably non-trivial time-wise.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory usage will probably go up a bit though....

@ilan-gold ilan-gold self-requested a review May 6, 2026 15:19
@maxrjones maxrjones added the performance Potential issues with Zarr performance (I/O, memory, etc.) label May 11, 2026
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first thing that jumps out at me is the potential for a performance regression because there is no whole-shard special casing in the new fused codec pipeline. I guess your benchmarks cover that, but they also might not.

As you point out in #3925, that PR will then bring nested concurrency because the pipeline will have "outer" concurrency (that also controls the decompression) while there will be some inner concurrency from coalesced ranges.

I'd like to understand how these two PRs will fit together!

Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @d-v-b maybe in the interest of pruning this a bit, I don't actually see a dependency on set_range_sync being used in the pipeline - could it be removed from this PR?

In fact, if we want to be able to do what zarrs does for write performance, we mayb likely need #3826 + special casing for unordered first as this would then be mixed with set_range_sync to achieve the "first to compress, first to write" paradigm that appears in zarrs for unordered subchunk writing:

https://github.com/zarrs/zarrs/blob/82876988f0ef25cb45950320c75f9ef591a4359c/zarrs/src/array/codec/array_to_bytes/sharding/sharding_codec.rs#L822-L871

Otherwise, you need to hold the shard in-memory AFAICT to be able to create the ordering ahead of time (i.e., morton) to write.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 20, 2026

first apologies for the messy state and second yes consider the set_range stuff extra credit. I do want to ensure that we can support range writes, but as long as we are confident that we aren't blocking that path, then it's totally fine to slim this PR down to "whatever it takes to speed up local + memory storage"

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 20, 2026

and you should absolutely feel free to push ad libitum to this branch. I'm not actively working on it, and I'm confident that the twin constraints of our test suite and the benchmarks can keep us sane

d-v-b and others added 24 commits May 30, 2026 09:19
)

In ShardingCodec._encode_partial_sync's full-shard-rewrite loop, a scalar
broadcast value produces byte-for-byte identical results for every complete
inner chunk (same fill, same empty-check, same encoded bytes). Compute that
outcome once and reuse it across all complete chunks instead of re-merging,
re-checking write_empty_chunks, and re-encoding tens of thousands of identical
chunks. Incomplete edge chunks still merge against their own data individually.

Target case (fused, memory, chunks=100/shards=1M, no compression):
write 92.26ms -> 21.59ms (4.3x). Pipeline parity (byte-identical to batched)
and 956 tests pass under the fused pipeline; adversarial partial-overwrite/
edge/compression/2D/aliasing checks pass.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#3826, partial-read opt zarr-developers#3004, _ShardIndex refactor zarr-developers#3975)

Resolves conflicts in sharding.py (kept FusedCodecPipeline sync methods +
main's _subchunk_order_iter / _load_partial_shard_maybe; fixed _ShardIndex
construction to main's 2-arg signature), array.py (took main's cached
regular_chunk_spec), test_codec_pipeline.py (kept the dual-pipeline suite +
main's evolve test), .gitignore (union).

423 codec/sharding/parity + 807 codecs/indexing tests pass under both pipelines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-merge

Two things, both scoped to the sync sharding read path:

1. Fix: main's zarr-developers#3975 made _ShardIndex a 2-field NamedTuple (chunks_per_shard,
   offsets_and_lengths), but the Fused sync methods still constructed it with one
   arg, erroring on every Fused sharded read. Pass chunks_per_shard through in
   _decode_shard_index_sync and the byte-range write path.

2. Perf: _decode_full_shard_bulk + _ShardIndex.is_dense. A whole-shard read of a
   dense, fixed-size, uncompressed shard is reconstructed by reshaping/scattering
   the data section in bulk, replacing the per-chunk decode/index/projection loop
   (~78% of a full read). Chunk positions are read from the stored index, so it is
   correct for any subchunk_write_order. Falls through to the per-chunk path for
   compression/filters, non-dense shards, and any read whose output shape != the
   shard shape (strided/partial/fancy).

Full read (memory, 10000 chunks/shard, uint8): ~291ms -> ~21ms (13.9x vs Batched).
Verified: 0 new test failures vs the merge baseline; full reads correct across
dtypes and 2D; partial/strided/gzip fall through. (Pre-existing Fused x
subchunk_write_order gaps remain, tracked separately.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…al reads

Three integration gaps surfaced when the Fused pipeline met main's new
subchunk_write_order (zarr-developers#3826), partial-read coalescing (zarr-developers#3004), and _ShardIndex
refactor. Under Fused these caused 25 sharding/parity failures (data was
correct in the partial-read cases; the failures were write-order layout +
IO-pattern divergence). Fixes:

1. Write order: _encode_shard_dict_sync laid out chunks in hardcoded morton
   order, ignoring subchunk_write_order. Now iterates
   _subchunk_order_iter(self.subchunk_write_order), matching the async
   _encode_shard_dict. Fixes lexicographic/colexicographic/unordered storage.

2. Coalesced sync partial reads: add Store.get_ranges_sync (a synchronous,
   coalescing counterpart of get_ranges, reusing coalesce_ranges) and
   ShardingCodec._load_partial_shard_maybe_sync; route _decode_partial_sync's
   partial branch through it. Sync stores now get zarr-developers#3004's byte-range coalescing
   without an event loop (fewer, merged reads).

3. Non-sync fallback: FusedCodecPipeline.read now routes non-sync stores (e.g.
   ZipStore) through the async partial-decode path when the AB codec supports
   it, instead of _async_read_fallback's whole-shard get(). Matches Batched's
   IO behavior; avoids over-reading whole shards on partial reads.

Tests: the zarr-developers#3004 partial-read tests are made pipeline-aware (assert the active
method family: get/get_ranges vs get_sync/get_ranges_sync, gated on store sync
support). 573 sharding+parity+pipeline+indexing and 657 codec tests pass under
BOTH pipelines (was 25 failing under Fused).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
HIGH (sharding.py, byte-range write fast path): derived each chunk's physical
slot from self.subchunk_write_order instead of hardcoded morton order, and
excluded 'unordered' (no recoverable rank -> falls through to the index-driven
full-rewrite path). A partial write into a dense shard first written with a
non-default order no longer corrupts data via wrong byte offsets.

HIGH (sharding.py, _decode_full_shard_bulk): build the read-view dtype from the
BytesCodec's endian (as BytesCodec._decode_sync does), not the dtype's native
endianness. A big-endian shard read on a little-endian host (or vice versa) now
decodes correctly instead of silently reinterpreting bytes.

MEDIUM (sharding.py, _decode_full_shard_bulk): the bulk fast path now requires
the inner chain to be exactly one BytesCodec, excluding crc-bearing shards. The
bulk path can't verify per-chunk checksums, so crc shards fall through to the
per-chunk path and keep their corruption detection.

LOW (codec_pipeline.py, ChunkTransform._resolve_specs): key the resolved-spec
cache on the frozen, hashable ArraySpec value instead of (shape, id()), which
could collide after id reuse.

LOW (codec_pipeline.py, _get_pool): don't shutdown(wait=False) the old pool on
grow — a concurrent in-flight pool.map could hit 'cannot schedule new futures
after shutdown'. The orphaned pool drains and is GC'd.

Tests: extended test_pipeline_parity with big-endian + crc32c codec configs and
a dedicated subchunk_write_order x index_location parity test (asserts identical
contents always, identical bytes for deterministic orders). Verified each new
test fails when its corresponding fix is reverted. 1219 tests pass under both
pipelines; mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mpute separation

The class docstring claimed it 'separates IO from compute', then immediately
said the ShardingCodec does IO internally — self-contradictory and misleading.
The actual win is replacing per-chunk ASYNC scheduling with synchronous,
batched/coalesced execution; the sharding codec still owns its storage IO
(the zarrs model, unlike tensorstore's storage-free codecs). Rewrite the
docstring to state this plainly and note that a storage-free codec is a
possible future direction, not what this pipeline does. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After merging zarr-developers#4011 (which made 'unordered' deterministic and warns callers not
to rely on its layout), drop the two places my earlier fixes special-cased it by
name:

- Byte-range write fast path: remove the 'subchunk_write_order != unordered'
  gate. The rank map is derived from _subchunk_order_iter(self.subchunk_write_
  order), which is the single source of truth for physical layout — correct for
  every order without a name check. _subchunk_order_iter is the only place that
  knows a given order's layout.
- Parity test: assert byte-equality across pipelines for ALL orders, not just
  'deterministic' ones. The check verifies the two pipelines AGREE (they share
  _subchunk_order_iter), which holds whatever an order resolves to; it makes no
  assumption about what 'unordered' means.

540 parity+sharding and 862 codec/indexing tests pass under both pipelines; mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Flip codec_pipeline.path default from BatchedCodecPipeline to FusedCodecPipeline.
Fused runs codec compute synchronously/in bulk and gives large speedups on
sharded workloads (up to ~24x write / ~14x read on many-chunks-per-shard, more
with compression) and no regressions on compute-bound cases; it falls back to
the async path for non-sync stores. Batched remains selectable via config.

Test fallout from the flip (all behavior, not stale-assertion churn):
- test_config_defaults_set: expected default path updated.
- test_config_codec_implementation: the mock codec now also overrides
  _encode_sync, so it records a call regardless of which pipeline is default
  (Fused uses the sync entry point).
- StoreExpectingTestBuffer (zarr.testing.buffer): added set_sync/get_sync that
  mirror the async buffer-type guards, so the 'all buffers are TestBuffer'
  invariant is checked on the sync write path too. Verified Fused correctly
  threads a custom BufferPrototype (sharded writes store TestBuffer instances) —
  the test simply wasn't exercising the sync path before.

Full suite: 6346 passed, 0 failed under the new default.

NOTE: changelog fragment filename is a PLACEHOLDER — rename
changes/PLACEHOLDER-fused-default.feature.md to changes/<PR#>.feature.md once the
PR number is known (towncrier keys fragments by issue/PR number).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… hard-coded Batched

The codec_pipeline property hard-coded BatchedCodecPipeline.from_codecs(). main
resolves it against the registry via get_pipeline_class() (zarr-developers#2179); the branch
carried an older hard-coded version and the main-merge kept the branch side.
With FusedCodecPipeline now the default this left the inner sub-chunk pipeline
stuck on Batched while the outer array used Fused — an inconsistency, and stale
relative to main. Restore get_pipeline_class().from_codecs(), matching the rest
of this module (which already uses get_pipeline_class elsewhere).

Verified: sharding + parity + pipeline (596) and codecs+array+indexing+properties
(2161) pass; nested sharding roundtrips correctly under both pipelines; no
functional BatchedCodecPipeline references remain in sharding.py. mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…red CodecPipelineTests

HIGH-2: FusedCodecPipeline.decode()/encode() (the async fallback for non-sync
stores) reused one flat chunk_spec across every codec stage instead of evolving
it per codec via resolve_metadata. Spec-changing array->array codecs broke:
TransposeCodec crashed on read (could not broadcast (2,2) into (2,4));
cast_value/scale_offset would silently corrupt. Reachable on the DEFAULT
pipeline for every non-sync store (S3/GCS/fsspec/zip).

Fix, without re-duplicating spec logic (the duplication caused the bug):
- Extract resolve_aa_specs(): single source of truth for per-stage spec
  evolution (forward-thread resolve_metadata over the AA codecs). Pure metadata.
- Add AsyncChunkTransform: per-chunk ASYNC mirror of ChunkTransform, driving the
  codecs' async _decode_single/_encode_single with the correct per-stage spec.
  No mini-batch concept (that stays a BatchedCodecPipeline concern).
- ChunkTransform._resolve_specs delegates to resolve_aa_specs.
- Fused.decode()/encode() loop per chunk through AsyncChunkTransform.

Also harden the sharding byte-range WRITE fast path: take chunk offsets from the
stored shard index, not from the live subchunk_write_order (which is not
recoverable on reopen by design).

New tests/test_codec_pipeline_suite.py: xUnit CodecPipelineTests base run as
TestBatchedPipeline and TestFusedPipeline over a sync (MemoryStore) AND a
non-sync (LatencyStore) store axis. Reproduces HIGH-2 automatically. 140 pass;
mypy clean; original ZipStore+transpose crash now roundtrips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts suite

The shared suite runs every pipeline-agnostic behavior test against BOTH
pipelines x both store paths, so per-file copies of the same behavior are
redundant. Remove confirmed duplicates; keep tests that exercise something the
suite does not.

- Strengthen the suite's write_empty_chunks tests to also assert chunk-key
  presence/absence (absorbing the old _no_store / _persists coverage).
- test_codec_pipeline.py: drop the 8 behavior duplicates now in the suite. KEEP
  test_read_returns_get_results (low-level pipeline.read GetResult API),
  test_write_empty_chunks_false_no_store (store-key shape), and
  test_codec_pipeline_threads_dtype_through_evolve (zarr-developers#3937 regression).
- test_fused_pipeline.py: drop the array-level streaming read/write tests and
  test_partial_shard_write_roundtrip_correctness (array behavior, suite-covered).
  KEEP all pipeline-API / Fused-internal tests (construction, evolve, low-level
  write/read(_sync) roundtrips, sync-write/async-read interop, ChunkTransform
  encode/decode, set_range, inner_codecs_fixed_size, byte-range fast path).

740 pass across suite + codec_pipeline + fused + sync + invariants + parity +
sharding; ruff + mypy clean. No coverage removed without a verified equivalent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…etrized test

The bulk of CodecPipelineTests followed one shape: create an array, apply some
writes, optionally assert which chunk keys exist, then assert reads come back
correct. Capture those variables in a frozen Scenario dataclass (array_kwargs,
writes, reads, keys_present/absent) and drive them all through a single
parametrized test_scenario. Correctness is checked against a numpy reference the
scenario derives from its own writes, so cases don't hand-maintain expected
values. 18 scenarios cover the same matrix (layouts, gzip, transpose
spec-evolution, nested sharding, partial-shard overwrite, write_empty key
presence/absence) x both pipelines x sync/async stores.

Kept as separate focused tests the two cases that don't fit the shape:
test_read_missing_chunks_false_raises (asserts an exception) and
test_partial_write_after_reopen_is_correct (has an extra reopen step).

Verified the parametrized form keeps its regression-guard value: reverting the
HIGH-2 spec-evolution fix still fails test_scenario[async-transpose]. 670 pass
across pipeline + sharding suites; ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…core

The Fused test file had accumulated tests that either duplicated the
pipeline-agnostic CodecPipelineTests suite or were misfiled. Triage:

- async roundtrip / missing-chunk-fill / partial-shard-write dups: removed;
  the shared test_scenario covers these across both pipelines x sync/async
  stores. Added float32 and zstd Scenarios first so the dtype/codec coverage
  the dups carried transfers to the shared matrix (no net coverage loss).
- store set_range / SupportsSetRange tests: already covered (more thoroughly,
  parametrized) in tests/test_store/test_memory.py; removed as dups.
- ShardingCodec._inner_codecs_fixed_size tests: moved to
  tests/test_codecs/test_sharding_unit.py where the sharding internals live.

What stays is genuinely Fused-only and cannot be pipeline-agnostic: the
synchronous API (write_sync / read_sync / _sync_transform) which Batched has
no equivalent of, and the byte-range fast-path assertions (set_range_sync
fires / falls back) which test a Fused-only optimization.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The "invariants" file grouped tests by their shared motivation (a design
doc) rather than by what they test, which is the wrong axis -- it mixed
pipeline-agnostic behavior, Fused-only internals, and a per-codec property
into one file. Sorted each test into the home its subject implies:

Pipeline-agnostic behavior -> CodecPipelineTests (runs on BOTH pipelines x
sync/async stores via the existing fixtures):
- S2 empty-chunk skipping under default config -> a Scenario (keys_absent).
- S2 shard deleted after overwrite-to-fill -> a base-class method (it needs
  a mid-sequence key assertion the Scenario shape can't express).
- C3 no isinstance(ShardingCodec) branching in read/write -> a base-class
  method that resolves the subclass's configured pipeline and source-scans it.

Fused-only (byte-range fast path / ChunkTransform internals) ->
test_fused_pipeline.py:
- S3 fast path skipped when write_empty_chunks=False (the unique complement
  of the existing uses-set-range test; the write_empty_chunks=True case was a
  dup and is dropped).
- B1 byte-range path copies read-only LocalStore buffers before mutating.
- C2 ChunkTransform passes each codec the runtime chunk_spec prototype.

Per-codec contract -> tests/test_codecs/test_codecs.py:
- C1 resolve_metadata only mutates shape (prototype/dtype/fill_value/config
  stable across the chain) -- a property of individual codecs, no pipeline.

Dropped as a pure duplicate (already in test_store/test_memory.py):
- test_supports_set_range_is_runtime_checkable.

No coverage lost: every kept test moved, and the two genuinely-shared
behaviors now run on both pipelines instead of only whichever was default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o shared suite

test_pipeline_read_parity checked Fused vs Batched partial reads against
*each other*. The shared CodecPipelineTests suite already reads partial/strided
selections from sharded arrays against a numpy reference on BOTH pipelines --
which is strictly stronger (it would catch both pipelines diverging from the
spec in the same way, which a pipeline-vs-pipeline check cannot).

The one sliver read-parity covered that the shared suite didn't was scalar
single-element reads from a sharded array (the sharding codec's partial-decode
path). Added two Scenarios (sharded-scalar-reads-1d / -2d) to capture it.
Verified they exercise the partial-decode path on both pipelines: the default
Fused pipeline routes a scalar sharded read through _decode_partial_sync, the
Batched pipeline through _decode_partial_single -- so both variants are now
checked against numpy, not just against each other.

Kept in test_pipeline_parity.py the two checks the per-pipeline suite cannot
express, because its two subclasses run in isolation and never see each other's
output:
- test_pipeline_parity: cross-read interop (write under A, read whole under B)
  + cross-pipeline store-key-set equality.
- test_pipeline_parity_subchunk_write_order: byte-identical shard output across
  pipelines for every subchunk_write_order x index_location.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ross-file dup

The file named test_sync_codec_pipeline.py tested no pipeline -- it is the unit
test suite for ChunkTransform (the per-chunk synchronous codec chain that
FusedCodecPipeline uses internally). "sync codec pipeline" was an earlier name
for the Fused pipeline; the filename had outlived it. Renamed to
test_chunk_transform.py (git mv preserves history) and added a module docstring
naming what it actually covers.

Also removed test_sync_transform_encode_decode_roundtrip from
test_fused_pipeline.py: it was a weaker cross-file duplicate of this file's
test_encode_decode_roundtrip (which covers the same encode->decode->compare over
five codec chains rather than just bytes-only). Its one extra assertion -- that
evolve_from_array_spec populates _sync_transform -- is already covered by
test_evolve_from_array_spec in the Fused file.

test_codec_pipeline.py left as-is: all three tests are correctly placed and
cover things the Scenario suite can't (the low-level pipeline.read GetResult
API, a plain dict store, and the zarr-developers#3937 cast_value dtype-threading regression).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The byte-range-write machinery works, but the right store interface for it is
still undecided, so it is removed from this PR and will return once that lands.

Removed:
- SupportsSetRange protocol (abc/store.py) and its __all__ export.
- MemoryStore.set_range / set_range_sync / _set_range_impl and the
  SupportsSetRange base (storage/_memory.py).
- LocalStore.set_range / set_range_sync, the _put_range helper, and the
  SupportsSetRange base (storage/_local.py).
- The sharding codec's byte-range-write fast path in _encode_partial_sync;
  partial shard writes now always take the full-shard-rewrite path (identical
  to BatchedCodecPipeline, verified by the pipeline-parity suite). Also dropped
  the now-dead _chunk_byte_offset helper it relied on.
- changes/3907.feature.md (the byte-range-writes changelog note). The
  byte-range-READ changelog (3004) is unrelated and kept.

Byte-range READS (ByteRequest, get(byte_range=), get_ranges coalescing,
the read-side bulk shard decode) are untouched -- this only removes writes.

The known-good tests that exercise byte-range writes are commented out (not
deleted) in test_store/test_memory.py, test_store/test_local.py, and
test_fused_pipeline.py, to restore once the store design is settled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR-added module-level helper in array.py with zero callers — an ArraySpec-reuse
optimization that was never wired up. Plain function, no protocol role, safe to
drop. Verified: no references anywhere in src/ or tests/, and the full
array/sharding/pipeline suites stay green.

Note: ShardingCodec._encode_sync, though never *called*, is NOT dead — it is a
required member of the runtime_checkable SupportsSyncCodec protocol. Removing it
drops ShardingCodec from SupportsSyncCodec and breaks the sync read-fallback
routing (16 test failures), so it stays.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The docstring claimed _encode_sync "iterates inner chunks in Morton order —
that's the canonical layout the shard index expects", which is wrong and a
latent footgun: it implies the method imposes a morton physical layout. It does
not. The morton iteration only populates an intermediate dict whose key order is
immaterial; the on-disk layout is decided downstream by the subchunk_write_order
loop in _encode_shard_dict_sync (same as the async _encode_single sibling).

Also clarified that this method IS reached — via nested sharding, where an inner
ShardingCodec is encoded through the outer codec's ChunkTransform. (It is not
called for top-level sharded writes, which route through _encode_partial_sync.)

Verified empirically: routing through nested _encode_sync, all three
subchunk_write_order values roundtrip correctly AND morton vs lexicographic
produce physically different bytes — i.e. the order is honored, not ignored.
Behavior unchanged; docstring only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR-added thin wrapper (`_load_shard_index_maybe(...) or _ShardIndex.create_empty(...)`)
with zero invocations anywhere in src/ or tests/. Unlike _encode_sync, this is
genuinely removable: confirmed it is NOT a member of any runtime_checkable
protocol or ABC (no reference in src/zarr/abc/, not a base-class override) and is
reached by no dynamic dispatch (no getattr / string reference). main has no
_load_shard_index* methods at all, so it was introduced and left unused by this
PR. The _maybe and _maybe_sync variants it wrapped remain and are used.

Verified: full sharding + nested-sharding + parity + pipeline suites stay green,
ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ring

The FusedCodecPipeline class docstring still described sharded writes as using
"byte-range writes via set_range_sync" — but byte-range-write support was removed
from this PR (set_range_sync / SupportsSetRange are gone). Sharded writes now take
the codec's synchronous full-shard-rewrite path. Docstring only; no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This branch's docstrings/comments had introduced RST-style ``double-backtick``
inline literals, which this project does not use (plain single backticks only —
no RST roles or double-backticks). Converted the 25 occurrences across the
sharding codec, codec_pipeline, and fsspec store docstrings/comments to single
backticks. Style only; no behavior change.

Also confirmed (via git blame, this-branch lines only) there are no remaining
references to removed/outdated designs: the byte-range-write (set_range) mentions
and the "separating IO from compute" framing were already corrected earlier in
this branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Potential issues with Zarr performance (I/O, memory, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants