Add sparse-read primitives: shards_initialized and read_regions#4028
Add sparse-read primitives: shards_initialized and read_regions#4028espg wants to merge 7 commits into
shards_initialized and read_regions#4028Conversation
| @@ -0,0 +1,157 @@ | |||
| """Benchmark for sparse-array reads via the chunk-access primitives. | |||
There was a problem hiding this comment.
not sure we want this checked in -- we have a benchmarks directory already, could you see if these code paths are already exercised there? Those benchmarks get run in CI, which is nice.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4028 +/- ##
==========================================
+ Coverage 93.53% 93.55% +0.02%
==========================================
Files 88 88
Lines 11894 11932 +38
==========================================
+ Hits 11125 11163 +38
Misses 769 769
🚀 New features to boost your workflow:
|
|
I'm not sure this approach would be useful, but we could also frame the question "how should we store our knowledge that a chunk is missing" as a caching problem, and express this in the storage layer by caching missing keys. I'm not sure if our experimental storage cache does this already. |
|
@d-v-b I had a look through the My understanding is that passing |
I don't think it's an either or-- it probably makes sense to have both populated shard/chunk discovery, and enable some sort of caching for which regions/shards/chunks are empty. It's a bit hard for me to see the proper design pattern for this ... sparse arrays often are realized with plans to revisit and fill them. So if we are caching regions that were previously empty, is the proper path to start run async shard discovery while starting to read from the cached keys? Or is this fully on the caller to update the cache status and mapping? |
exactly, we would need to modify the cache store to remember misses, and evict the cached miss when we write to that object. I feel like someone raised an issue about this a while back... |
| @pytest.mark.parametrize("store", ["memory", "memory_get_latency"], indirect=["store"]) | ||
| @pytest.mark.parametrize("shards", sparse_shards, ids=str) | ||
| @pytest.mark.parametrize("reader", ["full", "read_regions"], ids=str) | ||
| def test_sparse_read( |
There was a problem hiding this comment.
I'm not really sure we need this benchmark -- it basically proves that reading fewer chunks is faster than reading more chunks? I think just the tests confirming that the chunk discovery routine worked are sufficient.
|
|
||
|
|
||
| @pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"]) | ||
| def test_list_strategy_ignores_non_chunk_objects(store: Store) -> None: |
There was a problem hiding this comment.
shouldn't this test insert some non-chunk objects in the store? and I don't think you need any real chunks present to test this, and I don't think it needs to be parametrized over different stores. Just use memory storage, create an array (dont write any chunks), and set b"blablabla" to key "array/foo", and ensure that the initialized shards are reported to be empty
| @pytest.mark.parametrize("store", ["local", "memory"], indirect=["store"]) | ||
| @pytest.mark.parametrize( | ||
| ("setup_name", "expected_count"), | ||
| [ | ||
| ("sparse_1d", 2), | ||
| ("dense_1d", 4), | ||
| ("sparse_2d", 2), | ||
| ("sharded_sparse", 2), | ||
| ("all_empty", 0), | ||
| ("all_populated", 4), | ||
| ], | ||
| ) | ||
| def test_shards_initialized_counts(store: Store, setup_name: str, expected_count: int) -> None: | ||
| arr, _ = _CA_SETUPS[setup_name](store) | ||
| assert len(zarr.shards_initialized(arr)) == expected_count |
There was a problem hiding this comment.
I think these tests can be a lot simpler. I would start with a tuple of regions (parametrize over different tuples of regions), then create the array, then write the regions, then check that the initialized regions are exactly the ones you wrote. This will remove the need for a few of these test functions.
|
If your OK with me pushing to this branch I'd be happy addressing some of my concerns about test organization. I think the core functionality is good but I want some general approval from other devs before we commit to new public API. We might need a little bikeshedding over the function names, for example. This PR adds We also need to ensure that people understand that these functions don't introspect the contents of shards, so a shard file that has no subchunks written will appear as an initialized region. @zarr-developers/python-core-devs please have a look. I'd like feedback from at someone other than me before committing to the addition of these new routines. |
|
@d-v-b feel free to push to the branch and get things better lined up for a merge. Very open on the naming conventions!
Happy to tackle a prototype for this in another PR |
Related to / closes #3929 (first of two PRs)
Summary
Adds two composable, public functions for efficiently reading sparse arrays — arrays where most chunks are empty and resolve to the fill value:
zarr.shards_initialized(array, *, strategy="auto")— discover which shards (or chunks, when unsharded) actually exist in the store.zarr.read_regions(array, regions=None, *, concurrency=None)— concurrently read and decode array regions — by default only the populated ones — yielding each(region, data)pair spatially resolved to its location in the array.Both are available synchronously (
zarr.*,zarr.api.synchronous) and asynchronously (zarr.api.asynchronous); the asyncread_regionsis a generator that streams each region as soon as its data is available. Nothing about the existingarr[:]path changes — these are additive.Motivation
On a sparse array,
arr[:]pays a store round-trip + codec call for every chunk, including empty ones. In the issue's 49,152-chunk HEALPix example (~3% populated), ~150 s of the 173 s wall time is spent iterating empty chunks with zero useful I/O.These primitives let callers touch only the populated chunks, so cost scales with the populated count rather than the total count.
Design
This follows the direction from the discussion in #3929: rather than mutable state on the array that changes how
__getitem__behaves, expose plain, composable functions -- decomposes into two pieces:Discover the chunks that exist (
shards_initialized). Reported at the granularity of stored objects — shard keys for sharded arrays, chunk keys otherwise — because that is what physically exists in the store and is what a singlelist_prefixreturns. Two strategies, selected bystrategy=:"list"— onestore.list_prefix, filtered to this array's shard grid (ignoreszarr.jsonand any other objects sharing the prefix)."probe"— concurrent per-keyexists()checks; avoids listing a prefix that may hold many unrelated objects, and is faster when there are few possible keys."auto"(default) — probe for small grids, list otherwise.Read + decode those chunks, spatially resolved (
read_regions). Keyed on array regions (a tuple of slices) rather than key strings, on the assumption that regions are the more reusable handle. Reads concurrently and yields(region, data)in completion order. For sharded arrays it yields whole shard regions; empty inner chunks within a populated shard are still skipped efficiently by the existingShardingCodecpartial-decode path.The "pack N decoded chunks into one contiguous array" step that
arr[:]performs is deliberately not forced here — pipelines that operate per chunk skip it for a further performance win. Apack/read_sparseconvenience will follow in a second PR underzarr.experimental.Implementation notes
_initialized_shards) returns(coords, key)pairs;shards_initializedprojects it to keys andread_regionsprojects it to regions, so neither has to reverse-parse the other's output. This mirrors the existing_nchunks_initialized→nchunks_initializedand_iter_*core/wrapper pattern inarray.py._shards_initialized(used bynchunks_initialized/nshards_initialized/info) now delegates to that same core, removing duplicatedlist_prefix-and-intersect logic and incidentally fixing an O(grid×objects) membership check (list → set).API
Benchmarks
bench/empty_chunks.pysweeps chunk count at ~3% sparsity, comparing stockarr[:]againstread_regions+ pack and a per-region stream:LocalStore plateaus around ~10–13×; remote object stores see much more (~64× in the issue's S3 report) because each skipped empty chunk avoids a network round-trip.
Testing
tests/test_chunk_access.py(memory + local stores; unsharded, sharded, 2-D; all-empty / all-populated / sparse layouts):"list"strategy ignores non-chunk objects sharing the prefix;read_regionsoutput reproducesarr[:]byte-for-byte;shards_initialized;concurrency=1paths;Existing
test_array/test_api(incl. the sync/async docstring-match test) andtest_zarrpass unchanged.TODO:
docs/user-guide/*.mdchanges/AI Disclosure