-
-
Notifications
You must be signed in to change notification settings - Fork 412
docs: add AGENTS.md for AI-assisted development #4041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
maxrjones
wants to merge
2
commits into
zarr-developers:main
Choose a base branch
from
maxrjones:docs/agents
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+108
−0
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # AGENTS.md | ||
|
|
||
| Guidance for AI coding agents working in this repository. Human contributors should also read `docs/contributing.md`, especially the **AI-assisted contributions** section: the human submitting the PR must understand and being able to explain every change, and PR descriptions / review responses must be in the human's own words. Keep diffs small and reviewable. | ||
|
|
||
| ## Project overview | ||
|
|
||
| `zarr-python` (PyPI package `zarr`) implements chunked, compressed, N-dimensional arrays for Python. This is the 3.x line, which reads and writes both Zarr format v2 and v3 data. Requires Python >= 3.12. The public API is re-exported from `src/zarr/__init__.py` (`Array`, `Group`, `create_array`, `open`, etc.). | ||
| ## Related projects | ||
|
|
||
| `zarr-python` depends on the contents of the Zarr [v2](https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html) and Zarr [v3](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html) storage specifications. We are committed to compliance with the specs, and also consistency with other Zarr implementations, namely: | ||
| - [TensorStore](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html) (C++ / Python) | ||
| - [Zarrs](https://github.com/zarrs/zarrs) (Rust) | ||
| - [Zarrita](https://github.com/manzt/zarrita.js/) (Javascript) | ||
| ## Environment and common commands | ||
|
|
||
| Development uses **hatch** (with `uv` as the installer) for managed environments, and `uv` directly for ad-hoc commands. The canonical test environments are named `test.py3.{12,13,14}-{minimal,optional}`; `optional` pulls in remote stores (fsspec, obstore, s3fs), the CLI, and universal-pathlib. | ||
|
|
||
| ```bash | ||
| # Run the full test suite in a managed env (benchmarks excluded by default) | ||
| hatch env run --env test.py3.12-optional run | ||
|
|
||
| # Run with coverage (XML report); coverage must reach 100% for CI to pass | ||
| hatch env run --env test.py3.12-optional run-coverage | ||
|
|
||
| # Ad-hoc test runs with uv (faster iteration than spinning up a hatch env). | ||
| # Prefer uv run pytest for narrow runs; reach for hatch envs for full/coverage runs. | ||
| uv run pytest tests/test_array.py # one test file | ||
| uv run pytest tests/test_array.py::test_name # one test function | ||
| uv run pytest tests/test_array.py -k "expr" # tests matching a -k expression | ||
| uv run pytest "tests/test_array.py::test_name[param-id]" # one parametrized case | ||
| uv run pytest tests/test_array.py -x --lf # stop on first failure, rerun last-failed | ||
|
|
||
| # Run tests in parallel across CPUs with pytest-xdist (-n). Big speedup for the | ||
| # full suite; for a handful of tests the worker startup cost usually isn't worth it. | ||
| uv run pytest -n auto tests/ # auto = one worker per core | ||
| uv run pytest -n 4 tests/test_codecs/ # fixed worker count | ||
|
|
||
| # Type-check (strict mypy over src + tests) — this is what CI's Lint job runs | ||
| uv run --frozen mypy | ||
|
|
||
| # Hypothesis property tests (slow; opt in) | ||
| hatch env run --env test.py3.12-optional run-hypothesis | ||
|
|
||
| # Docs: live-reloading server / strict build | ||
| hatch --env docs run serve | ||
| hatch --env docs run check | ||
| ``` | ||
|
|
||
| Note: `pytest` `testpaths` include `src`, `tests`, and `docs/user-guide`, and `--doctest-modules` is on — docstrings in `src/` are executed as tests, and the user-guide markdown is doctested. `xfail_strict = true` and `filterwarnings = ["error", ...]` mean an unexpected pass or an unfiltered warning fails the suite. | ||
|
|
||
| ## Linting and pre-commit | ||
|
|
||
| The project uses [`prek`](https://github.com/j178/prek) (a drop-in pre-commit runner) against `.pre-commit-config.yaml`. Ruff (lint + format, line length 100) and `mypy` run here, plus codespell, numpydoc validation, towncrier-check, and zizmor. | ||
|
|
||
| ```bash | ||
| prek run --all-files # all hooks | ||
| prek run --last-commit # only files changed in the last commit | ||
| ``` | ||
|
|
||
| Two local rules worth knowing because they will reject otherwise-valid code: | ||
| - **No `.lstrip("...")` / `.rstrip("...")` with multi-char string args** (`ban-lstrip-rstrip`) — these are character-set operations and almost always a bug where `removeprefix`/`removesuffix` was intended. | ||
| - **numpydoc validation** is enforced on docstrings (Parameters/Returns sections must match signatures). | ||
|
|
||
| ## Changelog (required for most PRs) | ||
|
|
||
| Every user-facing change needs a news fragment in `changes/` named `{issue-or-pr-number}.{type}.md`, where type is one of `feature`, `bugfix`, `doc`, `removal`, `misc`. Generated into `docs/release-notes.md` by towncrier at release time. `towncrier create` scaffolds one. The `towncrier-check` pre-commit hook will flag PRs missing a fragment. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Async core with a generated sync facade | ||
|
|
||
| Every storage and array operation is fundamentally **async**. The two-layer pattern is pervasive and important to preserve: | ||
|
|
||
| - `AsyncArray` / `AsyncGroup` (`core/array.py`, `core/group.py`) hold the real async implementations. | ||
| - `Array` / `Group` are thin synchronous wrappers that drive the async methods through `sync()` (`core/sync.py`), which runs coroutines on a single dedicated background event loop (a module-global loop in its own thread). Never call `asyncio.run` or create new loops in library code — route through `sync()`. | ||
| - The user-facing top-level functions split the same way: `api/asynchronous.py` (async) and `api/synchronous.py` (sync wrappers). `__init__.py` re-exports the synchronous ones. | ||
|
|
||
| When adding functionality, implement it on the `Async*` class / in `api/asynchronous.py` first, then add the sync wrapper. | ||
|
|
||
| ### Store abstraction | ||
|
|
||
| `abc/store.py` defines the abstract `Store` (async `get`/`set`/`delete`/`list`/`exists`, partial reads via `RangeByteRequest`/`OffsetByteRequest`/`SuffixByteRequest`). Concrete stores live in `storage/`: `_local.py`, `_memory.py`, `_zip.py`, `_fsspec.py` (any fsspec filesystem), `_obstore.py` (Rust `obstore` backend), plus `_wrapper.py` and `_logging.py` decorators. `StorePath` (`storage/_common.py`) couples a store with a key prefix and is what arrays/groups actually hold. | ||
|
|
||
| ### Codecs and the codec pipeline | ||
|
|
||
| Reading/writing a chunk runs it through an ordered pipeline of codecs, typed by what they transform (`abc/codec.py`): `ArrayArrayCodec` → `ArrayBytesCodec` (exactly one, e.g. `bytes`) → `BytesBytesCodec` (e.g. compressors). `core/codec_pipeline.py` (`BatchedCodecPipeline`) orchestrates encode/decode and batches/concurrency-limits chunk I/O. Built-in codecs are in `codecs/` (`blosc`, `gzip`, `zstd`, `crc32c`, `transpose`, `sharding`, `vlen_utf8`, `scale_offset`, ...); `codecs/numcodecs/` and `codecs/_v2.py` bridge to numcodecs for v2 filters/compressors. **Sharding** (`codecs/sharding.py`) is itself an `ArrayBytesCodec` that packs a grid of sub-chunks into one store object — the most intricate codec. | ||
|
|
||
| ### Metadata and v2/v3 duality | ||
|
|
||
| `AsyncArray` is generic over `ArrayV2Metadata | ArrayV3Metadata` (`core/metadata/`). Most of the codebase branches on format version through this metadata type rather than ad-hoc version flags. v2 uses `.zarray`/`.zattrs`/`.zgroup`; v3 uses `zarr.json`. `metadata/migrate_v3.py` and `core/metadata/` hold the conversion and parsing logic. When touching read/write paths, check both metadata classes. | ||
|
|
||
| ### dtype system | ||
|
|
||
| `core/dtype/` is a pluggable dtype layer (not raw NumPy dtypes) that maps Zarr data types to/from NumPy and handles fill values, endianness, and v2-vs-v3 dtype naming. `dtype.py` at the package root re-exports it. | ||
|
|
||
| ### Registry and config | ||
|
|
||
| `registry.py` is the extension mechanism: codecs, pipelines, buffers, ndbuffers, and chunk-key-encodings are registered by name and discovered via entry points (`zarr.codecs`, `zarr.codec_pipelines`, etc.), so third-party packages can plug in. `core/config.py` uses `donfig` (the `zarr.config` object) for runtime settings — default codecs, concurrency limits, the active buffer/pipeline implementation. Resolving "which class implements X" generally goes through config → registry. | ||
|
|
||
| ### Buffers and the GPU path | ||
|
|
||
| `abc/buffer.py` + `core/buffer/` abstract `Buffer` (1-D bytes) and `NDBuffer` (N-D array) so the same code runs on CPU (`core/buffer/cpu.py`, NumPy) or GPU (`core/buffer/gpu.py`, CuPy). GPU-only tests are marked `@pytest.mark.gpu`. Don't hardcode `np.ndarray`; go through the buffer prototype. | ||
|
|
||
| ## Conventions | ||
|
|
||
| - **Public API surface:** anything added to `__all__` in `__init__.py` / `api/synchronous.py` is public and must have a numpydoc docstring and an entry under `docs/api/*.md`. New user-facing behavior also belongs in `docs/user-guide/`. | ||
| - **Experimental features** go under `src/zarr/experimental/`, are documented in `docs/user-guide/experimental.md`, and carry no stability guarantees (may be removed in any release). The team aims to promote or remove them within ~6 months. | ||
| - **Versioning is EffVer, not SemVer** — breaking changes are possible in minor (and rarely patch) releases, judged by upgrade effort. Prefer backwards-compatible changes and deprecation warnings over hard breaks. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.