`_decode_cf_datetime_dtype` triggers 2 store reads per time variable — expensive on remote stores

## Problem

`_decode_cf_datetime_dtype` (`xarray/coding/times.py:339-371`) reads the **first and last element** of every time-encoded variable during `open_dataset` / `open_datatree` to infer the decoded dtype. On local backends these reads are memory-mapped and free; on remote stores (zarr + S3, zarr + icechunk) each read is a round-trip through `zarr.sync()` → asyncio, **even when chunks are already cached in the backend**.

Cost scales as `2 × N_time_variables` reads per open.

**Downstream impact:** this hits any workflow that opens many time-encoded groups from remote zarr — radar (CfRadial2: 1 time per sweep, 100+ sweeps per volume), CMIP model archives, satellite time-series, Pangeo-style DataTrees. Interactive notebooks, dashboards, and tile servers pay the cost on every open.

## Impact — cProfile on `main`, 107-group DataTree (`nexrad-arco/KLOT` icechunk, 106 time variables)

```
total open_datatree time: 77.56s
```

| Function | cumtime | share |
|---|---|---|
| `_decode_cf_datetime_dtype` | **35.17s** | **45%** |
|   └─ `first_n_items` | 25.16s | 32% |
|   └─ `last_item` | 9.51s | 12% |
|   └─ `decode_cf_datetime` (actual decode) | 0.09s | 0.1% |

The CPU cost of CF decoding is negligible. The 35 seconds are **100% I/O round-trip overhead**.

## Reproducer (public icechunk store, anonymous S3)

```python
import icechunk, xarray as xr
from time import time

storage = icechunk.s3_storage(
    bucket="nexrad-arco", prefix="KLOT", region="us-east-1", anonymous=True,
)
repo = icechunk.Repository.open(storage)
session = repo.readonly_session("main")

start = time()
dtree = xr.open_datatree(session.store, engine="zarr", zarr_format=3,
                         consolidated=False, chunks={})
print(f"open_datatree: {time() - start:.1f}s")
```

## Why the reads exist

The comment at `times.py:346` (2018) explains: *"Verify that at least the first and last date can be decoded successfully. Otherwise, tracebacks end up swallowed by `Dataset.__repr__`."*

Empirically, no longer reproducible on current xarray — modern repr returns `"..."` for lazy arrays without triggering decode. But there is a second, undocumented purpose: with `use_cftime=None` and chunked data, `decode_cf_datetime` may return either `datetime64[ns]` or `object` (cftime) depending on whether values overflow the pandas ns range, and dask's `map_blocks(func, array, dtype=dtype)` casts output to the declared dtype — so a wrong declaration silently corrupts overflow values.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`_decode_cf_datetime_dtype` triggers 2 store reads per time variable — expensive on remote stores #11303

Problem

Impact — cProfile on `main`, 107-group DataTree (`nexrad-arco/KLOT` icechunk, 106 time variables)

Reproducer (public icechunk store, anonymous S3)

Why the reads exist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Function	cumtime	share
`_decode_cf_datetime_dtype`	35.17s	45%
└─ `first_n_items`	25.16s	32%
└─ `last_item`	9.51s	12%
└─ `decode_cf_datetime` (actual decode)	0.09s	0.1%

Uh oh!

_decode_cf_datetime_dtype triggers 2 store reads per time variable — expensive on remote stores #11303

Description

Problem

Impact — cProfile on main, 107-group DataTree (nexrad-arco/KLOT icechunk, 106 time variables)

Reproducer (public icechunk store, anonymous S3)

Why the reads exist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`_decode_cf_datetime_dtype` triggers 2 store reads per time variable — expensive on remote stores #11303

Impact — cProfile on `main`, 107-group DataTree (`nexrad-arco/KLOT` icechunk, 106 time variables)