Skip to content

_decode_cf_datetime_dtype triggers 2 store reads per time variable — expensive on remote stores #11303

@aladinor

Description

@aladinor

Problem

_decode_cf_datetime_dtype (xarray/coding/times.py:339-371) reads the first and last element of every time-encoded variable during open_dataset / open_datatree to infer the decoded dtype. On local backends these reads are memory-mapped and free; on remote stores (zarr + S3, zarr + icechunk) each read is a round-trip through zarr.sync() → asyncio, even when chunks are already cached in the backend.

Cost scales as 2 × N_time_variables reads per open.

Downstream impact: this hits any workflow that opens many time-encoded groups from remote zarr — radar (CfRadial2: 1 time per sweep, 100+ sweeps per volume), CMIP model archives, satellite time-series, Pangeo-style DataTrees. Interactive notebooks, dashboards, and tile servers pay the cost on every open.

Impact — cProfile on main, 107-group DataTree (nexrad-arco/KLOT icechunk, 106 time variables)

total open_datatree time: 77.56s
Function cumtime share
_decode_cf_datetime_dtype 35.17s 45%
└─ first_n_items 25.16s 32%
└─ last_item 9.51s 12%
└─ decode_cf_datetime (actual decode) 0.09s 0.1%

The CPU cost of CF decoding is negligible. The 35 seconds are 100% I/O round-trip overhead.

Reproducer (public icechunk store, anonymous S3)

import icechunk, xarray as xr
from time import time

storage = icechunk.s3_storage(
    bucket="nexrad-arco", prefix="KLOT", region="us-east-1", anonymous=True,
)
repo = icechunk.Repository.open(storage)
session = repo.readonly_session("main")

start = time()
dtree = xr.open_datatree(session.store, engine="zarr", zarr_format=3,
                         consolidated=False, chunks={})
print(f"open_datatree: {time() - start:.1f}s")

Why the reads exist

The comment at times.py:346 (2018) explains: "Verify that at least the first and last date can be decoded successfully. Otherwise, tracebacks end up swallowed by Dataset.__repr__."

Empirically, no longer reproducible on current xarray — modern repr returns "..." for lazy arrays without triggering decode. But there is a second, undocumented purpose: with use_cftime=None and chunked data, decode_cf_datetime may return either datetime64[ns] or object (cftime) depending on whether values overflow the pandas ns range, and dask's map_blocks(func, array, dtype=dtype) casts output to the declared dtype — so a wrong declaration silently corrupts overflow values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions