Skip to content

Support external Zarr stores as data sources (no local ingest) #46

@turban

Description

@turban

Overview

Not all climate data needs to be downloaded and ingested into our own storage. Many high-quality, analysis-ready Zarr datasets are already publicly available on cloud object storage and can be read directly. We should support connecting to these external Zarr stores as first-class data sources in the Climate API.

A concrete example: dynamical.org hosts open, analysis-ready weather and climate forecasts as Zarr stores on S3 — directly queryable with xarray/zarr without any local copy.

Related: #40

Proposed behaviour

  • The API can query an external Zarr store at its remote URL in the same way it queries locally ingested data
  • No data is downloaded or stored — reads happen directly against the remote store at query time
  • Two tiers of external sources:
    • Pre-configured — a curated list of well-known public datasets bundled with the API (e.g. dynamical.org ECMWF IFS ENS)
    • User-defined — users can register their own external Zarr URL, with optional credentials (e.g. private S3 bucket, authenticated endpoint)

Why this is valuable

  • Avoids duplicating large global datasets that are already maintained upstream
  • Forecast data (NWP) is updated continuously — consuming it directly removes the need for a download/ingest pipeline
  • Enables access to datasets we would never host ourselves (resolution, size, licensing)
  • Complements our ingested datasets: use external sources for global/forecast context, local ingest for bias-corrected or region-specific data

Implementation sketch

  • Abstract the data access layer so both local Zarr stores and remote Zarr URLs implement the same interface
  • For pre-configured sources, ship a registry (YAML/JSON) mapping a dataset ID to its Zarr store URL + variable/dimension metadata
  • For user-defined sources, add a registration endpoint (or config block) accepting a Zarr URL, optional storage options (S3 credentials, region), and a human-readable label
  • Validate on registration: open the Zarr store, check expected variables/dimensions are present, surface any access errors early
  • Consider caching consolidated metadata (.zmetadata) locally to avoid repeated round-trips on every request

Open questions

  • How do we handle latency / availability — should we cache tiles or analysis results for external sources?
  • Do we need a STAC-based discovery step to find the right variable in a remote store, or is direct URL + variable name enough?
  • Should pre-configured sources be versioned/pinned (store URL may change upstream)?
  • Credentials management for private external stores

Example pre-configured sources to consider

Source URL pattern Notes
dynamical.org s3://dynamical-… ECMWF IFS ENS — open, frequently updated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions