Skip to content

Investigate cloud-native processing close to GeoZarr data #49

@turban

Description

@turban

Background

The Climate API supports S3-compatible object storage as a backend for GeoZarr stores. When data lives in the cloud, running processing on a local server means transferring large volumes of data across the network before any computation can happen. For large datasets (30 years of daily CHIRPS, ERA5 reanalysis) this is slow and expensive.

The natural solution is to move the processing to where the data is — compute close to storage.

What we want to investigate

1. Serverless functions (FaaS)

Trigger a function in the same region as the object store. The function reads a spatial/temporal slice of the Zarr store, processes it, and writes results back.

  • AWS Lambda + S3, Google Cloud Functions + GCS, Cloudflare Workers + R2
  • Relevant for lightweight per-tile or per-timestep operations
  • Cold start latency may be a concern for interactive queries

2. Managed Dask / distributed computing

Spin up a Dask cluster co-located with the object store and submit a computation graph. Dask natively understands Zarr chunking and can parallelise across chunks without moving data to a central node.

  • Coiled (managed Dask-as-a-service, S3/GCS/Azure backends)
  • Dask Gateway on Kubernetes (self-hosted, fits sovereign deployment model)
  • Natural fit since the codebase already uses Xarray, which integrates directly with Dask

3. OGC API Processes with remote execution units

Define processing steps as OGC API Processes (step 2 of the roadmap) but allow execution units to run remotely — either as a managed job on a cloud provider or via a worker that has direct access to the object store.

  • Keeps the API surface standard regardless of where execution happens
  • Execution location becomes a deployment configuration, not an API concern

4. openEO back-ends

openEO defines a standard API for Earth Observation processing that maps directly onto Zarr-backed collections. Several cloud providers run openEO back-ends. The STAC catalogue from step 1 of the roadmap gives openEO clients a natural entry point via load_stac().

  • No server-side implementation needed if we rely on existing back-ends
  • Worth evaluating whether sovereign deployments can connect to a shared openEO back-end for processing, then write results back to local storage

5. Lithops / serverless Xarray

Lithops is a Python framework for running Xarray/Zarr workflows across serverless backends (AWS Lambda, IBM Cloud Functions, etc.) without managing infrastructure. Designed specifically for cloud-native array processing.

Key questions

  • Which cloud providers are realistic for DHIS2 national deployments?
  • Is the processing model pull (Climate API requests computation) or push (external tool triggers processing via the API)?
  • Should cloud processing be transparent to the API consumer, or an explicit deployment option?
  • How do we handle sovereign deployment constraints — some countries cannot use US/EU cloud providers?

Relationship to the roadmap

This feeds into step 2 (data processing) and step 3 (workflows). The storage backend decision (step 1) should not foreclose the processing options investigated here — ideally the same Zarr store on S3 can be read by both a local Dask worker and a cloud-native function without any changes to the data format or the API.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions