Background
The Climate API supports S3-compatible object storage as a backend for GeoZarr stores. When data lives in the cloud, running processing on a local server means transferring large volumes of data across the network before any computation can happen. For large datasets (30 years of daily CHIRPS, ERA5 reanalysis) this is slow and expensive.
The natural solution is to move the processing to where the data is — compute close to storage.
What we want to investigate
1. Serverless functions (FaaS)
Trigger a function in the same region as the object store. The function reads a spatial/temporal slice of the Zarr store, processes it, and writes results back.
- AWS Lambda + S3, Google Cloud Functions + GCS, Cloudflare Workers + R2
- Relevant for lightweight per-tile or per-timestep operations
- Cold start latency may be a concern for interactive queries
2. Managed Dask / distributed computing
Spin up a Dask cluster co-located with the object store and submit a computation graph. Dask natively understands Zarr chunking and can parallelise across chunks without moving data to a central node.
- Coiled (managed Dask-as-a-service, S3/GCS/Azure backends)
- Dask Gateway on Kubernetes (self-hosted, fits sovereign deployment model)
- Natural fit since the codebase already uses Xarray, which integrates directly with Dask
3. OGC API Processes with remote execution units
Define processing steps as OGC API Processes (step 2 of the roadmap) but allow execution units to run remotely — either as a managed job on a cloud provider or via a worker that has direct access to the object store.
- Keeps the API surface standard regardless of where execution happens
- Execution location becomes a deployment configuration, not an API concern
4. openEO back-ends
openEO defines a standard API for Earth Observation processing that maps directly onto Zarr-backed collections. Several cloud providers run openEO back-ends. The STAC catalogue from step 1 of the roadmap gives openEO clients a natural entry point via load_stac().
- No server-side implementation needed if we rely on existing back-ends
- Worth evaluating whether sovereign deployments can connect to a shared openEO back-end for processing, then write results back to local storage
5. Lithops / serverless Xarray
Lithops is a Python framework for running Xarray/Zarr workflows across serverless backends (AWS Lambda, IBM Cloud Functions, etc.) without managing infrastructure. Designed specifically for cloud-native array processing.
Key questions
- Which cloud providers are realistic for DHIS2 national deployments?
- Is the processing model pull (Climate API requests computation) or push (external tool triggers processing via the API)?
- Should cloud processing be transparent to the API consumer, or an explicit deployment option?
- How do we handle sovereign deployment constraints — some countries cannot use US/EU cloud providers?
Relationship to the roadmap
This feeds into step 2 (data processing) and step 3 (workflows). The storage backend decision (step 1) should not foreclose the processing options investigated here — ideally the same Zarr store on S3 can be read by both a local Dask worker and a cloud-native function without any changes to the data format or the API.
References
Background
The Climate API supports S3-compatible object storage as a backend for GeoZarr stores. When data lives in the cloud, running processing on a local server means transferring large volumes of data across the network before any computation can happen. For large datasets (30 years of daily CHIRPS, ERA5 reanalysis) this is slow and expensive.
The natural solution is to move the processing to where the data is — compute close to storage.
What we want to investigate
1. Serverless functions (FaaS)
Trigger a function in the same region as the object store. The function reads a spatial/temporal slice of the Zarr store, processes it, and writes results back.
2. Managed Dask / distributed computing
Spin up a Dask cluster co-located with the object store and submit a computation graph. Dask natively understands Zarr chunking and can parallelise across chunks without moving data to a central node.
3. OGC API Processes with remote execution units
Define processing steps as OGC API Processes (step 2 of the roadmap) but allow execution units to run remotely — either as a managed job on a cloud provider or via a worker that has direct access to the object store.
4. openEO back-ends
openEO defines a standard API for Earth Observation processing that maps directly onto Zarr-backed collections. Several cloud providers run openEO back-ends. The STAC catalogue from step 1 of the roadmap gives openEO clients a natural entry point via
load_stac().5. Lithops / serverless Xarray
Lithops is a Python framework for running Xarray/Zarr workflows across serverless backends (AWS Lambda, IBM Cloud Functions, etc.) without managing infrastructure. Designed specifically for cloud-native array processing.
Key questions
Relationship to the roadmap
This feeds into step 2 (data processing) and step 3 (workflows). The storage backend decision (step 1) should not foreclose the processing options investigated here — ideally the same Zarr store on S3 can be read by both a local Dask worker and a cloud-native function without any changes to the data format or the API.
References