Investigate cloud-native processing close to GeoZarr data

## Background

The Climate API supports S3-compatible object storage as a backend for GeoZarr stores. When data lives in the cloud, running processing on a local server means transferring large volumes of data across the network before any computation can happen. For large datasets (30 years of daily CHIRPS, ERA5 reanalysis) this is slow and expensive.

The natural solution is to move the processing to where the data is — compute close to storage.

## What we want to investigate

### 1. Serverless functions (FaaS)

Trigger a function in the same region as the object store. The function reads a spatial/temporal slice of the Zarr store, processes it, and writes results back.

- AWS Lambda + S3, Google Cloud Functions + GCS, Cloudflare Workers + R2
- Relevant for lightweight per-tile or per-timestep operations
- Cold start latency may be a concern for interactive queries

### 2. Managed Dask / distributed computing

Spin up a Dask cluster co-located with the object store and submit a computation graph. Dask natively understands Zarr chunking and can parallelise across chunks without moving data to a central node.

- Coiled (managed Dask-as-a-service, S3/GCS/Azure backends)
- Dask Gateway on Kubernetes (self-hosted, fits sovereign deployment model)
- Natural fit since the codebase already uses Xarray, which integrates directly with Dask

### 3. OGC API Processes with remote execution units

Define processing steps as OGC API Processes (step 2 of the roadmap) but allow execution units to run remotely — either as a managed job on a cloud provider or via a worker that has direct access to the object store.

- Keeps the API surface standard regardless of where execution happens
- Execution location becomes a deployment configuration, not an API concern

### 4. openEO back-ends

openEO defines a standard API for Earth Observation processing that maps directly onto Zarr-backed collections. Several cloud providers run openEO back-ends. The STAC catalogue from step 1 of the roadmap gives openEO clients a natural entry point via `load_stac()`.

- No server-side implementation needed if we rely on existing back-ends
- Worth evaluating whether sovereign deployments can connect to a shared openEO back-end for processing, then write results back to local storage

### 5. Lithops / serverless Xarray

[Lithops](https://lithops-cloud.github.io/) is a Python framework for running Xarray/Zarr workflows across serverless backends (AWS Lambda, IBM Cloud Functions, etc.) without managing infrastructure. Designed specifically for cloud-native array processing.

## Key questions

- Which cloud providers are realistic for DHIS2 national deployments? 
- Is the processing model pull (Climate API requests computation) or push (external tool triggers processing via the API)?
- Should cloud processing be transparent to the API consumer, or an explicit deployment option?
- How do we handle sovereign deployment constraints — some countries cannot use US/EU cloud providers?

## Relationship to the roadmap

This feeds into step 2 (data processing) and step 3 (workflows). The storage backend decision (step 1) should not foreclose the processing options investigated here — ideally the same Zarr store on S3 can be read by both a local Dask worker and a cloud-native function without any changes to the data format or the API.

## References

- [Coiled — managed Dask close to S3/GCS/Azure](https://coiled.io)
- [Lithops — serverless Xarray/Zarr](https://lithops-cloud.github.io/)
- [Pangeo — cloud-native geoscience stack](https://pangeo.io)
- [openEO — EO processing API standard](https://openeo.org)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate cloud-native processing close to GeoZarr data #49

Background

What we want to investigate

1. Serverless functions (FaaS)

2. Managed Dask / distributed computing

3. OGC API Processes with remote execution units

4. openEO back-ends

5. Lithops / serverless Xarray

Key questions

Relationship to the roadmap

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate cloud-native processing close to GeoZarr data #49

Description

Background

What we want to investigate

1. Serverless functions (FaaS)

2. Managed Dask / distributed computing

3. OGC API Processes with remote execution units

4. openEO back-ends

5. Lithops / serverless Xarray

Key questions

Relationship to the roadmap

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions