Skip to content

Installable package foundation, Python client, and User-facing docs#59

Open
turban wants to merge 93 commits intomainfrom
CLIM-683
Open

Installable package foundation, Python client, and User-facing docs#59
turban wants to merge 93 commits intomainfrom
CLIM-683

Conversation

@turban
Copy link
Copy Markdown
Contributor

@turban turban commented May 5, 2026

Why

The Climate API was designed from the start for a single deployment scenario: clone the repo, edit files in place, run. This worked for early development but creates real problems as we move toward production deployments and want users to be able to install the package with pip install climate-api:

  • Instance configuration (extent, custom datasets) was stored inside the repository, making upgrades destructive.
  • Built-in dataset templates were found by walking directory paths relative to source files — a technique that breaks when the package is installed into site-packages/ because the project root is no longer accessible.
  • There was no Python client for discovering and opening datasets without constructing raw URLs.
  • Documentation assumed you had already cloned the repo and knew the internal structure.

This PR addresses all of these to make the package usable outside a source checkout.

What changed

Instance configuration via CLIMATE_API_CONFIG (closes #61)

A new CLIMATE_API_CONFIG environment variable points to a YAML file that lives outside the repository. This separates instance-specific configuration from the package itself, so the package can be upgraded without overwriting local config.

# climate-api.yaml — lives outside the repo, not committed
extent:
  id: rwa
  name: Rwanda
  bbox: [28.8, -2.9, 30.9, -1.0]
  country_code: RWA

datasets_dir: ./my-datasets/   # optional — merged on top of built-ins

The extent is a single block per instance (not a list). The GET /extent endpoint returns it, or 404 if not configured. Dataset templates from datasets_dir are merged with the built-ins — a custom template with the same id overrides the built-in one.

Built-in dataset templates bundled inside the package

Previously, the built-in YAML templates (chirps3.yaml, era5_land.yaml, worldpop.yaml) lived in data/datasets/ at the project root and were located by walking four directory levels up from the source file. This breaks when the package is installed with pip install, because the package ends up in site-packages/ with no path to the original project root.

The YAMLs are now bundled inside the package at src/climate_api/data/datasets/ and loaded via importlib.resources, which resolves the correct location regardless of how the package was installed.

Coordinate normalisation at write time

All Zarr datasets are now written with canonical coordinate names (time, latitude, longitude) regardless of what the upstream source uses (valid_time, lat/lon, x/y). This is enforced in build_dataset_zarr() for both flat and pyramid outputs.

Every downstream consumer — the client, the user guide, the OGC API — can now use ds.latitude, ds.longitude, ds.time without dataset-specific branching.

Python client for dataset discovery and access (closes #60)

A new climate_api.client module makes it possible to discover and open datasets without constructing URLs manually:

from climate_api.client import Client

api = Client("http://127.0.0.1:8000")
datasets = api.catalog()          # list published datasets
ds = api.open(datasets[0]["id"]) # open as xarray.Dataset

Module-level functions (list_datasets, open_dataset) fall back to the CLIMATE_API_BASE_URL environment variable, so scripts work without hardcoding a URL.

create_app() factory function

The FastAPI application is now created via a create_app() factory, making it straightforward to embed the API in a larger application:

from climate_api.main import create_app
app = create_app()

CORS credentials flag corrected

allow_credentials was incorrectly set to True alongside allow_origins=["*"]. This combination violates the CORS specification and is rejected by browsers. It is now set to False, which is correct for a public data API that does not use cookies or session tokens.

Dataset template field renamed: cache_infoingestion

The cache_info block in dataset template YAMLs is renamed to ingestion. The ingestion.eo_function field is now required for all sync kinds, not just temporal ones.

Documentation

  • docs/setup_guide.md — step-by-step instance setup from install to first ingestion
  • docs/user_guide.md — consumer guide: STAC discovery, opening with xarray, subsetting
  • docs/adding_custom_datasets.md — how to write a custom dataset template and wire it up
  • examples/stac_discover_and_open.py and examples/zarr_direct_access.py — runnable examples using the client

Migration note

Existing datasets must be deleted and re-ingested. Coordinate normalisation only applies to newly written Zarr stores. Zarr files written before this PR will retain their original source coordinate names.

Rename cache_info: to ingestion: in any custom dataset YAML templates.

Test plan

  • make run starts the API without errors
  • uv run examples/stac_discover_and_open.py lists published datasets and prints dataset info
  • uv run examples/zarr_direct_access.py opens a Zarr store and prints a spatial mean time series
  • from climate_api.client import Client; print(Client("http://127.0.0.1:8000").catalog()) works in a Python session
  • A fresh instance configured with only climate-api.yaml serves the correct extent and built-in datasets
  • datasets_dir with a custom YAML adds that dataset alongside the built-ins
  • Setup guide is followable end to end for a new country
  • make test passes

Remove DHIS2 connection string references from setup section, add
/extents and /datasets to endpoint table, and expand STAC example to
show catalog discovery before opening a dataset with xarray.
@turban turban marked this pull request as draft May 5, 2026 13:34
turban added 5 commits May 5, 2026 15:44
uv run uvicorn resolves the uvicorn binary via PATH, which picks up
conda's uvicorn when the base environment is active. Using python -m
uvicorn forces the venv's interpreter and avoids the module not found
error in the reload subprocess.
Add docs/user_guide.md covering STAC-based dataset discovery and
xarray access, two runnable example scripts in examples/, and update
implementation-status.md to reflect PRs #51, #54, and #55 as merged.
Datasets use x/y dimension names not latitude/longitude. Direct access
example now reads open_kwargs from the STAC collection rather than
hardcoding consolidated=False, which fails for Zarr v3 stores.
Step-by-step guide covering extent configuration, environment setup,
first ingestion, and ERA5-Land DestinE authentication. Links added
from README and user_guide.md.

This comment was marked as duplicate.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

- datasets: validate id is a non-empty string before using it as a dict key
- datasets: require ingestion.eo_function for all sync_kind values including
  static — static datasets still need an initial ingestion, so the download
  path cannot safely omit the block
- config: validate YAML root is a mapping and raise a clear ValueError if not,
  rather than letting dict() crash with a low-signal TypeError
- tests: add ingestion.eo_function to static template fixtures in test_config.py
  and test_dataset_registry.py to match the now-enforced contract

This comment was marked as resolved.

turban added 2 commits May 6, 2026 12:17
- datasets: validate datasets_dir is a str/Path before Path concatenation
- extents: validate extent.id and extent.bbox in get_extent() and raise
  clear ValueErrors pointing to CLIMATE_API_CONFIG rather than letting
  callers hit KeyError/TypeError with no context
- client: use catalog.get('links') and validate it is a list before
  iterating, so a non-STAC response raises a clear ValueError instead
  of a KeyError

This comment was marked as resolved.

This comment was marked as resolved.

…stance

- Add 30s timeout to both httpx.get() calls in client.py to prevent
  indefinite hangs on network issues
- Set allow_credentials=False in CORSMiddleware; combining allow_origins=["*"]
  with allow_credentials=True is a CORS spec violation and a security footgun
- Use isinstance(x, (str, Path)) instead of str | Path union syntax for
  broader clarity (tuple form is unambiguous across all Python versions)

This comment was marked as resolved.

…x plural in docs

- Validate href in each STAC child link before slicing the id from it
- Check that assets is a dict before calling .get("zarr") to avoid
  AttributeError on malformed STAC responses
- Fix "Confirm configured extents" heading to singular in managed data guide

This comment was marked as resolved.

turban added 2 commits May 6, 2026 14:26
Previously, built-in dataset YAMLs were located by walking four directory
levels up from datasets.py and appending data/datasets/. This works in a
source checkout or editable install but fails silently in a wheel install:
the package lands in site-packages/ and the project-root data/ directory
is never included in the wheel, causing list_datasets() to crash with
"Path is not a directory".

Move the YAMLs into the package at src/climate_api/data/datasets/ and
load them via importlib.resources.files(). importlib.resources is
package-aware and resolves correctly whether the package is an unpacked
directory or a zip inside a wheel.

User-provided datasets_dir (from CLIMATE_API_CONFIG) continues to use
regular Path objects via _load_from_dir() — that path is always on disk.
…ts, safer conftest teardown

- Raise ValueError (not KeyError) when the Zarr asset is missing or not a
  dict — all other error paths in open_dataset raise ValueError, so callers
  catch one exception type
- Inject id into a copy of the link dict instead of mutating the parsed JSON
  object in-place
- Use os.environ.pop() instead of del in conftest session fixture teardown
  to avoid KeyError if the env var was already removed by a test's monkeypatch
- Replace next() generator in setup guide with an explicit list so an empty
  catalog gives an IndexError with clear context rather than StopIteration
@turban turban requested a review from Copilot May 6, 2026 12:43
@turban turban requested a review from abyot May 6, 2026 12:51
@turban turban marked this pull request as ready for review May 6, 2026 12:51

This comment was marked as resolved.

…ative path

Walking __file__ four levels up to find data/downloads/ fails when the package
is installed with pip because __file__ lands in site-packages/ and the project
root is not accessible. The directory may also be non-writable.

Default to $XDG_DATA_HOME/climate-api/downloads (~/.local/share/climate-api/downloads
if XDG_DATA_HOME is unset), which is always user-writable. The existing
CACHE_OVERRIDE env var continues to work and takes precedence, keeping Docker
and dev deployments unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Installable package: climate-api as a configurable dependency Python client: climate_api.open() convenience function for dataset access

2 participants