From 646f45e00729ddf473115a97567c7e1c20fd95c1 Mon Sep 17 00:00:00 2001 From: Brendan Collins Date: Wed, 1 Apr 2026 07:36:06 -0700 Subject: [PATCH 1/2] Add caveat and assumption admonitions to docs (#1133) New user guide page (caveats.rst) collecting the cross-cutting assumptions: WGS84 hardcoded in geodesic path, float32 output, NaN-as-nodata, pixel-unit proximity, dask chunk requirements, GPU memory, and dimension ordering. Admonition blocks added to existing reference pages: surface, proximity, focal, classification, hydrology, terrain_metrics, and the data_types guide. --- docs/source/reference/classification.rst | 6 + docs/source/reference/focal.rst | 12 + docs/source/reference/hydrology.rst | 7 + docs/source/reference/proximity.rst | 14 ++ docs/source/reference/surface.rst | 14 ++ docs/source/reference/terrain_metrics.rst | 6 + docs/source/user_guide/caveats.rst | 253 ++++++++++++++++++++++ docs/source/user_guide/data_types.rst | 8 +- docs/source/user_guide/index.rst | 1 + 9 files changed, 319 insertions(+), 2 deletions(-) create mode 100644 docs/source/user_guide/caveats.rst diff --git a/docs/source/reference/classification.rst b/docs/source/reference/classification.rst index 85c24743..5263ebb6 100644 --- a/docs/source/reference/classification.rst +++ b/docs/source/reference/classification.rst @@ -4,6 +4,12 @@ Classification ************** +.. warning:: + + Classification functions silently set NaN and infinite input values + to NaN in the output. Clean infinities before classifying if you + want every cell assigned to a bin. + Equal Interval ============== .. autosummary:: diff --git a/docs/source/reference/focal.rst b/docs/source/reference/focal.rst index c6ab56bf..d57b75e4 100644 --- a/docs/source/reference/focal.rst +++ b/docs/source/reference/focal.rst @@ -4,6 +4,18 @@ Focal ***** +.. caution:: + + Focal operations use ``dask.array.map_overlap``. Each chunk dimension + must be **larger than the kernel depth** (typically 1 cell for a 3x3 + kernel). Chunks smaller than the depth produce wrong results. + +.. note:: + + Focal functions output **float32**. NaN handling varies: ``mean`` + excludes NaN neighbours from the average, while ``hotspots`` propagates + NaN from any neighbour. + Apply ===== .. autosummary:: diff --git a/docs/source/reference/hydrology.rst b/docs/source/reference/hydrology.rst index b7b0769e..3d6a512d 100644 --- a/docs/source/reference/hydrology.rst +++ b/docs/source/reference/hydrology.rst @@ -4,6 +4,13 @@ Hydrology ********* +.. warning:: + + NaN cells act as **impassable barriers** in all hydrology functions. + Flow cannot cross them. If your DEM has NaN holes (e.g. water bodies + masked out), fill or interpolate them first, or expect disconnected + drainage networks. + Flow Direction (D8) =================== .. autosummary:: diff --git a/docs/source/reference/proximity.rst b/docs/source/reference/proximity.rst index 8c68a29c..418238ab 100644 --- a/docs/source/reference/proximity.rst +++ b/docs/source/reference/proximity.rst @@ -4,6 +4,20 @@ Proximity ********* +.. warning:: + + ``proximity()`` returns distances in **pixel units** by default + (``distance_metric='EUCLIDEAN'``). Multiply by cell size to get + real-world units, or use ``'GREAT_CIRCLE'`` for lat/lon data + (returns kilometres). + +.. caution:: + + With Dask, ``proximity()`` expands each chunk by ``max_distance`` cells. + If ``max_distance`` is infinite (the default), the whole array is loaded + into a single chunk. Set a finite ``max_distance`` to keep memory + bounded. + Allocation ========== .. autosummary:: diff --git a/docs/source/reference/surface.rst b/docs/source/reference/surface.rst index 24fdbbfb..04de4874 100644 --- a/docs/source/reference/surface.rst +++ b/docs/source/reference/surface.rst @@ -4,6 +4,20 @@ Surface ******* +.. danger:: + + ``slope()``, ``aspect()``, ``curvature()``, and ``hillshade()`` with + ``method='geodesic'`` assume the **WGS84 ellipsoid** and require + coordinates in **degrees** (geographic CRS). Passing projected + coordinates (metres) to the geodesic method produces wrong results. + Use ``method='planar'`` (the default) for projected data. + +.. note:: + + All surface functions output **float32** regardless of input dtype. + Edge cells within the 3x3 kernel radius are NaN by default + (``boundary='nan'``). + Aspect ====== .. autosummary:: diff --git a/docs/source/reference/terrain_metrics.rst b/docs/source/reference/terrain_metrics.rst index 10a5a9bf..8d807d51 100644 --- a/docs/source/reference/terrain_metrics.rst +++ b/docs/source/reference/terrain_metrics.rst @@ -4,6 +4,12 @@ Terrain Metrics *************** +.. note:: + + Terrain metrics use a 3x3 kernel and output **float32**. Edge cells + are NaN by default. These functions use the planar (Horn) algorithm + and assume the input is on a regular grid with uniform cell spacing. + Roughness ========= .. autosummary:: diff --git a/docs/source/user_guide/caveats.rst b/docs/source/user_guide/caveats.rst new file mode 100644 index 00000000..f361c805 --- /dev/null +++ b/docs/source/user_guide/caveats.rst @@ -0,0 +1,253 @@ +.. _user_guide.caveats: + +*********************** +Caveats & Assumptions +*********************** + +xarray-spatial makes several assumptions about input data that can produce +wrong or confusing results if they aren't met. This page collects them in +one place. Each section maps to a Sphinx admonition level so you can gauge +severity at a glance: + +.. danger:: marks assumptions that cause **silently wrong output**. + +.. warning:: marks gotchas that **change the meaning** of results. + +.. caution:: marks **performance or memory** pitfalls. + +.. tip:: gives practical workarounds. + +.. note:: provides context that clarifies non-obvious behaviour. + + +.. contents:: On this page + :local: + :depth: 1 + + +Geodesic terrain functions assume WGS84 +======================================== + +.. danger:: + + ``slope()``, ``aspect()``, ``curvature()``, and ``hillshade()`` with + ``method='geodesic'`` use the WGS84 ellipsoid (a = 6 378 137 m, + b = 6 356 752 m). There is no parameter to select a different + ellipsoid. + +If your elevation data references a different ellipsoid (e.g. Clarke 1866 or +Bessel 1841), the geodesic path will still run, but the ECEF projection and +curvature correction will be slightly off. For most modern data this +doesn't matter because WGS84 and GRS80 are functionally identical, but +older survey-quality datasets on local datums can differ by tens of metres +in the geoid. + + +Projected coordinates + geodesic = wrong results +================================================= + +.. danger:: + + Passing a raster whose coordinates are in metres (UTM, state-plane, etc.) + to ``method='geodesic'`` will produce garbage. The geodesic path expects + latitude/longitude in degrees. + +The library does validate that coordinate values fall within geographic +ranges (latitude in [-90, 90], longitude in [-180, 360]) and will raise a +``ValueError`` if they don't. But this check is not foolproof: coordinates +in small projected grids that happen to look like valid lat/lon values will +slip through. + +.. tip:: + + Use ``method='planar'`` (the default) for projected CRS data. Use + ``method='geodesic'`` only when your coordinates are in degrees on a + geographic CRS. + + +All output is float32 +===================== + +.. warning:: + + Every analytical function casts its output to ``float32`` regardless of + input dtype. If you pass ``float64`` elevation data, the result is + ``float32``. + +Float32 gives about 7 significant digits, which is more than enough for +slope, NDVI, and similar metrics. But if you're chaining many operations +or working with data that needs sub-centimetre precision (e.g. LiDAR +heights in a small range), the truncation can matter. + +See :ref:`user_guide.data_types` for the full rationale. + +.. tip:: + + If you need float64 output, cast after the function returns: + + .. code-block:: python + + result = slope(agg).astype('float64') + + +NaN is the nodata sentinel +========================== + +.. warning:: + + xarray-spatial uses ``NaN`` as its nodata value everywhere. There is + no parameter to choose a different sentinel (e.g. -9999). + +Functions generally **propagate** NaN: if any cell in a 3x3 kernel is NaN, +the output cell is NaN. The exact behaviour varies by function: + +- **Slope, aspect, hillshade** -- any NaN neighbour produces a NaN output + cell. +- **Focal mean/std** -- NaN neighbours are excluded from the average; the + cell itself is still computed. +- **Zonal stats** -- NaN values are excluded from the aggregation. An + all-NaN zone returns NaN, not zero. +- **Hydrology (flow direction)** -- NaN cells are treated as impassable + barriers. Flow cannot cross them. +- **Pathfinding (A*)** -- NaN cells are impassable, same as barriers. + +.. note:: + + Edge cells (within the kernel radius of the raster boundary) default to + NaN when ``boundary='nan'``. Use ``boundary='nearest'`` or + ``boundary='reflect'`` if you need values at the edges. + + +Proximity defaults to pixel units +================================== + +.. warning:: + + ``proximity()`` returns distances in **pixel units** when + ``distance_metric='EUCLIDEAN'`` (the default). It does not automatically + convert to metres or kilometres. + +To get distances in real-world units with Euclidean mode, multiply the +result by your cell size. Or switch to ``distance_metric='GREAT_CIRCLE'`` +if your data is in degrees, which returns kilometres. + +.. tip:: + + .. code-block:: python + + from xrspatial.proximity import proximity + + # Pixel-unit result + dist_px = proximity(raster) + + # Convert to metres (assumes square cells) + cell_m = 30.0 # e.g. Landsat 30 m resolution + dist_m = dist_px * cell_m + + +Haversine uses the semi-major axis +==================================== + +.. note:: + + Great-circle distances in ``proximity()`` and ``surface_distance()`` + use the WGS84 **semi-major axis** (6 378 137 m) as the Earth radius, + not the mean radius (6 371 km). + +The difference is about 0.1 %. This is negligible for most use cases but +worth knowing if you're comparing against a reference that uses the mean +radius or the IUGG value. + + +Dask chunk sizes and ``map_overlap`` +===================================== + +.. caution:: + + Focal, slope, aspect, and other kernel-based operations use + ``dask.array.map_overlap``. If any chunk dimension is **smaller than + the kernel depth** (typically 1 cell for a 3x3 kernel), the overlap + will fail or produce wrong results. + +.. caution:: + + ``proximity()`` with Dask expands each chunk by ``max_distance`` cells. + If ``max_distance`` is infinite (the default), the entire array is loaded + into one chunk. Set a finite ``max_distance`` to keep memory bounded. + +.. tip:: + + A good rule of thumb for focal operations: keep chunks at least 64 cells + on each side. For proximity, set ``max_distance`` to the largest + distance you actually care about. + + .. code-block:: python + + import dask.array as da + import xarray as xr + + # Rechunk to safe sizes before running slope + agg = xr.DataArray( + da.from_array(big_array, chunks=(512, 512)) + ) + + +GPU memory is not managed automatically +======================================== + +.. caution:: + + CuPy-backed operations allocate the full output array on the GPU. + There is no automatic tiling or fallback to CPU if the array exceeds + VRAM. + +If you hit ``cuda.cudadrv.driver.CudaAPIError`` or ``cupy.cuda.memory +.OutOfMemoryError``, the array is too large for your GPU. Use Dask + CuPy +to process in chunks, or fall back to NumPy. + +.. tip:: + + Complex geodesic kernels (slope, aspect with ``method='geodesic'``) + use many float64 registers, which limits thread-block occupancy. + xarray-spatial already reduces thread blocks to 16x16 for these + kernels, but very old GPUs (compute capability < 6.0) may still + hit register spill. + + +Dimension ordering +================== + +.. note:: + + xarray-spatial assumes the last two dimensions of a DataArray are + ``(y, x)`` in that order. It does not inspect ``attrs['crs']`` or + CF conventions to verify this. + +If your data has ``(x, y)`` ordering (rare, but it happens with some +netCDF conventions), transpose before passing to any function: + +.. code-block:: python + + agg = agg.transpose('y', 'x') + + +Classification and NaN/Inf +=========================== + +.. warning:: + + ``equal_interval()``, ``quantile()``, ``natural_breaks()``, and other + classification functions set NaN and infinite input values to NaN in the + output. They do not raise an error. + +This means a raster with a few stray ``inf`` values (e.g. from a division +by zero upstream) will silently produce NaN bins for those cells. + +.. tip:: + + Clean infinities before classifying: + + .. code-block:: python + + import numpy as np + agg = agg.where(np.isfinite(agg)) diff --git a/docs/source/user_guide/data_types.rst b/docs/source/user_guide/data_types.rst index e6775f6c..77df9340 100644 --- a/docs/source/user_guide/data_types.rst +++ b/docs/source/user_guide/data_types.rst @@ -11,8 +11,12 @@ Overview xarray-spatial standardizes on **float32** (32-bit floating-point) for output data in most analytical functions. This design decision provides a balance between computational precision, memory efficiency, and performance that is well-suited for raster analysis tasks. -.. note:: - All multispectral indices, terrain analysis functions, and most other operations in xarray-spatial output float32 data, regardless of input data type. +.. warning:: + + All multispectral indices, terrain analysis functions, and most other + operations in xarray-spatial output **float32** data, regardless of input + data type. If you pass float64 data, the output is float32. Cast the + result with ``.astype('float64')`` if you need higher precision. Why Float32? diff --git a/docs/source/user_guide/index.rst b/docs/source/user_guide/index.rst index e887fa78..b28978c6 100644 --- a/docs/source/user_guide/index.rst +++ b/docs/source/user_guide/index.rst @@ -7,6 +7,7 @@ User Guide .. toctree:: :maxdepth: 1 + caveats data_types classification fire From e71f4c4e1619981505d34d957396f5b709551999 Mon Sep 17 00:00:00 2001 From: Brendan Collins Date: Wed, 1 Apr 2026 07:54:40 -0700 Subject: [PATCH 2/2] Fix review findings: proximity units, terrain dtype, missing pages (#1133) - Proximity EUCLIDEAN returns coordinate-unit distances, not pixels - Great-circle returns metres, not kilometres - Terrain metrics output float64, not float32; drop Horn attribution - Add admonitions to multispectral, zonal, and pathfinding pages - Clarify haversine section title --- docs/source/reference/multispectral.rst | 7 +++++ docs/source/reference/pathfinding.rst | 11 ++++++++ docs/source/reference/proximity.rst | 8 +++--- docs/source/reference/terrain_metrics.rst | 6 ++--- docs/source/reference/zonal.rst | 5 ++++ docs/source/user_guide/caveats.rst | 32 +++++++++++------------ 6 files changed, 46 insertions(+), 23 deletions(-) diff --git a/docs/source/reference/multispectral.rst b/docs/source/reference/multispectral.rst index 0303bc44..7a9f101c 100644 --- a/docs/source/reference/multispectral.rst +++ b/docs/source/reference/multispectral.rst @@ -4,6 +4,13 @@ Multispectral ************* +.. note:: + + All spectral indices output **float32**. Division by zero (e.g. + NDVI where NIR + Red = 0) produces NaN or inf silently. Clean + the result with ``xr.where(np.isfinite(result), result, np.nan)`` + if needed. + Atmospherically Resistant Vegetation Index (ARVI) ================================================= .. autosummary:: diff --git a/docs/source/reference/pathfinding.rst b/docs/source/reference/pathfinding.rst index d0e4075e..e50c4b5b 100644 --- a/docs/source/reference/pathfinding.rst +++ b/docs/source/reference/pathfinding.rst @@ -4,6 +4,17 @@ Pathfinding *********** +.. caution:: + + A* allocates about 65 bytes per pixel and will raise ``MemoryError`` + if the required memory exceeds 80 % of available RAM. Use Dask or + set ``search_radius`` to limit the search area for large rasters. + +.. warning:: + + NaN and non-positive friction values are treated as impassable + barriers. Cells must have finite positive friction to be traversable. + A* Pathfinding ============== .. autosummary:: diff --git a/docs/source/reference/proximity.rst b/docs/source/reference/proximity.rst index 418238ab..e9b4530c 100644 --- a/docs/source/reference/proximity.rst +++ b/docs/source/reference/proximity.rst @@ -6,10 +6,10 @@ Proximity .. warning:: - ``proximity()`` returns distances in **pixel units** by default - (``distance_metric='EUCLIDEAN'``). Multiply by cell size to get - real-world units, or use ``'GREAT_CIRCLE'`` for lat/lon data - (returns kilometres). + ``proximity()`` with ``distance_metric='EUCLIDEAN'`` (the default) + returns distances in the **coordinate units** of the DataArray. + With default integer indices this equals pixel counts. Use + ``'GREAT_CIRCLE'`` for lat/lon data (returns metres). .. caution:: diff --git a/docs/source/reference/terrain_metrics.rst b/docs/source/reference/terrain_metrics.rst index 8d807d51..d0ebb29c 100644 --- a/docs/source/reference/terrain_metrics.rst +++ b/docs/source/reference/terrain_metrics.rst @@ -6,9 +6,9 @@ Terrain Metrics .. note:: - Terrain metrics use a 3x3 kernel and output **float32**. Edge cells - are NaN by default. These functions use the planar (Horn) algorithm - and assume the input is on a regular grid with uniform cell spacing. + Terrain metrics use a 3x3 neighbourhood and output **float64**. Edge + cells are NaN. These functions assume the input is on a regular grid + with uniform cell spacing. Roughness ========= diff --git a/docs/source/reference/zonal.rst b/docs/source/reference/zonal.rst index 9be3ad17..c151f5ec 100644 --- a/docs/source/reference/zonal.rst +++ b/docs/source/reference/zonal.rst @@ -4,6 +4,11 @@ Zonal ***** +.. note:: + + NaN values are excluded from all zonal aggregations. A zone where + every cell is NaN returns NaN (not zero) for sum, mean, etc. + Apply ===== .. autosummary:: diff --git a/docs/source/user_guide/caveats.rst b/docs/source/user_guide/caveats.rst index f361c805..3888a990 100644 --- a/docs/source/user_guide/caveats.rst +++ b/docs/source/user_guide/caveats.rst @@ -118,18 +118,19 @@ the output cell is NaN. The exact behaviour varies by function: ``boundary='reflect'`` if you need values at the edges. -Proximity defaults to pixel units -================================== +Proximity distances depend on coordinate units +================================================ .. warning:: - ``proximity()`` returns distances in **pixel units** when - ``distance_metric='EUCLIDEAN'`` (the default). It does not automatically - convert to metres or kilometres. + ``proximity()`` with ``distance_metric='EUCLIDEAN'`` (the default) + computes distances in whatever units the DataArray's coordinates use. + If coordinates are default integer indices (0, 1, 2, ...), the result + is in pixel counts. If coordinates are in metres (UTM), the result is + in metres. If coordinates are in degrees, the result is in degrees. -To get distances in real-world units with Euclidean mode, multiply the -result by your cell size. Or switch to ``distance_metric='GREAT_CIRCLE'`` -if your data is in degrees, which returns kilometres. +Switch to ``distance_metric='GREAT_CIRCLE'`` for lat/lon data to get +distances in **metres** (the radius used is 6 378 137 m). .. tip:: @@ -137,22 +138,21 @@ if your data is in degrees, which returns kilometres. from xrspatial.proximity import proximity - # Pixel-unit result + # EUCLIDEAN with default coords = pixel-count distances dist_px = proximity(raster) - # Convert to metres (assumes square cells) - cell_m = 30.0 # e.g. Landsat 30 m resolution - dist_m = dist_px * cell_m + # GREAT_CIRCLE with lat/lon coords = distances in metres + dist_m = proximity(raster, distance_metric='GREAT_CIRCLE') -Haversine uses the semi-major axis -==================================== +Great-circle distances assume sphere radius = semi-major axis +============================================================== .. note:: Great-circle distances in ``proximity()`` and ``surface_distance()`` - use the WGS84 **semi-major axis** (6 378 137 m) as the Earth radius, - not the mean radius (6 371 km). + model the Earth as a sphere with radius = WGS84 semi-major axis + (6 378 137 m), not the mean radius (~6 371 km). The difference is about 0.1 %. This is negligible for most use cases but worth knowing if you're comparing against a reference that uses the mean