This file provides guidance to Claude Code when working with the geopipe codebase.
geopipe is a Python package for managing large-scale geospatial data pipelines, designed for causal inference workflows that integrate satellite imagery with heterogeneous tabular data sources.
geopipe/
├── sources/ # Data source connectors (raster, tabular, remote)
├── fusion/ # Declarative fusion schemas
├── pipeline/ # Task orchestration with checkpointing
├── cluster/ # HPC integration (SLURM)
├── specs/ # Robustness specs, DSL, and curve analysis
├── quality/ # Data quality checks (preflight + post-load)
├── discovery/ # Dataset discovery across catalogs
└── cli.py # Command-line interface
base.py- AbstractDataSourceclass withload(),validate(),get_schema()raster.py-RasterSourcefor GeoTIFF/COG with zonal statisticstabular.py-TabularSourcefor CSV/Parquet with spatial joinsremote_base.py-RemoteSourceMixinbase class for cloud sourcesearthengine.py-EarthEngineSourcefor Google Earth Engine collectionsplanetary.py-PlanetaryComputerSourcefor Microsoft Planetary Computerstac.py-STACSourcefor generic STAC catalog access
schema.py-FusionSchemafor declarative multi-source data fusion- Supports YAML configuration files
- Handles resolution alignment, temporal filtering, spatial joins
tasks.py-@taskdecorator with caching and checkpointingdag.py-Pipelineclass for DAG-based execution- Supports resume from failure
slurm.py-SLURMExecutorfor distributed pipeline execution- Auto-generates job scripts with dependencies
- Job monitoring and cancellation
variants.py-SpecandSpecRegistryfor robustness specificationsdsl.py-RobustnessDSLfor declarative YAML specs with template substitutioncurve.py-SpecificationCurvefor analysis, plotting, and influence ranking- LaTeX table generation for papers
- Cross-spec result comparison
checks.py- Quality check classes:TemporalOverlapCheck,CRSAlignmentCheck,BoundsOverlapCheck,MissingValueCheck, etc.preflight.py- Pre-load validation:PathAccessibilityCheck,RequiredConfigCheck,TabularFormatCheck,RasterFormatCheck,AuthenticationCheckreport.py-QualityReportwith export to Markdown, LaTeX, JSON
catalog.py-CatalogRegistry,DatasetInfo,discover()functionresults.py-DiscoveryResult,CategoryTypeenum- Categories: nightlights, optical, sar, elevation, climate, land_cover, vegetation, population, infrastructure
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=geopipe
# Type checking
mypy geopipe
# Linting
ruff check geopipe# Core commands
geopipe init --output schema.yaml # Initialize template schema
geopipe validate schema.yaml # Validate schema
geopipe fuse schema.yaml # Execute fusion
geopipe status # Check pipeline status
geopipe clean # Clean checkpoints
# Quality commands
geopipe preflight schema.yaml # Run pre-load validation checks
geopipe audit schema.yaml # Data quality audit with --fix, --output, --strict
# Source management
geopipe sources list schema.yaml # List sources in schema
# Specification commands
geopipe specs list schema.yaml # Preview specifications
geopipe specs expand schema.yaml # Write individual schema files
geopipe specs run schema.yaml # Execute all specifications
geopipe specs curve "results/*.csv" # Generate specification curve
# Discovery commands
geopipe discover search # Search datasets (--category, --provider)
geopipe discover list-datasets # List all known datasets
geopipe discover info <dataset_id> # Get dataset details
geopipe discover categories # List available categories- Declarative over imperative - YAML schemas define what, not how
- Checkpoint everything - Support resume from any failure point
- HPC-native - First-class SLURM/PBS support
- Robustness-focused - Built-in specification management for sensitivity analysis
Core: geopandas, xarray, rioxarray, dask, pyarrow, pyyaml, click, rich, pydantic
Optional:
prefect- For advanced orchestrationsubmitit- Alternative SLURM submissionplanetary-computer,earthengine-api,pystac-client,stackstac,tenacity,requests- Remote data sources
R users can access geopipe via the reticulate package. See README.md "Using geopipe from R" section for setup and usage examples with conda environments.
Tests are in tests/ using pytest. Each module has corresponding test file:
test_sources.py- Data source teststest_fusion.py- Fusion schema teststest_pipeline.py- Pipeline and task teststest_specs.py- Specification management teststest_robustness_dsl.py- Robustness DSL teststest_preflight.py- Preflight validation teststest_quality.py- Quality audit teststest_discovery.py- Data discovery teststest_remote_sources.py- Remote source tests