The Archive of Mass ENvironmental Data (AMEND) is a project to assemble and analyze data related to environmental regulation, focused on water policy in Massachusetts.
The website for the project is openamend.org.
This git repository contains code for data acquisition (see get_data/), analysis (see analysis/), and the jekyll site (see docs/).
Data is refreshed automatically every Monday at 6am UTC via two GitHub Actions workflows:
- Update Data: Fetches all active data sources, validates row counts and schema, assembles the SQLite database, commits updated CSVs, and regenerates the AI Analysis semantic context. If any step fails, a GitHub Issue is opened automatically.
- Update Charts: Runs after a successful data update to regenerate Chart.js visualizations. The PySTAN-based CSO regression analysis (
NECIR_CSO_map.py) is excluded from CI and must be run locally.
If a workflow fails, a GitHub Issue is opened with a link to the failed run.
To run a full update locally:
bash update_all.shThis script will not update ECOS budget records or the SSA wage table, which require manual data entry.
Large files (SQLite database, full drinking water CSV, permit PDFs) are stored on Google Cloud Storage.
The site is hosted via GitHub Pages from the docs/ directory.
To run locally (use --host localhost so sidebar links resolve correctly in the browser):
conda env create -f amend_jekyll_env.yml
conda activate amend_jekyll
cd docs
bundle exec jekyll serve --host localhost --port 4000 --baseurl ""For faster rebuilds while editing, add the --incremental flag to rebuild only the files that have changed:
bundle exec jekyll serve --host localhost --port 4000 --baseurl "" --incrementalFor running data fetches and most chart scripts (no PySTAN/geopandas):
pip install -r requirements-ci.txtFor all scripts including PySTAN CSO regression analysis:
conda env create -f amend_python_env.yml
conda activate amend_pythonThe AI Analysis page lets users ask natural-language questions about the database. The LLM generates SQL, executes it client-side via sql.js, and renders results with Plotly.
The LLM is given a rich schema description — docs/assets/db_semantic_context.txt — instead of bare CREATE TABLE statements. This file includes:
- Table descriptions and row counts
- 5 sample rows per table (showing actual value formats, e.g. ALL-CAPS town names)
- Per-column notes (typos, date formats, join keys)
- Cross-table join relationships
The semantic context must be regenerated whenever data sources change (new tables, renamed columns, schema changes). It is regenerated automatically by assemble_db.py on each weekly data update. To regenerate manually:
cd get_data
conda run -n amend_python python generate_semantic_context.pyWhen adding or changing a data source:
- Update
TABLE_DESCRIPTIONSandCOLUMN_NOTESinget_data/generate_semantic_context.py - Run
generate_semantic_context.pyto regeneratedocs/assets/db_semantic_context.txt - Commit both files