Adopt requests.Session, bump UA to Chrome 130, raise_for_status everywhere by nick-gorman · Pull Request #85 · UNSW-CEEM/NEMOSIS

nick-gorman · 2026-05-25T03:06:24Z

Summary

First of three sequential PRs implementing the downloader rework from #67. This one is the foundation — purely structural changes to the HTTP layer, no new dependencies, no new code paths.

Change	Why
Module-level `requests.Session` replaces ad-hoc `requests.get(...)`	Connection reuse across the many small downloads in a query (~12+ monthly zips for a typical fetch). Saves a TCP+TLS handshake per request after the first. Headers set once, not threaded through every call site.
Default `User-Agent` bumped from Chrome 80 → Chrome 130	Chrome 80 dates from early 2020. AEMO's `aemo.com.au` host (where the participant registration `.xlsx` lives) filters out stale UAs as suspected scrapers, and a 6-year-old UA string increasingly fails that filter.
`raise_for_status()` on every download response	Today a 404 propagates as a confusing `BadZipFile` when `zipfile.ZipFile` chokes on the 404 HTML body. The outer `except Exception:` in `run()` / `run_bid_tables()` / etc. catches both, so this is a clarity change, not a behaviour change — failed downloads still log a warning and continue; they just produce a sensible exception class in the trace.
3 URL templates flipped `http://www.nemweb.com.au` → `https://...`	nemweb has supported HTTPS for years; AEMO's CDN redirects http→https anyway, costing an extra round trip per request.

Departures from #67

PR #67's e5cac94 kept Chrome 80 as the session default and overrode per-request to Chrome 130 only for aemo.com.au URLs. This PR uses uniform Chrome 130 instead — same goal (a current UA for the picky endpoint), simpler code (no per-request headers={} dict, no urlparse of the URL). nemweb.com.au gets a fresher UA too, but they don't UA-filter so there's no risk.

PR #67 also bundled the streamed-downloads work, the TTL-cached HTML pre-check, and the download_xl → download_xml rename into the same commit. Those are deferred to follow-up PRs (and the rename was already corrected to download_xlsx in #84).

What's coming next in the queue

PR-B: streamed downloads + cleanup-on-failure (download_to_path helper, stream=True, iter_content)
PR-C: TTL-cached HTML pre-check + cachetools>=7 dep + _pre_check_file_is_missing for fast 404-skipping when scanning back into history

Each builds on this one's Session.

Live smoke-tested against real AEMO and nemweb

The whole point of this PR (Chrome 130 UA + HTTPS to nemweb + raise_for_status) only matters in the wild — offline tests can verify the code paths but not the actual server behaviour. Four targeted smoke tests run against real services on this branch:

Test	Result
Download AEMO registration `.xlsx` via `download_xl`	✅ Chrome 130 accepted, 779 KB in 0.46s
Download a 2024-07 MMS zip via `download_unzip_csv` over HTTPS	✅ 16 MB CSV extracted in 1.07s
Hit a known-404 nemweb path	✅ Cleanly surfaces as `HTTPError: 404 Client Error` (not the previous `BadZipFile`)
End-to-end `dynamic_data_compiler` for 30 min of recent DISPATCHPRICE	✅ 30 rows × 13 cols across all 5 regions

Pre-existing bug surfaced (not in scope here)

The smoke testing surfaced #74 — PUBLIC_ARCHIVE#… format URLs (2024-08+) are wrongly encoded in download_unzip_csv (single %23 instead of the double %2523 that real nemweb expects). Diagnosed and root-cause documented in #74 (comment). Not introduced by this PR — same encoding code on master pre-this-PR — so it doesn't gate #85. Tracked as a follow-up.

Test plan

uv run pytest tests/test_downloader.py — 4/4 new tests pass
uv run pytest tests/ — 382 pass, 1 skipped, 1 pre-existing unrelated warning
Live smoke tests against real AEMO + nemweb (table above)
CI green

What the new unit tests cover

test_session_user_agent_claims_current_chrome — locks in the Chrome 130 string. Bumping the UA later requires updating this test, which is deliberate friction.
test_download_unzip_csv_raises_on_404, test_download_csv_raises_on_404, test_download_xl_raises_on_404 — confirm the three main download functions raise requests.HTTPError on non-2xx responses, using the offline aemo_mock_server fixture to hit real-but-missing paths.

…where Three related changes to the downloader's HTTP layer: 1. Replace `requests.get(url, headers=USR_AGENT_HEADER)` everywhere with a module-level `requests.Session`. The session reuses TCP connections across the many small downloads NEMOSIS makes when scanning monthly archives (one query can fetch a dozen+ zips), so we save a TCP+TLS handshake per request after the first. Headers are set on the session once, so individual call sites no longer need to thread them through. 2. Bump the default User-Agent from Chrome 80 (early 2020) to Chrome 130 (late 2024 stable). AEMO's `aemo.com.au` host — where the participant registration .xlsx lives — filters out stale UAs as suspected scrapers, and a 6-year-old UA string increasingly fails that filter. nemweb.com.au is happy either way. PR #67's approach was to keep Chrome 80 as the session default and override per-request to Chrome 130 only for aemo URLs; uniform Chrome 130 is simpler and achieves the same goal (nemweb gets a fresher UA too, but they don't UA-filter so there's no risk). 3. Add `response.raise_for_status()` to every download call so HTTP errors surface as `requests.HTTPError` directly. Today a 404 (or 500, or 403) propagates as a confusing `BadZipFile` when `zipfile.ZipFile` chokes on the error-page HTML body. The outer `except Exception:` handler in `run()`/`run_bid_tables()`/etc. catches both, so this is a clarity change, not a behaviour change for the integration path. Also flips three URL templates from `http://www.nemweb.com.au` to `https://...` — nemweb has supported HTTPS for years, AEMO's CDN redirects http→https anyway, and the redirect adds an extra round trip per request. New `tests/test_downloader.py` covers the new contracts: - `test_session_user_agent_claims_current_chrome` — locks in the Chrome 130 string. Bumping the UA later requires updating this test, which is intentional friction. - Three `test_download_*_raises_on_404` tests — confirm `download_unzip_csv`, `download_csv`, and `download_xl` all raise HTTPError on non-2xx responses, using the offline `aemo_mock_server` fixture to hit a real-but-missing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nick-gorman mentioned this pull request May 25, 2026

Monthly MMS dynamic table fetches fail from 2024-08 onward (PUBLIC_ARCHIVE#...) while older PUBLIC_DVD_* months still work #74

Closed

nick-gorman merged commit 2238d90 into master May 25, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt requests.Session, bump UA to Chrome 130, raise_for_status everywhere#85

Adopt requests.Session, bump UA to Chrome 130, raise_for_status everywhere#85
nick-gorman merged 1 commit into
masterfrom
downloader-session-https-status

nick-gorman commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nick-gorman commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Departures from #67

What's coming next in the queue

Live smoke-tested against real AEMO and nemweb

Pre-existing bug surfaced (not in scope here)

Test plan

What the new unit tests cover

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nick-gorman commented May 25, 2026 •

edited

Loading