Skip to content

Adopt requests.Session, bump UA to Chrome 130, raise_for_status everywhere#85

Merged
nick-gorman merged 1 commit into
masterfrom
downloader-session-https-status
May 25, 2026
Merged

Adopt requests.Session, bump UA to Chrome 130, raise_for_status everywhere#85
nick-gorman merged 1 commit into
masterfrom
downloader-session-https-status

Conversation

@nick-gorman
Copy link
Copy Markdown
Member

@nick-gorman nick-gorman commented May 25, 2026

Summary

First of three sequential PRs implementing the downloader rework from #67. This one is the foundation — purely structural changes to the HTTP layer, no new dependencies, no new code paths.

Change Why
Module-level requests.Session replaces ad-hoc requests.get(...) Connection reuse across the many small downloads in a query (~12+ monthly zips for a typical fetch). Saves a TCP+TLS handshake per request after the first. Headers set once, not threaded through every call site.
Default User-Agent bumped from Chrome 80 → Chrome 130 Chrome 80 dates from early 2020. AEMO's aemo.com.au host (where the participant registration .xlsx lives) filters out stale UAs as suspected scrapers, and a 6-year-old UA string increasingly fails that filter.
raise_for_status() on every download response Today a 404 propagates as a confusing BadZipFile when zipfile.ZipFile chokes on the 404 HTML body. The outer except Exception: in run() / run_bid_tables() / etc. catches both, so this is a clarity change, not a behaviour change — failed downloads still log a warning and continue; they just produce a sensible exception class in the trace.
3 URL templates flipped http://www.nemweb.com.auhttps://... nemweb has supported HTTPS for years; AEMO's CDN redirects http→https anyway, costing an extra round trip per request.

Departures from #67

PR #67's e5cac94 kept Chrome 80 as the session default and overrode per-request to Chrome 130 only for aemo.com.au URLs. This PR uses uniform Chrome 130 instead — same goal (a current UA for the picky endpoint), simpler code (no per-request headers={} dict, no urlparse of the URL). nemweb.com.au gets a fresher UA too, but they don't UA-filter so there's no risk.

PR #67 also bundled the streamed-downloads work, the TTL-cached HTML pre-check, and the download_xl → download_xml rename into the same commit. Those are deferred to follow-up PRs (and the rename was already corrected to download_xlsx in #84).

What's coming next in the queue

  • PR-B: streamed downloads + cleanup-on-failure (download_to_path helper, stream=True, iter_content)
  • PR-C: TTL-cached HTML pre-check + cachetools>=7 dep + _pre_check_file_is_missing for fast 404-skipping when scanning back into history

Each builds on this one's Session.

Live smoke-tested against real AEMO and nemweb

The whole point of this PR (Chrome 130 UA + HTTPS to nemweb + raise_for_status) only matters in the wild — offline tests can verify the code paths but not the actual server behaviour. Four targeted smoke tests run against real services on this branch:

Test Result
Download AEMO registration .xlsx via download_xl ✅ Chrome 130 accepted, 779 KB in 0.46s
Download a 2024-07 MMS zip via download_unzip_csv over HTTPS ✅ 16 MB CSV extracted in 1.07s
Hit a known-404 nemweb path ✅ Cleanly surfaces as HTTPError: 404 Client Error (not the previous BadZipFile)
End-to-end dynamic_data_compiler for 30 min of recent DISPATCHPRICE ✅ 30 rows × 13 cols across all 5 regions

Pre-existing bug surfaced (not in scope here)

The smoke testing surfaced #74PUBLIC_ARCHIVE#… format URLs (2024-08+) are wrongly encoded in download_unzip_csv (single %23 instead of the double %2523 that real nemweb expects). Diagnosed and root-cause documented in #74 (comment). Not introduced by this PR — same encoding code on master pre-this-PR — so it doesn't gate #85. Tracked as a follow-up.

Test plan

  • uv run pytest tests/test_downloader.py — 4/4 new tests pass
  • uv run pytest tests/ — 382 pass, 1 skipped, 1 pre-existing unrelated warning
  • Live smoke tests against real AEMO + nemweb (table above)
  • CI green

What the new unit tests cover

  • test_session_user_agent_claims_current_chrome — locks in the Chrome 130 string. Bumping the UA later requires updating this test, which is deliberate friction.
  • test_download_unzip_csv_raises_on_404, test_download_csv_raises_on_404, test_download_xl_raises_on_404 — confirm the three main download functions raise requests.HTTPError on non-2xx responses, using the offline aemo_mock_server fixture to hit real-but-missing paths.

…where

Three related changes to the downloader's HTTP layer:

1. Replace `requests.get(url, headers=USR_AGENT_HEADER)` everywhere
   with a module-level `requests.Session`. The session reuses TCP
   connections across the many small downloads NEMOSIS makes when
   scanning monthly archives (one query can fetch a dozen+ zips), so
   we save a TCP+TLS handshake per request after the first. Headers
   are set on the session once, so individual call sites no longer
   need to thread them through.

2. Bump the default User-Agent from Chrome 80 (early 2020) to
   Chrome 130 (late 2024 stable). AEMO's `aemo.com.au` host —
   where the participant registration .xlsx lives — filters out
   stale UAs as suspected scrapers, and a 6-year-old UA string
   increasingly fails that filter. nemweb.com.au is happy either
   way. PR #67's approach was to keep Chrome 80 as the session
   default and override per-request to Chrome 130 only for aemo
   URLs; uniform Chrome 130 is simpler and achieves the same goal
   (nemweb gets a fresher UA too, but they don't UA-filter so
   there's no risk).

3. Add `response.raise_for_status()` to every download call so HTTP
   errors surface as `requests.HTTPError` directly. Today a 404 (or
   500, or 403) propagates as a confusing `BadZipFile` when
   `zipfile.ZipFile` chokes on the error-page HTML body. The outer
   `except Exception:` handler in `run()`/`run_bid_tables()`/etc.
   catches both, so this is a clarity change, not a behaviour
   change for the integration path.

Also flips three URL templates from `http://www.nemweb.com.au` to
`https://...` — nemweb has supported HTTPS for years, AEMO's CDN
redirects http→https anyway, and the redirect adds an extra round
trip per request.

New `tests/test_downloader.py` covers the new contracts:

- `test_session_user_agent_claims_current_chrome` — locks in the
  Chrome 130 string. Bumping the UA later requires updating this
  test, which is intentional friction.
- Three `test_download_*_raises_on_404` tests — confirm
  `download_unzip_csv`, `download_csv`, and `download_xl` all
  raise HTTPError on non-2xx responses, using the offline
  `aemo_mock_server` fixture to hit a real-but-missing path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant