Adopt requests.Session, bump UA to Chrome 130, raise_for_status everywhere#85
Merged
Merged
Conversation
…where Three related changes to the downloader's HTTP layer: 1. Replace `requests.get(url, headers=USR_AGENT_HEADER)` everywhere with a module-level `requests.Session`. The session reuses TCP connections across the many small downloads NEMOSIS makes when scanning monthly archives (one query can fetch a dozen+ zips), so we save a TCP+TLS handshake per request after the first. Headers are set on the session once, so individual call sites no longer need to thread them through. 2. Bump the default User-Agent from Chrome 80 (early 2020) to Chrome 130 (late 2024 stable). AEMO's `aemo.com.au` host — where the participant registration .xlsx lives — filters out stale UAs as suspected scrapers, and a 6-year-old UA string increasingly fails that filter. nemweb.com.au is happy either way. PR #67's approach was to keep Chrome 80 as the session default and override per-request to Chrome 130 only for aemo URLs; uniform Chrome 130 is simpler and achieves the same goal (nemweb gets a fresher UA too, but they don't UA-filter so there's no risk). 3. Add `response.raise_for_status()` to every download call so HTTP errors surface as `requests.HTTPError` directly. Today a 404 (or 500, or 403) propagates as a confusing `BadZipFile` when `zipfile.ZipFile` chokes on the error-page HTML body. The outer `except Exception:` handler in `run()`/`run_bid_tables()`/etc. catches both, so this is a clarity change, not a behaviour change for the integration path. Also flips three URL templates from `http://www.nemweb.com.au` to `https://...` — nemweb has supported HTTPS for years, AEMO's CDN redirects http→https anyway, and the redirect adds an extra round trip per request. New `tests/test_downloader.py` covers the new contracts: - `test_session_user_agent_claims_current_chrome` — locks in the Chrome 130 string. Bumping the UA later requires updating this test, which is intentional friction. - Three `test_download_*_raises_on_404` tests — confirm `download_unzip_csv`, `download_csv`, and `download_xl` all raise HTTPError on non-2xx responses, using the offline `aemo_mock_server` fixture to hit a real-but-missing path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First of three sequential PRs implementing the downloader rework from #67. This one is the foundation — purely structural changes to the HTTP layer, no new dependencies, no new code paths.
requests.Sessionreplaces ad-hocrequests.get(...)User-Agentbumped from Chrome 80 → Chrome 130aemo.com.auhost (where the participant registration.xlsxlives) filters out stale UAs as suspected scrapers, and a 6-year-old UA string increasingly fails that filter.raise_for_status()on every download responseBadZipFilewhenzipfile.ZipFilechokes on the 404 HTML body. The outerexcept Exception:inrun()/run_bid_tables()/ etc. catches both, so this is a clarity change, not a behaviour change — failed downloads still log a warning and continue; they just produce a sensible exception class in the trace.http://www.nemweb.com.au→https://...Departures from #67
PR #67's
e5cac94kept Chrome 80 as the session default and overrode per-request to Chrome 130 only foraemo.com.auURLs. This PR uses uniform Chrome 130 instead — same goal (a current UA for the picky endpoint), simpler code (no per-requestheaders={}dict, nourlparseof the URL). nemweb.com.au gets a fresher UA too, but they don't UA-filter so there's no risk.PR #67 also bundled the streamed-downloads work, the TTL-cached HTML pre-check, and the
download_xl → download_xmlrename into the same commit. Those are deferred to follow-up PRs (and the rename was already corrected todownload_xlsxin #84).What's coming next in the queue
download_to_pathhelper,stream=True, iter_content)cachetools>=7dep +_pre_check_file_is_missingfor fast 404-skipping when scanning back into historyEach builds on this one's Session.
Live smoke-tested against real AEMO and nemweb
The whole point of this PR (Chrome 130 UA + HTTPS to nemweb + raise_for_status) only matters in the wild — offline tests can verify the code paths but not the actual server behaviour. Four targeted smoke tests run against real services on this branch:
.xlsxviadownload_xldownload_unzip_csvover HTTPSHTTPError: 404 Client Error(not the previousBadZipFile)dynamic_data_compilerfor 30 min of recent DISPATCHPRICEPre-existing bug surfaced (not in scope here)
The smoke testing surfaced #74 —
PUBLIC_ARCHIVE#…format URLs (2024-08+) are wrongly encoded indownload_unzip_csv(single%23instead of the double%2523that real nemweb expects). Diagnosed and root-cause documented in #74 (comment). Not introduced by this PR — same encoding code on master pre-this-PR — so it doesn't gate #85. Tracked as a follow-up.Test plan
uv run pytest tests/test_downloader.py— 4/4 new tests passuv run pytest tests/— 382 pass, 1 skipped, 1 pre-existing unrelated warningWhat the new unit tests cover
test_session_user_agent_claims_current_chrome— locks in the Chrome 130 string. Bumping the UA later requires updating this test, which is deliberate friction.test_download_unzip_csv_raises_on_404,test_download_csv_raises_on_404,test_download_xl_raises_on_404— confirm the three main download functions raiserequests.HTTPErroron non-2xx responses, using the offlineaemo_mock_serverfixture to hit real-but-missing paths.