Fix #74: double-encode # in archive URLs (real AEMO files use literal %23)#86
Merged
Merged
Conversation
Post-2024-07 PUBLIC_ARCHIVE# monthly MMS files are stored on
nemweb.com.au with literal `%23` characters in their filenames (not
`#`). NEMOSIS's `download_unzip_csv` was sending URLs with single `%23`
encoding — nemweb URL-decodes `%23` to `#`, looks for a `#`-named file,
finds none, and returns HTTP 400. The result: dynamic_data_compiler
fails with NoDataToReturn for any post-2024-07 PUBLIC_ARCHIVE# table
(DISPATCHPRICE, DISPATCHLOAD, etc.) on a cold cache.
To match the real filename on disk the URL needs `%2523` so nemweb
decodes it once to `%23`. Verified directly against the live server:
single `%23` returns 400; `%2523` returns 200 with the real zip body.
Three changes here:
1. `src/nemosis/downloader.py::download_unzip_csv` — change the
`url.replace("#", "%23")` step to `url.replace("#", "%2523")`. This
is the only place NEMOSIS percent-encodes URLs at fetch time;
pre-2024-08 PUBLIC_DVD_* filenames don't contain `#` so the replace
is a no-op for them.
2. `tests/fixtures/build.py` — apply the same fix in `http_get` (which
the build pipeline uses to fetch fixtures from real AEMO), and
write the resulting fixtures to disk under their `%23`-form name
in `mms_fixture_path` so they match nemweb's actual filename layout.
3. Rename the existing 48 PUBLIC_ARCHIVE# fixture zips in
`tests/fixtures/data/.../MMSDM/2024_*/` and `MMSDM/2025_*/` from
`…#…zip` to `…%23…zip` via `git mv`. Required because the offline
test suite stands up a `http.server` over those files and serves
them at the URL NEMOSIS requests — once NEMOSIS sends `%2523`, the
server decodes it to `%23` and needs to find a `%23`-named file on
disk.
Why the offline tests didn't catch this before: the fixture filenames
used literal `#`, which disagreed with how real nemweb stores the same
files. Python's `http.server` URL-decoded `%23` → `#` and happily
served them, so the encoding mismatch was masked. After this change
the fixture filenames mirror the real on-disk layout and the offline
suite would now flag a regression in the URL encoding.
All 222 offline tests pass. Live verification:
$ curl -I '…/PUBLIC_ARCHIVE%23DISPATCHPRICE%23FILE01%23202412010000.zip'
HTTP/1.1 400 Bad Request
$ curl -I '…/PUBLIC_ARCHIVE%2523DISPATCHPRICE%2523FILE01%2523202412010000.zip'
HTTP/1.1 200 OK (2,140,354 bytes)
Fixes #74.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4c8bcd5 to
fe13d34
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
%23in their on-disk filenames (not#). NEMOSIS sent URLs with single%23encoding — nemweb decoded that back to#, didn't find the file, returned HTTP 400, anddynamic_data_compilerfailed withNoDataToReturnfor any post-2024-07 DISPATCHPRICE / DISPATCHLOAD / etc. query on a cold cache.#→%2523indownloader.download_unzip_csvso nemweb decodes it once to%23and finds the real file. Pre-Aug-2024PUBLIC_DVD_*filenames don't contain#, so the replace is a no-op for the older path.#(disagreeing with how real nemweb stores the same files). Renamed all 48 post-2024-07 fixture zips to%23-form so the offline suite now mirrors real-server layout — and would flag a regression in the encoding step. Also applied the matching%2523fix intests/fixtures/build.py::http_getso anyone rebuilding fixtures from real AEMO gets working downloads.Root cause (one paragraph)
AEMO's nemweb URL-decodes one level. To match a real on-disk filename that contains literal
%23, the HTTP URL needs%2523— single%23decodes to#, finds nothing, 400s. Verified directly against the live server:Full root-cause investigation is in my issue comment (Option A there).
Test plan
uv run pytest tests/end_to_end_table_tests/).%23→ 400,%2523→ 200 with real zip body, againstnemweb.com.au/Data_Archive/.../MMSDM_2024_12/.../PUBLIC_ARCHIVE…DISPATCHPRICE…zip.dynamic_data_compiler(start_time="2024/08/15 00:00:00", end_time="2024/08/16 00:00:00", table_name="DISPATCHPRICE", …)end-to-end against the live server on a reviewer's machine. The HTTP-level proof above + offline test pass cover the same pipeline.Fixes #74.
🤖 Generated with Claude Code