Stream zip downloads, add keep_zip, flip raw-retention defaults to False by nick-gorman · Pull Request #87 · UNSW-CEEM/NEMOSIS

nick-gorman · 2026-05-25T04:34:15Z

Summary

Second of three sequential PRs implementing the downloader rework from #67. Builds on PR #85's requests.Session foundation. Lands streaming + cleanup-on-failure, adds a new keep_zip parameter for #56's slow-internet caching use case, and flips both raw-retention defaults (keep_csv and keep_zip) to False for a leaner default cache.

Three commits, each independently meaningful:

#	Commit	Author	What
1	Stream zip downloads to disk and cache them between runs	@mdavis-xyz	Cherry-pick of PR #67's `2a5812f`, with PR-C bits trimmed. Introduces `download_to_path` + `download_to_dir` helpers; refactors `download_unzip_csv`, the four `_download_and_unpack_*`, `download_csv`, `download_xl`, and `download_elements_file` to use them.
2	Tidy up the streaming download code	me	Bug fixes + bare-except fix that turned out to be the real cause of the KI-doesn't-propagate concern. Needs Matt's review — see "For @mdavis-xyz" below.
3	Add keep_zip parameter and flip both raw-retention defaults to False	me	Plumb `keep_zip` through the call stack with the `(path, downloaded)` signal that keeps cleanup safe for concurrent use. Flip defaults.

New default contract

keep_csv	keep_zip	what's left in the cache
F (default)	F (default)	feather/parquet only — lean cache
F	T	feather/parquet + zip — #56's slow-internet caching
T	F	feather/parquet + CSV — downstream CSV consumer
T	T	feather/parquet + CSV + zip — full raw retention

Behaviour change for keep_csv: PR #83 had landed keep_csv=True as the new default. This PR re-flips it to False. The rationale: both raw-retention flags should be independent opt-ins so the default state is the lean cache. PR #83's rationale ("keep the source CSV so downstream consumers can re-read without re-fetching") is still achievable via explicit keep_csv=True.

For @mdavis-xyz — checking in on a couple of judgement calls

Commit 2 drops the eight except KeyboardInterrupt: raise blocks you added in #67 (2a5812f for downloader.py, d097030 for gui.py). Wanted to flag this and check my reasoning rather than just silently delete:

In Py3, KeyboardInterrupt is a sibling of Exception under BaseException, not a subclass — so except Exception: already lets KI propagate.
I traced your stated intent ("don't swallow Ctrl+C") through the actual code path and the implementation as-written is structurally a no-op. bare-raise re-raises the same exception, which can still be caught by any outer except: (bare). I think you may have been operating on a Py2 mental model where BaseException didn't exist and except Exception could catch KI under some conditions.
I found the real cause of the concern you were trying to address: a bare except: in static_table (data_fetch_methods.py:301) that catches KI and converts it to NoDataToReturn, regardless of what the inner functions do. This bug pre-dates Skip/Speed up tests, resolve most open issues #67 — Commit 2 narrows that to except Exception: which makes KI propagate naturally without needing the inner KI handlers.
I also narrowed the other bare-except in custom_tables.py:performance_at_nodal_peak for consistency. Same fix, different shape (defensive x = 1 rather than re-raise).

If I've misread the design intent here, happy to put any/all of the KI handlers back — let me know.

Live smoke tests

Same set as PR #85 ran clean on this branch too:

Test	Result
AEMO registration `.xlsx` with Chrome 130	✅
nemweb HTTPS MMS zip (2024-07 PUBLIC_DVD)	✅
Real 404 → cleanly surfaces `HTTPError`	✅
End-to-end `dynamic_data_compiler` against real nemweb	✅

The pre-existing #74 (PUBLIC_ARCHIVE# encoding) is still untouched here — separate PR.

Test plan

uv run pytest tests/ — 389 pass, 1 skipped, 1 pre-existing unrelated warning
All 4 new downloader unit tests pass
All 4 new cache_compiler tests pass
Live smoke tests against real AEMO + nemweb
CI green

Cherry-picked from #67's 2a5812f. Two related behaviours land here: 1. Streamed downloads (closes the foundation for #55-style cleanup on partial writes — wired up in a follow-up commit). A new `download_to_path(url, path)` helper opens the response with `stream=True` and writes chunks to disk via `iter_content`, so large zips never have to fit in memory. `download_to_dir(url, dir)` layers on top, deriving the filename from the URL. 2. Zip retention (#56). All zip downloads now land on disk and stay there. Previously the zip was read into an `io.BytesIO` and discarded after extraction; with the new design the zip persists, so a subsequent call against the same cache can skip the network and just re-extract from local. `download_to_path` is idempotent on the existence of the destination — if the zip is there, the network call is short-circuited. `download_unzip_csv`, the four `_download_and_unpack_*` functions, `download_csv`, `download_xl`, and `download_elements_file` are refactored to use the new helpers. Trimmed from PR #67's original cherry-pick: - The TTL-cached HTML parent-page pre-check (`_pre_check_file_is_missing`, `download_html`, `download_html_as_soup`, the cachetools dep) — separate feature, will be PR-C. - The PR #67 `download_xl → download_xml` rename — was already corrected to `download_xlsx` in #84. Behaviour change on `keep_csv=False`: the CSV is still removed as documented, but the .zip stays. That's the whole point of #56 (Matt's slow-internet use case). Users wanting a truly empty cache have no opt-out yet — `keep_zip` parameter is added in the next commit on this branch. The pre-existing `test_keep_csv_false_leaves_cache_empty` test was written before #56 landed and asserted today's "no zip on disk" contract; renamed to `test_keep_csv_false_removes_csv` and the assertion narrowed to "no CSV", consistent with the new default. Bugs in PR #67's original code are preserved as-authored for clarity of authorship; fixes follow in the next commit: - `download_to_dir`'s `force_redo` parameter is accepted but never forwarded to `download_to_path`. - `download_to_path` has a "temporary debugging" log line for `.zip` URLs. - `download_to_path` contains a per-request `aemo.com.au` UA override rendered redundant by PR #85's uniform Chrome 130 session default. - Several `except KeyboardInterrupt: raise` blocks were added in the run dispatchers; KeyboardInterrupt isn't a subclass of Exception in Py3 so these are no-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@mdavis-xyz

Five small clean-ups to the helpers introduced in the previous commit, plus two pre-existing bare-except bugs they happened to expose. Want to flag these for @mdavis-xyz to confirm — the KeyboardInterrupt discussion in particular involves a deliberate-looking design call of his that I'm reading as a no-op in Py3 (see PR body for analysis). In downloader.py: 1. Fix `download_to_dir`'s `force_redo` parameter — it was accepted in the signature but never forwarded to `download_to_path`, so the kwarg silently did nothing. Now plumbed through. 2. Drop the per-request `aemo.com.au` Chrome 120 User-Agent override in `download_to_path`. The session-level Chrome 130 UA from PR #85 already covers all hosts uniformly; the per-domain split is redundant. `urlparse` import dropped along with it. 3. Drop the `if response.status_code not in [200, 404]: logger.debug(...)` line. `raise_for_status()` on the next line already raises on every non-2xx; the debug log was suppressing log noise for 404s (legitimate "this archive doesn't exist yet" cases) but at the cost of a noisy `response.text[:1000]` body slice on every other non-2xx — including 200s, which is wrong. 4. Drop the "# temporary debugging" `logger.info("Zip downloaded to ...")` line. Was marked as temporary in the original code. 5. Drop the inner `except KeyboardInterrupt: raise` in `download_to_path`'s write loop. Same reasoning as the 8 others I removed: KeyboardInterrupt is a sibling of Exception (not a subclass) in Py3, so `except Exception:` below it already lets KI propagate. The handler is structurally a no-op. In data_fetch_methods.py and custom_tables.py: 6. Two pre-existing bare `except:` clauses changed to `except Exception:`. These were the real cause of the KI-doesn't- propagate concern Matt was trying to address with his handlers in #67: the bare-except in `static_table`'s download dispatch would swallow KI and convert it to `NoDataToReturn`, regardless of what the inner functions did. Narrowing it to `except Exception:` lets KI propagate naturally — which makes the inner KI handlers genuinely unnecessary rather than just structurally no-op. The bare except in `custom_tables.py:performance_at_nodal_peak` is a different shape (catches into `x = 1` defensive code, not re-raise) and arguably a separate bug, but the same fix applies and it's a one-line change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related changes that finalise the streaming work: 1. Plumb a new `keep_zip` parameter through the call stack as the opt-in for #56's slow-internet zip caching. False by default. 2. Flip `keep_csv` default from True (the value PR #83 had landed) to False, so both raw-retention flags are independent opt-ins and the default state is the lean cache (only the typed feather/parquet remains, which is the "real" cache anyway). Behaviour matrix: keep_csv keep_zip what's left in the cache -------- -------- ----------------------------------- F (deflt) F (deflt) feather/parquet only ← lean F T feather/parquet + zip ← #56's slow-internet caching T F feather/parquet + CSV ← downstream CSV consumer T T feather/parquet + CSV + zip ← full raw retention ## Plumbing `download_to_path` now returns True if a fresh network fetch happened, False if the destination already existed and was re-used. `download_to_dir` returns `(path, downloaded)`. This signal is what keeps `keep_zip=False` safe for concurrent same-cache use: cleanup only fires for zips THIS call actually wrote — pre-existing zips (from a previous call, or another concurrent process) are reported as `downloaded=False` and left untouched. Each leaf function (`download_unzip_csv` and the four `_download_and_unpack_*`) takes `keep_zip` and uses the `downloaded` flag from `download_to_dir` to decide whether its own zip should be cleaned up after extract. `keep_zip` is plumbed from `dynamic_data_compiler` / `cache_compiler` through `_dynamic_data_fetch_loop` → `_download_data` → the per-table dispatcher functions (`run`, `run_bid_tables`, …). ## Default flip The `keep_csv=True` default that PR #83 introduced is reverted here. That PR's rationale ("keep raw CSV so downstream consumers can re-read without re-fetching") still applies — but now the user can opt in explicitly via `keep_csv=True` instead of getting it implicitly. The default contract becomes: NEMOSIS only retains the artifact it was specifically asked for (the typed feather/parquet); raw AEMO files are transient unless the user explicitly asks for them. ## Test changes - `test_keep_csv_true_by_default_keeps_fetched_csv` renamed to `test_keep_csv_true_retains_csv` and switched to explicit `keep_csv=True` — covers the opt-in path. - New `test_keep_csv_default_is_false` covers the new default behaviour. - New `test_keep_zip_default_is_false` and `test_keep_zip_true_retains_zip` cover the two `keep_zip` paths. - New `test_cached_zip_extracts_without_network` proves the #56 benefit end-to-end: with `keep_zip=True`, a second call that's missing the feather can rebuild from the cached zip without any network — verified by monkeypatching `aemo_mms_url` to a dead address mid-test. - New unit tests in `tests/test_downloader.py` for the helper-level contracts: `download_to_path` returns True on fresh fetch / False on cache hit / True with `force_redo=True`, and `download_to_dir` forwards `force_redo` (regression test for the PR #67 bug where the kwarg was accepted but never forwarded). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #84 renamed download_xl to download_xlsx but this test was missed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mdavis-xyz and others added 4 commits May 25, 2026 13:57

Fix stale download_xl reference in 404 test

08d1b6d

PR #84 renamed download_xl to download_xlsx but this test was missed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nick-gorman merged commit fb5caf4 into master May 25, 2026
16 checks passed

This was referenced May 25, 2026

Skip/Speed up tests, resolve most open issues #67

Closed

Option to keep .zip not csv #56

Closed

Set keep_csv=True as default for cache_compiler #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream zip downloads, add keep_zip, flip raw-retention defaults to False#87

Stream zip downloads, add keep_zip, flip raw-retention defaults to False#87
nick-gorman merged 4 commits into
masterfrom
downloader-streaming-cleanup

nick-gorman commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nick-gorman commented May 25, 2026

Summary

New default contract

For @mdavis-xyz — checking in on a couple of judgement calls

Live smoke tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants