Stream zip downloads, add keep_zip, flip raw-retention defaults to False#87
Merged
Conversation
Cherry-picked from #67's 2a5812f. Two related behaviours land here: 1. Streamed downloads (closes the foundation for #55-style cleanup on partial writes — wired up in a follow-up commit). A new `download_to_path(url, path)` helper opens the response with `stream=True` and writes chunks to disk via `iter_content`, so large zips never have to fit in memory. `download_to_dir(url, dir)` layers on top, deriving the filename from the URL. 2. Zip retention (#56). All zip downloads now land on disk and stay there. Previously the zip was read into an `io.BytesIO` and discarded after extraction; with the new design the zip persists, so a subsequent call against the same cache can skip the network and just re-extract from local. `download_to_path` is idempotent on the existence of the destination — if the zip is there, the network call is short-circuited. `download_unzip_csv`, the four `_download_and_unpack_*` functions, `download_csv`, `download_xl`, and `download_elements_file` are refactored to use the new helpers. Trimmed from PR #67's original cherry-pick: - The TTL-cached HTML parent-page pre-check (`_pre_check_file_is_missing`, `download_html`, `download_html_as_soup`, the cachetools dep) — separate feature, will be PR-C. - The PR #67 `download_xl → download_xml` rename — was already corrected to `download_xlsx` in #84. Behaviour change on `keep_csv=False`: the CSV is still removed as documented, but the .zip stays. That's the whole point of #56 (Matt's slow-internet use case). Users wanting a truly empty cache have no opt-out yet — `keep_zip` parameter is added in the next commit on this branch. The pre-existing `test_keep_csv_false_leaves_cache_empty` test was written before #56 landed and asserted today's "no zip on disk" contract; renamed to `test_keep_csv_false_removes_csv` and the assertion narrowed to "no CSV", consistent with the new default. Bugs in PR #67's original code are preserved as-authored for clarity of authorship; fixes follow in the next commit: - `download_to_dir`'s `force_redo` parameter is accepted but never forwarded to `download_to_path`. - `download_to_path` has a "temporary debugging" log line for `.zip` URLs. - `download_to_path` contains a per-request `aemo.com.au` UA override rendered redundant by PR #85's uniform Chrome 130 session default. - Several `except KeyboardInterrupt: raise` blocks were added in the run dispatchers; KeyboardInterrupt isn't a subclass of Exception in Py3 so these are no-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five small clean-ups to the helpers introduced in the previous commit, plus two pre-existing bare-except bugs they happened to expose. Want to flag these for @mdavis-xyz to confirm — the KeyboardInterrupt discussion in particular involves a deliberate-looking design call of his that I'm reading as a no-op in Py3 (see PR body for analysis). In downloader.py: 1. Fix `download_to_dir`'s `force_redo` parameter — it was accepted in the signature but never forwarded to `download_to_path`, so the kwarg silently did nothing. Now plumbed through. 2. Drop the per-request `aemo.com.au` Chrome 120 User-Agent override in `download_to_path`. The session-level Chrome 130 UA from PR #85 already covers all hosts uniformly; the per-domain split is redundant. `urlparse` import dropped along with it. 3. Drop the `if response.status_code not in [200, 404]: logger.debug(...)` line. `raise_for_status()` on the next line already raises on every non-2xx; the debug log was suppressing log noise for 404s (legitimate "this archive doesn't exist yet" cases) but at the cost of a noisy `response.text[:1000]` body slice on every other non-2xx — including 200s, which is wrong. 4. Drop the "# temporary debugging" `logger.info("Zip downloaded to ...")` line. Was marked as temporary in the original code. 5. Drop the inner `except KeyboardInterrupt: raise` in `download_to_path`'s write loop. Same reasoning as the 8 others I removed: KeyboardInterrupt is a sibling of Exception (not a subclass) in Py3, so `except Exception:` below it already lets KI propagate. The handler is structurally a no-op. In data_fetch_methods.py and custom_tables.py: 6. Two pre-existing bare `except:` clauses changed to `except Exception:`. These were the real cause of the KI-doesn't- propagate concern Matt was trying to address with his handlers in #67: the bare-except in `static_table`'s download dispatch would swallow KI and convert it to `NoDataToReturn`, regardless of what the inner functions did. Narrowing it to `except Exception:` lets KI propagate naturally — which makes the inner KI handlers genuinely unnecessary rather than just structurally no-op. The bare except in `custom_tables.py:performance_at_nodal_peak` is a different shape (catches into `x = 1` defensive code, not re-raise) and arguably a separate bug, but the same fix applies and it's a one-line change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes that finalise the streaming work: 1. Plumb a new `keep_zip` parameter through the call stack as the opt-in for #56's slow-internet zip caching. False by default. 2. Flip `keep_csv` default from True (the value PR #83 had landed) to False, so both raw-retention flags are independent opt-ins and the default state is the lean cache (only the typed feather/parquet remains, which is the "real" cache anyway). Behaviour matrix: keep_csv keep_zip what's left in the cache -------- -------- ----------------------------------- F (deflt) F (deflt) feather/parquet only ← lean F T feather/parquet + zip ← #56's slow-internet caching T F feather/parquet + CSV ← downstream CSV consumer T T feather/parquet + CSV + zip ← full raw retention ## Plumbing `download_to_path` now returns True if a fresh network fetch happened, False if the destination already existed and was re-used. `download_to_dir` returns `(path, downloaded)`. This signal is what keeps `keep_zip=False` safe for concurrent same-cache use: cleanup only fires for zips THIS call actually wrote — pre-existing zips (from a previous call, or another concurrent process) are reported as `downloaded=False` and left untouched. Each leaf function (`download_unzip_csv` and the four `_download_and_unpack_*`) takes `keep_zip` and uses the `downloaded` flag from `download_to_dir` to decide whether its own zip should be cleaned up after extract. `keep_zip` is plumbed from `dynamic_data_compiler` / `cache_compiler` through `_dynamic_data_fetch_loop` → `_download_data` → the per-table dispatcher functions (`run`, `run_bid_tables`, …). ## Default flip The `keep_csv=True` default that PR #83 introduced is reverted here. That PR's rationale ("keep raw CSV so downstream consumers can re-read without re-fetching") still applies — but now the user can opt in explicitly via `keep_csv=True` instead of getting it implicitly. The default contract becomes: NEMOSIS only retains the artifact it was specifically asked for (the typed feather/parquet); raw AEMO files are transient unless the user explicitly asks for them. ## Test changes - `test_keep_csv_true_by_default_keeps_fetched_csv` renamed to `test_keep_csv_true_retains_csv` and switched to explicit `keep_csv=True` — covers the opt-in path. - New `test_keep_csv_default_is_false` covers the new default behaviour. - New `test_keep_zip_default_is_false` and `test_keep_zip_true_retains_zip` cover the two `keep_zip` paths. - New `test_cached_zip_extracts_without_network` proves the #56 benefit end-to-end: with `keep_zip=True`, a second call that's missing the feather can rebuild from the cached zip without any network — verified by monkeypatching `aemo_mms_url` to a dead address mid-test. - New unit tests in `tests/test_downloader.py` for the helper-level contracts: `download_to_path` returns True on fresh fetch / False on cache hit / True with `force_redo=True`, and `download_to_dir` forwards `force_redo` (regression test for the PR #67 bug where the kwarg was accepted but never forwarded). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #84 renamed download_xl to download_xlsx but this test was missed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Second of three sequential PRs implementing the downloader rework from #67. Builds on PR #85's
requests.Sessionfoundation. Lands streaming + cleanup-on-failure, adds a newkeep_zipparameter for #56's slow-internet caching use case, and flips both raw-retention defaults (keep_csvandkeep_zip) toFalsefor a leaner default cache.Three commits, each independently meaningful:
2a5812f, with PR-C bits trimmed. Introducesdownload_to_path+download_to_dirhelpers; refactorsdownload_unzip_csv, the four_download_and_unpack_*,download_csv,download_xl, anddownload_elements_fileto use them.keep_zipthrough the call stack with the(path, downloaded)signal that keeps cleanup safe for concurrent use. Flip defaults.New default contract
Behaviour change for
keep_csv: PR #83 had landedkeep_csv=Trueas the new default. This PR re-flips it toFalse. The rationale: both raw-retention flags should be independent opt-ins so the default state is the lean cache. PR #83's rationale ("keep the source CSV so downstream consumers can re-read without re-fetching") is still achievable via explicitkeep_csv=True.For @mdavis-xyz — checking in on a couple of judgement calls
Commit 2 drops the eight
except KeyboardInterrupt: raiseblocks you added in #67 (2a5812ffor downloader.py,d097030for gui.py). Wanted to flag this and check my reasoning rather than just silently delete:KeyboardInterruptis a sibling ofExceptionunderBaseException, not a subclass — soexcept Exception:already lets KI propagate.bare-raisere-raises the same exception, which can still be caught by any outerexcept:(bare). I think you may have been operating on a Py2 mental model whereBaseExceptiondidn't exist andexcept Exceptioncould catch KI under some conditions.except:instatic_table(data_fetch_methods.py:301) that catches KI and converts it toNoDataToReturn, regardless of what the inner functions do. This bug pre-dates Skip/Speed up tests, resolve most open issues #67 — Commit 2 narrows that toexcept Exception:which makes KI propagate naturally without needing the inner KI handlers.custom_tables.py:performance_at_nodal_peakfor consistency. Same fix, different shape (defensivex = 1rather than re-raise).If I've misread the design intent here, happy to put any/all of the KI handlers back — let me know.
Live smoke tests
Same set as PR #85 ran clean on this branch too:
.xlsxwith Chrome 130HTTPErrordynamic_data_compileragainst real nemwebThe pre-existing #74 (PUBLIC_ARCHIVE# encoding) is still untouched here — separate PR.
Test plan
uv run pytest tests/— 389 pass, 1 skipped, 1 pre-existing unrelated warning