Skip to content

Stream zip downloads, add keep_zip, flip raw-retention defaults to False#87

Merged
nick-gorman merged 4 commits into
masterfrom
downloader-streaming-cleanup
May 25, 2026
Merged

Stream zip downloads, add keep_zip, flip raw-retention defaults to False#87
nick-gorman merged 4 commits into
masterfrom
downloader-streaming-cleanup

Conversation

@nick-gorman
Copy link
Copy Markdown
Member

Summary

Second of three sequential PRs implementing the downloader rework from #67. Builds on PR #85's requests.Session foundation. Lands streaming + cleanup-on-failure, adds a new keep_zip parameter for #56's slow-internet caching use case, and flips both raw-retention defaults (keep_csv and keep_zip) to False for a leaner default cache.

Three commits, each independently meaningful:

# Commit Author What
1 Stream zip downloads to disk and cache them between runs @mdavis-xyz Cherry-pick of PR #67's 2a5812f, with PR-C bits trimmed. Introduces download_to_path + download_to_dir helpers; refactors download_unzip_csv, the four _download_and_unpack_*, download_csv, download_xl, and download_elements_file to use them.
2 Tidy up the streaming download code me Bug fixes + bare-except fix that turned out to be the real cause of the KI-doesn't-propagate concern. Needs Matt's review — see "For @mdavis-xyz" below.
3 Add keep_zip parameter and flip both raw-retention defaults to False me Plumb keep_zip through the call stack with the (path, downloaded) signal that keeps cleanup safe for concurrent use. Flip defaults.

New default contract

keep_csv keep_zip what's left in the cache
F (default) F (default) feather/parquet only — lean cache
F T feather/parquet + zip — #56's slow-internet caching
T F feather/parquet + CSV — downstream CSV consumer
T T feather/parquet + CSV + zip — full raw retention

Behaviour change for keep_csv: PR #83 had landed keep_csv=True as the new default. This PR re-flips it to False. The rationale: both raw-retention flags should be independent opt-ins so the default state is the lean cache. PR #83's rationale ("keep the source CSV so downstream consumers can re-read without re-fetching") is still achievable via explicit keep_csv=True.

For @mdavis-xyz — checking in on a couple of judgement calls

Commit 2 drops the eight except KeyboardInterrupt: raise blocks you added in #67 (2a5812f for downloader.py, d097030 for gui.py). Wanted to flag this and check my reasoning rather than just silently delete:

  • In Py3, KeyboardInterrupt is a sibling of Exception under BaseException, not a subclass — so except Exception: already lets KI propagate.
  • I traced your stated intent ("don't swallow Ctrl+C") through the actual code path and the implementation as-written is structurally a no-op. bare-raise re-raises the same exception, which can still be caught by any outer except: (bare). I think you may have been operating on a Py2 mental model where BaseException didn't exist and except Exception could catch KI under some conditions.
  • I found the real cause of the concern you were trying to address: a bare except: in static_table (data_fetch_methods.py:301) that catches KI and converts it to NoDataToReturn, regardless of what the inner functions do. This bug pre-dates Skip/Speed up tests, resolve most open issues #67 — Commit 2 narrows that to except Exception: which makes KI propagate naturally without needing the inner KI handlers.
  • I also narrowed the other bare-except in custom_tables.py:performance_at_nodal_peak for consistency. Same fix, different shape (defensive x = 1 rather than re-raise).

If I've misread the design intent here, happy to put any/all of the KI handlers back — let me know.

Live smoke tests

Same set as PR #85 ran clean on this branch too:

Test Result
AEMO registration .xlsx with Chrome 130
nemweb HTTPS MMS zip (2024-07 PUBLIC_DVD)
Real 404 → cleanly surfaces HTTPError
End-to-end dynamic_data_compiler against real nemweb

The pre-existing #74 (PUBLIC_ARCHIVE# encoding) is still untouched here — separate PR.

Test plan

  • uv run pytest tests/ — 389 pass, 1 skipped, 1 pre-existing unrelated warning
  • All 4 new downloader unit tests pass
  • All 4 new cache_compiler tests pass
  • Live smoke tests against real AEMO + nemweb
  • CI green

mdavis-xyz and others added 4 commits May 25, 2026 13:57
Cherry-picked from #67's 2a5812f. Two related behaviours land here:

1. Streamed downloads (closes the foundation for #55-style cleanup
   on partial writes — wired up in a follow-up commit). A new
   `download_to_path(url, path)` helper opens the response with
   `stream=True` and writes chunks to disk via `iter_content`, so
   large zips never have to fit in memory. `download_to_dir(url, dir)`
   layers on top, deriving the filename from the URL.

2. Zip retention (#56). All zip downloads now land on disk and stay
   there. Previously the zip was read into an `io.BytesIO` and
   discarded after extraction; with the new design the zip persists,
   so a subsequent call against the same cache can skip the network
   and just re-extract from local. `download_to_path` is idempotent
   on the existence of the destination — if the zip is there, the
   network call is short-circuited.

`download_unzip_csv`, the four `_download_and_unpack_*` functions,
`download_csv`, `download_xl`, and `download_elements_file` are
refactored to use the new helpers.

Trimmed from PR #67's original cherry-pick:

- The TTL-cached HTML parent-page pre-check (`_pre_check_file_is_missing`,
  `download_html`, `download_html_as_soup`, the cachetools dep) —
  separate feature, will be PR-C.
- The PR #67 `download_xl → download_xml` rename — was already
  corrected to `download_xlsx` in #84.

Behaviour change on `keep_csv=False`: the CSV is still removed as
documented, but the .zip stays. That's the whole point of #56 (Matt's
slow-internet use case). Users wanting a truly empty cache have no
opt-out yet — `keep_zip` parameter is added in the next commit on
this branch. The pre-existing
`test_keep_csv_false_leaves_cache_empty` test was written before #56
landed and asserted today's "no zip on disk" contract; renamed to
`test_keep_csv_false_removes_csv` and the assertion narrowed to
"no CSV", consistent with the new default.

Bugs in PR #67's original code are preserved as-authored for clarity
of authorship; fixes follow in the next commit:
- `download_to_dir`'s `force_redo` parameter is accepted but never
  forwarded to `download_to_path`.
- `download_to_path` has a "temporary debugging" log line for `.zip`
  URLs.
- `download_to_path` contains a per-request `aemo.com.au` UA override
  rendered redundant by PR #85's uniform Chrome 130 session default.
- Several `except KeyboardInterrupt: raise` blocks were added in the
  run dispatchers; KeyboardInterrupt isn't a subclass of Exception
  in Py3 so these are no-ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five small clean-ups to the helpers introduced in the previous commit,
plus two pre-existing bare-except bugs they happened to expose. Want
to flag these for @mdavis-xyz to confirm — the KeyboardInterrupt
discussion in particular involves a deliberate-looking design call
of his that I'm reading as a no-op in Py3 (see PR body for analysis).

In downloader.py:

1. Fix `download_to_dir`'s `force_redo` parameter — it was accepted
   in the signature but never forwarded to `download_to_path`, so the
   kwarg silently did nothing. Now plumbed through.

2. Drop the per-request `aemo.com.au` Chrome 120 User-Agent override
   in `download_to_path`. The session-level Chrome 130 UA from PR #85
   already covers all hosts uniformly; the per-domain split is
   redundant. `urlparse` import dropped along with it.

3. Drop the `if response.status_code not in [200, 404]: logger.debug(...)`
   line. `raise_for_status()` on the next line already raises on every
   non-2xx; the debug log was suppressing log noise for 404s
   (legitimate "this archive doesn't exist yet" cases) but at the
   cost of a noisy `response.text[:1000]` body slice on every other
   non-2xx — including 200s, which is wrong.

4. Drop the "# temporary debugging" `logger.info("Zip downloaded to ...")`
   line. Was marked as temporary in the original code.

5. Drop the inner `except KeyboardInterrupt: raise` in
   `download_to_path`'s write loop. Same reasoning as the 8 others
   I removed: KeyboardInterrupt is a sibling of Exception (not a
   subclass) in Py3, so `except Exception:` below it already lets
   KI propagate. The handler is structurally a no-op.

In data_fetch_methods.py and custom_tables.py:

6. Two pre-existing bare `except:` clauses changed to
   `except Exception:`. These were the real cause of the KI-doesn't-
   propagate concern Matt was trying to address with his handlers in
   #67: the bare-except in `static_table`'s download dispatch would
   swallow KI and convert it to `NoDataToReturn`, regardless of what
   the inner functions did. Narrowing it to `except Exception:` lets
   KI propagate naturally — which makes the inner KI handlers
   genuinely unnecessary rather than just structurally no-op.

   The bare except in `custom_tables.py:performance_at_nodal_peak`
   is a different shape (catches into `x = 1` defensive code, not
   re-raise) and arguably a separate bug, but the same fix applies
   and it's a one-line change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes that finalise the streaming work:

1. Plumb a new `keep_zip` parameter through the call stack as the
   opt-in for #56's slow-internet zip caching. False by default.

2. Flip `keep_csv` default from True (the value PR #83 had landed)
   to False, so both raw-retention flags are independent opt-ins
   and the default state is the lean cache (only the typed
   feather/parquet remains, which is the "real" cache anyway).

Behaviour matrix:

  keep_csv  keep_zip  what's left in the cache
  --------  --------  -----------------------------------
  F (deflt) F (deflt) feather/parquet only          ← lean
  F         T         feather/parquet + zip          ← #56's slow-internet caching
  T         F         feather/parquet + CSV          ← downstream CSV consumer
  T         T         feather/parquet + CSV + zip    ← full raw retention

## Plumbing

`download_to_path` now returns True if a fresh network fetch
happened, False if the destination already existed and was re-used.
`download_to_dir` returns `(path, downloaded)`. This signal is what
keeps `keep_zip=False` safe for concurrent same-cache use: cleanup
only fires for zips THIS call actually wrote — pre-existing zips
(from a previous call, or another concurrent process) are reported as
`downloaded=False` and left untouched.

Each leaf function (`download_unzip_csv` and the four
`_download_and_unpack_*`) takes `keep_zip` and uses the `downloaded`
flag from `download_to_dir` to decide whether its own zip should be
cleaned up after extract. `keep_zip` is plumbed from
`dynamic_data_compiler` / `cache_compiler` through
`_dynamic_data_fetch_loop` → `_download_data` → the per-table
dispatcher functions (`run`, `run_bid_tables`, …).

## Default flip

The `keep_csv=True` default that PR #83 introduced is reverted here.
That PR's rationale ("keep raw CSV so downstream consumers can re-read
without re-fetching") still applies — but now the user can opt in
explicitly via `keep_csv=True` instead of getting it implicitly. The
default contract becomes: NEMOSIS only retains the artifact it was
specifically asked for (the typed feather/parquet); raw AEMO files
are transient unless the user explicitly asks for them.

## Test changes

- `test_keep_csv_true_by_default_keeps_fetched_csv` renamed to
  `test_keep_csv_true_retains_csv` and switched to explicit
  `keep_csv=True` — covers the opt-in path.
- New `test_keep_csv_default_is_false` covers the new default
  behaviour.
- New `test_keep_zip_default_is_false` and `test_keep_zip_true_retains_zip`
  cover the two `keep_zip` paths.
- New `test_cached_zip_extracts_without_network` proves the #56
  benefit end-to-end: with `keep_zip=True`, a second call that's
  missing the feather can rebuild from the cached zip without any
  network — verified by monkeypatching `aemo_mms_url` to a dead
  address mid-test.
- New unit tests in `tests/test_downloader.py` for the helper-level
  contracts: `download_to_path` returns True on fresh fetch / False
  on cache hit / True with `force_redo=True`, and `download_to_dir`
  forwards `force_redo` (regression test for the PR #67 bug where
  the kwarg was accepted but never forwarded).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #84 renamed download_xl to download_xlsx but this test was missed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nick-gorman nick-gorman merged commit fb5caf4 into master May 25, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants