Skip to content

Cache parent directory HTML for fast missing-file pre-check#88

Merged
nick-gorman merged 1 commit into
masterfrom
downloader-html-precheck
May 25, 2026
Merged

Cache parent directory HTML for fast missing-file pre-check#88
nick-gorman merged 1 commit into
masterfrom
downloader-html-precheck

Conversation

@nick-gorman
Copy link
Copy Markdown
Member

Summary

Caches each archive directory's HTML index for ~1 hour and uses it to answer "does this file exist?" locally instead of firing a guaranteed-404 request. Speedup hits whenever a single dynamic_data_compiler call scans multiple missing months in the same parent directory (e.g. start_time well before nem_data_model_start_time).

Smoke-tested against real nemweb:

  • 1st missing-file check: ~570ms (parent HTML fetch)
  • 2nd+ in same parent: ~10ms (cached)

Effectively the cost of N missing-file checks drops from N×(404 round-trip) to one HTML fetch + N local lookups.

What's new

  • download_html(url) / download_html_as_soup(url) — TTL-cached helpers (1hr, 1024 entries). Also adopted by _get_matching_link and download_elements_file for the same caching benefit on the live HTML pages NEMOSIS already fetches.
  • _pre_check_file_is_missing(url) — returns True/False/None against the parent HTML listing. None means "fall through to a real request" (covers non-nemweb URLs, non-archive extensions, parent HTML unreachable, etc).
  • download_to_path calls the pre-check; a True answer raises the same requests.HTTPError(404) shape as a real 404, so all existing missing-month warning paths keep working unchanged.
  • cachetools>=5.5.0 added to dependencies.
  • Autouse fixture in tests/conftest.py clears _html_cache between tests for hermeticity.

Tests

8 new tests in tests/test_downloader.py:

  • test_download_html_caches_across_calls — verifies the TTL cache by counting session.get calls
  • test_pre_check_returns_none_for_non_nemweb_urls — protects every existing test that targets the local mock server
  • test_pre_check_returns_none_for_non_archive_extensions — only .zip / .csv participate
  • test_pre_check_returns_false_when_file_is_listed — present in parent HTML → False
  • test_pre_check_returns_true_when_file_is_not_listed — absent → True
  • test_pre_check_returns_none_when_parent_html_unreachable — transient infra failure falls through, doesn't suppress real downloads
  • test_download_to_path_raises_http_404_when_pre_check_says_missing — end-to-end, with a session.get counter to prove the wire fetch is skipped
  • test_download_to_path_proceeds_when_pre_check_says_present — False answer must not short-circuit a real download

Departures from PR #67

Adapted from Matt's 2a5812f. Differences:

Part of the PR #67 breakdown effort (#8c of ~10).

Test plan

  • uv run pytest tests/test_downloader.py — 16/16 pass
  • uv run pytest — full suite 397 passed / 1 skipped (no regressions)
  • Smoke test against real nemweb confirms speed-up
  • CI matrix (3.10–3.14 × ubuntu/windows/macos)

🤖 Generated with Claude Code

Scanning historical NEMOSIS data hits many monthly archive paths that
don't yet exist. Each guaranteed-404 is a 200-500ms round-trip to
nemweb. Nemweb serves a browsable HTML index for each archive directory,
so fetching the parent listing once lets us answer many missing-file
questions locally.

Adds:
- download_html / download_html_as_soup, TTL-cached for 1 hour
- _pre_check_file_is_missing, called early in download_to_path; a True
  answer surfaces as the same HTTPError(404) shape callers already
  handle for real 404s, so missing-month warnings keep working unchanged
- cachetools dependency
- autouse fixture clearing the cache between tests to keep them
  hermetic
- 8 new tests in tests/test_downloader.py covering caching behaviour,
  the False/True/None pre-check branches, and end-to-end integration
  with download_to_path

Smoke-tested against real nemweb: first missing-file check ~570ms
(parent HTML fetch), subsequent missing-file checks in the same parent
directory ~10ms (cached lookup). Speedup is realised whenever a single
NEMOSIS call walks multiple missing months in the same archive folder.

The pre-check is targeted at nemweb's directory-listed archive trees
(/Data_Archive/ and /Reports/). Other endpoints (notably the hashed
PUBLIC_ARCHIVE# files under aemo_mms_url) don't have a browsable parent,
so the pre-check returns None and the caller falls through to a real
request.

Adapted from PR #67's 2a5812f.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nick-gorman nick-gorman merged commit 09559bc into master May 25, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant