Cache parent directory HTML for fast missing-file pre-check#88
Merged
Conversation
Scanning historical NEMOSIS data hits many monthly archive paths that don't yet exist. Each guaranteed-404 is a 200-500ms round-trip to nemweb. Nemweb serves a browsable HTML index for each archive directory, so fetching the parent listing once lets us answer many missing-file questions locally. Adds: - download_html / download_html_as_soup, TTL-cached for 1 hour - _pre_check_file_is_missing, called early in download_to_path; a True answer surfaces as the same HTTPError(404) shape callers already handle for real 404s, so missing-month warnings keep working unchanged - cachetools dependency - autouse fixture clearing the cache between tests to keep them hermetic - 8 new tests in tests/test_downloader.py covering caching behaviour, the False/True/None pre-check branches, and end-to-end integration with download_to_path Smoke-tested against real nemweb: first missing-file check ~570ms (parent HTML fetch), subsequent missing-file checks in the same parent directory ~10ms (cached lookup). Speedup is realised whenever a single NEMOSIS call walks multiple missing months in the same archive folder. The pre-check is targeted at nemweb's directory-listed archive trees (/Data_Archive/ and /Reports/). Other endpoints (notably the hashed PUBLIC_ARCHIVE# files under aemo_mms_url) don't have a browsable parent, so the pre-check returns None and the caller falls through to a real request. Adapted from PR #67's 2a5812f. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Caches each archive directory's HTML index for ~1 hour and uses it to answer "does this file exist?" locally instead of firing a guaranteed-404 request. Speedup hits whenever a single
dynamic_data_compilercall scans multiple missing months in the same parent directory (e.g.start_timewell beforenem_data_model_start_time).Smoke-tested against real nemweb:
Effectively the cost of N missing-file checks drops from N×(404 round-trip) to one HTML fetch + N local lookups.
What's new
download_html(url)/download_html_as_soup(url)— TTL-cached helpers (1hr, 1024 entries). Also adopted by_get_matching_linkanddownload_elements_filefor the same caching benefit on the live HTML pages NEMOSIS already fetches._pre_check_file_is_missing(url)— returnsTrue/False/Noneagainst the parent HTML listing. None means "fall through to a real request" (covers non-nemweb URLs, non-archive extensions, parent HTML unreachable, etc).download_to_pathcalls the pre-check; a True answer raises the samerequests.HTTPError(404)shape as a real 404, so all existing missing-month warning paths keep working unchanged.cachetools>=5.5.0added to dependencies.tests/conftest.pyclears_html_cachebetween tests for hermeticity.Tests
8 new tests in
tests/test_downloader.py:test_download_html_caches_across_calls— verifies the TTL cache by countingsession.getcallstest_pre_check_returns_none_for_non_nemweb_urls— protects every existing test that targets the local mock servertest_pre_check_returns_none_for_non_archive_extensions— only.zip/.csvparticipatetest_pre_check_returns_false_when_file_is_listed— present in parent HTML → Falsetest_pre_check_returns_true_when_file_is_not_listed— absent → Truetest_pre_check_returns_none_when_parent_html_unreachable— transient infra failure falls through, doesn't suppress real downloadstest_download_to_path_raises_http_404_when_pre_check_says_missing— end-to-end, with asession.getcounter to prove the wire fetch is skippedtest_download_to_path_proceeds_when_pre_check_says_present— False answer must not short-circuit a real downloadDepartures from PR #67
Adapted from Matt's
2a5812f. Differences:ValueErrorfrom the pre-check. This PR raisesrequests.HTTPErrorwith a synthetic 404 response, matching the contract PR Adopt requests.Session, bump UA to Chrome 130, raise_for_status everywhere #85 established (every download function surfaces a 404 asHTTPError). All existing 404-warning callers keep working unchanged.assert file_url.startswith("https://")that PR Skip/Speed up tests, resolve most open issues #67 had in the pre-check. The downloader is exercised againsthttp://URLs by the offline test fixtures, so an unconditional HTTPS assertion would break the test suite. The pre-check now simply returns None for any URL outside the configured nemweb prefixes — same effect for real callers._html_cacheexposed so tests can clear it between cases. PR Skip/Speed up tests, resolve most open issues #67's@cached(cache=TTLCache(...))decorator-inline pattern would have leaked state across tests._NEMWEB_HTML_PRECHECK_PREFIXESas a module-level constant (vs. PR Skip/Speed up tests, resolve most open issues #67's inline literal). Lets tests monkeypatch the prefix list to point at the local mock server.Part of the PR #67 breakdown effort (#8c of ~10).
Test plan
uv run pytest tests/test_downloader.py— 16/16 passuv run pytest— full suite 397 passed / 1 skipped (no regressions)🤖 Generated with Claude Code