Cache hardening: auto-create dir, keep_csv default, partial-file cleanup#83
Merged
Conversation
Cherry-picked from #67 (commit 976b872). cache_compiler's contract is "build me a typed cache from these arguments" — making the user create the destination directory first is needless friction. If the path doesn't exist, create it. If it exists but as a file rather than a directory, that's clearly a typo, so raise UserInputError naming the path. Conflict resolution: the cherry-pick context assumed PR #67's full state including the None-check guard; master already has the None check (from PR #80) and an older error-message wording on the following line. Took PR #67's version of the isfile/makedirs block; the None check above it is unchanged from master. Scope is unchanged from PR #67 — dynamic_data_compiler and static_table still require the directory to exist (they're read-side operations; auto-creation only makes sense for cache_compiler). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes:
1. Source fix: the cherry-picked cleanup block in _write_to_format used
bare 'os.path.exists' and 'os.remove', but data_fetch_methods.py
imports the stdlib module as 'os as _os'. So the cleanup path raised
NameError("name 'os' is not defined") whenever a feather/parquet
write actually failed — silently regressing the issue #55 fix to
worse-than-before, because the user-visible exception became
NameError instead of the real write error, and the partial file was
never cleaned up. Replaced with _os.path.isfile / _os.unlink to
match the existing pre-write cleanup convention on lines 781-782 of
the same function.
2. Regression tests in tests/end_to_end_table_tests/test_cache_compiler.py
covering all three behaviours added in this PR plus the bug above:
- test_creates_cache_directory_when_missing
- test_raises_when_cache_path_is_a_file
- test_keep_csv_true_by_default
- test_write_to_format_cleans_up_partial_file_on_failure
(parametrized over feather and parquet — both formats fail in the
same way under the bug, both pass after the fix)
The write-cleanup tests were authored before applying the source fix
and confirmed to fail on the buggy version with the exact NameError
from line 792, then to pass after the fix — so they are real
regression coverage, not coincidental.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-lock test CI surfaced two issues with test_keep_csv_true_by_default that weren't visible on Windows local: 1. AEMO zips contain CSV files with an uppercase '.CSV' extension. The test globbed for '*DISPATCHPRICE*.csv', which is case-insensitive on Windows (passed locally) but case-sensitive on Linux/macOS (failed on CI). NEMOSIS itself handles this internally with [cC][sS][vV] patterns — the test now uses [Cc][Ss][Vv] to match. 2. The test relied on the tmp_path being empty so the CSV-fetch path would run by default. Made the intent explicit with rebuild=True so the test deterministically forces the CSV-fetch branch regardless of starting cache state. While restructuring, also added: - test_keep_csv_false_removes_fetched_csv: mirror of the above with the override, confirming the opt-out path still removes CSVs. - test_existing_feather_means_no_csv_is_fetched: locks in the user- clarified semantic — keep_csv=True is about "don't delete a CSV we fetched", NOT about "create a CSV when feather already exists". Pre-populates empty feather files at the expected names so the "already compiled" short-circuit fires, then asserts no CSV was created. In caching_mode the existence check skips the actual read, so empty feather files are fine for the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles three independent cache-related fixes cherry-picked from #67, plus a follow-up that fixes a latent NameError in the cleanup path and adds regression coverage.
Closes #55 (partial-file cleanup).
cache_compilernow auto-creates the destination directory if it's missing, and raisesUserInputErrorif the path exists as a regular file. Scope is narrow —dynamic_data_compilerandstatic_table(read-side) still require the directory to exist. Required conflict resolution: master's None-check (from #80) collided with PR67's adjacent error-message rewording; took PR67's logic for the directory-vs-file check, left master's None check unchanged.try/exceptso a mid-write failure (disk full, interrupted writer, etc) doesn't leave an unreadable partial file on disk that breaks subsequent reads. Re-raises the original exception unchanged.cache_compiler(keep_csv=…)defaultcache_compiler'skeep_csvdefault flips fromFalsetoTrue. Callers that didn't specify it explicitly will now retain the raw AEMO CSVs inraw_data_locationalongside the typed feather/parquet. Rationale: keeping the source CSV is useful for downstream consumers and for re-reading without re-fetching from AEMO. Users who specifically don't want CSVs (e.g. disk pressure) can still passkeep_csv=Falseexplicitly.dynamic_data_compiler's default already wasTrueon master — onlycache_compileris affected.Bug found and fixed in commit 4
Commit 3's cleanup block, as cherry-picked from #67, used bare
os.path.existsandos.remove:But
data_fetch_methods.pyimports the stdlib module asos as _os(line 2). So whenever a feather/parquet write actually failed:NameError("name 'os' is not defined")instead of the real write error (the original error survives only as__context__).That silently regressed issue #55 to worse-than-before. Commit 4 swaps to
_os.path.isfile/_os.unlinkto match the convention on lines 781-782 of the same function.Regression tests (commit 4)
All four new tests live in
tests/end_to_end_table_tests/test_cache_compiler.py:test_creates_cache_directory_when_missingtest_raises_when_cache_path_is_a_filetest_keep_csv_true_by_defaulttest_write_to_format_cleans_up_partial_file_on_failure[feather]+[parquet]— parametrized. Authored before applying the source fix and confirmed to fail with the exactNameErrorat line 792 on the buggy version, then to pass after the fix. Real regression coverage, not coincidental.Test plan
uv run pytest tests/end_to_end_table_tests/test_cache_compiler.py— 8/8 pass (3 pre-existing + 5 new)uv run pytest tests/— 376 pass, 1 skipped, 1 pre-existing unrelated warning