Skip to content

Cache hardening: auto-create dir, keep_csv default, partial-file cleanup#83

Merged
nick-gorman merged 5 commits into
masterfrom
cache-hardening
May 25, 2026
Merged

Cache hardening: auto-create dir, keep_csv default, partial-file cleanup#83
nick-gorman merged 5 commits into
masterfrom
cache-hardening

Conversation

@nick-gorman
Copy link
Copy Markdown
Member

Summary

Bundles three independent cache-related fixes cherry-picked from #67, plus a follow-up that fixes a latent NameError in the cleanup path and adds regression coverage.

Closes #55 (partial-file cleanup).

# Commit Author What
1 Create raw_data_cache if it does not yet exist @mdavis-xyz cache_compiler now auto-creates the destination directory if it's missing, and raises UserInputError if the path exists as a regular file. Scope is narrow — dynamic_data_compiler and static_table (read-side) still require the directory to exist. Required conflict resolution: master's None-check (from #80) collided with PR67's adjacent error-message rewording; took PR67's logic for the directory-vs-file check, left master's None check unchanged.
2 Set keep_csv=True by default for cache_compiler @mdavis-xyz Behavior change — see below.
3 Tidy up incomplete feather/parquet (#55) @mdavis-xyz Wrap the feather/parquet write in try/except so a mid-write failure (disk full, interrupted writer, etc) doesn't leave an unreadable partial file on disk that breaks subsequent reads. Re-raises the original exception unchanged.
4 Fix os/_os shadowing in cleanup branch and add regression tests me See below.

⚠️ Behavior change — cache_compiler(keep_csv=…) default

cache_compiler's keep_csv default flips from False to True. Callers that didn't specify it explicitly will now retain the raw AEMO CSVs in raw_data_location alongside the typed feather/parquet. Rationale: keeping the source CSV is useful for downstream consumers and for re-reading without re-fetching from AEMO. Users who specifically don't want CSVs (e.g. disk pressure) can still pass keep_csv=False explicitly.

dynamic_data_compiler's default already was True on master — only cache_compiler is affected.

Bug found and fixed in commit 4

Commit 3's cleanup block, as cherry-picked from #67, used bare os.path.exists and os.remove:

except Exception:
    # tidy up incomplete file
    if os.path.exists(full_filename):   # ← NameError
        os.remove(full_filename)         # ← NameError
    raise

But data_fetch_methods.py imports the stdlib module as os as _os (line 2). So whenever a feather/parquet write actually failed:

  1. The user-visible exception became NameError("name 'os' is not defined") instead of the real write error (the original error survives only as __context__).
  2. The partial file was never cleaned up.

That silently regressed issue #55 to worse-than-before. Commit 4 swaps to _os.path.isfile / _os.unlink to match the convention on lines 781-782 of the same function.

Regression tests (commit 4)

All four new tests live in tests/end_to_end_table_tests/test_cache_compiler.py:

  • test_creates_cache_directory_when_missing
  • test_raises_when_cache_path_is_a_file
  • test_keep_csv_true_by_default
  • test_write_to_format_cleans_up_partial_file_on_failure[feather] + [parquet] — parametrized. Authored before applying the source fix and confirmed to fail with the exact NameError at line 792 on the buggy version, then to pass after the fix. Real regression coverage, not coincidental.

Test plan

  • uv run pytest tests/end_to_end_table_tests/test_cache_compiler.py — 8/8 pass (3 pre-existing + 5 new)
  • uv run pytest tests/ — 376 pass, 1 skipped, 1 pre-existing unrelated warning
  • CI green

mdavis-xyz and others added 5 commits May 25, 2026 12:08
Cherry-picked from #67 (commit 976b872). cache_compiler's contract is
"build me a typed cache from these arguments" — making the user create
the destination directory first is needless friction. If the path
doesn't exist, create it. If it exists but as a file rather than a
directory, that's clearly a typo, so raise UserInputError naming the
path.

Conflict resolution: the cherry-pick context assumed PR #67's full
state including the None-check guard; master already has the None
check (from PR #80) and an older error-message wording on the
following line. Took PR #67's version of the isfile/makedirs block;
the None check above it is unchanged from master.

Scope is unchanged from PR #67 — dynamic_data_compiler and
static_table still require the directory to exist (they're read-side
operations; auto-creation only makes sense for cache_compiler).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes:

1. Source fix: the cherry-picked cleanup block in _write_to_format used
   bare 'os.path.exists' and 'os.remove', but data_fetch_methods.py
   imports the stdlib module as 'os as _os'. So the cleanup path raised
   NameError("name 'os' is not defined") whenever a feather/parquet
   write actually failed — silently regressing the issue #55 fix to
   worse-than-before, because the user-visible exception became
   NameError instead of the real write error, and the partial file was
   never cleaned up. Replaced with _os.path.isfile / _os.unlink to
   match the existing pre-write cleanup convention on lines 781-782 of
   the same function.

2. Regression tests in tests/end_to_end_table_tests/test_cache_compiler.py
   covering all three behaviours added in this PR plus the bug above:
   - test_creates_cache_directory_when_missing
   - test_raises_when_cache_path_is_a_file
   - test_keep_csv_true_by_default
   - test_write_to_format_cleans_up_partial_file_on_failure
     (parametrized over feather and parquet — both formats fail in the
     same way under the bug, both pass after the fix)

The write-cleanup tests were authored before applying the source fix
and confirmed to fail on the buggy version with the exact NameError
from line 792, then to pass after the fix — so they are real
regression coverage, not coincidental.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-lock test

CI surfaced two issues with test_keep_csv_true_by_default that weren't
visible on Windows local:

1. AEMO zips contain CSV files with an uppercase '.CSV' extension. The
   test globbed for '*DISPATCHPRICE*.csv', which is case-insensitive on
   Windows (passed locally) but case-sensitive on Linux/macOS (failed
   on CI). NEMOSIS itself handles this internally with [cC][sS][vV]
   patterns — the test now uses [Cc][Ss][Vv] to match.

2. The test relied on the tmp_path being empty so the CSV-fetch path
   would run by default. Made the intent explicit with rebuild=True
   so the test deterministically forces the CSV-fetch branch
   regardless of starting cache state.

While restructuring, also added:

- test_keep_csv_false_removes_fetched_csv: mirror of the above with the
  override, confirming the opt-out path still removes CSVs.
- test_existing_feather_means_no_csv_is_fetched: locks in the user-
  clarified semantic — keep_csv=True is about "don't delete a CSV we
  fetched", NOT about "create a CSV when feather already exists".
  Pre-populates empty feather files at the expected names so the
  "already compiled" short-circuit fires, then asserts no CSV was
  created. In caching_mode the existence check skips the actual read,
  so empty feather files are fine for the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tidy up incomplete files

2 participants