Skip to content

fix(sof-export): clean up partial results on cancel, plus a TTL reaper (#144)#180

Open
mauripunzueta wants to merge 4 commits into
mainfrom
fix/144-sof-export-cleanup-cancelled
Open

fix(sof-export): clean up partial results on cancel, plus a TTL reaper (#144)#180
mauripunzueta wants to merge 4 commits into
mainfrom
fix/144-sof-export-cleanup-cancelled

Conversation

@mauripunzueta

Copy link
Copy Markdown
Contributor

Summary

Closes #144 and addresses the adjacent storage-hygiene gaps it surfaced in the SQL-on-FHIR async export controller ($viewdefinition-export / $sqlquery-export).

Previously, cancelling a job transitioned it to Cancelled but its already-written output shards were never deleted and stayed downloadable; and finished jobs were never reclaimed at all — the in-process jobs map and the sink output grew unbounded for the process lifetime, despite the completion manifest advertising a 24h Expires.

What changed

1. Cancellation cleanup (#144 as scoped)

  • New delete_job(&job_id) on the ExportSink trait, implemented for all three sinks:
    • FilesystemSinkremove_dir_all on the job dir (missing dir is not an error)
    • InMemorySink — drops entries under the {job_id}/ key prefix
    • S3Sink — paginated list_objects_v2 + delete_object under the job key prefix (idempotent)
  • Called from the cancellation path, and again from the background task if it finishes a job that was cancelled mid-run (covers the write-after-cancel race).
  • read_shard is gated on terminal state so a cancelled job's files 404 even while deletion is still draining.

2. TTL reaper for finished jobs

  • A background reaper deletes a terminal job's output and drops its bookkeeping once it ages past the TTL, bounding both storage and the in-memory map. Aligned with the manifest's advertised 24h Expires.
  • JobStatus::Cancelled now carries cancelled_at, and a terminal_at() helper lets all three terminal states age uniformly.
  • New config: HFS_EXPORT_OUTPUT_TTL (default 86400) and HFS_EXPORT_CLEANUP_INTERVAL (default 300). The interval is clamped to ≥ 1s (tokio::time::interval panics on zero).

3. Failed-job partial cleanup

  • A failed job's result URL returns 500 with no manifest, so its shards are unreachable but were never deleted. They're now removed on failure, and read_shard gates Failed alongside Cancelled.

Deliberately not changed

DELETE on a completed job remains a no-op that preserves the Completed state (the status URL keeps redirecting to the result manifest), per test_export_cancel_after_completion_preserves_completed_state. Completed output is reclaimed by the TTL reaper above, not by DELETE.

Testing

  • New unit tests: cancel_deletes_partial_output_and_download_404s, reap_expired_reclaims_terminal_jobs_only (age threshold, running-immunity, output deletion + bookkeeping removal).
  • Full helios-rest lib suite (282) and the sof_export integration suite (44) pass, including the completion-preservation test.
  • cargo fmt clean; cargo clippy --features s3 --all-targets -D warnings clean.

🤖 Generated with Claude Code

…lled

Per the SQL-on-FHIR operations-common spec (HL7/sql-on-fhir#365), a server
SHOULD clean up partial results when an export is cancelled via DELETE on the
status URL. Previously, cancelling a $viewdefinition-export / $sqlquery-export
job transitioned it to Cancelled but already-written output shards were never
deleted and remained downloadable via GET /export/{job_id}/{filename}.

- Add `delete_job(&self, job_id)` to the `ExportSink` trait and implement it
  for all three sinks: FilesystemSink (remove_dir_all), InMemorySink (drop
  matching keys), S3Sink (paginated list + delete under the job key prefix).
- Call it from the cancellation path, and again from the background task when
  it finishes a job that was cancelled mid-run (covers the write-after-cancel
  race).
- Gate `read_shard` on Cancelled status so a cancelled job's files 404 even
  while deletion is still draining.

Closes #144
…ials

Follow-up to the cancellation cleanup. Two adjacent storage-hygiene gaps in the
SoF async export controller, found while reviewing #144:

1. No retention at all. The completion manifest advertises a 24h `Expires`, but
   nothing ever reclaimed finished jobs — the in-memory `jobs` map and the sink
   output grew unbounded for the process lifetime. Add a background reaper
   (HFS_EXPORT_OUTPUT_TTL, default 24h; HFS_EXPORT_CLEANUP_INTERVAL, default
   300s) that deletes a terminal job's output and drops its bookkeeping once it
   ages past the TTL. `JobStatus::Cancelled` now carries `cancelled_at` and a
   `terminal_at()` helper lets all three terminal states age uniformly.

2. Failed jobs left orphaned partial shards. A failed job's result URL returns
   500 with no manifest, so its shards are unreachable but were never deleted.
   The background task now deletes them on failure, and `read_shard` gates
   `Failed` (alongside `Cancelled`) so a racing poll never serves stale output.

DELETE-on-completed is intentionally left as a no-op that preserves the
Completed state (per test_export_cancel_after_completion_preserves_completed_state);
completed output is reclaimed by the TTL reaper above, not by DELETE.

Tests: reap_expired_reclaims_terminal_jobs_only. Full helios-rest lib suite (282)
and the sof_export integration suite (44) pass; clippy clean with --features s3.
Document the output lifecycle at the module level and flesh out the
per-sink delete_job contracts (idempotency, key-prefix semantics, and why
the S3 path bridges to the blocking pool). Comments only — no behavior change.
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

…2026-0185

Two ambient CI breakages on origin/main, unrelated to this PR's changes:

- clippy 1.91 newly flags collapsible_else_if in sof/emit.rs; collapse the
  nested else { if .. } into else if.
- cargo audit fails on RUSTSEC-2026-0185 (quinn-proto remote memory
  exhaustion), a transitive reqwest QUIC dep. We never accept inbound QUIC,
  so the reassembly path is unreachable; ignore it with justification.
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.50943% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/rest/src/export/sink.rs 41.66% 7 Missing ⚠️
crates/rest/src/export/in_memory.rs 96.12% 6 Missing ⚠️
crates/persistence/src/sof/emit.rs 87.50% 3 Missing ⚠️
crates/rest/src/export/controller.rs 71.42% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SoF export: clean up partial results when a job is cancelled

1 participant