Skip to content

fix(sync-service): make AsyncDeleter boot and self-heal on a full disk#4599

Draft
erik-the-implementer wants to merge 5 commits into
mainfrom
alco/async-deleter-fix
Draft

fix(sync-service): make AsyncDeleter boot and self-heal on a full disk#4599
erik-the-implementer wants to merge 5 commits into
mainfrom
alco/async-deleter-fix

Conversation

@erik-the-implementer

Copy link
Copy Markdown
Contributor

Summary

Electric.AsyncDeleter could not start when its trash directory's filesystem was full: init/1 ignored File.mkdir_p/1's result and the immediately following File.ls! raised a misleading File.Error{reason: :enoent} (masking the real :enospc). Since AsyncDeleter is a stack-supervisor child, this propagated as :failed_to_start_child and crash-looped the whole stack — a self-reinforcing deadlock, because AsyncDeleter is the very process that reclaims trash. Fixes #4595.

This PR makes the deleter boot resiliently, surface the real error, and own end-to-end recovery so it can start reclaiming space precisely when the disk is full.

What changed

  • Resilient init/1: match on File.mkdir_p's result and File.ls/1 (non-raising) instead of File.ls!. On failure it logs the real reason via Logger.error (e.g. :enospc, :enotdir) and boots anyway — never crashes the stack.
  • delete/1 disambiguation: :prim_file.rename returns {:error, :enoent} both when the source is already gone and when the trash dir is missing. The old code blindly read both as "already gone → :ok", silently reporting success while reclaiming nothing. It now probes File.exists?/1 to tell them apart.
  • Self-healing capture: the GenServer is the single owner of trash-dir creation. Live paths that can't be moved into the trash yet are tracked in a new pending_sources inventory and retried (mkdir + rename) on a self-heal timer that is armed only while degraded and goes silent once healthy. A handed-off source's bytes are captured into the trash and reaped as soon as space frees up.
  • Operability: each heal tick logs the queued-deletion backlog while the trash dir remains un-writable.

Behavior note

On the degraded (full-disk) path, delete/1 now returns :ok rather than {:error, …}. This is strictly safer: the old error tuple would have hit shape_cleaner.ex's :ok = Storage.cleanup!(...) hard-match and crashed the cleanup task. The shape is dropped from the index while its bytes are reclaimed asynchronously by the heal loop — consistent with the existing async-delete philosophy, with a slightly wider index/disk consistency window during a full disk.

Testing

New resilient boot test group (trash dir obstructed by a regular file at .electric_trashmkdir_p fails with :enotdir, uid-independent):

  • boots without crashing when the trash dir cannot be created;
  • delete hands off a live source (returns :ok, source preserved) when the trash dir is missing;
  • deleting a missing source still returns :ok;
  • handed-off source is captured into the trash and reaped once the obstruction clears.

mix test test/electric/async_deleter_test.exs → 9 tests, 0 failures (5 original + 4 new). mix compile --warnings-as-errors clean.

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
4433 1 4432 55
View the top 2 failed test(s) by shortest run time
Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough
Stack Traces | 0s run time
29) test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:507
     ** (EXIT from #PID<0.11509.0>) killed
test/runtime-dsl.test.ts > F: coordination orchestration > F12: dispatcher preserves counters and child rows across repeated failing dispatches
Stack Traces | 30s run time
Error: Expected 1 additional wake error(s), but they did not occur
 ❯ drainRuntimeWakes test/runtime-dsl.ts:591:15
 ❯ waitForRuntimeSettled test/runtime-dsl.ts:621:5
 ❯ Object.waitForSettled test/runtime-dsl.ts:941:7
 ❯ test/runtime-dsl.test.ts:2923:3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@netlify

netlify Bot commented Jun 16, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit cbc1d26
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a317141911d310008367c31
😎 Deploy Preview https://deploy-preview-4599--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AsyncDeleter deadlocks on full disk: crashes on boot with misleading :enoent, never reclaims trash

2 participants