Skip to content

fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold#4588

Draft
erik-the-implementer wants to merge 10 commits into
mainfrom
alco/ets-inspector-mailbox-flooding
Draft

fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold#4588
erik-the-implementer wants to merge 10 commits into
mainfrom
alco/ets-inspector-mailbox-flooding

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #4370.

Electric.Postgres.Inspector.EtsInspector is the single GenServer that answers Postgres relation/oid/column/feature lookups for every cold-cache shape validate_request. When a never-before-seen root table gets a burst of requests — or when the PG pool is exhausted or the DB is unavailable — every concurrent miss used to run its own serialized GenServer.call. Because failed lookups weren't cached, N concurrent requests for the same failing key meant N serial DB attempts, each able to run for Postgrex's 15s default. The mailbox backs up and the inspector spends minutes doing redundant work.

This PR bounds that blast radius with three changes (the GenServer.call(:infinity) client API is unchanged):

  1. Explicit DB transaction timeout. A single lookup is now capped at 5s instead of inheriting Postgrex's 15s default, so a degraded pool can't tie up the inspector long after the triggering request has given up.

  2. In-flight coalescing. The DB lookup no longer runs inside handle_call. The GenServer spawns one supervised worker per unique in-flight key (relation, oid, or feature set), parks every concurrent waiter, and replies to all of them when that one worker finishes. load_relation_info and load_column_info share the oid key, so they coalesce onto the same call. A slow lookup for one key no longer blocks lookups for others, and a worker crash comes back as a {:DOWN, ...} message that replies {:error, :connection_not_available} to the waiters without taking down the inspector.

  3. Short-TTL negative cache. :table_not_found and {:error, _} results are cached in the same ETS table with a short TTL (default 1s). The client process reads this cache and skips the GenServer entirely during that window, so a sustained burst against a failing key lets the mailbox drain instead of refilling. Negative entries aren't persisted, can't collide with positive rows, and clean/2 drops them.

This follows on from the cheap admission control work (#4359) under the thundering-herd umbrella (#4266): sizing the :initial admission bucket only helps if the inspector's own behaviour under a burst against a degraded DB is bounded, which this provides. It mirrors #4537 (publication-manager cast suppression) — bounding how many messages get issued during shape-arrival bursts.

Note on scope

This fully addresses goal (a) (mailbox overload) and partially addresses goal (b) (orphaned waiters): the inspector no longer repeats DB work for a waiter that has already timed out upstream — an orphaned reply is now a single cheap, discarded GenServer.reply — but it still isn't actively told the waiter is gone. The issue's mitigation (4) (a request-process poll loop with a local deadline) is intentionally out of scope, since it needs a request deadline that isn't currently threaded onto the serve-shape path.

🤖 Generated with Claude Code

@alco alco self-assigned this Jun 16, 2026
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
4509 2 4507 61
View the top 3 failed test(s) by shortest run time
Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough
Stack Traces | 0s run time
23) test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:507
     ** (EXIT from #PID<0.13259.0>) killed
Elixir.Electric.StackSupervisor.TelemetryTest::test prometheus_metrics/0 per-status HTTP request counts are scrapable
Stack Traces | 0.191s run time
3) test prometheus_metrics/0 per-status HTTP request counts are scrapable (Electric.StackSupervisor.TelemetryTest)
     .../electric/stack_supervisor/telemetry_test.exs:67
     Assertion with =~ failed
     code:  assert scrape =~ ~r/electric_plug_serve_shape_requests_count\{status="200"\} 2/
     left:  "# HELP electric_admission_control_acquire_current \n# TYPE electric_admission_control_acquire_current gauge\nelectric_admission_control_acquire_current{kind=\"initial\"} 1\n# HELP electric_plug_serve_shape_requests_count \n# TYPE electric_plug_serve_shape_requests_count counter\nelectric_plug_serve_shape_requests_count{status=\"409\"} 1\nelectric_plug_serve_shape_requests_count{status=\"400\"} 1\nelectric_plug_serve_shape_requests_count{status=\"304\"} 1\nelectric_plug_serve_shape_requests_count{status=\"200\"} 3\nelectric_plug_serve_shape_requests_count{status=\"500\"} 1\n"
     right: ~r/electric_plug_serve_shape_requests_count\{status="200"\} 2/
     stacktrace:
       .../electric/stack_supervisor/telemetry_test.exs:90: (test)
Elixir.Electric.ShapeCacheTest::test await_snapshot_start/4 should wait for consumer to come up
Stack Traces | 5.77s run time
25) test await_snapshot_start/4 should wait for consumer to come up (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1080
     ** (exit) exited in: Task.await(%Task{mfa: {:erlang, :apply, 2}, owner: #PID<0.19031.0>, pid: #PID<0.19116.0>, ref: #Reference<0.0.2435971.3297180567.2644574211.9602>}, 5300)
         ** (EXIT) time out
     code: assert :started = Task.await(wait_task, start_consumer_delay + 5_000)
     stacktrace:
       (elixir 1.19.5) lib/task.ex:892: Task.await_receive/3
       test/electric/shape_cache_test.exs:1118: (test)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@alco alco changed the title fix(sync-service): bound EtsInspector mailbox & DB amplification under bursts (#4370) fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold Jun 16, 2026
@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR bounds EtsInspector's mailbox and DB-attempt amplification under cold-start bursts and DB degradation via in-flight coalescing onto one supervised worker per key, a short-TTL negative cache, and an explicit 5s DB transaction timeout. Iteration 3 lands the two observability/architecture suggestions from the previous review (telemetry span + tree-supervised Task.Supervisor) plus doc/changeset polish. The OTP mechanics remain correct and the new code is clean. Looks ready to merge.

Previous Review Status

Both items deferred at iteration 2 are now addressed, and the earlier design items remain resolved:

  • Suggestion [Merged on #3] Write inserts, updates and deletes to Vaxine #2 (observability) — addressed. Commit efbc17b wraps each DB lookup in an inspector.fetch_db span (OpenTelemetry.with_span/4 with stack_id), tagged with inspector.key_type (relation / oid / supported_features) and inspector.result (ok / table_not_found / error). The span is correctly emitted as a root span from the detached worker — there's no parent request context to attach to, and one coalesced lookup can't belong to a single request anyway; the comment states this explicitly. Sampler.include_span?/1 defaults to included?(_) -> true, so the span isn't silently dropped. The fetch_for_keydo_fetch_for_key split is clean and fetch_outcome/1 is exhaustive over the {:ok, :table_not_found} | {:ok, _} | {:error, _} result shapes.
  • Suggestion Postgres replication source TCP server #3 (Task.Supervisor in tree) — addressed. Commit fc05b98 moves the worker Task.Supervisor out of init/1 and declares it as a sibling child in Electric.StackSupervisor, addressed by the registered name EtsInspector.task_supervisor_name/1. This is strictly more robust than the old self-linked pid: the inspector now holds the supervisor's registered name in state.task_sup, so under the supervisor's :one_for_one strategy a Task.Supervisor crash restarts only the supervisor and the inspector re-resolves the name on its next lookup — without losing its ETS cache. The child is ordered before EtsInspector (starts first), in-flight workers that die on a supervisor crash still surface as {:DOWN, …} and reply {:error, :connection_not_available} to waiters, and the test setup (with_inspector/1) correctly starts the sibling supervisor first with a clear comment about why it outlives inspector restarts.
  • Timeout rationale / changeset prose. Commits a8048e0 and c0186f4 improve the @fetch_db_timeout comment (now framed as metadata-pool protection — the 4-connection shared pool can be pinned by a handful of degraded cold-cache keys — explicitly distinct from the caller's 20–60s long-poll budget) and the changeset bullet list. Both are accurate.
  • ✅ Iteration-1 items (unbounded negative-cache growth → periodic sweep; O(n) pop_in_flight_by_refin_flight_refs reverse index; clean/2 clearing stale negatives) remain resolved.

What's Working Well

  • Supervision change is a genuine robustness upgrade, not just cosmetics. Decoupling the worker pool's lifecycle from the inspector means a worker-pool fault no longer forces the inspector to drop its warm ETS cache, while name-based addressing keeps the reference valid across restarts.
  • Telemetry placement is correct. Recording outside the request trace and as a standalone span is the right call for a coalesced, detached lookup; attributes are added inside the span before the result returns.
  • No regressions in the OTP core. async_nolink demonitor-with-:flush, single-route-through pop_in_flight_by_ref on every exit path, and the race-free re-check before parking are all unchanged and still correct.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

  1. Negative-caching :connection_not_available still briefly masks recovery (carried over, still optional). A transient connection error is cached for the full TTL, so a key keeps short-circuiting even if the pool recovers within the window. The default TTL is short (1s) and this is exactly what EtsInspector: prevent mailbox overload and avoid serving orphaned replies #4370 asked for, so it's fine; a shorter TTL for {:error, _} than for :table_not_found would help only if it ever bites in practice.
  2. handle_info({:EXIT, _, reason}, state) is now narrower in purpose. With the worker Task.Supervisor no longer linked to the inspector, this clause effectively only handles the parent-supervisor shutdown signal (workers are async_nolink). It's still correct and worth keeping, but a one-line comment noting it now exists for the supervisor link (not the task pool) would prevent a future reader from assuming it still catches worker-pool crashes.

Issue Conformance

Unchanged from iteration 2: mitigations (1)–(3) from #4370 are implemented; mitigation (4) (request-process poll-loop with a local deadline) remains intentionally out of scope because a request deadline isn't threaded onto the serve-shape path. Goal (a) (mailbox overload) is fully addressed; goal (b) (orphaned waiters) is partially addressed and the PR is honest about it. The telemetry addition also closes the observability gap flagged in the earlier review. Changeset present with the correct @core/sync-service package.


Review iteration: 3 | 2026-06-19

alco and others added 5 commits June 16, 2026 10:58
…he tree (#4370)

The inspector's DB-lookup workers were run under a `Task.Supervisor` started
ad-hoc from `EtsInspector.init/1`. Declare it as a sibling child of the
inspector in `Electric.StackSupervisor` instead, addressed by a registered
name (`EtsInspector.task_supervisor_name/1`), so the process hierarchy is
visible in the supervision tree like every other `Task.Supervisor` in the
service.

As a tree sibling the supervisor now outlives inspector restarts; the test
helper starts it idempotently to mirror that.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT
…ut (#4370)

The previous comment justified the explicit transaction timeout by the
request's "HTTP budget", which is wrong: a shape request tolerates its full
long-poll timeout (20-60s), far longer than 5s. The real reason is
metadata-pool protection — the inspector shares a pool of at most 4
connections with the connection manager, and a coalesced lookup pins one for
its whole duration, so the timeout bounds connection-hold time on a degraded
Postgres. No behaviour change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT
…pector DB lookup (#4370)

The inspector's catalog lookups had no telemetry: they run in a detached
worker outside the request's trace, so neither their latency nor their outcome
was observable in prod, and the explicit DB timeout could only be reasoned
about rather than validated against real data.

Wrap each lookup in a standalone `inspector.fetch_db` span tagged with the key
type (relation / oid / supported_features) and the outcome
(ok / table_not_found / error). It's a root span by design: one coalesced
lookup serves many waiters, so it can't belong to any single request's trace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT
@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit c0186f4
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a3514e784e43f00082b2126
😎 Deploy Preview https://deploy-preview-4588--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EtsInspector: prevent mailbox overload and avoid serving orphaned replies

2 participants