fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold by erik-the-implementer · Pull Request #4588 · electric-sql/electric

erik-the-implementer · 2026-06-16T06:37:48Z

Summary

Electric.Postgres.Inspector.EtsInspector is the single GenServer that answers Postgres relation/oid/column/feature lookups for every cold-cache shape validate_request. When a never-before-seen root table gets a burst of requests — or when the PG pool is exhausted or the DB is unavailable — every concurrent miss used to run its own serialized GenServer.call. Because failed lookups weren't cached, N concurrent requests for the same failing key meant N serial DB attempts, each able to run for Postgrex's 15s default. The mailbox backs up and the inspector spends minutes doing redundant work.

This PR bounds that blast radius with three changes (the GenServer.call(:infinity) client API is unchanged):

Explicit DB transaction timeout. A single lookup is now capped at 5s instead of inheriting Postgrex's 15s default, so a degraded pool can't tie up the inspector long after the triggering request has given up.
In-flight coalescing. The DB lookup no longer runs inside handle_call. The GenServer spawns one supervised worker per unique in-flight key (relation, oid, or feature set), parks every concurrent waiter, and replies to all of them when that one worker finishes. load_relation_info and load_column_info share the oid key, so they coalesce onto the same call. A slow lookup for one key no longer blocks lookups for others, and a worker crash comes back as a {:DOWN, ...} message that replies {:error, :connection_not_available} to the waiters without taking down the inspector.
Short-TTL negative cache. :table_not_found and {:error, _} results are cached in the same ETS table with a short TTL (default 1s). The client process reads this cache and skips the GenServer entirely during that window, so a sustained burst against a failing key lets the mailbox drain instead of refilling. Negative entries aren't persisted, can't collide with positive rows, and clean/2 drops them.

This follows on from the cheap admission control work (#4359) under the thundering-herd umbrella (#4266): sizing the :initial admission bucket only helps if the inspector's own behaviour under a burst against a degraded DB is bounded, which this provides. It mirrors #4537 (publication-manager cast suppression) — bounding how many messages get issued during shape-arrival bursts.

Note on scope

This fully addresses goal (a) (mailbox overload) and partially addresses goal (b) (orphaned waiters): the inspector no longer repeats DB work for a waiter that has already timed out upstream — an orphaned reply is now a single cheap, discarded GenServer.reply — but it still isn't actively told the waiter is gone. The issue's mitigation (4) (a request-process poll loop with a local deadline) is intentionally out of scope, since it needs a request deadline that isn't currently threaded onto the serve-shape path.

🤖 Generated with Claude Code

…eout (#4370)

…DB call (#4370)

…TL (#4370)

…gation (#4370)

codecov · 2026-06-16T06:38:52Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
4509	2	4507	61

View the top 3 failed test(s) by shortest run time

Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough

Stack Traces | 0s run time

23) test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:507
     ** (EXIT from #PID<0.13259.0>) killed

Elixir.Electric.StackSupervisor.TelemetryTest::test prometheus_metrics/0 per-status HTTP request counts are scrapable

Stack Traces | 0.191s run time

3) test prometheus_metrics/0 per-status HTTP request counts are scrapable (Electric.StackSupervisor.TelemetryTest)
     .../electric/stack_supervisor/telemetry_test.exs:67
     Assertion with =~ failed
     code:  assert scrape =~ ~r/electric_plug_serve_shape_requests_count\{status="200"\} 2/
     left:  "# HELP electric_admission_control_acquire_current \n# TYPE electric_admission_control_acquire_current gauge\nelectric_admission_control_acquire_current{kind=\"initial\"} 1\n# HELP electric_plug_serve_shape_requests_count \n# TYPE electric_plug_serve_shape_requests_count counter\nelectric_plug_serve_shape_requests_count{status=\"409\"} 1\nelectric_plug_serve_shape_requests_count{status=\"400\"} 1\nelectric_plug_serve_shape_requests_count{status=\"304\"} 1\nelectric_plug_serve_shape_requests_count{status=\"200\"} 3\nelectric_plug_serve_shape_requests_count{status=\"500\"} 1\n"
     right: ~r/electric_plug_serve_shape_requests_count\{status="200"\} 2/
     stacktrace:
       .../electric/stack_supervisor/telemetry_test.exs:90: (test)

Elixir.Electric.ShapeCacheTest::test await_snapshot_start/4 should wait for consumer to come up

Stack Traces | 5.77s run time

25) test await_snapshot_start/4 should wait for consumer to come up (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:1080
     ** (exit) exited in: Task.await(%Task{mfa: {:erlang, :apply, 2}, owner: #PID<0.19031.0>, pid: #PID<0.19116.0>, ref: #Reference<0.0.2435971.3297180567.2644574211.9602>}, 5300)
         ** (EXIT) time out
     code: assert :started = Task.await(wait_task, start_consumer_delay + 5_000)
     stacktrace:
       (elixir 1.19.5) lib/task.ex:892: Task.await_receive/3
       test/electric/shape_cache_test.exs:1118: (test)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

claude · 2026-06-16T06:45:47Z

Claude Code Review

Summary

This PR bounds EtsInspector's mailbox and DB-attempt amplification under cold-start bursts and DB degradation via in-flight coalescing onto one supervised worker per key, a short-TTL negative cache, and an explicit 5s DB transaction timeout. Iteration 3 lands the two observability/architecture suggestions from the previous review (telemetry span + tree-supervised Task.Supervisor) plus doc/changeset polish. The OTP mechanics remain correct and the new code is clean. Looks ready to merge.

Previous Review Status

Both items deferred at iteration 2 are now addressed, and the earlier design items remain resolved:

✅ Suggestion [Merged on #3] Write inserts, updates and deletes to Vaxine #2 (observability) — addressed. Commit efbc17b wraps each DB lookup in an inspector.fetch_db span (OpenTelemetry.with_span/4 with stack_id), tagged with inspector.key_type (relation / oid / supported_features) and inspector.result (ok / table_not_found / error). The span is correctly emitted as a root span from the detached worker — there's no parent request context to attach to, and one coalesced lookup can't belong to a single request anyway; the comment states this explicitly. Sampler.include_span?/1 defaults to included?(_) -> true, so the span isn't silently dropped. The fetch_for_key → do_fetch_for_key split is clean and fetch_outcome/1 is exhaustive over the {:ok, :table_not_found} | {:ok, _} | {:error, _} result shapes.
✅ Suggestion Postgres replication source TCP server #3 (Task.Supervisor in tree) — addressed. Commit fc05b98 moves the worker Task.Supervisor out of init/1 and declares it as a sibling child in Electric.StackSupervisor, addressed by the registered name EtsInspector.task_supervisor_name/1. This is strictly more robust than the old self-linked pid: the inspector now holds the supervisor's registered name in state.task_sup, so under the supervisor's :one_for_one strategy a Task.Supervisor crash restarts only the supervisor and the inspector re-resolves the name on its next lookup — without losing its ETS cache. The child is ordered before EtsInspector (starts first), in-flight workers that die on a supervisor crash still surface as {:DOWN, …} and reply {:error, :connection_not_available} to waiters, and the test setup (with_inspector/1) correctly starts the sibling supervisor first with a clear comment about why it outlives inspector restarts.
✅ Timeout rationale / changeset prose. Commits a8048e0 and c0186f4 improve the @fetch_db_timeout comment (now framed as metadata-pool protection — the 4-connection shared pool can be pinned by a handful of degraded cold-cache keys — explicitly distinct from the caller's 20–60s long-poll budget) and the changeset bullet list. Both are accurate.
✅ Iteration-1 items (unbounded negative-cache growth → periodic sweep; O(n) pop_in_flight_by_ref → in_flight_refs reverse index; clean/2 clearing stale negatives) remain resolved.

What's Working Well

Supervision change is a genuine robustness upgrade, not just cosmetics. Decoupling the worker pool's lifecycle from the inspector means a worker-pool fault no longer forces the inspector to drop its warm ETS cache, while name-based addressing keeps the reference valid across restarts.
Telemetry placement is correct. Recording outside the request trace and as a standalone span is the right call for a coalesced, detached lookup; attributes are added inside the span before the result returns.
No regressions in the OTP core. async_nolink demonitor-with-:flush, single-route-through pop_in_flight_by_ref on every exit path, and the race-free re-check before parking are all unchanged and still correct.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

Negative-caching :connection_not_available still briefly masks recovery (carried over, still optional). A transient connection error is cached for the full TTL, so a key keeps short-circuiting even if the pool recovers within the window. The default TTL is short (1s) and this is exactly what EtsInspector: prevent mailbox overload and avoid serving orphaned replies #4370 asked for, so it's fine; a shorter TTL for {:error, _} than for :table_not_found would help only if it ever bites in practice.
handle_info({:EXIT, _, reason}, state) is now narrower in purpose. With the worker Task.Supervisor no longer linked to the inspector, this clause effectively only handles the parent-supervisor shutdown signal (workers are async_nolink). It's still correct and worth keeping, but a one-line comment noting it now exists for the supervisor link (not the task pool) would prevent a future reader from assuming it still catches worker-pool crashes.

Issue Conformance

Unchanged from iteration 2: mitigations (1)–(3) from #4370 are implemented; mitigation (4) (request-process poll-loop with a local deadline) remains intentionally out of scope because a request deadline isn't threaded onto the serve-shape path. Goal (a) (mailbox overload) is fully addressed; goal (b) (orphaned waiters) is partially addressed and the PR is honest about it. The telemetry addition also closes the observability gap flagged in the earlier review. Changeset present with the correct @core/sync-service package.

Review iteration: 3 | 2026-06-19

…and index in-flight by ref (#4370)

…he tree (#4370) The inspector's DB-lookup workers were run under a `Task.Supervisor` started ad-hoc from `EtsInspector.init/1`. Declare it as a sibling child of the inspector in `Electric.StackSupervisor` instead, addressed by a registered name (`EtsInspector.task_supervisor_name/1`), so the process hierarchy is visible in the supervision tree like every other `Task.Supervisor` in the service. As a tree sibling the supervisor now outlives inspector restarts; the test helper starts it idempotently to mirror that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT

…ut (#4370) The previous comment justified the explicit transaction timeout by the request's "HTTP budget", which is wrong: a shape request tolerates its full long-poll timeout (20-60s), far longer than 5s. The real reason is metadata-pool protection — the inspector shares a pool of at most 4 connections with the connection manager, and a coalesced lookup pins one for its whole duration, so the timeout bounds connection-hold time on a degraded Postgres. No behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT

…pector DB lookup (#4370) The inspector's catalog lookups had no telemetry: they run in a detached worker outside the request's trace, so neither their latency nor their outcome was observable in prod, and the explicit DB timeout could only be reasoned about rather than validated against real data. Wrap each lookup in a standalone `inspector.fetch_db` span tagged with the key type (relation / oid / supported_features) and the outcome (ok / table_not_found / error). It's a root span by design: one coalesced lookup serves many waiters, so it can't belong to any single request's trace. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT

netlify · 2026-06-19T10:09:28Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`c0186f4`
🔍 Latest deploy log	https://app.netlify.com/projects/electric-next/deploys/6a3514e784e43f00082b2126
😎 Deploy Preview	https://deploy-preview-4588--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

alco added 5 commits June 16, 2026 01:25

fix(sync-service): bound EtsInspector DB lookups with an explicit tim…

ec0bfdb

…eout (#4370)

fix(sync-service): coalesce concurrent EtsInspector lookups into one …

d098f23

…DB call (#4370)

fix(sync-service): cache EtsInspector negative results with a short T…

390fdd9

…TL (#4370)

chore(sync-service): changeset for EtsInspector mailbox-overload miti…

2afcc50

…gation (#4370)

fix(sync-service): clear EtsInspector negative cache on clean (#4370)

703d32f

erik-the-implementer added the claude label Jun 16, 2026

alco self-assigned this Jun 16, 2026

alco changed the title ~~fix(sync-service): bound EtsInspector mailbox & DB amplification under bursts (#4370)~~ fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold Jun 16, 2026

alco and others added 5 commits June 16, 2026 10:58

fix(sync-service): evict expired EtsInspector negative-cache entries …

1e8ed71

…and index in-flight by ref (#4370)

docs(sync-service): simplify the EtsInspector changeset prose (#4370)

c0186f4

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold#4588

fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold#4588
erik-the-implementer wants to merge 10 commits into
mainfrom
alco/ets-inspector-mailbox-flooding

erik-the-implementer commented Jun 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erik-the-implementer commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Note on scope

Uh oh!

codecov Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

claude Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Summary

Previous Review Status

What's Working Well

Issues Found

Critical (Must Fix)

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Uh oh!

netlify Bot commented Jun 19, 2026

✅ Deploy Preview for electric-next ready!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erik-the-implementer commented Jun 16, 2026 •

edited

Loading

codecov Bot commented Jun 16, 2026 •

edited

Loading

claude Bot commented Jun 16, 2026 •

edited

Loading