fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold#4588
fix(sync-service): Prevent EtsInspector's mailbox from getting flooded when the cache is cold#4588erik-the-implementer wants to merge 10 commits into
Conversation
❌ 2 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Claude Code ReviewSummaryThis PR bounds Previous Review StatusBoth items deferred at iteration 2 are now addressed, and the earlier design items remain resolved:
What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)
Issue ConformanceUnchanged from iteration 2: mitigations (1)–(3) from #4370 are implemented; mitigation (4) (request-process poll-loop with a local deadline) remains intentionally out of scope because a request deadline isn't threaded onto the serve-shape path. Goal (a) (mailbox overload) is fully addressed; goal (b) (orphaned waiters) is partially addressed and the PR is honest about it. The telemetry addition also closes the observability gap flagged in the earlier review. Changeset present with the correct Review iteration: 3 | 2026-06-19 |
…and index in-flight by ref (#4370)
…he tree (#4370) The inspector's DB-lookup workers were run under a `Task.Supervisor` started ad-hoc from `EtsInspector.init/1`. Declare it as a sibling child of the inspector in `Electric.StackSupervisor` instead, addressed by a registered name (`EtsInspector.task_supervisor_name/1`), so the process hierarchy is visible in the supervision tree like every other `Task.Supervisor` in the service. As a tree sibling the supervisor now outlives inspector restarts; the test helper starts it idempotently to mirror that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT
…ut (#4370) The previous comment justified the explicit transaction timeout by the request's "HTTP budget", which is wrong: a shape request tolerates its full long-poll timeout (20-60s), far longer than 5s. The real reason is metadata-pool protection — the inspector shares a pool of at most 4 connections with the connection manager, and a coalesced lookup pins one for its whole duration, so the timeout bounds connection-hold time on a degraded Postgres. No behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT
…pector DB lookup (#4370) The inspector's catalog lookups had no telemetry: they run in a detached worker outside the request's trace, so neither their latency nor their outcome was observable in prod, and the explicit DB timeout could only be reasoned about rather than validated against real data. Wrap each lookup in a standalone `inspector.fetch_db` span tagged with the key type (relation / oid / supported_features) and the outcome (ok / table_not_found / error). It's a root span by design: one coalesced lookup serves many waiters, so it can't belong to any single request's trace. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NjVzRFXnrvziD4Gp55r1MT
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Summary
Fixes #4370.
Electric.Postgres.Inspector.EtsInspectoris the single GenServer that answers Postgres relation/oid/column/feature lookups for every cold-cache shapevalidate_request. When a never-before-seen root table gets a burst of requests — or when the PG pool is exhausted or the DB is unavailable — every concurrent miss used to run its own serializedGenServer.call. Because failed lookups weren't cached, N concurrent requests for the same failing key meant N serial DB attempts, each able to run for Postgrex's 15s default. The mailbox backs up and the inspector spends minutes doing redundant work.This PR bounds that blast radius with three changes (the
GenServer.call(:infinity)client API is unchanged):Explicit DB transaction timeout. A single lookup is now capped at 5s instead of inheriting Postgrex's 15s default, so a degraded pool can't tie up the inspector long after the triggering request has given up.
In-flight coalescing. The DB lookup no longer runs inside
handle_call. The GenServer spawns one supervised worker per unique in-flight key (relation, oid, or feature set), parks every concurrent waiter, and replies to all of them when that one worker finishes.load_relation_infoandload_column_infoshare the oid key, so they coalesce onto the same call. A slow lookup for one key no longer blocks lookups for others, and a worker crash comes back as a{:DOWN, ...}message that replies{:error, :connection_not_available}to the waiters without taking down the inspector.Short-TTL negative cache.
:table_not_foundand{:error, _}results are cached in the same ETS table with a short TTL (default 1s). The client process reads this cache and skips the GenServer entirely during that window, so a sustained burst against a failing key lets the mailbox drain instead of refilling. Negative entries aren't persisted, can't collide with positive rows, andclean/2drops them.This follows on from the cheap admission control work (#4359) under the thundering-herd umbrella (#4266): sizing the
:initialadmission bucket only helps if the inspector's own behaviour under a burst against a degraded DB is bounded, which this provides. It mirrors #4537 (publication-manager cast suppression) — bounding how many messages get issued during shape-arrival bursts.Note on scope
This fully addresses goal (a) (mailbox overload) and partially addresses goal (b) (orphaned waiters): the inspector no longer repeats DB work for a waiter that has already timed out upstream — an orphaned reply is now a single cheap, discarded
GenServer.reply— but it still isn't actively told the waiter is gone. The issue's mitigation (4) (a request-process poll loop with a local deadline) is intentionally out of scope, since it needs a request deadline that isn't currently threaded onto the serve-shape path.🤖 Generated with Claude Code