feat(sync-service): Make concurrent shape requests wait for ShapeCache to be ready instead of flooding it with messages#4585
Conversation
❌ 1 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Claude Code ReviewSummaryIteration 2. The new commit What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)Newly-added File: Before this PR, def handle_info({:sweep_failed_create_lock, lock_key}, state) do
...
endOnce any I traced the current message sources and the practical trigger today looks low: the create path's outbound calls are Suggested fix — restore the previous tolerance with an explicit catch-all: def handle_info({:sweep_failed_create_lock, lock_key}, state) do
...
{:noreply, state}
end
def handle_info(msg, state) do
Logger.warning("#{inspect(__MODULE__)} received unexpected message: #{inspect(msg)}")
{:noreply, state}
endSuggestions (Nice to Have)
Issue ConformanceThe fix is a direct response to PR review feedback and the implementation matches the updated PR description and changeset. Linked-issues context remains empty (the PR references #4372 under umbrella #4266 in its body). Reminder still stands: the PR is stacked on #4376 ( Previous Review Status
Review iteration: 2 | 2026-06-16 |
…ers fail fast
Address PR review: a GenServer.call fallback on the failure path would
re-create the thundering herd we are eliminating. Instead the leader writes
{:failed, reason} into the shared lock table; polling waiters read it and
return the real error immediately, with zero GenServer.calls. The failed
entry is swept after a grace window so a subsequent request retries creation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Implements #4372 (sibling of #4370/#4371 under the thundering-herd umbrella #4266, "Direction 2: pre-GenServer ETS dedup of in-flight creations").
When N concurrent
offset=-1requests hit the same uncached shape, today each one misses the fast path and enqueues aGenServer.callinto the single per-stackShapeCachemailbox, occupying a slot for the full duration of one slow creation — even though the creation work is already deduplicated.This adds a public, GenServer-owned, caller-readable ETS lock table keyed by
Shape.comparable/1. The handler sets the lock when it begins creating a shape and clears it (viatry/after) when done. Callers, after the existing fast-path miss, consult the lock:Electric.PollWait.until/3(per-call backoff5 → 10 → 20 → 40 → 80 → 100ms, tuned for sub-second creation latency).GenServer.callpath (becomes the creator, or short-circuits via the critical fetch).Callers never write the lock, so they can never strand a stale claim; the table is owned by the ShapeCache process, so it is recreated empty on restart.
Design notes / deliberate deviations from the issue sketch
Shape.comparable/1, notShape.hash/1—hashis 32-bitphash2and can collide (there's an explicit collision test), which would strand a waiter for 30s.comparableis the canonical identity SQLite/add_shapededup on.after— simpler/DRY; callers for existing+activated shapes return on the fast path and never reach the handler.fetch_handle_by_shape/2(read connection) — never the_critical(write) variant, so N polling waiters cannot re-create the write-connection contention (bottleneck 3 in ShapeCache bottlenecks under thundering herd #4266).Trade-off
If a creation fails, polling waiters no longer get the specific error — they
{:error, :timeout}after the existing 30s deadline. Same trade-off already accepted for StatusMonitor/EtsInspector under congestion.Tests
"concurrent callers for a fresh shape coalesce: leader creates, followers poll"— blocks the leader upstream ofadd_shape, asserts the ShapeCache mailbox stays at 0 (followers poll, don't enqueue) and all followers return the leader's handle, with snapshot work running exactly once."shape-create lock table is created empty when the ShapeCache starts"(crash-recovery property).shape_cache_test.exs: 40 tests, 0 failures.Stacking
Electric.PollWait). Base branch =alco/poll-wait-instead-of-genserver-call. Rebase ontomainonce #4376 merges.🤖 Generated with Claude Code