Skip to content

feat(sync-service): Make concurrent shape requests wait for ShapeCache to be ready instead of flooding it with messages#4585

Draft
erik-the-implementer wants to merge 6 commits into
alco/poll-wait-instead-of-genserver-callfrom
alco/shape-cache-mailbox-flooding
Draft

feat(sync-service): Make concurrent shape requests wait for ShapeCache to be ready instead of flooding it with messages#4585
erik-the-implementer wants to merge 6 commits into
alco/poll-wait-instead-of-genserver-callfrom
alco/shape-cache-mailbox-flooding

Conversation

@erik-the-implementer

Copy link
Copy Markdown
Contributor

What

Implements #4372 (sibling of #4370/#4371 under the thundering-herd umbrella #4266, "Direction 2: pre-GenServer ETS dedup of in-flight creations").

When N concurrent offset=-1 requests hit the same uncached shape, today each one misses the fast path and enqueues a GenServer.call into the single per-stack ShapeCache mailbox, occupying a slot for the full duration of one slow creation — even though the creation work is already deduplicated.

This adds a public, GenServer-owned, caller-readable ETS lock table keyed by Shape.comparable/1. The handler sets the lock when it begins creating a shape and clears it (via try/after) when done. Callers, after the existing fast-path miss, consult the lock:

  • set → bypass the GenServer entirely and poll a cheap ETS/read-connection predicate via Electric.PollWait.until/3 (per-call backoff 5 → 10 → 20 → 40 → 80 → 100 ms, tuned for sub-second creation latency).
  • unset → today's GenServer.call path (becomes the creator, or short-circuits via the critical fetch).

Callers never write the lock, so they can never strand a stale claim; the table is owned by the ShapeCache process, so it is recreated empty on restart.

Design notes / deliberate deviations from the issue sketch

  • Lock key is Shape.comparable/1, not Shape.hash/1hash is 32-bit phash2 and can collide (there's an explicit collision test), which would strand a waiter for 30s. comparable is the canonical identity SQLite/add_shape dedup on.
  • Lock set unconditionally at handler entry, cleared in after — simpler/DRY; callers for existing+activated shapes return on the fast path and never reach the handler.
  • Poll predicate uses the non-critical fetch_handle_by_shape/2 (read connection) — never the _critical (write) variant, so N polling waiters cannot re-create the write-connection contention (bottleneck 3 in ShapeCache bottlenecks under thundering herd #4266).

Trade-off

If a creation fails, polling waiters no longer get the specific error — they {:error, :timeout} after the existing 30s deadline. Same trade-off already accepted for StatusMonitor/EtsInspector under congestion.

Tests

  • New "concurrent callers for a fresh shape coalesce: leader creates, followers poll" — blocks the leader upstream of add_shape, asserts the ShapeCache mailbox stays at 0 (followers poll, don't enqueue) and all followers return the leader's handle, with snapshot work running exactly once.
  • New "shape-create lock table is created empty when the ShapeCache starts" (crash-recovery property).
  • Pre-existing concurrency tests unchanged and green. Full shape_cache_test.exs: 40 tests, 0 failures.

Stacking

⚠️ Stacked on #4376 (Electric.PollWait). Base branch = alco/poll-wait-instead-of-genserver-call. Rebase onto main once #4376 merges.

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
4415 1 4414 55
View the top 3 failed test(s) by shortest run time
Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough
Stack Traces | 0s run time
34) test get_or_create_shape_handle/2 against real db crashes when initial snapshot query fails to return data quickly enough (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:635
     ** (EXIT from #PID<0.15215.0>) killed
Elixir.Electric.Shapes.FilterTest::test optimisations where clause in the form `array_field @> const_array` is optimised
Stack Traces | 0.0482s run time
9) test optimisations where clause in the form `array_field @> const_array` is optimised (Electric.Shapes.FilterTest)
     .../electric/shapes/filter_test.exs:844
     Assertion with < failed
     code:  assert add_reductions < max_reductions
     left:  8633
     right: 6500
     stacktrace:
       (elixir 1.19.5) lib/enum.ex:961: Enum."-each/2-lists^foreach/1-0-"/2
       .../electric/shapes/filter_test.exs:872: (test)
Elixir.Electric.ShapeCacheTest::test get_or_create_shape_handle/2 shape initialization concurrent callers for a fresh shape coalesce: leader creates, followers poll
Stack Traces | 0.262s run time
32) test get_or_create_shape_handle/2 shape initialization concurrent callers for a fresh shape coalesce: leader creates, followers poll (Electric.ShapeCacheTest)
     test/electric/shape_cache_test.exs:293
     Assertion failed, no matching message after 0ms
     The process mailbox is empty.
     code: assert_received {:called, :create_snapshot_fn}
     stacktrace:
       test/electric/shape_cache_test.exs:360: (test)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Iteration 2. The new commit 5964bdf3 ("publish create failures to the lock table so waiters fail fast") directly resolves the one Important issue from the previous review: a failing creation no longer hangs every concurrent follower for 30s. The leader now writes {:failed, reason} into the lock table, pollers read it and return the real error immediately, and the entry is swept after a grace window so a later request retries. This is a clean, well-tested fix that avoids the GenServer.call stampede a fallback-to-call approach would have re-introduced. One robustness concern remains around the newly-introduced handle_info clause.

What's Working Well

  • Failure fast-path is the right design. Publishing {:failed, reason} into the shared lock table — rather than falling back to a GenServer.call to disambiguate — keeps the thundering-herd property intact on the error path too. Followers get the real error with zero mailbox pressure. This is a better resolution than either of the two options I sketched in iteration 1.
  • Sweep safety is defensive and correct. handle_info({:sweep_failed_create_lock, …}) only deletes when the entry is still {:failed, _}; a stale sweep that races a fresh :in_progress claim is a no-op. The "sweep clears a published failure but leaves an in-progress claim" test pins exactly this property, and the use of a synchronous has_shape_handle? call to flush the mailbox before asserting is a clean way to avoid timing flakiness.
  • Grace window is well-reasoned. @failed_create_grace_ms = 1_000 sits comfortably above @poll_max_interval (100ms), so every already-waiting poller is guaranteed to observe the failure before it is swept, and the comment explains exactly that.
  • Previous suggestions absorbed. The lock value is now a meaningful :in_progress / {:failed, reason} tag instead of the unused {lock_key, self()} pid — addressing the iteration-1 "lock value is unused" note. The changeset trade-off paragraph was also updated to describe the new fail-fast behavior rather than the old 30s-timeout caveat. Nice attention to detail.
  • poll_shape_ready ordering is correct. Checking the {:failed, _} branch before the shape-existence with is right: while a failure is published no successful shape can exist for the same comparable key (leaders are serialized through the lock), so there is no risk of returning a stale error over a live shape.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

Newly-added handle_info/2 removes the GenServer default catch-all — unexpected messages now crash the per-stack ShapeCache.

File: packages/sync-service/lib/electric/shape_cache.ex:399-414

Before this PR, ShapeCache defined no handle_info/2 clause, so it inherited use GenServer's default implementation, which logs and ignores any unexpected message. This commit introduces a single specific clause:

def handle_info({:sweep_failed_create_lock, lock_key}, state) do
  ...
end

Once any handle_info/2 is defined, the default fallback is gone. Any message that is not {:sweep_failed_create_lock, _} now raises FunctionClauseError and crashes the process — which is a high-blast-radius, per-stack singleton. This is the exact pattern called out in the repo's known-pitfalls list ("Missing catch-all handle_info/2 clause causes crashes on unexpected messages").

I traced the current message sources and the practical trigger today looks low: the create path's outbound calls are DynamicSupervisor.start_child (:infinity timeout, no late reply) and Consumer.initialize_shape (send, no reply), and the process does not Process.monitor or trap exits. So I'm not claiming a live crash exists on main. But the change silently removed the safety net the process used to have, and any future finite-timeout GenServer.call from within a handler (whose late {ref, reply} lands in the mailbox), a monitor, or a stray system message would now take the stack's ShapeCache down.

Suggested fix — restore the previous tolerance with an explicit catch-all:

def handle_info({:sweep_failed_create_lock, lock_key}, state) do
  ...
  {:noreply, state}
end

def handle_info(msg, state) do
  Logger.warning("#{inspect(__MODULE__)} received unexpected message: #{inspect(msg)}")
  {:noreply, state}
end

Suggestions (Nice to Have)

  • Brief negative-caching window. During the ~1s grace window, a fresh request that arrives after the failure is published also reads {:failed, reason} and returns the cached error instead of attempting its own creation. For a transient failure (e.g. :connection_not_available) that has since cleared, callers in that window get the stale error for up to a second. This is arguably desirable (it dampens retry stampedes against a struggling DB) and is bounded, but it's a behavior nuance worth a one-line comment at the shape_create_in_progress? call site so the next reader knows late arrivals intentionally share the published failure.
  • Observability (carried over from iteration 1). Still no telemetry to confirm the coalescing — and now the fail-fast path — is working under real load. A lightweight :telemetry.execute([:electric, :shape_cache, :create_waiter], …) distinguishing leader/follower and success/failure/timeout outcomes would make the thundering-herd fix measurable in prod.

Issue Conformance

The fix is a direct response to PR review feedback and the implementation matches the updated PR description and changeset. Linked-issues context remains empty (the PR references #4372 under umbrella #4266 in its body). Reminder still stands: the PR is stacked on #4376 (Electric.PollWait) with base alco/poll-wait-instead-of-genserver-call — rebase onto main once #4376 merges.

Previous Review Status

  • ResolvedError-path latency (concurrent waiters on a failing creation waited the full 30s). Now fail fast via the published {:failed, reason} entry, without re-introducing a GenServer.call stampede.
  • ResolvedLock value unused ({lock_key, self()}). Value is now a meaningful :in_progress / {:failed, reason} tag.
  • New — Missing catch-all handle_info/2 (Important, above), introduced by the same commit that fixed the error-path issue.
  • The telemetry/observability suggestion and the wall-clock test-timing note from iteration 1 still stand as suggestions.

Review iteration: 2 | 2026-06-16

@alco alco self-assigned this Jun 16, 2026
@alco alco changed the title feat(sync-service): coalesce concurrent ShapeCache create waiters via lock + PollWait feat(sync-service): Make concurrent shape requests wait for ShapeCache to be ready instead of flooding it with messages Jun 16, 2026
…ers fail fast

Address PR review: a GenServer.call fallback on the failure path would
re-create the thundering herd we are eliminating. Instead the leader writes
{:failed, reason} into the shared lock table; polling waiters read it and
return the real error immediately, with zero GenServer.calls. The failed
entry is swept after a grace window so a subsequent request retries creation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants