Skip to content

Use compact runtime shape refs internally#4624

Open
robacourt wants to merge 10 commits into
mainfrom
rob/internal-shape-handles
Open

Use compact runtime shape refs internally#4624
robacourt wants to merge 10 commits into
mainfrom
rob/internal-shape-handles

Conversation

@robacourt

@robacourt robacourt commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Introduces an in-memory numeric shape_id (the "runtime shape ref") that replaces the binary shape_handle for all internal routing, filtering, and identity. Shapes are held millions of times across ETS tables, maps, and sets; each shape_handle is a ~30-byte binary, so swapping the internal key to a small integer is a substantial memory win — the largest amplifier being the subquery index, where the key was duplicated 4–6× per node per shape.

shape_handle is preserved at every external/durable boundary: the HTTP API, on-disk + storage-ETS keys, the ShapeStatus authority's public face, the client change-notification Registry, and the subquery tag-hash. The shape_id is in-memory only — never persisted, freshly minted each boot.

How to review

The change is deliberately structured so most of it is trivial to review. It breaks into four parts:

1. Mechanical changes (the bulk — ~17 of 23 lib files are 90–100% mechanical)

Pure shape_handleshape_id renames: function params, @specs, ETS tuple keys, message tuples, local vars, and is_shape_handleis_integer guard swaps. No behaviour change. These modules are key swaps only, no algorithm change:
consumer_registry.ex, event_router.ex, filter.ex, filter/indexes/subquery_index.ex, materializer.ex, where_clause.ex, request_batcher.ex, partitions.ex, dependency_layers.ex (algorithm unchanged — only the caller now supplies resolved ids), ref_resolver.ex, state.ex, effects.ex, setup_effects.ex, event_handler/subqueries/{steady,buffering}.ex, dynamic_consumer_supervisor.ex (partition hash now over id — behaviour-equivalent).

2. Core logic — shape_status.ex (the one place to read closely)

This is the feature. ShapeStatus becomes the authority that mints ids and holds the bidirectional map:

  • shape_id_table ETS (id → handle reverse map) + mint_id/1 (atomic :ets.update_counter on a :seq row).
  • id_for_handle/2, handle_for_id/2, fetch_shape_by_id/2, and shape_handle_for_log/2 (returns "unknown, id: N" on miss so log/telemetry callers can resolve unconditionally).
  • The shape_meta_table row grows to a 6-tuple (id appended); ids are minted both in add_shape/2 (new shapes) and populate_shape_meta_table/2 (restore/boot), and the reverse-map row is cleaned up in remove_shape/2.

The id is never written to SQLite — the shape_db/connection/query persistence layer is untouched.

3. Planned boundary (handle ↔ id resolution, as designed)

Resolution happens once at each edge, never on the steady-state hot path:

  • shape_log_collector.ex dependency_ids/2 and event_handler_builder.ex dep_id_for_handle!/2 — resolve a shape's persisted dependency handles → ids at consumer setup.
  • consumer.ex consumer_pid_for_handle/2 — the single handle→id resolver for cold-path/handle-boundary callers; routing-identity functions (name, stop, consumer_pid) are id-only.
  • shape_handle_for_log/2 used at id-only log/telemetry sites.

Deliberately kept on shape_handle: the API/plug boundary, all Storage.* and on-disk paths, the client Elixir Registry pub-sub (register_for_changes/Registry.dispatch), PublicationManager/RelationTracker (one-entry-per-shape, low memory value), and the subquery tag-hash (subquery_tags.ex/move_broadcast.ex are unchanged — their md5("#{stack_id}#{shape_handle}…") is streamed/persisted in :tags headers and must byte-match the Postgres-side hash in querying.ex).

4. Implementation-time additions (the only non-mechanical extras — review these)

Almost all are the unavoidable consequence of introducing a lookup: once a path resolves handle→id, it must handle the lookup missing (a shape concurrently deleted):

  • Deletion-race branches around new lookups: shape_cache.ex restore path (:error -> halt), shape_cleaner.ex (resolve id before ShapeStatus.remove_shape deletes the mapping; nil-guard helpers stop_consumer/3, remove_from_shape_log_collector/2; suspend-path :error -> :ok), consumer.ex all_materializers_alive?.
  • shape_cache.ex start_shape/4 + start_materializers/3 restructured to a with chain: a materializer-id miss now aborts and purges before starting the consumer, instead of starting both regardless (symmetric graceful handling).
  • snapshotter.ex looks the consumer up via ConsumerRegistry.whereis(id) directly so a snapshot-failure report still reaches it after cleanup has removed the handle→id mapping.
  • shape_status.ex put_handle_id/3 — a Mix.env() == :test-gated ETS seeder for routing tests (compile-gated; cannot reach prod).

Guarantees verified

  • shape_id never reaches the wire (no shape_id in api/, plug/, log_items.ex, or :tags/electric-handle headers).
  • id is in-memory only and never persisted (re-minted each boot).
  • Subquery tag-hash stays handle-keyed and byte-compatible with the Postgres side.
  • The two registries stay distinct (custom ConsumerRegistry = id; client Elixir Registry = handle).
  • No internal function accepts both a handle and an id.

Testing

Full suite green: 391 doctests, 9 properties, 1833 tests, 0 failures. Subquery move-in/move-out integration and electric-handle header round-trip covered by existing suites.

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
4505 1 4504 61
View the top 1 failed test(s) by shortest run time
Elixir.Electric.StackSupervisor.TelemetryTest::test prometheus_metrics/0 per-status HTTP request counts are scrapable
Stack Traces | 0.201s run time
3) test prometheus_metrics/0 per-status HTTP request counts are scrapable (Electric.StackSupervisor.TelemetryTest)
     .../electric/stack_supervisor/telemetry_test.exs:67
     Assertion with =~ failed
     code:  assert scrape =~ ~r/electric_plug_serve_shape_requests_count\{status="200"\} 2/
     left:  "# HELP electric_plug_serve_shape_requests_count \n# TYPE electric_plug_serve_shape_requests_count counter\nelectric_plug_serve_shape_requests_count{status=\"200\"} 3\nelectric_plug_serve_shape_requests_count{status=\"409\"} 1\n"
     right: ~r/electric_plug_serve_shape_requests_count\{status="200"\} 2/
     stacktrace:
       .../electric/stack_supervisor/telemetry_test.exs:90: (test)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Comment thread .changeset/small-shape-refs.md Outdated
@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR introduces in-memory numeric shape_ids (a monotonic pos_integer minted by ShapeStatus) for all internal routing/consumer/materializer/dependency/filter paths, while keeping the public string shape_handle at the HTTP and storage boundaries. Since iteration 5 the only change is 115488cae "Remove plan", which drops the committed planning doc (resolving the prettier-check failure), and the changeset wording was updated per @robacourt's suggestion. On re-examination, the shape_id-reverse-map issue flagged last round is more serious than the previous review concluded: re-minting ids on every refresh/1 doesn't just leak — it diverges the handle->id map from the ids that live consumers/materializers are registered under across a replication reconnect.

What's Working Well

  • Iteration 5 -> 6 cleanups landed:
    • The internal planning doc docs/superpowers/plans/2026-06-18-internal-shape-id.md was removed (115488cae). docs/** is not in .prettierignore, so this was the file failing Check formatting of TS packages, examples and website — finding [Merged on #3] Write inserts, updates and deletes to Vaxine #2 from last round is resolved.
    • The changeset (.changeset/small-shape-refs.md) now reads "Reduce memory use by using compact runtime shape refs internally.", matching @robacourt's inline suggestion.
  • All the structural wins from the iteration-4 -> 5 rework still stand: handle/id resolved before removal in shape_cleaner.ex:165-204, per-flush hot path uses state.shape_id directly, :active_shapes returns MapSet.t(shape_id()), the subquery_index table-name parsing is gone, Consumer.State.new is arity-distinguished, and @alco's three request_batcher comments are resolved.
  • Atomic mint_id via :ets.update_counter(.., :seq, {2,1}, {:seq,0}), the is_stack_id guards, and the :ets.lookup_element/4 + rescue ArgumentError defensive lookups are clean.

Issues Found

Critical (Must Fix)

1. refresh/1 re-mints every shape's shape_id, diverging handle->id from live consumers across a reconnect (escalated from last round's "leak")

File: packages/sync-service/lib/electric/shape_cache/shape_status.ex:99-117 (refresh/1) and :527-548 (populate_shape_meta_table/2)

populate_shape_meta_table/2 unconditionally mint_ids a fresh id for every shape on each call and overwrites the meta row (handle -> new_id) plus inserts {new_id, handle} into the reverse shape_id_table (:537-541). refresh/1 calls it on every replication-lock acquisition (connection/manager.ex:780, on :replication_client_lock_acquired), which re-runs each time Connection.Manager re-enters :connection_setup — i.e. on every connection failure/restart.

The previous review treated this as "purely a leak (routing stays correct)". That conclusion is incomplete. The shape subsystem (ShapeLogCollector + consumers + materializers) is explicitly "resilient to connection failures" (core_supervisor.ex:5-6; started together in start_shapes_supervisor, core_supervisor.ex:69,84), and stop_shapes_supervisor is only called on the timeline-change / purge_all_shapes? branch (connection/manager.ex:467). So on an ordinary reconnect the consumers stay alive holding the ids they were started with, while refresh/1 installs new ids in the meta table. Consumers/materializers cache their id at init (consumer.ex:165-178, state.ex:93-106) and register under it — they never re-resolve.

After such a reconnect, any code that resolves an existing shape's handle -> id fresh gets the new id, which no live process is registered under:

  • Consumer.consumer_pid_for_handle/2 (consumer.ex:134-136) -> id_for_handle -> new id -> ConsumerRegistry.whereis(new_id) -> nil for a shape whose consumer is alive under the old id (breaks handle-based lookups, e.g. delete-by-handle).
  • all_materializers_alive?/1 (consumer.ex:1254-1263): a dependent consumer resolves its dependency's handle -> id via id_for_handle, then GenServer.whereis(Materializer.name(stack_id, new_id)) -> nil. The dependency materializer is alive under the old id, so the dependent shape concludes the dependency "has gone away" and tears itself down (stop_and_clean, :1250). Subquery/dependent shapes are dropped after a benign reconnect.

(Steady-state change routing collector->consumer happens to keep working because both sides hold the same old ids internally — the break is specifically at every fresh handle -> id boundary after a re-mint.)

The unbounded leak the previous review described is also still real and stacks on top: the old {old_id, handle} reverse rows are never swept (the generation :ets.select_delete at :111 only touches the meta table; shape_id_table is cleared only in create_shape_meta_table/1 at :523), so each reconnect orphans ~(number of shapes) reverse entries for the life of the ShapeStatusOwner.

Suggested fix: don't re-mint ids for shapes that already have one. On refresh/1, reuse the existing handle -> id from the meta row when the handle is already present (only mint for genuinely new handles), and prune reverse rows for handles that disappeared. That keeps live consumers/materializers addressable across reconnects and bounds the reverse table. A clear-and-rebuild of shape_id_table would fix the leak but would not fix the divergence (and would transiently break handle_for_id for live shapes), so the id-stability property is the important one here.

Note: I could not exercise the reconnect path in this environment; the analysis is from the supervision topology and id-caching above. Worth confirming with a test that calls refresh/1 twice with a live consumer/materializer (see suggestion #3).

Suggestions (Nice to Have)

2. remove_shape_immediate/3 is exhaustive only by current return contracts (unchanged from last round)

File: packages/sync-service/lib/electric/shape_cache/shape_cleaner.ex:155-160

remove_shapes_immediate/3 matches the result of remove_shape_immediate/3 against exactly :ok and {:error, :data_removed}. The :ok branch returns the value of the with chain (stop_consumer -> Storage.cleanup! -> remove_from_shape_log_collector), all of which return :ok/raise today — so it's fine. But if any grows an {:error, other} return, the fall-through value would hit a CaseClauseError here rather than dropping gracefully. A _ -> [] arm (or an explicit else on the with) would harden it. Low priority.

3. Add reconcile/concurrency coverage for the identity layer (unchanged, and now directly testable)

The new describe "shape_id" tests are a good addition. A test that calls refresh/1 twice and asserts that (a) a still-present shape keeps a stable id and remains addressable via handle_for_id/id_for_handle, and (b) a shape removed from SQLite between the two refreshes does not linger in shape_id_table, would pin both halves of finding #1. A concurrent-mint / lost-race test is also worth adding.

Issue Conformance

Still no linked issue. The PR description explains the rationale (cut the memory cost of holding millions of binary handles in ETS/maps/sets) and the implementation matches the described scope, including the deliberate scope cuts (PublicationManager/RelationTracker stay on handle; subquery_tags/move_broadcast stay on handle so the on-disk/Postgres-side tag hashes still byte-match). Consider linking a tracking issue if one exists.

Previous Review Status

Iteration 5 -> 6 (only new commit: 115488cae "Remove plan"):


Review iteration: 6 | 2026-06-19

@robacourt robacourt force-pushed the rob/internal-shape-handles branch from c0b57f3 to c82a144 Compare June 18, 2026 11:47

@alco alco left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only reviewed a part of it.

So far, the addition of optionality between shape_ref and shape_handle in multiple modules/functions looks like added complexity we would like to avoid. What's the heuristic I should use when reading the code to know whether a given function can end up taking a shape ref or a shape handle at runtime?


def add_shape(stack_id, shape_handle, shape, :create)
when is_stack_id(stack_id) and is_shape_handle(shape_handle) do
GenServer.call(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just call the local add_shape() function to avoid duplicating GenServer.call() and the timeout magic number.

| to_add: Map.put(state.to_add, shape_handle, shape),
to_remove: MapSet.delete(state.to_remove, shape_handle),
to_schedule_waiters: Map.put(state.to_schedule_waiters, shape_handle, from)
| to_add: Map.put(state.to_add, shape_ref, {shape_handle, shape}),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so these maps and sets may now have to shapes of values? I feel like this change is missing a better in-code documentation clarifying the tradeoffs. For an uninformed reader it looks odd to have similar handle_call clauses where one has the additional shape_handle param.

end
end

defp shape_handle_for_ref(stack_id, shape_id) when is_shape_id(shape_id) do

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the double naming of shape_ref and shape_id referring to the same thing?

robacourt and others added 9 commits June 18, 2026 16:46
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Convert the internal routing pipeline from the binary shape_handle to the
numeric shape_id in one atomic change. ConsumerRegistry, EventRouter/Filter,
ShapeLogCollector (+ RequestBatcher/FlushTracker), Partitions and
DependencyLayers are now keyed by id, along with the consumer's registration
name, flush notifications and the removal path.

Client-facing Registry pub-sub, Storage, PublicationManager, the subquery
index seeding/mark_ready and the cleaner's deferred phase remain keyed by
handle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Close the Chunk 3 transitional mixed-key state: the subquery index's
registration side was keyed by shape_id while seeding/membership was still
keyed by shape_handle, leaving every subquery shape permanently in routing
fallback. Flip the remaining index callers (seed_membership, mark_ready,
add_value, remove_value) and the index's own parameters to shape_id so the
exact-membership reader (membership_or_fallback?/member?) matches the
seeded entries again.

Also key the RefResolver and the materializer (process name, link-values
ETS cache, and the {:materializer_changes, dep_id, ...} routing message) by
shape_id, resolving dependency handle -> id at the consumer boundary.

move_broadcast.ex and subquery_tags.ex stay on shape_handle: their tag hash
md5('stack_id' || 'shape_handle' || value) must byte-match the Postgres side
and is streamed/persisted in tag headers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r_log

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@robacourt robacourt force-pushed the rob/internal-shape-handles branch from 449886b to 6655b70 Compare June 19, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants