Skip to content

Emit per-individual-table ETS memory/size telemetry#4636

Draft
erik-the-implementer wants to merge 5 commits into
mainfrom
alco/ets-per-table-metrics
Draft

Emit per-individual-table ETS memory/size telemetry#4636
erik-the-implementer wants to merge 5 commits into
mainfrom
alco/ets-per-table-metrics

Conversation

@erik-the-implementer

Copy link
Copy Markdown
Contributor

What

Adds per-individual-table ETS memory/size telemetry so we can drill into which specific tables dominate ETS memory in Honeycomb, beyond the existing per-table_type aggregate.

Details

  • New [:ets, :table] telemetry event emitting ets.table.memory and ets.table.size, tagged by table_name + table_type, for the top N tables by memory.
  • N is the new top_ets_individual_count option (default 10).
  • Reuses the already-present EtsTables.top_tables/1 (no new gathering code); the existing top_by_type / ets.memory.total path is untouched.
  • The per-table emitter does its own :ets.all() walk (kept separate from the type rollup so the two counts stay independently configurable — a cheap second scan, not worth disturbing the working type path to merge).

Two follow-up cleanup commits remove redundant to_string/1 calls on telemetry tag values for the ETS and process emitters. Every reporter that carries these tagged metrics (OTel, Prometheus, StatsD) stringifies tag values itself, and the CallHomeReporter doesn't carry them at all — so the call-site to_string/1 was a no-op. Tag values are now passed through raw (atom table name, string type).

Testing

  • Added an emission test asserting [:ets, :table] fires with memory/size measurements and table_name/table_type metadata.
  • Full electric-telemetry suite: 124 tests, 0 failures; clean mix compile --warnings-as-errors.

Changeset included (@core/electric-telemetry, patch).

🤖 Generated with Claude Code

alco and others added 3 commits June 18, 2026 19:23
Wire the already-present EtsTables.top_tables/1 to a new [:ets, :table]
telemetry event, emitting ets.table.memory and ets.table.size tagged by
table_name + table_type for the top N tables by memory. N is the new
top_ets_individual_count option (default 10). Complements the existing
per-table_type ets.memory.total aggregate, which is left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every metrics reporter already stringifies tag values itself — OTel
(OtlpUtils.to_kv_value has an is_atom clause), Prometheus (exporter
escape/1 starts with to_string/1), and StatsD (both standard and datadog
formatters to_string the value). So the to_string/1 at the telemetry
call site was a no-op. Pass the raw atom/binary tag values through for
ets.memory.total and ets.table.{memory,size}.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same as the ETS tag cleanup: every reporter that carries these tagged
metrics (OTel, Prometheus, StatsD) stringifies tag values itself, and the
CallHomeReporter doesn't carry the process.* metrics at all. So the
to_string/1 on process_type in process_memory/1 and process_bin_memory/1
was a no-op. Pass map.type through raw.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.66%. Comparing base (ee0da19) to head (a733372).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ry/lib/electric/telemetry/application_telemetry.ex 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4636      +/-   ##
==========================================
+ Coverage   59.46%   59.66%   +0.20%     
==========================================
  Files         385      400      +15     
  Lines       43039    43637     +598     
  Branches    12383    12383              
==========================================
+ Hits        25591    26037     +446     
- Misses      17371    17524     +153     
+ Partials       77       76       -1     
Flag Coverage Δ
electric-telemetry 73.24% <66.66%> (?)
elixir 73.24% <66.66%> (?)
packages/agents 72.64% <ø> (ø)
packages/agents-mcp 77.70% <ø> (ø)
packages/agents-mobile 80.67% <ø> (ø)
packages/agents-runtime 83.50% <ø> (+0.03%) ⬆️
packages/agents-server 75.47% <ø> (ø)
packages/agents-server-ui 7.51% <ø> (ø)
packages/electric-ax 51.06% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.83% <ø> (+0.11%) ⬆️
packages/y-electric 56.05% <ø> (ø)
typescript 59.47% <ø> (+0.01%) ⬆️
unit-tests 59.66% <66.66%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 19, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR adds a per-individual-table ETS telemetry event ([:ets, :table]ets.table.memory/ets.table.size, tagged by table_name + table_type) for the top N tables, plus to_string/1 cleanups on telemetry tag values. The per-table gauges are routed away from the TTL-less Prometheus registry while OTel and StatsD still carry them. Small, well-scoped, has a changeset, and well-tested. Since iteration 2, the only change is a comment-wording cleanup (a733372fe) that resolves the last open suggestion.

What's Working Well

  • Prometheus cardinality is correctly handled. prometheus_excluded?/1 filters ets.table.* out of the Prometheus metric list while OTel and StatsD still carry them. The pattern %{name: [:ets, :table | _]} matches both [:ets, :table, :memory] and [:ets, :table, :size], and leaves [:ets, :memory, :total] (the bounded per-table_type aggregate) on the Prometheus path.
  • The exclusion comment is now precise. The reworded exporter_child_specs/1 comment cleanly separates the three reporter behaviors — OTel clears its series between exports, StatsD pushes each value as emitted (nothing accumulates in-process), and only TelemetryMetricsPrometheus.Core retains series with no TTL. The contrast with additional_prometheus_metrics remains, so the next reader won't reintroduce the issue.
  • Strong routing tests. The ets.table.* metric routing describe block asserts the metrics reach OTel + StatsD, are absent from Prometheus, and that ets.memory.total still reaches Prometheus — exactly the invariants the fix relies on.
  • The emission test inserts real data and asserts measurement/metadata shapes and types, confirming the named table appears in the emitted set.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None remaining.

Suggestions (Nice to Have)

  1. Double :ets.all() scan per poll (carried over). ets_table_memory/1 and ets_memory/1 each do a full :ets.all() walk with 3× :ets.info per table; EtsTables.top_memory_stats/2 could produce both in one scan. The author's rationale (independent counts, don't disturb the working type path) is reasonable — flagging only as something to revisit if poller latency shows up at 10k–150k shapes.

  2. Option naming (carried over). top_ets_individual_count next to top_ets_table_count doesn't make the individual-vs-type distinction obvious; both default to 10 and neither is surfaced as an env var in telemetry_opts/1. Consistent with the existing option, not worth churning.

Issue Conformance

Still no linked issue (per repo convention PRs should reference one — minor; the author notes this is a small telemetry enhancement rather than a tracked bug). Implementation matches the PR description; commits are accurately scoped. No scope creep.

Previous Review Status

This PR looks ready to merge.


Review iteration: 3 | 2026-06-19

@alco alco marked this pull request as draft June 19, 2026 10:16
ets.table.memory/size are tagged by the raw table name (per-shape/stack
ids), so the top-N set rotates over time. OTel clears series between
exports and StatsD is per-scrape, so a rotating high-cardinality set is
fine there (and is the intended trade-off for Honeycomb). But
TelemetryMetricsPrometheus.Core has no series TTL, so those series would
accumulate in the local registry indefinitely and report stale frozen
last_values for tables that drop out of the top N.

Exclude ets.table.* from the Prometheus reporter's metric list (the
bounded ets.memory.total by table_type still goes to Prometheus). Add a
routing test asserting ets.table.* reach OTel + StatsD but not Prometheus.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@erik-the-implementer

Copy link
Copy Markdown
Contributor Author

Thanks — addressed the cardinality issue and verified the rest.

Important #1 (Prometheus cardinality) — fixed in e653bf570. ets.table.* are now excluded from the Prometheus reporter's metric list and flow to OTel + StatsD only; the bounded ets.memory.total (by table_type) stays on Prometheus. Added a routing test asserting exactly that. I confirmed the asymmetry is correct: the OTel exporter clears its series between exports (metric_store.ex :ets.match_delete on the exported generation), and StatsD is per-scrape, so a rotating top-N set is harmless there — only TelemetryMetricsPrometheus.Core keeps series with no TTL, which is the leak/stale-gauge you flagged.

Suggestion #1 (double :ets.all() scan) — kept as-is, deliberately. top_memory_stats/2 does exist and would do one scan with independent counts; I'm keeping the two emitters separate to avoid rewriting the working ets_memory/1 path. Noted for revisiting if poller latency shows up on very-high-shape nodes.

Suggestion #2 (to_string at reporter boundary) — verified by source. Prometheus exporter.escape/1 begins with to_string/1, OTel OtlpUtils.to_kv_value/1 has an is_atom clause, and both StatsD formatters stringify. With ets.table.* now off Prometheus, the only atom tag reaching Prometheus is process_type, which escape/1 handles.

Suggestion #3 (option naming) — left as-is to avoid churning the existing top_ets_table_count; both are default-only opts.

No linked issue — this is a small telemetry enhancement rather than a tracked bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants