Skip to content

SymDB: TracePoint :class hot-load hook for incremental class extraction#5697

Draft
p-datadog wants to merge 5 commits into
masterfrom
symdb-hot-load-hook
Draft

SymDB: TracePoint :class hot-load hook for incremental class extraction#5697
p-datadog wants to merge 5 commits into
masterfrom
symdb-hot-load-hook

Conversation

@p-datadog
Copy link
Copy Markdown
Member

@p-datadog p-datadog commented May 6, 2026

JIRA: DEBUG-5367

What does this PR do?

Adds a TracePoint :class hook to the symdb Component so that classes loaded after initial extraction (request-time autoloads, late initializers, plugin systems, STI subclasses on first query, dynamically-created classes) reach the symbol DB. Closes a cross-tracer parity gap with Java (ClassFileTransformer), Python (BaseModuleWatchdog#after_import), and .NET (AppDomain.AssemblyLoad) — all of which combine an initial scan with a hot-load hook for continuous coverage.

Motivation:

Ruby's symdb today scans ObjectSpace.each_object(Module) once per process. Anything not loaded by :after_initialize is invisible to the upload — under eager_load=false, that's most of the app; under eager_load=true, that's still anything that loads later (request-time autoloads, plugin code, dynamically-created classes, STI subclasses on first query). Java/Python/.NET don't have this gap because they hook the runtime's class-load mechanism. This PR closes it.

Design background:

  • Cross-tracer comparison and Ruby-native equivalents evaluated in the symdb project research notes.
  • TracePoint :class chosen over Zeitwerk on_load (Rails-only), Module#const_added (Ruby 3.1+ only), and polling (high latency).
  • Per-class incremental extraction via existing Extractor#extract(mod) chosen over full re-scan (matches Java/Python; bandwidth scales with what changed, not total app size).

Behavioral changes (intentional):

  1. Multiple uploads per process. Initial upload + one upload per debounce window of new class loads. Backend dedupes overlapping FILE scopes.
  2. No more class-level once-per-process flag. Component.uploaded_this_process? and the surrounding machinery removed. The two-uploads bug fix that introduced the flag is preserved by the per-instance scheduler debounce — multiple start_upload calls on one Component still collapse into one extraction within the debounce window. Cross-Component dedup is intentionally removed: each Component (e.g. after a Datadog.configure reconfigure) does its own initial extraction, which the new Component needs in order to have a hot-load baseline.
  3. eager_load=false no longer suppresses auto-deferred upload. Under eager_load=false, the initial upload may be small but the hot-load hook covers the rest as the app exercises code.
  4. wait_for_idle semantics unchanged from the script's perspective but the implementation waits on @last_upload_time advance instead of the removed flag.

Change log entry

None.

Additional Notes:

  • Hot-load coverage is on by default; no env-var opt-out for MVP. If telemetry surfaces excessive uploads in some app shape, an opt-out can be added.
  • Open question: backend dedup of overlapping FILE scopes from one process. The agnostic provider's merge-on-query path should handle this, but verifying against staging is on the verification list.
  • The eager_load gate removal (add Rubocop for style guidelines #3) interacts with the two-uploads investigation. The actual fix that resolved that bug was the per-instance scheduler + the gate; with hot-load present, the gate is no longer needed because under-extracted initial uploads self-correct. A dedicated regression test ('debounce regression') exercises the per-instance scheduler half of the original fix to confirm it still holds.

How to test the change?

```
bundle exec rspec spec/datadog/symbol_database/
bundle exec steep check sig/datadog/symbol_database lib/datadog/symbol_database
bundle exec rake standard
```

316/316 specs pass; steep clean; standardrb clean. New regression test (`hot-load coverage (TracePoint :class)`) defines a class via `eval` after initial upload and asserts `Extractor#extract` is called for it.

End-to-end: gobo `bin/extract_symbols` produces an initial upload then incremental uploads as the test app exercises code. Backend dedup of overlapping FILE scopes is a verification step before un-drafting.

@p-datadog p-datadog added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label May 6, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 6, 2026

Benchmarks

Benchmark execution time: 2026-05-19 20:09:04

Comparing candidate commit 9c7dcf0 in PR branch symdb-hot-load-hook with baseline commit 87c5373 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 46 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

@p-datadog p-datadog force-pushed the symdb-hot-load-hook branch from efbfd03 to 63abb16 Compare May 11, 2026 14:48
@p-datadog p-datadog changed the base branch from symbol-database-upload to symbol-database-upload-rebased May 11, 2026 16:27
Base automatically changed from symbol-database-upload-rebased to master May 18, 2026 15:59
p-ddsign and others added 3 commits May 18, 2026 12:13
…op ignore

CI's Steep run flagged 10 errors after the previous "fill in concrete types"
commit. Two type-fill changes were too tight, and one steep:ignore removal
was unverified locally because libdatadog version mismatch blocked Steep
here at the time. CI surfaced all three.

1. Component @logger / initialize logger / environment_supported? logger:
   revert to `untyped`. The tightened `SymbolDatabase::Logger` type didn't
   match Uploader's `Datadog::Core::Logger` parameter type, surfacing a
   long-standing inconsistency between the SymDB Logger wrapper and the
   downstream component sig declarations. Reverting to `untyped` matches
   the established cross-component logger-handoff convention (see DI's
   sigs for the same pattern); fixing the underlying inconsistency in
   Uploader / Extractor / ScopeBatcher sigs is out of scope for this PR.

2. Remote process_change `change` and receiver block's `changes`: revert
   to `untyped`. The union type
   `Repository::Change::Deleted | ::Inserted | ::Updated` looked right but
   Steep cannot narrow the union by the `case change.type` dispatch the
   code uses — `.content` exists only on Inserted/Updated, `.previous`
   only on Deleted/Updated, so every access through the union flagged.
   The runtime dispatch via `.type` symbol matches what the code does
   correctly; reverting to `untyped` matches the pattern. (DI's
   remote.rbs takes the same approach.)

3. scheduler_loop in component.rb:306: reinstate a per-line
   `# steep:ignore NoMethod` on the `@scheduled_at - Time.get_time`
   line, with an inline comment explaining that Steep does not narrow
   nullable instance variables across an `if @scheduled_at.nil?`
   check. Per-line rather than the previously removed block-level
   ignore — only one line needs suppression.

Verification:
- bundle exec steep check → No type error detected. (locally, with
  libdatadog 33.0.0.1.0 now installed)

Co-Authored-By: Claude <noreply@anthropic.com>
Closes the cross-tracer parity gap: Java (ClassFileTransformer), Python
(BaseModuleWatchdog#after_import), and .NET (AppDomain.AssemblyLoad)
all combine an initial scan with a hot-load hook for continuous coverage
of dynamically loaded code. Ruby's symdb only scanned once, so classes
loaded after :after_initialize (request-time autoloads, late
initializers, plugin systems, STI subclasses on first query, dynamically
created classes) never reached the symbol DB.

Code changes (lib/datadog/symbol_database/component.rb):
- Install per-instance TracePoint :class hook lazily on first start_upload.
  Filters singleton classes (matches extract_all). Pushes loaded modules
  onto a per-instance buffer; signals the scheduler.
- extract_and_upload now distinguishes initial extraction (extract_all,
  full ObjectSpace walk) from incremental (drain buffer, call
  Extractor#extract per unique module). Buffer cleared before initial
  extraction so the post-init drain doesn't reprocess what extract_all
  already covered.
- Disable the TracePoint in shutdown! before the scheduler stops.
- Remove the class-level @uploaded_this_process flag and its three
  short-circuit sites (start_upload + 2 in scheduler_loop). The actual
  two-uploads fix is the per-instance scheduler debounce; the flag was
  belt-and-braces. Removing it allows hot-load extractions after the
  initial upload.
- Remove Component.mark_uploaded, .uploaded_this_process?,
  .reset_uploaded_this_process_for_tests!, @upload_done_mutex,
  @upload_done_cv — all dead with the flag gone.
- Rewrite wait_for_idle to wait on @last_upload_time advance via a
  per-instance condition variable. Preserves the gobo bin/extract_symbols
  semantic (script triggers upload, blocks for it to complete) without
  depending on the one-shot flag.
- Remove the Rails.application.config.eager_load gate in
  schedule_deferred_upload. Under hot-load, an under-extracted initial
  upload self-corrects as the app exercises code.

Spec changes:
- spec/datadog/symbol_database/component_spec.rb: drop the
  reset_uploaded_this_process_for_tests! before-block and the
  short-circuit test; rewrite the two-uploads regression as 'debounce
  collapses bursts of start_upload calls' (the actual property the
  scheduler debounce holds) plus 'each Component built across
  reconfigurations extracts independently' (the new explicit invariant);
  add a hot-load regression test that defines a class via eval after
  initial upload and asserts Extractor#extract is called for it.
- spec/datadog/symbol_database/remote_config_integration_spec.rb: drop
  reset; rewrite 'class-level dedup prevents re-upload' to assert the new
  invariant (a second start_upload with no new class loads produces no
  extra captured upload because the hot-load drain is empty).

Type signatures (sig/datadog/symbol_database/component.rbs): add buffer
ivars, hot-load hook ivars, last_upload_time_cv; remove class-level flag
machinery; add private extract_hot_load_buffer / install_hot_load_hook
/ enqueue_hot_load.

Verification: 316/316 specs pass (full symbol_database/ suite). Steep
clean. StandardRB clean. Two-uploads regression preserved in renamed form
(per-instance debounce continues to collapse bursts within one Component;
cross-Component dedup is intentionally removed).
In a forked child process, only the forking thread survives. Mutexes
and condition variables are copied without ownership tracking (orphan
locks), the scheduler thread reference is dead, and the parent's
TracePoint hook is bound to the dead scheduler. Without an after_fork
reset, the first attempt to acquire a symdb mutex in the child can
deadlock on an orphan lock, and a force_upload-mode child never picks
up where the parent left off.

`Component#after_fork!` reinitializes:

- Hot-load: buffer cleared, `@initial_extraction_done = false`,
  `@hot_load_tracepoint = nil`. The next `start_upload` installs a
  fresh TracePoint bound to the child's component and triggers its own
  initial extraction.
- Scheduler: thread/`@scheduled_at`/`@scheduler_signaled` cleared;
  `@scheduler_mutex` and `@scheduler_cv` reinitialized.
- Upload: `@upload_in_progress = false`; `@mutex`,
  `@upload_in_progress_cv`, `@last_upload_time_cv`,
  `@hot_load_buffer_mutex` reinitialized.
- Force-upload mode: re-registers the deferred-upload callback in the
  child. In Rails, `:after_initialize` already fired in the parent,
  so the on_load block runs immediately and the child schedules its
  own upload.

Cross-process upload deduplication is deliberately out of scope:
each forked Component performs its own initial extraction. Workers
in `preload_app! + eager_load=true` deployments hold identical code
to the parent — backend content dedup is tracked separately in
`projects/symdb/backlog/impl.md §Fork/child dedup` and
`projects/symdb/design/fork-architecture.md`.

Wired from `Core::Configuration::Components#after_fork` alongside
telemetry/remote/crashtracker/ProcessDiscovery.

Unit specs cover: mutex/CV reinit, scheduler thread/state reset,
hot-load buffer/flag/tracepoint reset, `@upload_in_progress`
reset, force-upload re-trigger, and that start_upload still works
after after_fork!.

Local rspec couldn't run (libdatadog 33.0.0.1.0 not installed in
this dev env) — CI is the source of truth.
@p-datadog p-datadog force-pushed the symdb-hot-load-hook branch from 63abb16 to ade017a Compare May 18, 2026 16:18
@dd-octo-sts dd-octo-sts Bot added core Involves Datadog core libraries debugger Live Debugger (+Dynamic Instrumentation, +Symbol Database) labels May 18, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 18, 2026

Typing analysis

Note: Ignored files are excluded from the next sections.

steep:ignore comments

This PR introduces 1 steep:ignore comment, and clears 1 steep:ignore comment.

steep:ignore comments (+1-1)Introduced:
lib/datadog/symbol_database/component.rb:472
Cleared:
lib/datadog/symbol_database/component.rb:170

Untyped methods

This PR introduces 4 partially typed methods. It decreases the percentage of typed methods from 63.0% to 62.89% (-0.11%).

Partially typed methods (+4-0)Introduced:
sig/datadog/symbol_database/component.rbs:42
└── def initialize: (Datadog::Core::Configuration::Settings settings, Datadog::Core::Configuration::AgentSettings agent_settings, untyped logger, ?telemetry: Datadog::Core::Telemetry::Component?) -> void
sig/datadog/symbol_database/component.rbs:65
└── def self.environment_supported?: (untyped logger) -> bool
sig/datadog/symbol_database/remote.rbs:12
└── def self.receiver: (?::Array[::String] products) { (Datadog::Core::Remote::Configuration::Repository repository, ::Array[untyped] changes) -> void } -> ::Array[Datadog::Core::Remote::Dispatcher::Receiver]
sig/datadog/symbol_database/remote.rbs:16
└── def self.process_change: (Component component, untyped change, Datadog::Core::Telemetry::Component? telemetry) -> void

Untyped other declarations

This PR introduces 1 untyped other declaration.

Untyped other declarations (+1-0)Introduced:
sig/datadog/symbol_database/component.rbs:18
└── @logger: untyped

If you believe a method or an attribute is rightfully untyped or partially typed, you can add # untyped:accept on the line before the definition to remove it from the stats.

p-ddsign and others added 2 commits May 19, 2026 08:53
CI's Steep check on PR #5697 flagged 4 problems:

1. component.rb:477 - NoMethod on @hot_load_tracepoint.enable. Steep
   does not narrow instance variable types after assignment within the
   same method, so the `(::TracePoint | nil)` RBS type persists past the
   `@hot_load_tracepoint = TracePoint.new(...)` assignment. Added
   `steep:ignore NoMethod` per the project Steep policy: false positives
   from narrowing limitations are silenced, never worked around by
   refactoring working code.

2. component.rb:472 - redundant `# steep:ignore:start` block around the
   TracePoint body. The block contents (`mod.singleton_class?`, `send`)
   do not produce Steep errors. Removed the ignore.

3. remote.rb:106, 113 - redundant `# steep:ignore NoMethod` on
   `change.content` accesses. The union type's variants must now all
   expose `.content`, so Steep no longer flags the access. Removed both
   ignores.

Verified locally: `bundle exec rake typecheck` clean.

Co-Authored-By: Claude <noreply@anthropic.com>
The previous Steep fix in 51f692a removed two `# steep:ignore
NoMethod` directives in remote.rb because Steep flagged them as
redundant. But the redundancy was a symptom — the underlying cause was
in `sig/datadog/symbol_database/remote.rbs`, where commit e002f60
weakened `change` and `changes` parameters to `untyped` to bypass union
narrowing issues. With `change: untyped`, `change.content` type-checks
trivially, which made the genuine NoMethod-suppressing ignores look
redundant.

Restore the proper types on remote.rbs (matching master and DI):

- `receiver` block param: `Array[untyped]` →
  `Array[Datadog::Core::Remote::Configuration::Repository::change]`
- `process_change` change param: `untyped` →
  `Datadog::Core::Remote::Configuration::Repository::change`

The `case change.type` body in process_change already uses
`# @type var change: ...Inserted/Updated/Deleted` to narrow within each
when branch. The `else` and `rescue` branches operate on the un-narrowed
union, so reinstate the `# steep:ignore NoMethod` directives there —
they suppress real errors, not redundant ones.

Verification: `bundle exec rake typecheck` clean.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos core Involves Datadog core libraries debugger Live Debugger (+Dynamic Instrumentation, +Symbol Database)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants