SymDB: TracePoint :class hot-load hook for incremental class extraction#5697
SymDB: TracePoint :class hot-load hook for incremental class extraction#5697p-datadog wants to merge 5 commits into
Conversation
BenchmarksBenchmark execution time: 2026-05-19 20:09:04 Comparing candidate commit 9c7dcf0 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 46 metrics, 0 unstable metrics.
|
efbfd03 to
63abb16
Compare
…op ignore CI's Steep run flagged 10 errors after the previous "fill in concrete types" commit. Two type-fill changes were too tight, and one steep:ignore removal was unverified locally because libdatadog version mismatch blocked Steep here at the time. CI surfaced all three. 1. Component @logger / initialize logger / environment_supported? logger: revert to `untyped`. The tightened `SymbolDatabase::Logger` type didn't match Uploader's `Datadog::Core::Logger` parameter type, surfacing a long-standing inconsistency between the SymDB Logger wrapper and the downstream component sig declarations. Reverting to `untyped` matches the established cross-component logger-handoff convention (see DI's sigs for the same pattern); fixing the underlying inconsistency in Uploader / Extractor / ScopeBatcher sigs is out of scope for this PR. 2. Remote process_change `change` and receiver block's `changes`: revert to `untyped`. The union type `Repository::Change::Deleted | ::Inserted | ::Updated` looked right but Steep cannot narrow the union by the `case change.type` dispatch the code uses — `.content` exists only on Inserted/Updated, `.previous` only on Deleted/Updated, so every access through the union flagged. The runtime dispatch via `.type` symbol matches what the code does correctly; reverting to `untyped` matches the pattern. (DI's remote.rbs takes the same approach.) 3. scheduler_loop in component.rb:306: reinstate a per-line `# steep:ignore NoMethod` on the `@scheduled_at - Time.get_time` line, with an inline comment explaining that Steep does not narrow nullable instance variables across an `if @scheduled_at.nil?` check. Per-line rather than the previously removed block-level ignore — only one line needs suppression. Verification: - bundle exec steep check → No type error detected. (locally, with libdatadog 33.0.0.1.0 now installed) Co-Authored-By: Claude <noreply@anthropic.com>
Closes the cross-tracer parity gap: Java (ClassFileTransformer), Python (BaseModuleWatchdog#after_import), and .NET (AppDomain.AssemblyLoad) all combine an initial scan with a hot-load hook for continuous coverage of dynamically loaded code. Ruby's symdb only scanned once, so classes loaded after :after_initialize (request-time autoloads, late initializers, plugin systems, STI subclasses on first query, dynamically created classes) never reached the symbol DB. Code changes (lib/datadog/symbol_database/component.rb): - Install per-instance TracePoint :class hook lazily on first start_upload. Filters singleton classes (matches extract_all). Pushes loaded modules onto a per-instance buffer; signals the scheduler. - extract_and_upload now distinguishes initial extraction (extract_all, full ObjectSpace walk) from incremental (drain buffer, call Extractor#extract per unique module). Buffer cleared before initial extraction so the post-init drain doesn't reprocess what extract_all already covered. - Disable the TracePoint in shutdown! before the scheduler stops. - Remove the class-level @uploaded_this_process flag and its three short-circuit sites (start_upload + 2 in scheduler_loop). The actual two-uploads fix is the per-instance scheduler debounce; the flag was belt-and-braces. Removing it allows hot-load extractions after the initial upload. - Remove Component.mark_uploaded, .uploaded_this_process?, .reset_uploaded_this_process_for_tests!, @upload_done_mutex, @upload_done_cv — all dead with the flag gone. - Rewrite wait_for_idle to wait on @last_upload_time advance via a per-instance condition variable. Preserves the gobo bin/extract_symbols semantic (script triggers upload, blocks for it to complete) without depending on the one-shot flag. - Remove the Rails.application.config.eager_load gate in schedule_deferred_upload. Under hot-load, an under-extracted initial upload self-corrects as the app exercises code. Spec changes: - spec/datadog/symbol_database/component_spec.rb: drop the reset_uploaded_this_process_for_tests! before-block and the short-circuit test; rewrite the two-uploads regression as 'debounce collapses bursts of start_upload calls' (the actual property the scheduler debounce holds) plus 'each Component built across reconfigurations extracts independently' (the new explicit invariant); add a hot-load regression test that defines a class via eval after initial upload and asserts Extractor#extract is called for it. - spec/datadog/symbol_database/remote_config_integration_spec.rb: drop reset; rewrite 'class-level dedup prevents re-upload' to assert the new invariant (a second start_upload with no new class loads produces no extra captured upload because the hot-load drain is empty). Type signatures (sig/datadog/symbol_database/component.rbs): add buffer ivars, hot-load hook ivars, last_upload_time_cv; remove class-level flag machinery; add private extract_hot_load_buffer / install_hot_load_hook / enqueue_hot_load. Verification: 316/316 specs pass (full symbol_database/ suite). Steep clean. StandardRB clean. Two-uploads regression preserved in renamed form (per-instance debounce continues to collapse bursts within one Component; cross-Component dedup is intentionally removed).
In a forked child process, only the forking thread survives. Mutexes and condition variables are copied without ownership tracking (orphan locks), the scheduler thread reference is dead, and the parent's TracePoint hook is bound to the dead scheduler. Without an after_fork reset, the first attempt to acquire a symdb mutex in the child can deadlock on an orphan lock, and a force_upload-mode child never picks up where the parent left off. `Component#after_fork!` reinitializes: - Hot-load: buffer cleared, `@initial_extraction_done = false`, `@hot_load_tracepoint = nil`. The next `start_upload` installs a fresh TracePoint bound to the child's component and triggers its own initial extraction. - Scheduler: thread/`@scheduled_at`/`@scheduler_signaled` cleared; `@scheduler_mutex` and `@scheduler_cv` reinitialized. - Upload: `@upload_in_progress = false`; `@mutex`, `@upload_in_progress_cv`, `@last_upload_time_cv`, `@hot_load_buffer_mutex` reinitialized. - Force-upload mode: re-registers the deferred-upload callback in the child. In Rails, `:after_initialize` already fired in the parent, so the on_load block runs immediately and the child schedules its own upload. Cross-process upload deduplication is deliberately out of scope: each forked Component performs its own initial extraction. Workers in `preload_app! + eager_load=true` deployments hold identical code to the parent — backend content dedup is tracked separately in `projects/symdb/backlog/impl.md §Fork/child dedup` and `projects/symdb/design/fork-architecture.md`. Wired from `Core::Configuration::Components#after_fork` alongside telemetry/remote/crashtracker/ProcessDiscovery. Unit specs cover: mutex/CV reinit, scheduler thread/state reset, hot-load buffer/flag/tracepoint reset, `@upload_in_progress` reset, force-upload re-trigger, and that start_upload still works after after_fork!. Local rspec couldn't run (libdatadog 33.0.0.1.0 not installed in this dev env) — CI is the source of truth.
63abb16 to
ade017a
Compare
Typing analysisNote: Ignored files are excluded from the next sections.
|
CI's Steep check on PR #5697 flagged 4 problems: 1. component.rb:477 - NoMethod on @hot_load_tracepoint.enable. Steep does not narrow instance variable types after assignment within the same method, so the `(::TracePoint | nil)` RBS type persists past the `@hot_load_tracepoint = TracePoint.new(...)` assignment. Added `steep:ignore NoMethod` per the project Steep policy: false positives from narrowing limitations are silenced, never worked around by refactoring working code. 2. component.rb:472 - redundant `# steep:ignore:start` block around the TracePoint body. The block contents (`mod.singleton_class?`, `send`) do not produce Steep errors. Removed the ignore. 3. remote.rb:106, 113 - redundant `# steep:ignore NoMethod` on `change.content` accesses. The union type's variants must now all expose `.content`, so Steep no longer flags the access. Removed both ignores. Verified locally: `bundle exec rake typecheck` clean. Co-Authored-By: Claude <noreply@anthropic.com>
The previous Steep fix in 51f692a removed two `# steep:ignore NoMethod` directives in remote.rb because Steep flagged them as redundant. But the redundancy was a symptom — the underlying cause was in `sig/datadog/symbol_database/remote.rbs`, where commit e002f60 weakened `change` and `changes` parameters to `untyped` to bypass union narrowing issues. With `change: untyped`, `change.content` type-checks trivially, which made the genuine NoMethod-suppressing ignores look redundant. Restore the proper types on remote.rbs (matching master and DI): - `receiver` block param: `Array[untyped]` → `Array[Datadog::Core::Remote::Configuration::Repository::change]` - `process_change` change param: `untyped` → `Datadog::Core::Remote::Configuration::Repository::change` The `case change.type` body in process_change already uses `# @type var change: ...Inserted/Updated/Deleted` to narrow within each when branch. The `else` and `rescue` branches operate on the un-narrowed union, so reinstate the `# steep:ignore NoMethod` directives there — they suppress real errors, not redundant ones. Verification: `bundle exec rake typecheck` clean. Co-Authored-By: Claude <noreply@anthropic.com>
JIRA: DEBUG-5367
What does this PR do?
Adds a
TracePoint :classhook to the symdbComponentso that classes loaded after initial extraction (request-time autoloads, late initializers, plugin systems, STI subclasses on first query, dynamically-created classes) reach the symbol DB. Closes a cross-tracer parity gap with Java (ClassFileTransformer), Python (BaseModuleWatchdog#after_import), and .NET (AppDomain.AssemblyLoad) — all of which combine an initial scan with a hot-load hook for continuous coverage.Motivation:
Ruby's symdb today scans
ObjectSpace.each_object(Module)once per process. Anything not loaded by:after_initializeis invisible to the upload — undereager_load=false, that's most of the app; undereager_load=true, that's still anything that loads later (request-time autoloads, plugin code, dynamically-created classes, STI subclasses on first query). Java/Python/.NET don't have this gap because they hook the runtime's class-load mechanism. This PR closes it.Design background:
:classchosen over Zeitwerkon_load(Rails-only),Module#const_added(Ruby 3.1+ only), and polling (high latency).Extractor#extract(mod)chosen over full re-scan (matches Java/Python; bandwidth scales with what changed, not total app size).Behavioral changes (intentional):
Component.uploaded_this_process?and the surrounding machinery removed. The two-uploads bug fix that introduced the flag is preserved by the per-instance scheduler debounce — multiplestart_uploadcalls on one Component still collapse into one extraction within the debounce window. Cross-Component dedup is intentionally removed: each Component (e.g. after aDatadog.configurereconfigure) does its own initial extraction, which the new Component needs in order to have a hot-load baseline.eager_load=falseno longer suppresses auto-deferred upload. Undereager_load=false, the initial upload may be small but the hot-load hook covers the rest as the app exercises code.wait_for_idlesemantics unchanged from the script's perspective but the implementation waits on@last_upload_timeadvance instead of the removed flag.Change log entry
None.
Additional Notes:
eager_loadgate removal (add Rubocop for style guidelines #3) interacts with the two-uploads investigation. The actual fix that resolved that bug was the per-instance scheduler + the gate; with hot-load present, the gate is no longer needed because under-extracted initial uploads self-correct. A dedicated regression test ('debounce regression') exercises the per-instance scheduler half of the original fix to confirm it still holds.How to test the change?
```
bundle exec rspec spec/datadog/symbol_database/
bundle exec steep check sig/datadog/symbol_database lib/datadog/symbol_database
bundle exec rake standard
```
316/316 specs pass; steep clean; standardrb clean. New regression test (`hot-load coverage (TracePoint :class)`) defines a class via `eval` after initial upload and asserts `Extractor#extract` is called for it.
End-to-end: gobo `bin/extract_symbols` produces an initial upload then incremental uploads as the test app exercises code. Backend dedup of overlapping FILE scopes is a verification step before un-drafting.