feat(correctness): add agent telemetry correctness test#1637
Conversation
Binary Size Analysis (Agent Data Plane)Target: a7a8935 (baseline) vs d38bf07 (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
anon.4f8fd67d74ae1f1600187cfeb0121be9.1.llvm.8940699831907001114 |
+129 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.1.llvm.7700950232534575509 |
-129 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.4.llvm.8940699831907001114 |
+114 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.4.llvm.7700950232534575509 |
-114 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.3.llvm.8940699831907001114 |
+108 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.3.llvm.7700950232534575509 |
-108 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.0.llvm.8940699831907001114 |
+96 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.0.llvm.7700950232534575509 |
-96 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.2.llvm.8940699831907001114 |
+94 B | 1 |
anon.4f8fd67d74ae1f1600187cfeb0121be9.2.llvm.7700950232534575509 |
-94 B | 1 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
[NEW] +129 [NEW] +40 anon.4f8fd67d74ae1f1600187cfeb0121be9.1.llvm.8940699831907001114
[NEW] +114 [NEW] +25 anon.4f8fd67d74ae1f1600187cfeb0121be9.4.llvm.8940699831907001114
[NEW] +108 [NEW] +19 anon.4f8fd67d74ae1f1600187cfeb0121be9.3.llvm.8940699831907001114
[NEW] +96 [NEW] +7 anon.4f8fd67d74ae1f1600187cfeb0121be9.0.llvm.8940699831907001114
[NEW] +94 [NEW] +5 anon.4f8fd67d74ae1f1600187cfeb0121be9.2.llvm.8940699831907001114
[DEL] -94 [DEL] -5 anon.4f8fd67d74ae1f1600187cfeb0121be9.2.llvm.7700950232534575509
[DEL] -96 [DEL] -7 anon.4f8fd67d74ae1f1600187cfeb0121be9.0.llvm.7700950232534575509
[DEL] -108 [DEL] -19 anon.4f8fd67d74ae1f1600187cfeb0121be9.3.llvm.7700950232534575509
[DEL] -114 [DEL] -25 anon.4f8fd67d74ae1f1600187cfeb0121be9.4.llvm.7700950232534575509
[DEL] -129 [DEL] -40 anon.4f8fd67d74ae1f1600187cfeb0121be9.1.llvm.7700950232534575509
[ = ] 0 [ = ] 0 TOTAL
Regression Detector (Agent Data Plane)Run ID: Optimization Goals: ✅ No significant changes detectedFine details of change detection per experiment (35)Experiments configured
Bounds Checks: ✅ Passed (5)
ExplanationA change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression ( |
Adds a new correctness test for the Datadog Agent's agenttelemetry component, which collects internal Prometheus metrics and periodically ships them to an intake endpoint. This test surfaces gaps where ADP intercepts traffic (DogStatsD, distributions) without updating the corresponding Go forwarder telemetry. Changes: - datadog-intake: new /api/v2/apmtelemetry handler (stores raw JSON payloads) + /agent-telemetry/dump endpoint; HTTPS proxy on port 2050 using a self-signed rcgen cert so the agent's hardcoded-https sender can reach the local intake (requires skip_ssl_validation: true) - panoramic: AgentTelemetry analysis mode comparing (metric, tags) -> value contexts between baseline and comparison; configurable flush_wait_secs per test to accommodate the agenttelemetry schedule - test/correctness/agent-telemetry: test case using a custom agenttelemetry profile (forwarder metrics only, iterations: 1, start_after: 67) routed to the local intake; count-only millstone traffic to eliminate sketches_v2 as a confounding variable
- Disable trace agent (apm_config.enabled: false) in the test's datadog.yaml to eliminate ~60 internal DogStatsD contexts (datadog.trace_agent.*, datadog.dogstatsd.client.*) from every flush bucket, reducing per-bucket context count from 160 to 101 and making the series point structure far simpler to reason about. - Add ±1 gauge tolerance to the AgentTelemetryAnalyzer. The sole source of residual between-run variance is datadog.agent.running, which is appended unconditionally to every 15-second aggregator flush in pkg/aggregator/aggregator.go:appendDefaultSeries. Whether its last firing lands just before or just after the start_after: 67 snapshot boundary produces a ±1 point swing. There is no agent config to disable it; it is hardcoded. Within a single run both agents start at the same second and always capture the same flush count so the intra-run comparison is always exact. A WARN is emitted if values differ within tolerance so any unexpected intra-run delta is visible. - Add per-metric-name labels to small-context (≤3) flush buckets in the series breakdown log to aid debugging non-DSD metric sources.
e440d21 to
4eb2f96
Compare
Add a `focus_metrics` field to the correctness test config that, when non-empty, replaces the standard internal-telemetry filter with an allowlist. Only metrics whose names appear in the list are kept for comparison; everything else (including other `datadog.*` metrics and all user DSD traffic) is discarded. This enables correctness tests that validate specific agent-emitted metrics such as `datadog.agent.point.sent` and `datadog.agent.point.dropped`, which would otherwise be stripped by the default filter. Also fix a test reference to `MetricsEndpoint::Series` that was renamed to `MetricsEndpoint::SeriesV1` in #1646.
Agent images >= v112974386 already include a built-in `data-plane` s6 service that starts the ADP binary. Copying our own s6-services entry alongside it caused a double-start crash. Only cont-init.d is copied now; the built-in service manages ADP lifecycle.
- Lower start_after from 67s to 37s so the COAT snapshot lands after both sides' first user DSD flush but before their second, giving each side exactly one flush cycle to compare. Both Go and ADP flush every ~15s but are offset by ~4s; t=37s sits safely in the (30s, 44s) window between first and second flushes on each side. - Lower flush_wait_secs from 90s to 60s to match the shorter window. - Remove transactions.success from the profile: baseline sends user+internal metrics in one Go payload per flush cycle while comparison splits them across separate ADP and Go payloads, so transaction counts are structurally different by design. - Remove transactions.errors from the profile: ADP eagerly registers per-error-type counters that have no equivalent in the Go-only baseline, so these contexts will never line up. - Add dogstatsd_context_expiry_seconds: 120 to keep DSD contexts alive across the test window for consistent flush behavior.
Add a new correctness test that validates the customer-facing `datadog.agent.point.sent` and `datadog.agent.point.dropped` metrics emitted by the Go agent's telemetry check. The test uses the new `focus_metrics` allowlist to discard all user DSD and other internal metrics, comparing only these two names between baseline (Go-only DSD) and comparison (ADP DSD). Both sides run the telemetry check via a mounted `conf.d/telemetry.d/conf.yaml` (not yet bundled in the agent image) and have `data_plane.telemetry_enabled: true` so ADP registers its TelemetryProvider with the Go agent. This allows the telemetry check's `collectMergeMetrics` to pull ADP's `point__sent` gauge from the non-default registry via RAR and sum it with the Go forwarder's contribution, producing a merged `datadog.agent.point.sent` that should match the baseline.
4eb2f96 to
d38bf07
Compare
## Human Summary We are working with a handful of design partners to start using the Agent Data Plane (ADP) to handle DogStatsD metrics in customer orgs. This will shift custom metric payloads from the core agent to ADP, which will then forward them along to the Datadog backend. This will affect the `datadog.agent.point.sent` and `datadog.agent.point.dropped` metrics which are currently sent both to the customer org and to Datadog via Cross-Org Agent Telemetry (COAT). These can be sourced from ADP via the Remote Agent Registry (RAR) and ADP's TelemetryProvider. However, the sources of these two metrics will not be _entirely_ shifted to ADP. Core Agent will still submit _some_ points on its own from checks and other internal metrics such as `datadog.agent.running`. This presents a new functional requirement where we need to be able to _merge_ both the Core Agent and ADP versions of these metrics before forwarding them along. This PR does three things: 1. Within the customer-facing `telemetry` check flow, "regular" metrics are now gathered. This includes RAR-sourced metrics from remote agents. Selected regular metrics (currently just these two) are merged into the existing "default" metric set before being forwarded. 2. Within the internal COAT flow (`agenttelemetry`), we perform a similar merge. A big difference here is that `agenttelemetry` is already sourcing "regular" metrics, but previously there were no cases where a single metric came from both the core agent _and_ RAR. Now that's supported and metrics are merged. 3. The COAT versions of these two metrics are updated to include their `domain` and `remote_agent` tags. This will allow us to differentiate Core Agent traffic from ADP traffic for Datadog internal telemetry. See [DADP-71](https://datadoghq.atlassian.net/jira/software/c/projects/DADP/boards/25544?search=71&selectedIssue=DADP-71) for further rationale. ### Rationale on Required Labels and QA Actions This is my first Agent PR so it's entirely possible I misunderstood some of the intention behind these, but here's my first pass: - Applied `changelog/no-changelog`. This change is intended to maintain compatibility with existing behavior when enabling the Agent Data Plane, so user-facing functionality is unaffected. Datadog-internal COAT metrics have some tagging changes which does not seem changelog-worthy. - Applied `qa/done`. See this PR on Saluki, behavior was verified using differential tests comparing the Agent with and without Agent Data Plane enabled: DataDog/saluki#1637 - Applied `need-change/agenttelemetry-governance` and commented on existing governance card [ASUP-31](https://datadoghq.atlassian.net/jira/software/c/projects/ASUP/boards/8889/?selectedIssue=ASUP-31) pinging owner @carlosroman ## Agentic Summary ### What does this PR do? Adds Core Agent support for ADP point telemetry from Remote Agent Registry in both Agent telemetry paths: - COAT preserves `domain` and `remote_agent` tags for `point.sent` / `point.dropped` by updating the default `logs-and-metrics` profile. - COAT coalesces compatible metric families gathered from the regular and default telemetry registries before profile aggregation. This prevents duplicate metric families with the same name, such as ADP `point.sent` and Core Agent `point.sent`, from overwriting each other in the Agent telemetry payload map. - The customer-facing telemetry core check merges allowlisted regular-registry metrics into the existing default telemetry output. Initially this covers `point.sent` and `point.dropped`, grouped by `domain`. - The customer-facing metric names and tag shape remain compatible: - `datadog.agent.point.sent{domain:...}` - `datadog.agent.point.dropped{domain:...}` - If gathering regular/RAR telemetry fails in the customer telemetry path, the check logs a warning and continues with Core Agent values only. ### Motivation DADP-71: when Agent Data Plane forwards metrics, Core Agent forwarder telemetry alone undercounts `point.sent` and `point.dropped`. ADP exposes equivalent point counts through RAR; Core Agent needs to include them in customer-facing Agent telemetry while preserving customer metric compatibility, and in COAT while retaining `remote_agent` attribution. ### Describe how you validated your changes - `dda inv install-tools` - `dda inv test --targets=./pkg/collector/corechecks/telemetry` - `dda inv test --targets=./comp/core/agenttelemetry/impl` - `dda inv test --targets=./comp/core/agenttelemetry/impl --test-run-name TestRun` - Verified `TestCoalescesDefaultAndNoDefaultMetricFamiliesBeforeAggregation` fails without the COAT metric-family coalescing change and passes with it. ### Additional Notes The customer path intentionally keeps RAR/regular-registry metrics allowlisted. It does not expose arbitrary remote-agent telemetry to customer orgs. [DADP-71]: https://datadoghq.atlassian.net/browse/DADP-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: jesse.szwedko <jesse.szwedko@datadoghq.com>
Human Summary / Notes
This PR adds two correctness tests attempting to validate both COAT and customer-facing metrics being emitted from Agent Telemetry, see https://datadoghq.atlassian.net/jira/software/c/projects/DADP/boards/25544?search=71&selectedIssue=DADP-71
I was able to use these to convince myself that the changes in #1638 and DataDog/datadog-agent#50750 successfully get the ADP versions of
datadog.agent.point.sentanddatadog.agent.point.droppedworking in both COAT and non-COAT paths. However, this PR is not currently mergeable because it relies on an unreleased Agent version with changes from DataDog/datadog-agent#50750. Additionally, the COAT test is very finnicky with timing and I haven't figured out how to get it to consistently pass. The main issue there is getting DDA and ADP flush intervals to align when they're both based on 15 second timers from process start and the processes don't start simultaneously.I hope to return to this and get it merged once we have a released Agent version with the above fixes, but backburnering it for now to focus on getting the functional changes out for the next release code freeze later today.
Agentic Summary
Adds a correctness test for the Datadog Agent's
agenttelemetrycomponent. Theagenttelemetrycomponent collects internal Prometheus metrics from the agent and periodically ships them to/api/v2/apmtelemetryatinstrumentation-telemetry-intake.<site>. This endpoint is completely separate from the regular metrics pipeline —dd_urlhas no effect on it — so no existing correctness infrastructure captured it.The test surfaces behavioral gaps where ADP intercepts agent traffic (DogStatsD, distributions) without updating the corresponding Go forwarder telemetry. The initial failure mode is
point.sentbeing ~80× higher on the baseline than on the comparison, because ADP handles DogStatsD forwarding through its own HTTP client and never increments the Go forwarder'sPointCountTelemetry.Key changes
datadog-intakeagent_telemetrymodule:POST /api/v2/apmtelemetrystores raw JSON payloads;GET /agent-telemetry/dumpreturns them for analysis.rcgen; a TCP-level HTTPS proxy on port 2050 decrypts and forwards to the existing HTTP intake on port 2049. Required because the agent'sagenttelemetrysender hardcodeshttps://regardless of the configured URL scheme.panoramicAnalysisMode::AgentTelemetry: compares(metric_name, tags) → valuecontexts between baseline and comparison. Handles bothagent-metricsandmessage-batchenvelope types. Reports context mismatches and value mismatches separately.datadog.agent.running, which is appended unconditionally to every 15-second aggregator flush inpkg/aggregator/aggregator.go:appendDefaultSeries. Whether its last firing falls just before or just after thestart_after: 67snapshot boundary produces a ±1 point swing. There is no agent config to disable it; it is hardcoded. Within a single run both agents start at the same second and always hit the same flush count, so the intra-run comparison is always exact.flush_wait_secsconfig field (default: 32s). The agent-telemetry test uses 90s to give theagenttelemetrycomponent time to fire after all traffic is flushed.CollectedDataextended with agent telemetry payload collection.test/correctness/agent-telemetryagenttelemetryprofile scoped to forwarder metrics, routed tohttps://datadog-intake:2050withskip_ssl_validation: true.iterations: 1,start_after: 67— chosen to land between 15-second aggregator flush boundaries.apm_config.enabled: false— removes ~60 internal trace-agent DogStatsD contexts (datadog.trace_agent.*,datadog.dogstatsd.client.*) from every flush bucket, reducing per-bucket context count from 160→101.sketches_v2as a confounding variable.Agent telemetry metrics observed
All contexts present on both baseline and comparison unless noted. Profiling is scoped to forwarder metrics only.
point.sentPointCountTelemetryfor those points.point.droppedtransactions.input_countsketches_v2(ADP intercepts distributions before they reach Go forwarder).transactions.droppedprocess.datadoghq.com.transactions.successdomain,endpointtransactions.http_errorscode,endpointSeries flush structure (baseline, with trace agent disabled)
Each flush bucket contains 101 contexts: 100 from millstone count metrics (with zero-value continuation after first bucket) + 1 from
ntp.offsetor similar. Thedatadog.agent.runningmetric fires at every 15-second flush tick with its exact wall-clock timestamp (not DSD-bucket-aligned), producing 3–4 single-context off-boundary timestamps per run.Test plan
Expected: test fails with
point.sentvalue mismatch (baseline ~918, comparison ~11) until ADP is updated to route DogStatsD point accounting through the Go forwarder'sPointCountTelemetry.