[CXP-2637] Baseline process_config.run_in_core_agent.enabled on Linux#47902
Conversation
Go Package Import DifferencesBaseline: eb09c31
|
Files inventory check summaryFile checks results against ancestor 49cc033d: Results for datadog-agent_7.78.0~devel.git.681.0c9dd46.pipeline.103220928-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
9 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: bbb6db2 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ❌ | docker_containers_cpu | % cpu utilization | +6.91 | [+3.78, +10.03] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ❌ | docker_containers_cpu | % cpu utilization | +6.91 | [+3.78, +10.03] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.54 | [+0.49, +0.59] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +0.43 | [+0.34, +0.53] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.30 | [+0.16, +0.44] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.07 | [+0.04, +0.11] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.03 | [-0.47, +0.52] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.01 | [-0.39, +0.40] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.22, +0.23] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.11, +0.11] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.02 | [-0.21, +0.18] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.10, +0.06] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.02 | [-0.10, +0.05] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.05 | [-0.47, +0.38] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.05 | [-0.23, +0.13] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.08 | [-0.14, -0.03] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.14 | [-0.20, -0.08] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.18 | [-0.33, -0.02] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.25 | [-0.31, -0.20] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.28 | [-0.51, -0.06] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.30 | [-0.46, -0.13] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -1.05 | [-1.29, -0.82] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -1.62 | [-1.76, -1.49] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -2.65 | [-4.20, -1.11] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 575 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 270.76MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 691 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.22GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 174.42MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 2 ≤ 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 489.14MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 204.17MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 355.04 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 398.29MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
davidor
left a comment
There was a problem hiding this comment.
👍 for files owned by container-platform (just 1 file)
| run_in_core_agent: | ||
| enabled: false |
There was a problem hiding this comment.
I'm guessing there was a reason it was set to false, so just removing it would make the test fail
There was a problem hiding this comment.
If you can enable a config which makes the process-agent run on Linux that should fix it
There was a problem hiding this comment.
I added the NPM config to turn on the process-agent and that did fix it.
fefe0f2 to
0f39319
Compare
On Linux, process checks (process, container, RT container, process discovery) now always run in the core agent. The `process_config.run_in_core_agent.enabled` config key is removed entirely — the behavior is no longer optional. A build-tag-based platform helper `ProcessChecksRunInCoreAgent()` in `pkg/process/util/` returns `true` on Linux and `false` on other platforms, preserving non-Linux behavior unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the helper from pkg/process/util (which transitively pulls in go4.org/netipx) to a new zero-dependency leaf package pkg/process/util/coreagent. This avoids adding unnecessary imports to binaries like cluster-agent-cloudfoundry and otel-agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On Linux, the process-agent no longer runs standalone (process checks are in the core agent). Update E2E and unit tests that expected the process-agent to be running: - Remove process-agent from auth_artifact agentProcesses list on Linux - Remove WithProcessAgentOnPort from Linux IPC and config-refresh tests - Make assertAgentsUseKey platform-aware (includeProcessAgent param) - Make container/RT container check tests platform-aware using coreagent.ProcessChecksRunInCoreAgent() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address review feedback: instead of removing process-agent from tests, enable NPM (network_config.enabled + system_probe_config.enabled) in the IPC, auth_artifact, and config-refresh test fixtures so that the process-agent starts on Linux for connections check. This preserves IPC auth coverage for process-agent. Reverts the assertAgentsUseKey signature change — no longer needed since process-agent is now running in all test environments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The "Process Agent Enabled" and "Process Agent Enabled, Alternate Setting" test cases tested the profiler timeout when process checks run in a standalone process-agent. This scenario no longer exists on Linux, and the "Process Agent Checks in Core Agent" test already covers the current behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update assertRunningChecks to combine enabled checks from both ProcessAgentStatus and ProcessComponentStatus, since process checks now run in the core agent on Linux while connections runs in the standalone process-agent. - Fix K8s test assertions to check ProcessComponentStatus for process/discovery checks. - Move NPM config from datadog.yaml to system-probe.yaml via WithSystemProbeConfig in IPC, auth_artifact, and config-refresh Linux tests so the process-agent starts for IPC testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove shouldStayAlive() from process-agent: the process-agent no longer needs to stay alive idle in K8s to prevent crash loops. When disabled, it exits immediately. - Remove duplicate E2E tests that existed only to test the now-removed core agent toggle: - docker: TestProcessChecksInCoreAgentWithNPM - linux: TestProcessChecksInCoreAgent, TestProcessChecksInCoreAgentWithNPM - k8s: entire K8sCoreAgentSuite (TestProcessCheckInCoreAgent) - Move K8s NPM test to K8sSuite.TestProcessCheckWithNPM (unique test). - Fix assertRunningChecks expected values for remaining tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Helm chart deploys a process-agent container regardless. Without shouldStayAlive, the process-agent exits immediately when it has no checks to run, causing CrashLoopBackOff in K8s. Simplified from the original: no longer checks the removed config key, just checks env.IsKubernetes(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Helm chart's doNotCheckTag logic prevents automatic detection of agent version >=7.60, causing it to deploy the process-agent container even when process checks run in the core agent. shouldStayAlive keeps this container from crash-looping. Changed from Warn to Info since this is expected behavior, not a misconfiguration. Added TODO to remove once the chart is updated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On Linux, process checks run in the core agent. Update manual check tests to use `datadog-agent processchecks` / `agent processchecks` instead of `process-agent check`: - linux_test.go: 3 tests updated, removed duplicate TestManualProcessCheckCoreAgent - docker_test.go: 3 tests updated - k8s_test.go: execProcessAgentCheck helper updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0f39319 to
0c9dd46
Compare
| // in the core agent. The process-agent stays alive idle to prevent the container from crash-looping. | ||
| // TODO: remove this once the Helm chart no longer deploys the process-agent container when | ||
| // runInCoreAgent is enabled (the chart's doNotCheckTag logic prevents automatic detection). | ||
| func shouldStayAlive() bool { |
There was a problem hiding this comment.
I wante to remove this, but we cannot quite yet. The Helm chart auto-detects agent >=7.60 to skip deploying the process-agent container. But the E2E tests set doNotCheckTag: true in the chart, which disables the version check. The chart falls through to deploying the process-agent container even though it's not needed. Causing issues for some tests that check liveness of containers. We need some changes on the chart-side to remove this.
| run_in_core_agent: | ||
| enabled: false |
15b3097
into
main
Summary
process_config.run_in_core_agent.enabledconfig key entirely. On Linux, process checks (process, container, RT container, process discovery) now always run in the core agent — no config toggle.ProcessChecksRunInCoreAgent()inpkg/process/util/that returnstrueon Linux andfalseon other platforms, preserving non-Linux behavior unchanged.Test plan
pkg/process/checks,comp/process/agent,comp/core/profiler/impl,pkg/config/setup,pkg/process/metadata/workloadmeta/collector)🤖 Generated with Claude Code