[Backport 7.79.x] [autoscaling] Re-sync local-owner DPAs on annotation changes#51243
Conversation
… ### What does this PR do?
Fixes a bug in the workload autoscaling controller where annotation-only edits on a
locally-owned `DatadogPodAutoscaler` were silently ignored until the next `.spec`
change or cluster-agent restart.
The Local-owner branch in `syncPodAutoscaler` previously gated `UpdateFromPodAutoscaler`
on `.metadata.generation` changes. Annotation edits don't bump `.metadata.generation`,
so adding/changing `autoscaling.datadoghq.com/preview` (or any other annotation the
controller reads) did nothing — `IsBurstable()` and the parsed `previewOptions` stayed
stale, and the burstable CPU-limit removal never took effect.
The gate is replaced by `NeedsResyncFromPodAutoscaler`, which compares:
- `.metadata.generation`
- the profile label (`autoscaling.datadoghq.com/profile`)
- every annotation consumed by `UpdateFromPodAutoscaler` (`preview`, `profile-template-hash`, `custom-recommender`)
The list of relevant annotation keys (`resyncMetadataKeysFromPodAutoscaler`) lives next
to `UpdateFromPodAutoscaler` with an explicit comment to keep the two in sync.
### Motivation
Customer report: a `DatadogPodAutoscaler` with `owner: Local` had the burstable preview
annotation added post-creation, but the workload's CPU limits were never stripped. Root
cause traced to the generation-only gate at `controller.go` Local-owner branch — the
parsing in `UpdateFromPodAutoscaler` was never re-invoked after the annotation appeared.
### Describe how you validated your changes
**Unit tests** (`pkg/clusteragent/autoscaling/workload/model/`):
- New `TestNeedsResyncFromPodAutoscaler` with 10 sub-tests covering: no cached upstream, identical object (no resync), generation bump, preview annotation added/changed/removed, profile label change, profile-template-hash change, custom-recommender change, and an irrelevant annotation that must NOT trigger a resync.
**Full suite** (`./pkg/clusteragent/autoscaling/...`): all 14 packages green.
**End-to-end on a kind cluster** with a cluster-agent built from this branch:
- Created a Local DPA with no preview annotation, observed initial state.
- Ran `kubectl annotate dpa ... autoscaling.datadoghq.com/preview='{\"burstable\":true}'` — the controller logged a re-sync at generation 1, no spec change.
- Ran `kubectl annotate dpa ... autoscaling.datadoghq.com/preview-` to remove it — the controller logged a re-sync again at generation 1.
- Confirmed `previewOptions.Burstable` flipped on both transitions.
Before this change, the same scenario produced no re-sync log and `IsBurstable()` stayed false.
### Additional Notes
- Workaround for customers on an affected agent: `kubectl -n datadog rollout restart deploy/datadog-cluster-agent`, or any `.spec` edit to force a generation bump.
- The `agent autoscaler-list` CLI prints `Burstable: false` for all DPAs regardless of state because `previewOptions` is unexported and stripped during IPC serialization; that is a separate issue tracked outside this PR.
🤖 Assisted by Claude:claude-opus-4-7
Co-authored-by: cedric.lamoriniere <cedric.lamoriniere@datadoghq.com>
(cherry picked from commit 8dde25f)
___
Co-authored-by: Cedric Lamoriniere <cedric.lamoriniere@datadoghq.com>
|
Files inventory check summaryFile checks results against ancestor 2f261afb: Results for datadog-agent_7.79.0.git.7.3790ad0.pipeline.114807175-1_amd64.deb:No change detected |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 2f261af Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +2.94 | [-0.14, +6.01] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +2.94 | [-0.14, +6.01] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +1.41 | [-0.22, +3.04] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +0.39 | [+0.30, +0.48] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.38 | [+0.32, +0.45] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.38 | [+0.14, +0.61] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | +0.30 | [+0.23, +0.37] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.24 | [+0.20, +0.27] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | +0.22 | [+0.17, +0.27] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.19 | [+0.01, +0.37] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.10 | [-0.08, +0.29] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +0.05 | [-0.10, +0.20] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.02 | [-0.17, +0.21] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.01 | [-0.10, +0.13] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.08, +0.10] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.00 | [-0.08, +0.08] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.19, +0.19] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.01 | [-0.46, +0.43] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.03 | [-0.50, +0.44] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.06 | [-0.45, +0.33] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.06 | [-0.22, +0.09] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.22 | [-0.37, -0.07] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.38 | [-0.43, -0.32] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.40 | [-0.62, -0.18] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 721 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 279.77MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 724 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.24GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.22GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 4 = 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 174.98MiB ≤ 181MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 4 = 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 495.29MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 208.24MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 341.32 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 418.72MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
Backport 8dde25f from #51173.
What does this PR do?
Fixes a bug in the workload autoscaling controller where annotation-only edits on a
locally-owned
DatadogPodAutoscalerwere silently ignored until the next.specchange or cluster-agent restart.
The Local-owner branch in
syncPodAutoscalerpreviously gatedUpdateFromPodAutoscaleron
.metadata.generationchanges. Annotation edits don't bump.metadata.generation,so adding/changing
autoscaling.datadoghq.com/preview(or any other annotation thecontroller reads) did nothing —
IsBurstable()and the parsedpreviewOptionsstayedstale, and the burstable CPU-limit removal never took effect.
The gate is replaced by
NeedsResyncFromPodAutoscaler, which compares:.metadata.generationautoscaling.datadoghq.com/profile)UpdateFromPodAutoscaler(preview,profile-template-hash,custom-recommender)The list of relevant annotation keys (
resyncMetadataKeysFromPodAutoscaler) lives nextto
UpdateFromPodAutoscalerwith an explicit comment to keep the two in sync.Motivation
Customer report: a
DatadogPodAutoscalerwithowner: Localhad the burstable previewannotation added post-creation, but the workload's CPU limits were never stripped. Root
cause traced to the generation-only gate at
controller.goLocal-owner branch — theparsing in
UpdateFromPodAutoscalerwas never re-invoked after the annotation appeared.Describe how you validated your changes
Unit tests (
pkg/clusteragent/autoscaling/workload/model/):TestNeedsResyncFromPodAutoscalerwith 10 sub-tests covering: no cached upstream, identical object (no resync), generation bump, preview annotation added/changed/removed, profile label change, profile-template-hash change, custom-recommender change, and an irrelevant annotation that must NOT trigger a resync.Full suite (
./pkg/clusteragent/autoscaling/...): all 14 packages green.End-to-end on a kind cluster with a cluster-agent built from this branch:
kubectl annotate dpa ... autoscaling.datadoghq.com/preview='{\"burstable\":true}'— the controller logged a re-sync at generation 1, no spec change.kubectl annotate dpa ... autoscaling.datadoghq.com/preview-to remove it — the controller logged a re-sync again at generation 1.previewOptions.Burstableflipped on both transitions.Before this change, the same scenario produced no re-sync log and
IsBurstable()stayed false.Additional Notes
kubectl -n datadog rollout restart deploy/datadog-cluster-agent, or any.specedit to force a generation bump.agent autoscaler-listCLI printsBurstable: falsefor all DPAs regardless of state becausepreviewOptionsis unexported and stripped during IPC serialization; that is a separate issue tracked outside this PR.🤖 Assisted by Claude:claude-opus-4-7