[autoscaling] Re-sync local-owner DPAs on annotation changes#51173
Conversation
|
ee50d0e to
3a35e1d
Compare
The Local-owner sync gate in syncPodAutoscaler only invoked UpdateFromPodAutoscaler when .metadata.generation changed. Annotation edits do not bump generation, so adding or changing autoscaling.datadoghq.com/preview on an existing DPA was silently ignored until the next .spec edit or cluster-agent restart — leaving IsBurstable() and the parsed previewOptions stale and preventing the burstable CPU-limit removal from taking effect. Cache a murmur3 fingerprint of the labels/annotations consumed by UpdateFromPodAutoscaler on PodAutoscalerInternal (metadataHash, exposed via MetadataHash()). Expose ComputePodAutoscalerMetadataHash for the controller to compute the same fingerprint on the incoming K8s object. The Local-owner branch now triggers a re-sync when either the generation or the metadata hash differs, so the gate no longer depends on holding a reference to the cached K8s object. The watched label/annotation key lists (preview, profile-template-hash, custom-recommender) live next to UpdateFromPodAutoscaler so future additions stay in sync with the hash. Assisted-by: Claude:claude-opus-4-7
3a35e1d to
a59562b
Compare
Files inventory check summaryFile checks results against ancestor b7c4f46b: Results for datadog-agent_7.81.0~devel.git.149.a9300a7.pipeline.114774855-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
31 successful checks with minimal change (< 2 KiB)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 8e51ad9 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -0.98 | [-3.93, +1.97] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +1.90 | [+1.72, +2.09] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.78 | [+0.71, +0.85] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.09 | [-0.11, +0.29] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.07 | [-0.16, +0.31] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.03 | [-0.17, +0.22] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.02 | [-0.37, +0.42] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | +0.02 | [-0.07, +0.11] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.02 | [-0.18, +0.22] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.01 | [-0.09, +0.10] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.00 | [-0.45, +0.45] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.00 | [-0.12, +0.12] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.09, +0.08] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.01 | [-0.48, +0.47] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.01 | [-0.18, +0.15] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.04 | [-0.22, +0.14] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.06 | [-0.11, -0.02] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.13 | [-0.18, -0.08] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.16 | [-0.20, -0.11] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.24 | [-0.29, -0.18] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | -0.47 | [-0.63, -0.31] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -0.98 | [-3.93, +1.97] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -1.62 | [-1.86, -1.38] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | -2.00 | [-3.01, -1.00] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 715 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 245.93MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 741 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.18GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 142.33MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle | total_bytes_received | 10/10 | 740.44KiB ≤ 819.20KiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 2 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 423.57MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | total_bytes_received | 10/10 | 1.11MiB ≤ 1.25MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 175.87MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_logs | total_bytes_received | 10/10 | 264.35MiB ≤ 292MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 346.64 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 383.05MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | total_bytes_received | 10/10 | 0.94GiB ≤ 1.04GiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
HemeryJu
left a comment
There was a problem hiding this comment.
Looks good to me, approving now to unblock but I agree with Vincent comment!
c037467 to
a9300a7
Compare
…n changes (#51244) Backport 8dde25f from #51173. ___ ### What does this PR do? Fixes a bug in the workload autoscaling controller where annotation-only edits on a locally-owned `DatadogPodAutoscaler` were silently ignored until the next `.spec` change or cluster-agent restart. The Local-owner branch in `syncPodAutoscaler` previously gated `UpdateFromPodAutoscaler` on `.metadata.generation` changes. Annotation edits don't bump `.metadata.generation`, so adding/changing `autoscaling.datadoghq.com/preview` (or any other annotation the controller reads) did nothing — `IsBurstable()` and the parsed `previewOptions` stayed stale, and the burstable CPU-limit removal never took effect. The gate is replaced by `NeedsResyncFromPodAutoscaler`, which compares: - `.metadata.generation` - the profile label (`autoscaling.datadoghq.com/profile`) - every annotation consumed by `UpdateFromPodAutoscaler` (`preview`, `profile-template-hash`, `custom-recommender`) The list of relevant annotation keys (`resyncMetadataKeysFromPodAutoscaler`) lives next to `UpdateFromPodAutoscaler` with an explicit comment to keep the two in sync. ### Motivation Customer report: a `DatadogPodAutoscaler` with `owner: Local` had the burstable preview annotation added post-creation, but the workload's CPU limits were never stripped. Root cause traced to the generation-only gate at `controller.go` Local-owner branch — the parsing in `UpdateFromPodAutoscaler` was never re-invoked after the annotation appeared. ### Describe how you validated your changes **Unit tests** (`pkg/clusteragent/autoscaling/workload/model/`): - New `TestNeedsResyncFromPodAutoscaler` with 10 sub-tests covering: no cached upstream, identical object (no resync), generation bump, preview annotation added/changed/removed, profile label change, profile-template-hash change, custom-recommender change, and an irrelevant annotation that must NOT trigger a resync. **Full suite** (`./pkg/clusteragent/autoscaling/...`): all 14 packages green. **End-to-end on a kind cluster** with a cluster-agent built from this branch: - Created a Local DPA with no preview annotation, observed initial state. - Ran `kubectl annotate dpa ... autoscaling.datadoghq.com/preview='{\"burstable\":true}'` — the controller logged a re-sync at generation 1, no spec change. - Ran `kubectl annotate dpa ... autoscaling.datadoghq.com/preview-` to remove it — the controller logged a re-sync again at generation 1. - Confirmed `previewOptions.Burstable` flipped on both transitions. Before this change, the same scenario produced no re-sync log and `IsBurstable()` stayed false. ### Additional Notes - Workaround for customers on an affected agent: `kubectl -n datadog rollout restart deploy/datadog-cluster-agent`, or any `.spec` edit to force a generation bump. - The `agent autoscaler-list` CLI prints `Burstable: false` for all DPAs regardless of state because `previewOptions` is unexported and stripped during IPC serialization; that is a separate issue tracked outside this PR. 🤖 Assisted by Claude:claude-opus-4-7 Co-authored-by: florent.clarret <florent.clarret@datadoghq.com>
…n changes (#51243) Backport 8dde25f from #51173. ___ ### What does this PR do? Fixes a bug in the workload autoscaling controller where annotation-only edits on a locally-owned `DatadogPodAutoscaler` were silently ignored until the next `.spec` change or cluster-agent restart. The Local-owner branch in `syncPodAutoscaler` previously gated `UpdateFromPodAutoscaler` on `.metadata.generation` changes. Annotation edits don't bump `.metadata.generation`, so adding/changing `autoscaling.datadoghq.com/preview` (or any other annotation the controller reads) did nothing — `IsBurstable()` and the parsed `previewOptions` stayed stale, and the burstable CPU-limit removal never took effect. The gate is replaced by `NeedsResyncFromPodAutoscaler`, which compares: - `.metadata.generation` - the profile label (`autoscaling.datadoghq.com/profile`) - every annotation consumed by `UpdateFromPodAutoscaler` (`preview`, `profile-template-hash`, `custom-recommender`) The list of relevant annotation keys (`resyncMetadataKeysFromPodAutoscaler`) lives next to `UpdateFromPodAutoscaler` with an explicit comment to keep the two in sync. ### Motivation Customer report: a `DatadogPodAutoscaler` with `owner: Local` had the burstable preview annotation added post-creation, but the workload's CPU limits were never stripped. Root cause traced to the generation-only gate at `controller.go` Local-owner branch — the parsing in `UpdateFromPodAutoscaler` was never re-invoked after the annotation appeared. ### Describe how you validated your changes **Unit tests** (`pkg/clusteragent/autoscaling/workload/model/`): - New `TestNeedsResyncFromPodAutoscaler` with 10 sub-tests covering: no cached upstream, identical object (no resync), generation bump, preview annotation added/changed/removed, profile label change, profile-template-hash change, custom-recommender change, and an irrelevant annotation that must NOT trigger a resync. **Full suite** (`./pkg/clusteragent/autoscaling/...`): all 14 packages green. **End-to-end on a kind cluster** with a cluster-agent built from this branch: - Created a Local DPA with no preview annotation, observed initial state. - Ran `kubectl annotate dpa ... autoscaling.datadoghq.com/preview='{\"burstable\":true}'` — the controller logged a re-sync at generation 1, no spec change. - Ran `kubectl annotate dpa ... autoscaling.datadoghq.com/preview-` to remove it — the controller logged a re-sync again at generation 1. - Confirmed `previewOptions.Burstable` flipped on both transitions. Before this change, the same scenario produced no re-sync log and `IsBurstable()` stayed false. ### Additional Notes - Workaround for customers on an affected agent: `kubectl -n datadog rollout restart deploy/datadog-cluster-agent`, or any `.spec` edit to force a generation bump. - The `agent autoscaler-list` CLI prints `Burstable: false` for all DPAs regardless of state because `previewOptions` is unexported and stripped during IPC serialization; that is a separate issue tracked outside this PR. 🤖 Assisted by Claude:claude-opus-4-7 Co-authored-by: florent.clarret <florent.clarret@datadoghq.com>
What does this PR do?
Fixes a bug in the workload autoscaling controller where annotation-only edits on a
locally-owned
DatadogPodAutoscalerwere silently ignored until the next.specchange or cluster-agent restart.
The Local-owner branch in
syncPodAutoscalerpreviously gatedUpdateFromPodAutoscaleron
.metadata.generationchanges. Annotation edits don't bump.metadata.generation,so adding/changing
autoscaling.datadoghq.com/preview(or any other annotation thecontroller reads) did nothing —
IsBurstable()and the parsedpreviewOptionsstayedstale, and the burstable CPU-limit removal never took effect.
The gate is replaced by
NeedsResyncFromPodAutoscaler, which compares:.metadata.generationautoscaling.datadoghq.com/profile)UpdateFromPodAutoscaler(preview,profile-template-hash,custom-recommender)The list of relevant annotation keys (
resyncMetadataKeysFromPodAutoscaler) lives nextto
UpdateFromPodAutoscalerwith an explicit comment to keep the two in sync.Motivation
Customer report: a
DatadogPodAutoscalerwithowner: Localhad the burstable previewannotation added post-creation, but the workload's CPU limits were never stripped. Root
cause traced to the generation-only gate at
controller.goLocal-owner branch — theparsing in
UpdateFromPodAutoscalerwas never re-invoked after the annotation appeared.Describe how you validated your changes
Unit tests (
pkg/clusteragent/autoscaling/workload/model/):TestNeedsResyncFromPodAutoscalerwith 10 sub-tests covering: no cached upstream, identical object (no resync), generation bump, preview annotation added/changed/removed, profile label change, profile-template-hash change, custom-recommender change, and an irrelevant annotation that must NOT trigger a resync.Full suite (
./pkg/clusteragent/autoscaling/...): all 14 packages green.End-to-end on a kind cluster with a cluster-agent built from this branch:
kubectl annotate dpa ... autoscaling.datadoghq.com/preview='{\"burstable\":true}'— the controller logged a re-sync at generation 1, no spec change.kubectl annotate dpa ... autoscaling.datadoghq.com/preview-to remove it — the controller logged a re-sync again at generation 1.previewOptions.Burstableflipped on both transitions.Before this change, the same scenario produced no re-sync log and
IsBurstable()stayed false.Additional Notes
kubectl -n datadog rollout restart deploy/datadog-cluster-agent, or any.specedit to force a generation bump.agent autoscaler-listCLI printsBurstable: falsefor all DPAs regardless of state becausepreviewOptionsis unexported and stripped during IPC serialization; that is a separate issue tracked outside this PR.🤖 Assisted by Claude:claude-opus-4-7