Alertmanager CPU leak on single gossip cluster member (similar to #4868)

### What did you do?

- Upgraded our Alertmanager deployment on Kubernetes (Amazon EKS) to use the latest prometheus-community Helm chart `1.35.1` (corresponding to the latest upstream Alertmanager version `v0.32.1`)
- Configured Alertmanager in HA mode with multiple replicas using gossip clustering (`--cluster.peer`)
- Sent alerts from multiple producers to all Alertmanager replicas
- Allowed the system to run under steady/normal alert load over time

### What did you expect to see?

- Even distribution of CPU and workload across all Alertmanager replicas
- CPU usage remaining stable over time for each replica, proportional to alert volume
- No individual node becoming a bottleneck or degrading independently
- No restarts or loss of responsiveness under normal operating conditions

### What did you see instead? Under which circumstances?

- A single replica in the gossip cluster gradually consumes more CPU over time, seemingly never being released
- The high CPU usage isn't related to a spike in incoming alerts or posting to downstream receivers 
- The increase appears unbounded (monotonic growth) until:

  - The instance reaches ~100% CPU utilization at the node level
  - The instance becomes unresponsive
  - Kubernetes sends `SIGTERM` to the pod after liveness probe failures
  - The pod restarts freeing up CPU before starting the growth again


Other replicas in the same cluster on identical nodes, receiving identical alerts:
- Continue operating normally
- Show significantly lower and stable CPU usage

This behavior appears very similar to the previously reported issue:
- [Alertmanager issue #4868](https://github.com/prometheus/alertmanager/issues/4868)

Although this is occurring on the latest version, so unsure if the same symptom with a different cause, or a case the fix did not account for.

Here's the CPU usage on that particular node:
<img width="1739" height="504" alt="Image" src="https://github.com/user-attachments/assets/1ac3d0ad-07c2-431d-964e-ea4fdae6384a" />

Memory usage follows a similar growth, but CPU is ultimately what is exhausted first in our case.

Lastly, we run many gossip clusters of alertmanager in our environment, all tied to this same chart and upstream version, with similar alertmanager configurations, just with their own gossip peers, and this issue is **not** shown uniformly across all clusters, so may be a specific condition triggering this behavior. 

Please let me know if there's any other information you'd like me to pull. I'll be continuing to investigate and posting anything I find. Thank you!

### System information

Linux 6.12.68-92.122.amzn2023.x86_64

### Alertmanager version

```text
alertmanager, version 0.32.1 (branch: HEAD, revision: 8768aa6f65f1a888b5aa5fbf877cf20ad45d1f61)
  build user:       root@6f5b35dc1248
  build date:       20260429-17:34:54
  go version:       go1.26.2
  platform:         linux/amd64
  tags:             netgo
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager CPU leak on single gossip cluster member (similar to #4868) #5249

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Alertmanager version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Alertmanager CPU leak on single gossip cluster member (similar to #4868) #5249

Description

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Alertmanager version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions