What did you do?
- Upgraded our Alertmanager deployment on Kubernetes (Amazon EKS) to use the latest prometheus-community Helm chart
1.35.1 (corresponding to the latest upstream Alertmanager version v0.32.1)
- Configured Alertmanager in HA mode with multiple replicas using gossip clustering (
--cluster.peer)
- Sent alerts from multiple producers to all Alertmanager replicas
- Allowed the system to run under steady/normal alert load over time
What did you expect to see?
- Even distribution of CPU and workload across all Alertmanager replicas
- CPU usage remaining stable over time for each replica, proportional to alert volume
- No individual node becoming a bottleneck or degrading independently
- No restarts or loss of responsiveness under normal operating conditions
What did you see instead? Under which circumstances?
-
A single replica in the gossip cluster gradually consumes more CPU over time, seemingly never being released
-
The high CPU usage isn't related to a spike in incoming alerts or posting to downstream receivers
-
The increase appears unbounded (monotonic growth) until:
- The instance reaches ~100% CPU utilization at the node level
- The instance becomes unresponsive
- Kubernetes sends
SIGTERM to the pod after liveness probe failures
- The pod restarts freeing up CPU before starting the growth again
Other replicas in the same cluster on identical nodes, receiving identical alerts:
- Continue operating normally
- Show significantly lower and stable CPU usage
This behavior appears very similar to the previously reported issue:
Although this is occurring on the latest version, so unsure if the same symptom with a different cause, or a case the fix did not account for.
Here's the CPU usage on that particular node:

Memory usage follows a similar growth, but CPU is ultimately what is exhausted first in our case.
Lastly, we run many gossip clusters of alertmanager in our environment, all tied to this same chart and upstream version, with similar alertmanager configurations, just with their own gossip peers, and this issue is not shown uniformly across all clusters, so may be a specific condition triggering this behavior.
Please let me know if there's any other information you'd like me to pull. I'll be continuing to investigate and posting anything I find. Thank you!
System information
Linux 6.12.68-92.122.amzn2023.x86_64
Alertmanager version
alertmanager, version 0.32.1 (branch: HEAD, revision: 8768aa6f65f1a888b5aa5fbf877cf20ad45d1f61)
build user: root@6f5b35dc1248
build date: 20260429-17:34:54
go version: go1.26.2
platform: linux/amd64
tags: netgo
What did you do?
1.35.1(corresponding to the latest upstream Alertmanager versionv0.32.1)--cluster.peer)What did you expect to see?
What did you see instead? Under which circumstances?
A single replica in the gossip cluster gradually consumes more CPU over time, seemingly never being released
The high CPU usage isn't related to a spike in incoming alerts or posting to downstream receivers
The increase appears unbounded (monotonic growth) until:
SIGTERMto the pod after liveness probe failuresOther replicas in the same cluster on identical nodes, receiving identical alerts:
This behavior appears very similar to the previously reported issue:
Although this is occurring on the latest version, so unsure if the same symptom with a different cause, or a case the fix did not account for.
Here's the CPU usage on that particular node:

Memory usage follows a similar growth, but CPU is ultimately what is exhausted first in our case.
Lastly, we run many gossip clusters of alertmanager in our environment, all tied to this same chart and upstream version, with similar alertmanager configurations, just with their own gossip peers, and this issue is not shown uniformly across all clusters, so may be a specific condition triggering this behavior.
Please let me know if there's any other information you'd like me to pull. I'll be continuing to investigate and posting anything I find. Thank you!
System information
Linux 6.12.68-92.122.amzn2023.x86_64
Alertmanager version