Skip to content

Alertmanager CPU leak on single gossip cluster member (similar to #4868) #5249

@tonytwostep

Description

@tonytwostep

What did you do?

  • Upgraded our Alertmanager deployment on Kubernetes (Amazon EKS) to use the latest prometheus-community Helm chart 1.35.1 (corresponding to the latest upstream Alertmanager version v0.32.1)
  • Configured Alertmanager in HA mode with multiple replicas using gossip clustering (--cluster.peer)
  • Sent alerts from multiple producers to all Alertmanager replicas
  • Allowed the system to run under steady/normal alert load over time

What did you expect to see?

  • Even distribution of CPU and workload across all Alertmanager replicas
  • CPU usage remaining stable over time for each replica, proportional to alert volume
  • No individual node becoming a bottleneck or degrading independently
  • No restarts or loss of responsiveness under normal operating conditions

What did you see instead? Under which circumstances?

  • A single replica in the gossip cluster gradually consumes more CPU over time, seemingly never being released

  • The high CPU usage isn't related to a spike in incoming alerts or posting to downstream receivers

  • The increase appears unbounded (monotonic growth) until:

    • The instance reaches ~100% CPU utilization at the node level
    • The instance becomes unresponsive
    • Kubernetes sends SIGTERM to the pod after liveness probe failures
    • The pod restarts freeing up CPU before starting the growth again

Other replicas in the same cluster on identical nodes, receiving identical alerts:

  • Continue operating normally
  • Show significantly lower and stable CPU usage

This behavior appears very similar to the previously reported issue:

Although this is occurring on the latest version, so unsure if the same symptom with a different cause, or a case the fix did not account for.

Here's the CPU usage on that particular node:
Image

Memory usage follows a similar growth, but CPU is ultimately what is exhausted first in our case.

Lastly, we run many gossip clusters of alertmanager in our environment, all tied to this same chart and upstream version, with similar alertmanager configurations, just with their own gossip peers, and this issue is not shown uniformly across all clusters, so may be a specific condition triggering this behavior.

Please let me know if there's any other information you'd like me to pull. I'll be continuing to investigate and posting anything I find. Thank you!

System information

Linux 6.12.68-92.122.amzn2023.x86_64

Alertmanager version

alertmanager, version 0.32.1 (branch: HEAD, revision: 8768aa6f65f1a888b5aa5fbf877cf20ad45d1f61)
  build user:       root@6f5b35dc1248
  build date:       20260429-17:34:54
  go version:       go1.26.2
  platform:         linux/amd64
  tags:             netgo

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    To triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions