| Pillar | Selected | Evaluated Alternatives | Decision Driver |
|---|---|---|---|
| Metrics | Prometheus (kube-prometheus-stack) | Datadog, Amazon CloudWatch | Open-source, no per-host licensing, native Kubernetes service discovery, PromQL ecosystem |
| Logs | Fluent Bit | Fluentd, CloudWatch Logs Agent | 10x lower memory footprint than Fluentd (~5MB vs ~50MB per node), native EKS addon support, sufficient plugin coverage |
| Traces | AWS X-Ray | Jaeger, Zipkin, Tempo | Native integration with AWS SDK (SQS, DynamoDB, Lambda), no additional infrastructure to operate, built-in service map |
| Profiling | Pyroscope | Parca, Datadog Continuous Profiler | Pull-based model matches Prometheus conventions, supports Go/Java/Python runtimes used by ExamplePay services, Grafana-native integration |
Prometheus over Datadog: Datadog's per-host pricing model becomes prohibitive at scale with Karpenter's dynamic node provisioning. A single Karpenter-managed cluster can fluctuate between 20 and 200 nodes during batch processing windows, which would result in unpredictable monthly bills. Prometheus with Thanos for long-term storage provides equivalent functionality at a fraction of the cost.
Fluent Bit over Fluentd: Fluent Bit is written in C and designed for constrained environments. With 3 Karpenter node pools (payments, batch, GPU) creating and destroying nodes frequently, the DaemonSet footprint matters. Fluent Bit's memory usage remains under 10MB per node versus Fluentd's 50-100MB. The plugin ecosystem covers all our destinations (S3, CloudWatch, OpenSearch).
AWS X-Ray over Jaeger: ExamplePay's payment processing pipeline relies heavily on AWS-managed services (SQS, Secrets Manager, Aurora). X-Ray provides automatic instrumentation for AWS SDK calls without additional code changes. Jaeger would require manual span creation for every AWS service interaction. X-Ray's service map also provides a real-time dependency graph that integrates with CloudWatch ServiceLens.
Pyroscope over Parca: Both are open-source continuous profiling tools. Pyroscope was selected for its mature Grafana data source plugin, which allows correlating profiles with Prometheus metrics and Loki logs in a single dashboard. Pyroscope also supports push and pull modes, giving flexibility per-service.
+------------------+ +------------------+ +-------------------+
| Application Pods | | Application Pods | | Application Pods |
| (metrics: /metrics) | (stdout/stderr) | | (X-Ray SDK) |
+--------+---------+ +--------+---------+ +---------+---------+
| | |
v v v
+--------+---------+ +--------+---------+ +---------+---------+
| Prometheus | | Fluent Bit | | X-Ray Daemon |
| (ServiceMonitor | | (DaemonSet) | | (DaemonSet) |
| + PodMonitor) | | | | |
+--------+---------+ +----+----+--------+ +---------+---------+
| | | |
v v v v
+--------+---------+ +----+- --+--------+ +---------+---------+
| Thanos Sidecar | | S3 | OpenSearch | AWS X-Ray |
| --> Thanos Store | | (raw) | (indexed) | (managed) |
| --> S3 (compact) | +--------+----------+ +-------------------+
+------------------+
|
v
+------------------+
| Grafana |
| - Prometheus DS |
| - Loki DS | <-- Fluent Bit can also forward to Loki
| - X-Ray DS |
| - Pyroscope DS |
+------------------+
-
Metrics: Application pods expose
/metricsendpoints. Prometheus scrapes them via ServiceMonitor/PodMonitor CRDs. Thanos sidecar ships blocks to S3 for long-term storage. Thanos Query provides a unified query layer across regions. -
Logs: Fluent Bit runs as a DaemonSet, tailing container stdout/stderr from
/var/log/containers/. Logs are enriched with Kubernetes metadata (namespace, pod, labels) and forwarded to both S3 (raw archive) and OpenSearch (indexed for search). -
Traces: Application services instrumented with the X-Ray SDK send trace segments to the X-Ray daemon running as a DaemonSet. The daemon batches and forwards segments to the X-Ray service endpoint.
-
Profiles: Pyroscope server scrapes profiling endpoints from annotated pods. Profiles are stored locally with S3 for long-term retention.
| Data Type | Hot Storage | Warm Storage | Cold/Archive | Total Retention |
|---|---|---|---|---|
| Metrics (Prometheus) | 15 days (local SSD) | 90 days (Thanos S3) | 1 year (S3 Glacier) | 1 year |
| Logs (OpenSearch) | 7 days (hot nodes) | 30 days (warm nodes) | 1 year (S3 via ISM) | 1 year |
| Logs (S3 raw) | -- | 90 days (S3 Standard) | 7 years (S3 Glacier) | 7 years |
| Traces (X-Ray) | 30 days (X-Ray service) | -- | -- | 30 days |
| Profiles (Pyroscope) | 7 days (local) | 30 days (S3) | -- | 30 days |
The 7-year retention for raw logs in S3 Glacier satisfies PCI-DSS audit trail requirements for payment processing systems.
| Component | Monthly Estimate (200-node peak) | Notes |
|---|---|---|
| Prometheus (2 replicas) | $0 (OSS) + ~$400 compute | 2x m6i.xlarge for Prometheus pods |
| Thanos (store, compact, query) | $0 (OSS) + ~$200 compute | Shared node pool |
| S3 storage (metrics) | ~$50/month | ~500GB compressed, S3 Standard |
| Total | ~$650/month | |
| Datadog equivalent | ~$6,000-15,000/month | $23/host/month infrastructure + $15/host/month APM, variable node count |
| Component | Monthly Estimate | Notes |
|---|---|---|
| Fluent Bit | $0 (OSS) | Negligible DaemonSet overhead |
| OpenSearch (3 hot + 2 warm) | ~$1,200/month | m6g.xlarge.search instances |
| S3 archive | ~$100/month | Raw logs, lifecycle to Glacier |
| Total | ~$1,300/month | |
| CloudWatch Logs equivalent | ~$3,000-5,000/month | $0.50/GB ingestion at ~200GB/day |
- Prometheus recording rules pre-aggregate high-cardinality metrics, reducing Thanos storage by ~60%
- Fluent Bit sampling for debug-level logs in non-production namespaces
- S3 Intelligent-Tiering for metric blocks older than 30 days
- OpenSearch ISM policies automatically migrate indices from hot to warm to S3