Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 67 additions & 3 deletions deployment-examples/metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,24 @@ This directory contains configurations and examples for collecting, processing,

## Overview

NativeLink exposes comprehensive metrics about cache operations and remote execution through OpenTelemetry. These metrics provide insights into:
NativeLink exposes remote execution metrics through OpenTelemetry. Cache
operation metrics are available when the store is explicitly wrapped with the
opt-in `cache_metrics` store wrapper. These metrics provide insights into:

- **Cache Performance**: Hit rates, operation latencies, eviction rates
- **Cache Performance**: Hit rates and operation latencies when `cache_metrics` is enabled
- **Execution Pipeline**: Queue times, stage durations, success rates
- **System Health**: Worker utilization, throughput, error rates

## Quick Start

NativeLink doesn't expose a Prometheus scrape endpoint directly. It emits OTLP
metrics. To view those metrics in Prometheus, use one of these paths:

1. NativeLink sends OTLP to an OpenTelemetry Collector, and Prometheus scrapes
the Collector's Prometheus exporter endpoint.
2. NativeLink sends OTLP/HTTP metrics directly to Prometheus with Prometheus'
OTLP receiver enabled.

### Using Docker Compose (Recommended for Development)

1. Start the metrics stack:
Expand All @@ -33,11 +43,37 @@ export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=dev,nativelink.instance_
nativelink /path/to/config.json
```

To emit `nativelink_cache_*` metrics, wrap the CAS and/or AC store you want to
measure:

```json5
{
"name": "CAS_MAIN_STORE",
"cache_metrics": {
"cache_type": "cas",
"backend": {
"filesystem": {
"content_path": "~/.cache/nativelink/content_path-cas",
"temp_path": "~/.cache/nativelink/tmp_path-cas"
}
}
}
}
```

If `cache_metrics` is absent, NativeLink constructs the same store graph as it
would without cache metrics. The disabled path doesn't add a wrapper, timer,
attribute allocation, or OpenTelemetry recording call to cache operations.

4. Access the metrics:
- Prometheus UI: http://localhost:9090
- Grafana: http://localhost:3000 (if included)
- OTEL Collector metrics: http://localhost:8888/metrics

In this flow, NativeLink sends OTLP to the Collector on `:4317`. The Collector
serves Prometheus-format metrics on its Prometheus exporter endpoint, and
Prometheus scrapes that endpoint.

### Using Kubernetes

1. Deploy the OTEL Collector:
Expand Down Expand Up @@ -65,6 +101,9 @@ env:

### Cache Metrics

Cache metrics are opt-in. The following series are emitted only for stores
wrapped with `cache_metrics`; configuring OTEL alone doesn't enable them.

| Metric | Type | Description | Labels |
|--------|------|-------------|--------|
| `nativelink_cache_operations_total` | Counter | Total cache operations | `cache_type`, `cache_operation_name`, `cache_operation_result` |
Expand Down Expand Up @@ -160,13 +199,38 @@ See `otel-collector-config.yaml` for a complete example.

Prometheus offers native OTLP support and excellent query capabilities.

**Direct OTLP Ingestion:**
**Direct OTLP Ingestion from NativeLink:**
```bash
prometheus --web.enable-otlp-receiver \
--storage.tsdb.out-of-order-time-window=30m
```

Then point NativeLink at Prometheus' OTLP metrics endpoint:

```bash
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://localhost:9090/api/v1/otlp/v1/metrics
```

**Via Collector Scraping:**

Configure the Collector with a Prometheus exporter:

```yaml
exporters:
prometheus:
endpoint: "0.0.0.0:9090"

service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
```

Then configure Prometheus to scrape the Collector:

```yaml
scrape_configs:
- job_name: 'otel-collector'
Expand Down
22 changes: 20 additions & 2 deletions deployment-examples/metrics/cache-metrics-wrapper-store.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@

Expose consistent, low-cardinality cache metrics (CAS/AC/store backends) without needing to implement bespoke instrumentation inside every individual store implementation.

The implementation is intentionally opt-in through the `cache_metrics` store
wrapper. If a store isn't wrapped, NativeLink constructs the same store graph
as before and doesn't add timers, attribute allocation, or OpenTelemetry
recording calls to that store's hot path.

This document focuses on a **wrapper store** (middleware) approach that can be applied to any `StoreDriver`, and compares it with **instrumenting inside each store**.

## Problem Statement
Expand All @@ -20,7 +25,7 @@ These should be queryable and composable with low cognitive overhead and consist

### A) Wrapper Store (middleware)

Wrap an existing `Arc<dyn StoreDriver>` with a new `StoreDriver` that:
Wrap an existing `Arc<dyn StoreDriver>` with the `cache_metrics` `StoreDriver` that:
1. Starts a timer
2. Calls the inner store method
3. Classifies the outcome (hit/miss/error/etc)
Expand Down Expand Up @@ -142,4 +147,17 @@ To reach that:
- Ensure `deployment-examples/metrics/prometheus-recording-rules.yml` references `_total` counter names.
- Keep existing dashboards querying recording rules (for example, `nativelink:cache_hit_rate`) instead of raw high-cardinality series.

If wrapper metrics are **optional/config-gated**, docs may need a small note describing how to enable them; otherwise docs can remain unchanged.
Wrapper metrics are config-gated. Wrap only the logical store layer you want to
measure, for example:

```json5
"cache_metrics": {
"cache_type": "cas",
"backend": {
"filesystem": {
"content_path": "~/.cache/nativelink/content_path-cas",
"temp_path": "~/.cache/nativelink/tmp_path-cas"
}
}
}
```
50 changes: 31 additions & 19 deletions nativelink-config/examples/stores-config.json5
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,26 @@
stores: [
{
name: "0",
"cache_metrics": {
"cache_type": "cas",
"backend": {
"filesystem": {
"content_path": "~/.cache/nativelink/content_path-cas",
"temp_path": "~/.cache/nativelink/tmp_path-cas"
}
}
}
},
{
name: "1",
"memory": {
"eviction_policy": {
"max_bytes": "10mb",
}
}
},
{
name: "1",
name: "2",
"experimental_cloud_object_store": {
"provider": "aws",
"region": "eu-north-1",
Expand All @@ -26,7 +38,7 @@
}
},
{
name: "2",
name: "3",
"experimental_cloud_object_store": {
"provider": "gcs",
"bucket": "test-bucket",
Expand All @@ -40,7 +52,7 @@
}
},
{
name: "3",
name: "4",
"experimental_cloud_object_store": {
"provider": "azure",
"account_name": "cloudshell1393657559",
Expand All @@ -55,7 +67,7 @@
}
},
{
name: "4",
name: "5",
"experimental_cloud_object_store": {
"provider": "ontap",
"endpoint": "https://ontap-s3-endpoint:443",
Expand All @@ -72,7 +84,7 @@
}
},
{
name: "5",
name: "6",
"ontap_s3_existence_cache": {
"index_path": "/path/to/cache/index.json",
"sync_interval_seconds": 300,
Expand All @@ -85,7 +97,7 @@
}
},
{
name: "6",
name: "7",
"verify": {
"backend": {
"memory": {
Expand All @@ -99,7 +111,7 @@
}
},
{
name: "7",
name: "8",
"completeness_checking": {
"backend": {
"filesystem": {
Expand All @@ -118,7 +130,7 @@
}
},
{
name: "8",
name: "9",
"compression": {
"compression_algorithm": {
"lz4": {}
Expand All @@ -135,7 +147,7 @@
}
},
{
name: "9",
name: "10",
"dedup": {
"index_store": {
"memory": {
Expand Down Expand Up @@ -174,7 +186,7 @@
}
},
{
name: "10",
name: "11",
"existence_cache": {
"backend": {
"memory": {
Expand All @@ -190,7 +202,7 @@
}
},
{
name: "11",
name: "12",
"fast_slow": {
"fast": {
"filesystem": {
Expand All @@ -213,7 +225,7 @@
}
},
{
name: "12",
name: "13",
"shard": {
"stores": [
{
Expand All @@ -229,7 +241,7 @@
}
},
{
name: "13",
name: "14",
"filesystem": {
"content_path": "/tmp/nativelink/data-worker-test/content_path-cas",
"temp_path": "/tmp/nativelink/data-worker-test/tmp_path-cas",
Expand All @@ -239,13 +251,13 @@
}
},
{
name: "14",
name: "15",
"ref_store": {
"name": "FS_CONTENT_STORE"
}
},
{
name: "15",
name: "16",
"size_partitioning": {
"size": "128mib",
"lower_store": {
Expand All @@ -262,7 +274,7 @@
}
},
{
name: "16",
name: "17",
"grpc": {
"instance_name": "main",
"endpoints": [
Expand All @@ -283,7 +295,7 @@
}
},
{
name: "17",
name: "18",
"redis_store": {
"addresses": [
"redis://127.0.0.1:6379/",
Expand All @@ -292,11 +304,11 @@
}
},
{
name: "18",
name: "19",
"noop": {}
},
{
name: "19",
name: "20",
"experimental_mongo": {
"connection_string": "mongodb://localhost:27017",
"database": "nativelink",
Expand Down
34 changes: 34 additions & 0 deletions nativelink-config/src/stores.rs
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,28 @@ pub enum ConfigDigestHashFunction {
#[serde(rename_all = "snake_case")]
#[cfg_attr(feature = "dev-schema", derive(JsonSchema))]
pub enum StoreSpec {
/// Cache metrics store wraps another store and emits low-cardinality
/// OpenTelemetry cache operation metrics for the wrapped store.
///
/// This wrapper is opt-in. Stores that are not explicitly wrapped by
/// `cache_metrics` are constructed exactly as they are without this
/// wrapper and do not pay its hot-path timing or recording cost.
///
/// **Example JSON Config:**
/// ```json
/// "cache_metrics": {
/// "cache_type": "cas",
/// "backend": {
/// "filesystem": {
/// "content_path": "~/.cache/nativelink/content_path-cas",
/// "temp_path": "~/.cache/nativelink/tmp_path-cas"
/// }
/// }
/// }
/// ```
///
CacheMetrics(Box<CacheMetricsSpec>),

/// Memory store will store all data in a hashmap in memory.
///
/// **Example JSON Config:**
Expand Down Expand Up @@ -594,6 +616,18 @@ pub struct ShardSpec {
pub stores: Vec<ShardConfig>,
}

#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(deny_unknown_fields)]
#[cfg_attr(feature = "dev-schema", derive(JsonSchema))]
pub struct CacheMetricsSpec {
/// Low-cardinality cache type label for metrics, for example `cas` or `ac`.
#[serde(deserialize_with = "convert_string_with_shellexpand")]
pub cache_type: String,

/// Store to wrap with cache operation metrics.
pub backend: StoreSpec,
}

#[derive(Serialize, Deserialize, Debug, Clone)]
#[serde(deny_unknown_fields)]
#[cfg_attr(feature = "dev-schema", derive(JsonSchema))]
Expand Down
Loading
Loading