Skip to content

environmentd: add Prometheus metrics for MCP endpoints DEX-2#36535

Draft
jubrad wants to merge 1 commit into
MaterializeInc:mainfrom
jubrad:jbradfield/mcp-metrics
Draft

environmentd: add Prometheus metrics for MCP endpoints DEX-2#36535
jubrad wants to merge 1 commit into
MaterializeInc:mainfrom
jubrad:jbradfield/mcp-metrics

Conversation

@jubrad
Copy link
Copy Markdown
Contributor

@jubrad jubrad commented May 13, 2026

Summary

Adds a McpMetrics struct that tracks MCP traffic at the JSON-RPC protocol layer, complementing the existing HTTP-level PrometheusLayer metrics (which already expose request counts/durations keyed on /api/mcp/agent and /api/mcp/developer paths).

Five new time series, all labelled by endpoint (agent or developer):

Metric Labels What it answers
mcp_requests_total endpoint, method, status Request rate by JSON-RPC method and outcome
mcp_tool_calls_total endpoint, tool, status Which tools are used and how often they fail
mcp_tool_duration_seconds endpoint, tool Tool execution latency histogram
mcp_errors_total endpoint, error_type Error breakdown (ValidationError, ExecutionError, ResponseSizeExceeded, …)
mcp_timeouts_total endpoint Requests that hit the 60 s timeout

Timeout counting — the timeout arm now increments requests_total and errors_total immediately by capturing method_name before request is moved into the spawned task. Previously, those counters would only be updated when the background task eventually finished, which could lag by up to 60 s during timeout storms.

ResponseSizeExceeded error variant — replaces the inline QueryExecutionFailed string in format_rows_response with a dedicated variant so size-limit hits appear as error_type="ResponseSizeExceeded" in mcp_errors_total rather than "ExecutionError".

Tests

  • Extended test_mcp_error_codes to assert that ResponseSizeExceeded maps to INTERNAL_ERROR error code and "ResponseSizeExceeded" error type string.
  • Existing test_format_rows_response_errors_when_over_limit continues to pass unchanged (the error message text is preserved by the new variant's #[error] attribute).

Adds a `McpMetrics` struct tracking five time series at the JSON-RPC
protocol layer, complementing the existing HTTP-level `PrometheusLayer`
metrics:

- `mcp_requests_total{endpoint, method, status}` — per JSON-RPC method
  (initialize / tools/list / tools/call) and outcome
- `mcp_tool_calls_total{endpoint, tool, status}` — per tool and outcome
- `mcp_tool_duration_seconds{endpoint, tool}` — tool execution latency
  histogram
- `mcp_errors_total{endpoint, error_type}` — error breakdown by type
  (ValidationError, ExecutionError, ResponseSizeExceeded, etc.)
- `mcp_timeouts_total{endpoint}` — requests that hit the 60 s timeout

The timeout arm now increments `requests_total` and `errors_total`
immediately (not delayed until the background task finishes) by
capturing the method name before the spawn.

Also replaces the inline `QueryExecutionFailed` string in
`format_rows_response` with a dedicated `ResponseSizeExceeded` error
variant so size-limit hits are distinguishable in `errors_total`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jubrad jubrad changed the title environmentd: add Prometheus metrics for MCP endpoints environmentd: add Prometheus metrics for MCP endpoints DEX-2 May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant