Skip to content

starknet_transaction_prover: /health returns 503 when service is saturated#14171

Open
avi-starkware wants to merge 1 commit into
avi/prover-v3/panic-counterfrom
avi/prover-v3/saturation-health
Open

starknet_transaction_prover: /health returns 503 when service is saturated#14171
avi-starkware wants to merge 1 commit into
avi/prover-v3/panic-counterfrom
avi/prover-v3/saturation-health

Conversation

@avi-starkware
Copy link
Copy Markdown
Collaborator

Adds SaturationMonitor (shared by ProvingRpcServerImpl and
HealthLayer) that tracks whether the concurrency semaphore has been
continuously rejecting proving requests. Once that has held for the
configured window (health_max_saturated_ms, default 10s), /health
returns 503 with an opaque body so load balancers can drain the pod
before in-flight requests start failing.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

@cursor
Copy link
Copy Markdown

cursor Bot commented May 24, 2026

PR Summary

Medium Risk
Changes load-balancer health signaling and pod draining behavior; mis-tuned thresholds could drain healthy pods or leave overloaded ones in rotation, though defaults and recovery on accept limit blast radius.

Overview
Adds saturation-aware /health so load balancers can drain pods when the prover stays at its concurrency limit.

A shared SaturationMonitor records when prove_transaction hits the semaphore (mark_rejected) and clears on a successful acquire (mark_accepted). HealthLayer uses that state: after health_max_saturated_ms (default 10s) of continuous rejections, GET /health returns 503 with an opaque {"status":"unhealthy","reason":"saturated"} body; a successful accept restores 200. Wiring runs through main, ProvingRpcServerImpl, and HTTP/HTTPS start_server / start_tls_server, with HEALTH_MAX_SATURATED_MS / config file support. Unit tests cover the monitor and health behavior.

Reviewed by Cursor Bugbot for commit e084131. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Collaborator Author

avi-starkware commented May 24, 2026

@reviewable-StarkWare
Copy link
Copy Markdown

This change is Reviewable

@avi-starkware avi-starkware force-pushed the avi/prover-v3/panic-counter branch from cbd1def to e503ebd Compare May 24, 2026 16:48
@avi-starkware avi-starkware force-pushed the avi/prover-v3/saturation-health branch from 318c9c2 to 53381dd Compare May 24, 2026 16:48
@avi-starkware avi-starkware force-pushed the avi/prover-v3/panic-counter branch from e503ebd to db503b7 Compare May 26, 2026 08:43
@avi-starkware avi-starkware force-pushed the avi/prover-v3/saturation-health branch 2 times, most recently from d477f5e to ef3cf0b Compare May 26, 2026 12:16
@avi-starkware avi-starkware force-pushed the avi/prover-v3/panic-counter branch from 1da27e9 to ac98d86 Compare May 26, 2026 12:17
@avi-starkware avi-starkware force-pushed the avi/prover-v3/saturation-health branch from ef3cf0b to eb8da8d Compare May 26, 2026 12:17
…rated

Adds `SaturationMonitor` (shared by `ProvingRpcServerImpl` and
`HealthLayer`) that tracks whether the concurrency semaphore has been
continuously rejecting proving requests. Once that has held for the
configured window (`health_max_saturated_ms`, default 10s), `/health`
returns 503 with an opaque body so load balancers can drain the pod
before in-flight requests start failing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants