feat(curtailment): Stop, staggered restore, max-duration enforcement by rongxin-liu · Pull Request #232 · block/proto-fleet

rongxin-liu · 2026-05-14T10:42:10Z

Background

Curtailment ships in stages. The contract foundation, persistence + preview, and the operator-facing Start + dispatch + reconciler all landed via #192. What's still missing is the path back: an operator-facing way to end an event, and the system's ability to bring miners back online safely. This PR closes that gap and removes the production gate on StartCurtailment.

Summary

StopCurtailment

The operator-facing end-of-event RPC. Validates min_curtailed_duration_sec has elapsed (admin caller can supply force=true to bypass the hysteresis gate), transitions the event to restoring, and in the same transaction flips every non-terminal target to desired_state='active', resets state to pending, and clears the per-phase metadata (retry_count, last_dispatched_at, last_batch_uuid, confirmed_at, last_error). Terminal targets are untouched. The reset gives the restorer a DB-backed, unambiguous work queue that survives a fleetd restart.

min_curtailed_duration_sec only gates active events — pending events haven't curtailed anything yet, so the elapsed-since-started_at check doesn't apply. Idempotent re-Stop: a second call against an already-restoring event echoes the persisted row without touching targets. Terminal events return FailedPrecondition; cross-org event UUIDs return NotFound. These semantics are documented on the StopCurtailment proto RPC declaration so SDK consumers see the idempotent-restoring vs. non-retryable-terminal split without reading the handler.

StopCurtailmentRequest carries event_uuid (tag 1) and an admin-gated force boolean (tag 3). Tag 2 is reserved — both the number and the name restore_batch_size_override — so an old client's varint payload on the deleted slot can never silently alias into force=true. The handler enforces the admin role on force=true before reaching the service, so non-admin callers see PermissionDenied and never trip the bypass.

Restorer

A DB-backed step inside the reconciler tick rather than a separate goroutine, so a restart resumes restore from persisted state without rehydrating in-memory workers. Each tick scans restoring events and:

Confirms any dispatched restore targets via telemetry (baseline_power_w × 0.5 threshold, falling back to a positive-hash signal when baseline is null). A target whose candidate row has vanished (device unpaired or deleted) burns retry budget via recordDispatchFailure instead of pinning the event indefinitely.
Transitions the event terminal — COMPLETED or COMPLETED_WITH_FAILURES based on RestoreFailed count — once every target is terminal.
Claims the next batch when both gates pass:
- No prior batch in flight: zero restore-phase targets in Dispatched or Drifted.
- Inter-batch interval elapsed: newest restore-phase last_dispatched_at is null or older than restore_batch_interval_sec.

The claim selects up to effective_batch_size pending targets ordered by device_identifier, dispatches a single batched Uncurtail call covering all of them (sharing one batch_uuid), and per-device commits state transitions based on the kept/skipped split returned by the command service. Bulk dispatch errors and per-device skips route through the same recordDispatchFailure retry path the Curtail dispatch uses, and the per-device audit string comes from a single skippedDeviceReason helper shared with single-device dispatch (Reason > FilterName > generic preflight fallback). The reconciler self-bypass on CurtailmentActiveFilter covers Uncurtail traffic too, so restore batches aren't blocked by the very event they're restoring.

Each tick body is bounded by a 2 × TickInterval deadline. Per-event contexts derive from the same tick budget, so the loop-entry gate and per-event deadline together cap total tick wall-time. A stuck DB during a large restore (ResetCurtailmentTargetsForRestore over thousands of targets, slow ListCandidates on a large fleet) returns an error to the tick instead of hanging the goroutine; the heartbeat upsert uses its own background-derived 5s deadline so liveness reporting stays honest even when the tick body errors out. Telemetry-timeout aging in the restore phase logs a dedicated slog line before recordDispatchFailure so triage can distinguish timed-out-and-wrote from timed-out-DB-error.

Adaptive batch sizing — computed at Start

effective_batch_size = max(restore_batch_size, ceil(0.01 × selected_target_count)) clamped to [10, 100], stamped on the event row at Start time from the selected-target count and read back by Stop, the reconciler's max-duration arm, and the restorer's batch claim. The column is non-null from event creation. StartCurtailmentResponse.event.effective_batch_size echoes the computed value, so SDK consumers see the resolved batch size in the same response as the new event UUID rather than needing a follow-up GetActiveCurtailment poll.

A 5,000-miner event therefore restores in batches of 50; a 10K+ event caps at 100. Operators tune the formula's floor via the restore_batch_size field on StartCurtailmentRequest.

One non-terminal event per org

Migration 000051 adds a partial unique index uq_curtailment_event_one_non_terminal_per_org ON curtailment_event (org_id) WHERE state IN ('pending','active','restoring'). InsertEventWithTargets maps the unique-violation to a typed interfaces.ErrCurtailmentNonTerminalEventExists sentinel, and Service.Start catches it: a concurrent Start that races the selector's per-device exclusion check now fails at insert time with AlreadyExists, surfaced with the existing event's UUID + state. The AlreadyExists DebugMessage embeds that identity as the stable substring (event_uuid=<uuid>, state="<state>") — documented on the StartCurtailment RPC declaration so SDK consumers can recover the existing event without parsing an undocumented format.

This closes the cross-event scope-overlap gap at the DB level. The rule is org-wide rather than device-set-aware (two non-overlapping device sets in the same org are still mutually exclusive), which is sufficient for v1's manual-trigger use case.

Migration 000051 also bounds the operational controls at the DB layer with CHECK constraints on max_duration_seconds (NULL OR 0 < x <= 604800) and restore_batch_interval_sec (0 <= x <= 3600). The migration is safe without CONCURRENTLY / NOT VALID because curtailment_event has zero rows at deploy (the production StartCurtailment gate is removed by the same PR); the file documents this assumption inline for any future re-application against a populated table.

max_duration_seconds enforcement

Reconciler-side enforcement transitions an active event to restoring when the cap elapses since started_at. This closes the prior gap where a target sitting in Dispatched with no positive telemetry stays there indefinitely under drift detection alone — the cap now drags the event into restore regardless. allow_unbounded events and events with nil started_at short-circuit out.

When max_duration_seconds is unset and !allow_unbounded, Service.Start fills the value from orgConfig.MaxDurationDefaultSec and re-checks the absolute upper bound (maxFiniteDurationSeconds = 604800) on the normalized value. A misconfigured org default surfaces as InvalidArgument (org's max_duration_default_sec must be <= 604800) instead of tunnelling past validateStartRequest and tripping the DB CHECK constraint as Internal.

StartCurtailment gate flipped

With Stop, the restorer, and reconciler-side max-duration enforcement all in place, an active event now has both an operator-facing exit (Stop) and a wall-clock safety net. The Handler.startEnabled field, in-handler gate check, and the coupling test that pinned the gate are removed. main.go constructs the handler with the lean NewHandler(curtailmentSvc).

Restore semantics worth flagging

Restore does not re-check device_status. The operator who set force_include_maintenance=true owns restore timing. Operators expecting a maintenance window to outlast a curtailment event must size max_duration_seconds, not rely on schedule-coupled timing.
Per-device FIFO awareness. Stop does not cancel any already-enqueued Curtail; the Uncurtail queues behind earlier pending or processing work. The restorer accounts for this via the in-flight gate rather than assuming Uncurtail supersedes.
BeginCurtailmentRestoration concurrency model. The load-bearing guarantee is the WHERE state IN ('pending','active') row-lock guard combined with the per-org partial unique index added in migration 000051. The store's ErrNoRows re-read path distinguishes "already restoring" (idempotent) from "already terminal" (FailedPrecondition).

Wiring

cmd/fleetd/main.go constructs the handler with the single-argument NewHandler and the wiring comment documents the current surface: Preview, Start, and Stop are wired; the reconciler drives non-terminal events terminal (including max-duration-based forced restore); UpdateCurtailmentEvent, GetActiveCurtailment, and ListCurtailmentEvents return Unimplemented pending the read APIs ticket.

What is intentionally not in this PR

UpdateCurtailmentEvent for operator-safe mutation of reason and restore controls — lands with the read/audit work.
GetActiveCurtailment and ListCurtailmentEvents — same.
Metrics pipeline wiring and the heartbeat staleness alert — blocked on the unresolved metrics-pipeline shape decision.
Webhook ingestion and signal adapters.

Test plan

go build ./... and go vet ./... clean. golangci-lint run ./internal/domain/curtailment/... ./internal/handlers/curtailment/... ./cmd/fleetd/... reports 0 issues. go test on the curtailment unit surface is green (DB-dependent integration tests fail on local connection-refused as usual).

Coverage by area:

Service.Stop — happy path, idempotent already-restoring, terminal rejection, min_curtailed_duration_sec gate with force=true bypass, EMERGENCY priority does not bypass the gate at Stop time (regression), pending-event short-circuit, store-error propagation.
Service.Start — TestService_Start_StampsEffectiveBatchSize pins the adaptive formula at Start time (floor 10, restore_batch_size floor, 5K → 50, 10K ceiling at 100). TestService_Start_NonTerminalOverlapReturnsAlreadyExists pins the per-org partial unique index path; TestService_Start_NonTerminalOverlapWithGetActiveErrorReturnsBareAlreadyExists pins the identity-lookup-fallback sub-branch. TestService_Start_RejectsOrgDefaultAboveAbsoluteCeiling pins the post-normalization upper-bound guard on the org-default path. TestService_Start_RejectsAdminRestoreBatchIntervalAboveAbsoluteCeiling pins the absolute ceiling for admin callers (above the non-admin cap but still bounded).
ComputeEffectiveBatchSize — small-fleet floor, 5K-fleet → 50, 10K+ ceiling at 100, restore_batch_size floors the formula, negative restore_batch_size handling.
StopCurtailment handler — happy path through the full handler → service → store flow, missing session Unauthenticated, malformed UUID InvalidArgument, force admin role gate (PermissionDenied for non-Admin), admin force=true flowing through and bypassing the min-duration gate, ListTargetsByEvent error propagation after a committed transition.
StartCurtailment handler — happy path asserts EffectiveBatchSize on the wire response (proves toStartResponse populates the computed value rather than always-zero).
Reconciler enforceMaxDuration — elapsed transitions to restoring (no drift dispatch), not-elapsed no-ops, allow_unbounded skips cap, nil started_at no-ops, BeginRestoreTransition error path skips drift this tick.
Reconciler observeRestoring — claim dispatches one Uncurtail per batch with shared batch_uuid, in-flight gate blocks claim, interval gate blocks claim, telemetry-confirmed dispatched promotes to Resolved (and triggers terminal transition same tick), all-resolved → COMPLETED, mixed-with-failures → COMPLETED_WITH_FAILURES. TestReconciler_Restoring_MissingCandidateDuringConfirmConsumesRetryBudget pins the nil-candidate guard in the restore phase, mirroring the curtail-phase analog.
Reconciler tick budget — TestReconciler_RunTickStopsWhenTickBudgetExpires pins the per-event context bound; later events cancel when the shared tick budget expires.
isRestored — power above/at/below baseline × 0.5 thresholds, baseline-nil positive-hash fallback, baseline-nil zero-hash not-restored, no-telemetry not-restored, baseline-zero hash fallback.
skippedDeviceReason — priority-order contract (Reason wins, then FilterName, then generic preflight fallback) pinned across all three branches.

Closes #231
Refs #192

github-actions · 2026-05-14T10:46:06Z

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

Reviewed pull request diff only (032712636d79888dce558d32ebe7ecc2391b0617...9fded5863cf56eb8513c5443b9a285a25c5d3613, exact PR three-dot diff)

Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.

Review Summary

Overall Risk: HIGH

Findings

[HIGH] Pending curtailment dispatches can stall forever and bypass `max_duration_seconds`

Category: Reliability
Location: server/internal/domain/curtailment/reconciler/reconciler.go:519
Description: After a Curtail command is successfully enqueued, confirmOneDispatched simply returns when telemetry does not prove the miner is curtailed. There is no dispatch-confirmation timeout or retry burn for this path. Since max_duration_seconds is only enforced for active events, an event with a dispatched target that never confirms remains pending indefinitely.
Impact: A single stale or broken telemetry stream can leave an org with a permanent non-terminal curtailment event. The new per-org non-terminal unique index then blocks future curtailment events, and already-curtailed miners may remain curtailed beyond the configured max duration.
Recommendation: Add a pending-phase dispatch confirmation timeout, similar to restore telemetry timeout, that either retries or marks the target failed. Also consider enforcing max_duration_seconds from event creation/first dispatch so pending events cannot outlive the cap.

[MEDIUM] Non-admin Stop can bypass minimum curtailed duration during pending rollout

Category: Authorization | Reliability
Location: server/internal/domain/curtailment/service.go:840
Description: checkMinCurtailedDurationGate skips the gate unless the event state is active. But pending events can already have dispatched or confirmed targets while waiting for the remaining targets to confirm. A normal non-admin caller can call Stop during that pending window and restore already-curtailed devices without waiting for min_curtailed_duration_sec.
Impact: The admin-only force bypass is weakened because operators can avoid the min-duration hysteresis by stopping before the event flips to active. This can cause rapid curtail/restore flapping and defeats the operational safety control.
Recommendation: Do not exempt all pending events. Either reject non-force Stop once any target has been dispatched/confirmed, or enforce the gate transactionally using target timestamps/first confirmation. Re-check the state inside BeginRestoreTransition so pending-to-active races cannot bypass the gate.

Notes

No SQL injection, command injection, pool/wallet hijacking, or protobuf wire-compatibility issues stood out in the scoped diff. The main risk is the new curtailment lifecycle becoming stuck or bypassing its own duration controls.

_{Generated by Codex Security Review |

Triggered by: @rongxin-liu |

Review workflow run}

Adds the operator-facing StopCurtailment write path: - New sqlc queries BeginCurtailmentRestoration (event-side flip to 'restoring' + effective_batch_size stamp, state-guarded so already-restoring or terminal events fall through to a no-op row count) and ResetCurtailmentTargetsForRestore (non-terminal targets flip to desired_state='active', state='pending', cleared phase-local cursors). Composed inside the SQL store as a single tx via WithTransaction. - Store.BeginRestoreTransition method on the curtailment interface + SQL implementation. Idempotent against re-Stop (returns the current row unchanged); FailedPrecondition on a terminal event; NotFound for cross-org event UUIDs. A concurrent state-change between the pre-read and the UPDATE is handled by a re-read so the caller observes the latest persisted state. - Service.Stop with adaptive batch sizing — max(restore_batch_size, ceil(0.01 × non_terminal_target_count)) clamped to [10, 100] — plus the min_curtailed_duration_sec gate that applies only to active events (pending events haven't curtailed anything yet) with EMERGENCY priority bypass. The per-request restore_batch_size_override (admin-gated upstream) skips the formula entirely and is bounded at the service layer to [1, 200], the BE-4 fleet-class ceiling that the proto bound will be tightened to match. - Handler.StopCurtailment wired: replaces the Unimplemented stub with the service call, preserving the existing requireAdminFromContext gate on the override field. Session-derived OrgID, no actor change. - New translate helpers (toStopRequest, toStopResponse, toEventProto, plus enum mappers for event_state/mode/strategy/level/priority) that build a CurtailmentEvent wire message from the persisted row. Scope and mode_params are intentionally left empty for now — reconstructing those from JSONB lands with the read APIs. - Shared models.DesiredStateCurtailed / models.DesiredStateActive constants so the SQL literal values have a single Go-side source of truth, and fake-store stubs (preview/start service tests + handler start test + reconciler test) implementing the new interface method.

Replaces the BE-3 'restoring is a no-op' stub with an end-to-end restorer arm and a time-based escape hatch on active events: - observeRestoring runs three phases per tick: (1) confirm any dispatched restore targets via telemetry (baseline-relative power_w > baseline × 0.5, with a positive-hash fallback when baseline is null); (2) transition the event terminal once every target is Resolved/Released/RestoreFailed (chooses COMPLETED vs COMPLETED_WITH_FAILURES based on failure count, stamps ended_at); (3) claim the next batch when both gates pass. - Restore-batch gating mirrors the BE-4 plan: no prior batch in flight (no restore-phase targets in Dispatched or Drifted) AND inter-batch interval elapsed (newest restore-phase last_dispatched_at older than restore_batch_interval_sec). Both gates fail closed; the in-flight gate is the primary inrush protection when interval is zero. - Batched Uncurtail dispatch: one command call covers up to effective_batch_size devices, sharing a single batch_uuid. Per-device outcomes (kept vs filtered-skip) drive individual state updates; bulk dispatch errors route every claim target through retry budget the same way BE-3 Curtail does. Missing candidate row burns budget rather than stalling so vanished devices reach RestoreFailed instead of hanging the event. - enforceMaxDuration runs before drift detection on active events: when (now - started_at) >= max_duration_seconds, the reconciler calls BeginRestoreTransition with the adaptive-formula batch size (no override available without an operator request). This closes the BE-3 stuck-Dispatched gap at the event boundary — a target with no positive telemetry is dragged into restore by the cap, not left sitting in Dispatched forever. AllowUnbounded and pending events short-circuit. - isRestored predicate is the symmetric counterpart of isCurtailed for the restore-confirm path; it requires positive evidence so a flaky sensor cannot trigger a premature Resolved transition.

…he override ceiling With Stop, the restorer, and reconciler-side max_duration_seconds enforcement all in place, an Active event now has both an operator-facing exit (StopCurtailment) and a wall-clock safety net (max-duration forced restore). The 'gated until BE-4 ships' rationale no longer applies, so the Handler.startEnabled field, the in-handler gate check, and the TestHandler_StartCurtailment_DisabledFlagReturnsUnimplemented coupling test are all removed. main.go constructs the handler with the lean single-argument NewHandler(curtailmentSvc) and the wiring comment now documents what is and isn't reachable in the current surface. Separately, the proto bound on restore_batch_size_override drops from the ungrounded [1, 10000] sanity ceiling to [1, 200] — 2x the adaptive formula's upper band (100). 200 short-circuits inrush staggering for a single-rack restore but stays narrow enough that an admin typo cannot silently dump 10K miners at once. Service.Stop already enforces the same 200 ceiling at the service layer, so the two checks agree.

…ounds Stop service tests (service_stop_test.go) pin the happy path, the idempotent already-restoring branch, terminal-event rejection, the min_curtailed_duration gate and its EMERGENCY bypass, override precedence over the adaptive formula, and override bound rejection at both ends of the [1, 200] window. ComputeEffectiveBatchSize is exercised across the realistic fleet sizes (small → floor, 5000 → 50, 10K+ → ceiling) so a regression in the inrush envelope shows up immediately. Stop handler tests (handler_stop_test.go) cover the happy path through the full handler→service→store flow, missing-session Unauthenticated, malformed-UUID InvalidArgument, the admin role gate on restore_batch_size_override, and the admin override flowing through to the persistence boundary so an admin-driven number doesn't get silently dropped on the floor. Reconciler restorer arm tests (restore_test.go) cover the four enforceMaxDuration branches (elapsed/not-elapsed/allow_unbounded/nil started_at), the observeRestoring three-phase loop (claim+dispatch, in-flight gate, interval gate, telemetry confirmation), the completion paths (all-resolved→COMPLETED, mixed→COMPLETED_WITH_FAILURES), and the isRestored predicate including the baseline-null hash fallback that protects the operator from a spurious "restore complete" call when the power-side telemetry is missing. A small fix lands alongside: the gosec int→int32 saturation around the non-terminal-count → ComputeEffectiveBatchSize call site, exhaustive switch statements in the target-state count loops, and a cap variable renamed to maxDur to stop shadowing the builtin.

A target Dispatched in the restore phase that never returns positive telemetry evidence would stay there indefinitely. confirmOneRestore only consumed retry budget when the candidate row was nil (vanished device); a present device with nil or missing power and hash samples never advanced. The in-flight gate on the next batch claim then blocked all subsequent batches, and enforceMaxDuration does not close this gap (it only transitions active to restoring). Add RestoreDispatchTimeoutSec (default 300s = 10x the tick interval). confirmOneRestore now consumes retry budget when a Dispatched target has been there longer than the timeout, draining to RestoreFailed at exhaustion so the event can still progress. Also picked up during the review pass: - Move countNonTerminalTargets to models.CountNonTerminalTargets so service.Stop and reconciler.enforceMaxDuration share one helper. - Document BeginCurtailmentRestoration's atomicity model inline (UPDATE WHERE state-guard + partial-unique-index; ErrNoRows re-read terminal-state risk). - Document that service.Stop's pre-read terminal/restoring check is a fast-path; BeginRestoreTransition is the authoritative atomic guard. - Document the strict > vs <= asymmetry between isRestored and isCurtailed, and the no-progress band at exactly baseline*factor. - Forward-compat comment on maybeClaimRestoreBatch's unreachable Drifted arm; comment on the interval gate loop documenting that terminal targets retain LastDispatchedAt as the spacing reference. - Tighten translate.toStopRequest overflow-check comment. - Inline single-use desiredStateCurtailed alias. - Strip internal ticket prefixes from source comments. - Trim multi-paragraph rationale on adaptive batch-sizing constants and ComputeEffectiveBatchSize godoc. - Fix stale cross-reference comment on StartCurtailmentRequest.restore_batch_size; regenerate bindings.

Fills coverage gaps surfaced during the review pass and pins the new restore-dispatch age-out behavior. - Restore age-out exhaustion drains the target to RestoreFailed and takes the event to COMPLETED_WITH_FAILURES; happy path within the timeout stays in Dispatched. - enforceMaxDuration with a BeginRestoreTransition error leaves the event unchanged and still runs drift dispatch in the same tick. - dispatchRestoreBatch under an Uncurtail error and under an empty BatchIdentifier keeps the batch Pending with LastError set and RetryCount incremented. - dispatchRestoreBatch per-device filter-skip in the restore phase mirrors the curtail-phase coverage: kept target -> Dispatched, skipped target -> Pending with retry consumed. - Service.Stop returns NotFound for an unknown event UUID, propagates ListTargetsByEvent store errors, and the MinDurationGate test now pins GRPCCode = FailedPrecondition. - Restorer-shared batch_uuid is asserted by value (both batched targets reference the same LastBatchUUID), not just by comment. - isRestored table picks up the power-present, baseline-nil, hash-nil case so the line-1035 hash-nil guard has direct coverage.

Post-review simplification pass surfaced four small wins: - service_test fakeStore: remove three fields that were declared but never read or assigned in any test (beginRestoreResult, getEventByUUIDLastOrg, listTargetsLastOrg). The dead capture-state was a misleading escape hatch. - service.Stop: drop the nine-line int->int32 MaxInt32 saturation in favour of the same //nolint:gosec one-liner the reconciler's enforceMaxDuration already uses. The count is bounded by the per-event target rows; the saturation block was dead defensive code. - maybeCompleteRestoring: add a default arm to the per-target switch so an unknown state from a future migration falls through as non-terminal rather than silently completing the event prematurely. - observeActive: drop the block comment narrating the enforceMaxDuration call site; the function name and its godoc carry the rationale.

…Stop failure Two related Codex-flagged issues on the BE-4 surface plus one defensive test, all small and self-contained. P1: enforceMaxDuration's BeginRestoreTransition error path returned false, falling through to drift detection. Past the max_duration cap, that meant new Curtail commands could be issued to miners the event was supposed to be releasing — extending curtailment past the contracted ceiling on every transient DB failure. Now returns true on error so observeActive skips drift; next tick retries the transition. The corresponding test (was pinning the wrong behavior) is updated to assert curtailCalls=0 on the failure tick. P2: BeginRestoreTransition's ErrNoRows re-read returned the latest event unconditionally, including terminal states. A concurrent AdminTerminateEvent between the pre-read and the UPDATE caused Stop to echo a 200 with a terminal event masquerading as restoring. The re-read now routes by latest.State: terminal → FailedPrecondition, restoring → idempotent success, anything else → InternalError. P3: New table-style test for maybeCompleteRestoring's default arm. The arm is dead code today (all current TargetState values are explicitly covered), but the test pins the defensive "unknown state keeps the event non-terminal" contract against future schema additions.

…ment The prior query comment claimed concurrency safety relied on "the partial-unique-index on (org_id, state) for active events." No such unique index exists; the actual index `idx_curtailment_event_active` is a plain partial btree. Concurrency safety rests on the `WHERE state IN ('pending','active')` row-lock guard alone. Replace the comment with the actual model so a future maintainer does not trust a fictitious uniqueness guarantee and remove the state guard.

Adds the server-resolved restore batch size as a field on the wire CurtailmentEvent and populates it from the persisted event row in toEventProto. Restore-time clients (Stop response, future read APIs) can now display the value the server actually drained at — important when the adaptive [10, 100] envelope or an operator `restore_batch_size_override` resolves to something other than the caller's `restore_batch_size` preference. Also tightens the StopCurtailment RPC docstring to make the terminal-race contract explicit: idempotent on an already-restoring event; FailedPrecondition when the event is terminal or a concurrent transition raced past restoring; treat as non-retryable.

Derive a `2 * TickInterval` per-tick deadline in runTick and pass it to ListNonTerminalEvents and processEvent. A stuck DB during a large restore tick (e.g. ResetCurtailmentTargetsForRestore over thousands of targets) now returns an error to the tick body instead of hanging the goroutine indefinitely. The heartbeat keeps using its own background-derived 5s timeout so liveness reporting is preserved even when the tick errors. Extend dispatchRestoreBatch's doc to call out the audit-lineage consequence of the existing restart-safety design: a per-target UpdateTargetState failure after Uncurtail returns success leaves the target Pending; next tick redispatches a second Uncurtail. Uncurtail is device-idempotent via the per-device FIFO, but audit records may show duplicate batches.

Copilot

Pull request overview

Adds the missing “exit path” for curtailment events by implementing operator Stop, a DB-backed staggered restore loop in the reconciler, adaptive restore batch sizing, and reconciler-side max-duration enforcement; this also removes the production gate that previously prevented StartCurtailment from being enabled.

Changes:

Implement StopCurtailment end-to-end (proto docs, handler/service/store, SQL queries) and reset non-terminal targets into a restore work queue.
Extend the reconciler to (a) force restore on max_duration_seconds and (b) drive restoring events to terminal via staggered batched Uncurtail.
Add adaptive restore batch sizing and surface the server-resolved effective_batch_size on CurtailmentEvent; update generated artifacts and wiring.

Reviewed changes

Copilot reviewed 17 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
server/sqlc/queries/curtailment.sql	Adds restoration transition + target reset queries.
server/internal/handlers/curtailment/translate.go	Adds Stop request/response translation and event → proto mapping (incl. timestamps and effective batch size).
server/internal/handlers/curtailment/handler.go	Removes Start gate; implements StopCurtailment handler with auth + admin override gate.
server/internal/handlers/curtailment/handler_test.go	Updates handler constructor usage after gate removal.
server/internal/handlers/curtailment/handler_stop_test.go	New Stop handler tests (auth, UUID validation, admin override gating).
server/internal/handlers/curtailment/handler_start_test.go	Removes Start gate test; updates handler constructor and stubs.
server/internal/domain/stores/sqlstores/curtailment.go	Implements `BeginRestoreTransition` in SQL store with transactional semantics.
server/internal/domain/stores/interfaces/curtailment.go	Extends CurtailmentStore interface with `BeginRestoreTransition`.
server/internal/domain/curtailment/service.go	Implements `Service.Stop`, min-duration gating, and `ComputeEffectiveBatchSize`.
server/internal/domain/curtailment/service_test.go	Extends fake store to support Stop-path testing.
server/internal/domain/curtailment/service_stop_test.go	New unit tests for Stop + batch sizing helper.
server/internal/domain/curtailment/reconciler/restore_test.go	New reconciler tests for restoring workflow + max-duration enforcement.
server/internal/domain/curtailment/reconciler/reconciler.go	Adds per-tick deadline, max-duration enforcement, restoring observer, and batched Uncurtail dispatch.
server/internal/domain/curtailment/reconciler/reconciler_test.go	Updates fakes to support restore/max-duration and Uncurtail result overrides.
server/internal/domain/curtailment/models/models.go	Adds desired_state constants and `CountNonTerminalTargets`.
server/generated/sqlc/db.go	Regenerates sqlc prepared statements for new queries.
server/generated/sqlc/curtailment.sql.go	Regenerates sqlc bindings for new restoration queries.
server/generated/grpc/curtailment/v1/curtailmentv1connect/curtailment.connect.go	Updates generated connect client/server docs for Stop semantics.
server/cmd/fleetd/main.go	Removes Start gate wiring; wires handler with new constructor and updates comment.
proto/curtailment/v1/curtailment.proto	Documents Stop semantics; adds `effective_batch_size`; tightens Stop override validation to `lte:200`.
client/src/protoFleet/api/generated/curtailment/v1/curtailment_pb.ts	Regenerates TS proto bindings to reflect proto updates.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46df1b316a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Make restore reconciliation recoverable under slow ticks, clear stale target errors correctly, and expose the active event recovery path needed after Start responses are lost. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9379e3662

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Ensure GetActiveCurtailment returns active or restoring events before newer pending events so clients recover the event currently affecting miners.

ankitgoswami

left some comments but not 100% sure about the requirements so don't know if they are blocking.

… max_duration check Two small simplifications from review: - CountNonTerminalTargets: the explicit non-terminal case arm did the same thing as default. Collapse to terminal-skip + default-count. Comment notes unknown future states count as non-terminal — the conservative direction. - Service.Start: the upper-bound check on max_duration_seconds duplicates the check already in validateStartRequest. Drop the duplicate. The admin-gate against the org default stays — it is intrinsically post-normalization (req.MaxDurationSeconds may have been resolved from orgConfig.MaxDurationDefaultSec just above).

Adds a partial unique index on curtailment_event(org_id) WHERE state IN ('pending','active','restoring') so a concurrent Start that races the selector's per-device exclusion check fails at insert time instead of silently producing two overlapping events. The unique-violation surfaces as a typed sentinel ErrCurtailmentNonTerminalEventExists from the sqlstore boundary. Service.Start catches the sentinel, reads the existing event via GetActiveEvent, and returns AlreadyExists with the existing event_uuid + state so the caller can act on the conflict (e.g. join the existing event, surface it in the UI). Migration 000051 appends the index alongside the CHECK constraints already in the file. Index name and constraint name pinned in code: nonTerminalEventPerOrgUniqueIndex matches the migration's identifier so only this exact violation is mapped to the sentinel. Adds NewAlreadyExistsError(f) and IsAlreadyExistsError helpers to fleeterror to match the existing per-code patterns.

… the stop override The Stop-time restore_batch_size_override forced a per-Stop admin gate plus a race window (concurrent Stops with conflicting overrides silently dropped all but the first). It was also doubly redundant: the per-event restore_batch_size the operator sets at Start already lets admins tune the adaptive formula. This commit removes the override entirely and moves effective_batch_size computation to Start. The column is stamped from the selected target count when the event row is inserted, so it is non-null from creation. Stop just flips state; the reconciler's max-duration arm and the restorer read the already-stamped value. Proto change: optional uint32 restore_batch_size_override removed from StopCurtailmentRequest (pre-release per BE-1 policy; deleted, not reserved). Tag 2 frees up for the admin-gated `force` field in the next commit. Surface changes: - proto/curtailment.proto: field removed; bindings regenerated. - service: StopRequest loses RestoreBatchSizeOverride; Service.Stop drops ListTargetsByEvent + ComputeEffectiveBatchSize; validateStopRequest loses the override bound check. ComputeEffectiveBatchSize loses its override param. Service.Start (via buildInsertParams) stamps effective_batch_size on the InsertEventParams from the selected target count. - sqlc InsertCurtailmentEvent: effective_batch_size joins the INSERT column list. BeginCurtailmentRestoration drops the SET clause for it. - sqlstores: InsertEventWithTargets forwards EffectiveBatchSize; the BeginRestoreTransition signature loses the int32 parameter. - interfaces: BeginRestoreTransition signature updated. - reconciler.enforceMaxDuration: drops the ComputeEffectiveBatchSize recomputation; the event row already carries the value. - handler.StopCurtailment: drops the override admin-gate block. - translate.toStopRequest: drops the override translation + uint32-overflow safety check. Tests: - service_start_test.go: TestService_Start_StampsEffectiveBatchSize pins the formula at Start (floor 10, restore_batch_size floor, 5K → 50, 10K ceiling at 100). Uses tolerated-undershoot so the FIXED_KW selector takes every candidate. - service_stop_test.go: deleted override-precedence + override-bounds + list-targets-error tests (Stop no longer reads targets or accepts an override). TestComputeEffectiveBatchSize loses its override case + column. - handler_stop_test.go: deleted OverrideRequiresAdmin + PassesAdminOverride. - handler_test.go: removed stopWithBatchOverride from the role-gate matrix (4 cases). - service_test.go / reconciler/reconciler_test.go: BeginRestoreTransition signatures updated; beginRestoreLastBatch capture removed.

…ity bypass Decouples Start-time priority from Stop-time hysteresis bypass. Priority is set at event creation and reflects the curtailment urgency at that moment; reusing it to short-circuit the min_curtailed_duration_sec gate on Stop conflated two different operator intents. The bypass now lives where the decision actually happens. StopCurtailmentRequest gains `bool force = 2` (admin-gated handler-side via requireAdminFromContext, same pattern as the existing override fields on Start/Preview). checkMinCurtailedDurationGate takes a force parameter and bypasses on force=true; the Priority==Emergency shortcut is gone. Error message updated to direct admins to force=true rather than EMERGENCY-priority retry. EMERGENCY priority still skips post_event_cooldown_sec in the selector at Start time — that bypass is at the correct layer and stays. Only the Stop-time bypass moves. Surface changes: - proto/curtailment.proto: bool force = 2 on StopCurtailmentRequest. - handler.StopCurtailment: gates force=true via requireAdminFromContext. - translate.toStopRequest: carries Force from proto to service. - service.StopRequest: adds Force bool. - service.checkMinCurtailedDurationGate: signature change. Tests: - service_stop_test.go: TestService_Stop_ForceBypassesMinDuration replaces the old emergency-priority bypass test. New TestService_Stop_EmergencyPriorityNoLongerBypasses pins the regression: an EMERGENCY-priority event with unelapsed min_curtailed_duration_sec is now blocked unless force=true. - handler_stop_test.go: TestHandler_StopCurtailment_ForceRequiresAdmin (non-admin + Force=true → PermissionDenied) and TestHandler_StopCurtailment_AdminForcePassesThrough. - handler_test.go: Stop force role-gate cases added to the matrix (viewer/admin × session/API-key).

Two tail-end fixups missed from 1e5cf13 (the commit that dropped the int32 parameter from BeginRestoreTransition): - reconciler/restore_test.go: removed the stale beginRestoreLastBatch assertion in TestReconciler_EnforceMaxDuration_ElapsedTransitionsToRestoring; effective_batch_size is now stamped at Start and not touched by the transition. - handlers/curtailment/handler_start_test.go: updated the startStubStore BeginRestoreTransition panic-stub signature to match the new interface.

- Proto + connect/TS regen: rewrite the effective_batch_size docstring to match the actual stamp-at-Start behavior (the column is non-null from event creation, no operator override exists). Same revision touches the models.InsertEventParams godoc. - StopCurtailmentRequest: add `reserved "restore_batch_size_override"` so the deleted name can never be re-introduced by mistake. The field number itself is now `bool force`; reserving the old name documents the history without resurrecting the slot. - StartCurtailment RPC: document the AlreadyExists DebugMessage format as a stable substring so SDK consumers can recover the existing event's identity without parsing an undocumented contract. - Migration 000051: note that CREATE INDEX / ADD CONSTRAINT run without CONCURRENTLY / NOT VALID is safe only because the table is empty at deploy; future re-applications against populated tables should use the online forms.

toStartResponse built CurtailmentEvent from request + plan and never set EffectiveBatchSize, so clients reading the value from Start saw 0 while the persisted row carried the computed batch size. The mismatch forced callers into a follow-up GetActiveCurtailment poll. Stamp the adaptive batch size on the plan in Service.Start (single source of truth for buildInsertParams and the response builder), thread it through toStartResponse, and update the stale toEventProto inline comment that still described the field as nil-until-restoring.

runTick derived each per-event timeout from tickCtx, which already had its own 2×TickInterval deadline. A slow ListNonTerminalEvents or a slow first event would silently shrink the budget available to later events. Derive the per-event context from the parent ctx so each restoring event gets the full 2×TickInterval window; tickCtx.Err() still gates the loop, so total tick duration stays bounded. Also tag the restore telemetry-timeout aging path with a dedicated slog line so triage can distinguish a timed-out-and-wrote target from one where the follow-up UpdateTargetState DB write itself failed.

The ORDER BY CASE preferred active/restoring over pending when both existed, but the partial unique index added in 000051 makes "both exist" impossible: at most one row per org matches the WHERE state IN clause. The CASE expression also defeated idx_curtailment_event_active, forcing a sort over the filtered rows. Drop the ORDER BY entirely; LIMIT 1 now lets the planner satisfy the lookup directly via the partial unique index.

…g, and Stop error paths - Restorer: pin the restore-phase analog of the curtail-phase nil-candidate guard. A Dispatched+Active target whose candidate row has vanished burns retry budget via recordDispatchFailure instead of pinning the event indefinitely. - Service.Start: pin the absolute restore_batch_interval_sec ceiling (3600s) for admin callers. The non-admin cap test already exists; this pins what happens when an admin caller exceeds the absolute upper bound, preventing a future refactor that drops the validateStartRequest check from silently passing through to the DB CHECK constraint. - Service.Start: pin the AlreadyExists fallback when the identity lookup itself fails (insertEvent returns ErrCurtailmentNonTerminalEventExists and the subsequent GetActiveEvent errors out). Caller still gets AlreadyExists; the identity substring is omitted as documented. - StopCurtailment handler: pin the ListTargetsByEvent error propagation path. The transition has already committed when the targets read fails, so the handler returns the error without leaking a partial response.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d721bfef7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

79750ab derived each per-event context from the long-lived parent ctx (workCtx) so that a slow first event would not eat into later events' budgets. That objective is real, but it also reintroduced exactly the failure mode the Codex P1 thread (and the explicit ack at 7c6359f) called out: one tick can run for N × 2×TickInterval when multiple events are present, violating the documented bounded-tick contract. TestReconciler_RunTickStopsWhenTickBudgetExpires fails on HEAD with the parent-derived eventCtx because the second event no longer cancels when tickCtx expires. Restore the tickCtx-derived form. The aging-log addition from 79750ab stays in place.

…re and tighten plan comment The Start handler's translate.go wire path that populates EffectiveBatchSize had no end-to-end assertion; a regression that zeroed plan.EffectiveBatchSize before the translate call would not be caught. Add the assertion to the happy-path handler test (two candidates, no caller preference clamp to the floor of 10). Plan.EffectiveBatchSize's godoc claimed Stop and the reconciler read the value from Plan. Stop never touches Plan, and the reconciler reads from models.Event after persistence. Rewrite the comment to say "echoed in the Start response; Stop and the reconciler read the persisted event row" so future readers don't chase a nonexistent data-flow. Add a clarifying comment to the missing-candidate restore test noting that the disp.uncurtailCalls==0 assertion is enforced by the interval gate, not by the missing-candidate path it primarily pins.

… persist When the caller leaves max_duration_seconds unset and !allow_unbounded, Service.Start fills it from orgConfig.MaxDurationDefaultSec. The upper bound (maxFiniteDurationSeconds = 604800) is checked by validateStartRequest, which ran before the fill and saw nil. A misconfigured org default above the cap would tunnel into the DB and trip ck_curtailment_event_max_duration_bounds as an Internal error instead of a clean InvalidArgument. Mirror the existing <= 0 org-default guard with a > maxFiniteDurationSeconds guard at the same site. Pin with TestService_Start_RejectsOrgDefaultAboveAbsoluteCeiling.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35cc67c417

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Extract candidatesByDeviceID: confirmDispatched, observeActive, and confirmDispatchedRestores each built the same map inline. One helper replaces three identical 4-line loops. - Consolidate skip-reason rendering. dispatchOneCurtail used skipReasonForDevice (linear scan with priority logic), while dispatchRestoreBatch inlined the same priority logic into a map builder. Extract skippedDeviceReason as the single source of the reason-priority rule (Reason > FilterName > generic). skipReasonForDevice becomes a thin wrapper for the single-device path; dispatchRestoreBatch calls it from the map-builder loop. - Trim multi-paragraph rationale comments on Stop, dispatchOneCurtail, observeRestoring, and dispatchRestoreBatch. The WHY (restart-safety gap, gate semantics, audit-double-batch caveat) survives; the WHAT that the code already shows is gone.

CountNonTerminalTargets was exported from the models package with no production caller (only its own test referenced it). Drafted earlier when batch sizing was target-count-based; the current design stamps effective_batch_size from len(plan.Selected) at Start time, so the count helper is unused. Delete the function and its test; the models package still ships the canonical state-set predicates next to it. toStopResponse was a one-line wrapper with a single caller. Inline into the StopCurtailment handler so the response shape is visible at the call site, matching how toStartResponse / toEventProtoWithTargets are already used elsewhere.

Trim multi-paragraph rationale on Service.Start's org-default upper-bound guard, Service.runSelector's pipeline overview, Service.Stop's invariant, and toStartRequest's max_duration_seconds=0 sentinel. The WHY (mismatch surfaces as InvalidArgument vs DB CHECK trip, mutual-exclusion check intent) survives; restated parameter descriptions and pipeline phase names are gone.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4549263ffa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Reserving only the old name left the tag number free, so an older client that still encodes restore_batch_size_override on tag 2 would have its varint payload decoded as force on the new shape — silently flipping force=true on any non-zero value and bypassing the min_curtailed_duration_sec gate. Reserve both number and name on StopCurtailmentRequest, and put force on tag 3. An old client now decodes its tag-2 payload into a reserved slot (ignored), so the Stop hits the min-duration gate and fails loudly with FailedPrecondition rather than aliasing into a silent bypass.

skippedDeviceReason is the shared rendering helper that both the single-device and batch restore dispatch paths route audit strings through. Only the Reason!="" branch was exercised before; the FilterName-only and both-empty fallback branches had no test coverage. Add a table-driven unit test pinning all three cases.

github-actions Bot added server shared labels May 14, 2026

github-actions Bot assigned rongxin-liu May 14, 2026

rongxin-liu mentioned this pull request May 19, 2026

feat(curtailment): StartCurtailment + dispatch + reconciler #192

Merged

Base automatically changed from feat/issue-191-curtailment-start-dispatch-reconciler to main May 19, 2026 15:34

rongxin-liu added 2 commits May 19, 2026 11:36

rongxin-liu force-pushed the feat/issue-231-curtailment-stop-restore branch from 3ed88b5 to 77e78da Compare May 19, 2026 15:48

github-actions Bot added javascript Pull requests that update javascript code client labels May 19, 2026

rongxin-liu added 4 commits May 19, 2026 16:09

rongxin-liu force-pushed the feat/issue-231-curtailment-stop-restore branch from 3db4d3a to ab215a0 Compare May 19, 2026 20:15

rongxin-liu added 5 commits May 19, 2026 16:28

rongxin-liu marked this pull request as ready for review May 20, 2026 14:36

Copilot AI review requested due to automatic review settings May 20, 2026 14:36

rongxin-liu requested a review from a team as a code owner May 20, 2026 14:36

Copilot started reviewing on behalf of rongxin-liu May 20, 2026 14:37 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/models/models.go Outdated

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go

fix(curtailment): address restore review feedback

c9379e3

Make restore reconciliation recoverable under slow ticks, clear stale target errors correctly, and expose the active event recovery path needed after Start responses are lost. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Comment thread server/sqlc/queries/curtailment.sql Outdated

fix(curtailment): prefer in-effect active event

c61eae3

Ensure GetActiveCurtailment returns active or restoring events before newer pending events so clients recover the event currently affecting miners.

ankitgoswami reviewed May 20, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/service.go Outdated

ankitgoswami reviewed May 20, 2026

View reviewed changes

Comment thread server/sqlc/queries/curtailment.sql

ankitgoswami reviewed May 20, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/service.go Outdated

ankitgoswami reviewed May 20, 2026

View reviewed changes

Comment thread server/internal/handlers/curtailment/translate.go

ankitgoswami approved these changes May 20, 2026

View reviewed changes

rongxin-liu added 10 commits May 20, 2026 20:31

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/service.go Outdated

rongxin-liu added 3 commits May 21, 2026 01:47

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread proto/curtailment/v1/curtailment.proto Outdated

rongxin-liu added 3 commits May 21, 2026 10:12

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread proto/curtailment/v1/curtailment.proto

rongxin-liu added 2 commits May 21, 2026 10:17

rongxin-liu merged commit 2bf18b5 into main May 21, 2026
69 checks passed

rongxin-liu deleted the feat/issue-231-curtailment-stop-restore branch May 21, 2026 15:31

This was referenced May 21, 2026

feat(curtailment): operator read APIs + admin terminate + audit + metrics interface + E2E #289

Open

feat(curtailment): operator read/update/admin APIs + audit + metrics #299

Open

Conversation

rongxin-liu commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Summary

StopCurtailment

Restorer

Adaptive batch sizing — computed at Start

One non-terminal event per org

max_duration_seconds enforcement

StartCurtailment gate flipped

Restore semantics worth flagging

Wiring

What is intentionally not in this PR

Test plan

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔐 Codex Security Review

Review Summary

Findings

[HIGH] Pending curtailment dispatches can stall forever and bypass max_duration_seconds

[MEDIUM] Non-admin Stop can bypass minimum curtailed duration during pending rollout

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ankitgoswami left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rongxin-liu commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

[HIGH] Pending curtailment dispatches can stall forever and bypass `max_duration_seconds`