feat(curtailment): Stop, staggered restore, max-duration enforcement#232
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Pending curtailment dispatches can stall forever and bypass
|
Adds the operator-facing StopCurtailment write path: - New sqlc queries BeginCurtailmentRestoration (event-side flip to 'restoring' + effective_batch_size stamp, state-guarded so already-restoring or terminal events fall through to a no-op row count) and ResetCurtailmentTargetsForRestore (non-terminal targets flip to desired_state='active', state='pending', cleared phase-local cursors). Composed inside the SQL store as a single tx via WithTransaction. - Store.BeginRestoreTransition method on the curtailment interface + SQL implementation. Idempotent against re-Stop (returns the current row unchanged); FailedPrecondition on a terminal event; NotFound for cross-org event UUIDs. A concurrent state-change between the pre-read and the UPDATE is handled by a re-read so the caller observes the latest persisted state. - Service.Stop with adaptive batch sizing — max(restore_batch_size, ceil(0.01 × non_terminal_target_count)) clamped to [10, 100] — plus the min_curtailed_duration_sec gate that applies only to active events (pending events haven't curtailed anything yet) with EMERGENCY priority bypass. The per-request restore_batch_size_override (admin-gated upstream) skips the formula entirely and is bounded at the service layer to [1, 200], the BE-4 fleet-class ceiling that the proto bound will be tightened to match. - Handler.StopCurtailment wired: replaces the Unimplemented stub with the service call, preserving the existing requireAdminFromContext gate on the override field. Session-derived OrgID, no actor change. - New translate helpers (toStopRequest, toStopResponse, toEventProto, plus enum mappers for event_state/mode/strategy/level/priority) that build a CurtailmentEvent wire message from the persisted row. Scope and mode_params are intentionally left empty for now — reconstructing those from JSONB lands with the read APIs. - Shared models.DesiredStateCurtailed / models.DesiredStateActive constants so the SQL literal values have a single Go-side source of truth, and fake-store stubs (preview/start service tests + handler start test + reconciler test) implementing the new interface method.
Replaces the BE-3 'restoring is a no-op' stub with an end-to-end restorer arm and a time-based escape hatch on active events: - observeRestoring runs three phases per tick: (1) confirm any dispatched restore targets via telemetry (baseline-relative power_w > baseline × 0.5, with a positive-hash fallback when baseline is null); (2) transition the event terminal once every target is Resolved/Released/RestoreFailed (chooses COMPLETED vs COMPLETED_WITH_FAILURES based on failure count, stamps ended_at); (3) claim the next batch when both gates pass. - Restore-batch gating mirrors the BE-4 plan: no prior batch in flight (no restore-phase targets in Dispatched or Drifted) AND inter-batch interval elapsed (newest restore-phase last_dispatched_at older than restore_batch_interval_sec). Both gates fail closed; the in-flight gate is the primary inrush protection when interval is zero. - Batched Uncurtail dispatch: one command call covers up to effective_batch_size devices, sharing a single batch_uuid. Per-device outcomes (kept vs filtered-skip) drive individual state updates; bulk dispatch errors route every claim target through retry budget the same way BE-3 Curtail does. Missing candidate row burns budget rather than stalling so vanished devices reach RestoreFailed instead of hanging the event. - enforceMaxDuration runs before drift detection on active events: when (now - started_at) >= max_duration_seconds, the reconciler calls BeginRestoreTransition with the adaptive-formula batch size (no override available without an operator request). This closes the BE-3 stuck-Dispatched gap at the event boundary — a target with no positive telemetry is dragged into restore by the cap, not left sitting in Dispatched forever. AllowUnbounded and pending events short-circuit. - isRestored predicate is the symmetric counterpart of isCurtailed for the restore-confirm path; it requires positive evidence so a flaky sensor cannot trigger a premature Resolved transition.
3ed88b5 to
77e78da
Compare
…he override ceiling With Stop, the restorer, and reconciler-side max_duration_seconds enforcement all in place, an Active event now has both an operator-facing exit (StopCurtailment) and a wall-clock safety net (max-duration forced restore). The 'gated until BE-4 ships' rationale no longer applies, so the Handler.startEnabled field, the in-handler gate check, and the TestHandler_StartCurtailment_DisabledFlagReturnsUnimplemented coupling test are all removed. main.go constructs the handler with the lean single-argument NewHandler(curtailmentSvc) and the wiring comment now documents what is and isn't reachable in the current surface. Separately, the proto bound on restore_batch_size_override drops from the ungrounded [1, 10000] sanity ceiling to [1, 200] — 2x the adaptive formula's upper band (100). 200 short-circuits inrush staggering for a single-rack restore but stays narrow enough that an admin typo cannot silently dump 10K miners at once. Service.Stop already enforces the same 200 ceiling at the service layer, so the two checks agree.
…ounds Stop service tests (service_stop_test.go) pin the happy path, the idempotent already-restoring branch, terminal-event rejection, the min_curtailed_duration gate and its EMERGENCY bypass, override precedence over the adaptive formula, and override bound rejection at both ends of the [1, 200] window. ComputeEffectiveBatchSize is exercised across the realistic fleet sizes (small → floor, 5000 → 50, 10K+ → ceiling) so a regression in the inrush envelope shows up immediately. Stop handler tests (handler_stop_test.go) cover the happy path through the full handler→service→store flow, missing-session Unauthenticated, malformed-UUID InvalidArgument, the admin role gate on restore_batch_size_override, and the admin override flowing through to the persistence boundary so an admin-driven number doesn't get silently dropped on the floor. Reconciler restorer arm tests (restore_test.go) cover the four enforceMaxDuration branches (elapsed/not-elapsed/allow_unbounded/nil started_at), the observeRestoring three-phase loop (claim+dispatch, in-flight gate, interval gate, telemetry confirmation), the completion paths (all-resolved→COMPLETED, mixed→COMPLETED_WITH_FAILURES), and the isRestored predicate including the baseline-null hash fallback that protects the operator from a spurious "restore complete" call when the power-side telemetry is missing. A small fix lands alongside: the gosec int→int32 saturation around the non-terminal-count → ComputeEffectiveBatchSize call site, exhaustive switch statements in the target-state count loops, and a cap variable renamed to maxDur to stop shadowing the builtin.
A target Dispatched in the restore phase that never returns positive telemetry evidence would stay there indefinitely. confirmOneRestore only consumed retry budget when the candidate row was nil (vanished device); a present device with nil or missing power and hash samples never advanced. The in-flight gate on the next batch claim then blocked all subsequent batches, and enforceMaxDuration does not close this gap (it only transitions active to restoring). Add RestoreDispatchTimeoutSec (default 300s = 10x the tick interval). confirmOneRestore now consumes retry budget when a Dispatched target has been there longer than the timeout, draining to RestoreFailed at exhaustion so the event can still progress. Also picked up during the review pass: - Move countNonTerminalTargets to models.CountNonTerminalTargets so service.Stop and reconciler.enforceMaxDuration share one helper. - Document BeginCurtailmentRestoration's atomicity model inline (UPDATE WHERE state-guard + partial-unique-index; ErrNoRows re-read terminal-state risk). - Document that service.Stop's pre-read terminal/restoring check is a fast-path; BeginRestoreTransition is the authoritative atomic guard. - Document the strict > vs <= asymmetry between isRestored and isCurtailed, and the no-progress band at exactly baseline*factor. - Forward-compat comment on maybeClaimRestoreBatch's unreachable Drifted arm; comment on the interval gate loop documenting that terminal targets retain LastDispatchedAt as the spacing reference. - Tighten translate.toStopRequest overflow-check comment. - Inline single-use desiredStateCurtailed alias. - Strip internal ticket prefixes from source comments. - Trim multi-paragraph rationale on adaptive batch-sizing constants and ComputeEffectiveBatchSize godoc. - Fix stale cross-reference comment on StartCurtailmentRequest.restore_batch_size; regenerate bindings.
Fills coverage gaps surfaced during the review pass and pins the new restore-dispatch age-out behavior. - Restore age-out exhaustion drains the target to RestoreFailed and takes the event to COMPLETED_WITH_FAILURES; happy path within the timeout stays in Dispatched. - enforceMaxDuration with a BeginRestoreTransition error leaves the event unchanged and still runs drift dispatch in the same tick. - dispatchRestoreBatch under an Uncurtail error and under an empty BatchIdentifier keeps the batch Pending with LastError set and RetryCount incremented. - dispatchRestoreBatch per-device filter-skip in the restore phase mirrors the curtail-phase coverage: kept target -> Dispatched, skipped target -> Pending with retry consumed. - Service.Stop returns NotFound for an unknown event UUID, propagates ListTargetsByEvent store errors, and the MinDurationGate test now pins GRPCCode = FailedPrecondition. - Restorer-shared batch_uuid is asserted by value (both batched targets reference the same LastBatchUUID), not just by comment. - isRestored table picks up the power-present, baseline-nil, hash-nil case so the line-1035 hash-nil guard has direct coverage.
3db4d3a to
ab215a0
Compare
Post-review simplification pass surfaced four small wins: - service_test fakeStore: remove three fields that were declared but never read or assigned in any test (beginRestoreResult, getEventByUUIDLastOrg, listTargetsLastOrg). The dead capture-state was a misleading escape hatch. - service.Stop: drop the nine-line int->int32 MaxInt32 saturation in favour of the same //nolint:gosec one-liner the reconciler's enforceMaxDuration already uses. The count is bounded by the per-event target rows; the saturation block was dead defensive code. - maybeCompleteRestoring: add a default arm to the per-target switch so an unknown state from a future migration falls through as non-terminal rather than silently completing the event prematurely. - observeActive: drop the block comment narrating the enforceMaxDuration call site; the function name and its godoc carry the rationale.
…Stop failure Two related Codex-flagged issues on the BE-4 surface plus one defensive test, all small and self-contained. P1: enforceMaxDuration's BeginRestoreTransition error path returned false, falling through to drift detection. Past the max_duration cap, that meant new Curtail commands could be issued to miners the event was supposed to be releasing — extending curtailment past the contracted ceiling on every transient DB failure. Now returns true on error so observeActive skips drift; next tick retries the transition. The corresponding test (was pinning the wrong behavior) is updated to assert curtailCalls=0 on the failure tick. P2: BeginRestoreTransition's ErrNoRows re-read returned the latest event unconditionally, including terminal states. A concurrent AdminTerminateEvent between the pre-read and the UPDATE caused Stop to echo a 200 with a terminal event masquerading as restoring. The re-read now routes by latest.State: terminal → FailedPrecondition, restoring → idempotent success, anything else → InternalError. P3: New table-style test for maybeCompleteRestoring's default arm. The arm is dead code today (all current TargetState values are explicitly covered), but the test pins the defensive "unknown state keeps the event non-terminal" contract against future schema additions.
…ment
The prior query comment claimed concurrency safety relied on "the
partial-unique-index on (org_id, state) for active events." No such
unique index exists; the actual index `idx_curtailment_event_active`
is a plain partial btree. Concurrency safety rests on the
`WHERE state IN ('pending','active')` row-lock guard alone.
Replace the comment with the actual model so a future maintainer
does not trust a fictitious uniqueness guarantee and remove the
state guard.
Adds the server-resolved restore batch size as a field on the wire CurtailmentEvent and populates it from the persisted event row in toEventProto. Restore-time clients (Stop response, future read APIs) can now display the value the server actually drained at — important when the adaptive [10, 100] envelope or an operator `restore_batch_size_override` resolves to something other than the caller's `restore_batch_size` preference. Also tightens the StopCurtailment RPC docstring to make the terminal-race contract explicit: idempotent on an already-restoring event; FailedPrecondition when the event is terminal or a concurrent transition raced past restoring; treat as non-retryable.
Derive a `2 * TickInterval` per-tick deadline in runTick and pass it to ListNonTerminalEvents and processEvent. A stuck DB during a large restore tick (e.g. ResetCurtailmentTargetsForRestore over thousands of targets) now returns an error to the tick body instead of hanging the goroutine indefinitely. The heartbeat keeps using its own background-derived 5s timeout so liveness reporting is preserved even when the tick errors. Extend dispatchRestoreBatch's doc to call out the audit-lineage consequence of the existing restart-safety design: a per-target UpdateTargetState failure after Uncurtail returns success leaves the target Pending; next tick redispatches a second Uncurtail. Uncurtail is device-idempotent via the per-device FIFO, but audit records may show duplicate batches.
There was a problem hiding this comment.
Pull request overview
Adds the missing “exit path” for curtailment events by implementing operator Stop, a DB-backed staggered restore loop in the reconciler, adaptive restore batch sizing, and reconciler-side max-duration enforcement; this also removes the production gate that previously prevented StartCurtailment from being enabled.
Changes:
- Implement
StopCurtailmentend-to-end (proto docs, handler/service/store, SQL queries) and reset non-terminal targets into a restore work queue. - Extend the reconciler to (a) force restore on
max_duration_secondsand (b) driverestoringevents to terminal via staggered batchedUncurtail. - Add adaptive restore batch sizing and surface the server-resolved
effective_batch_sizeonCurtailmentEvent; update generated artifacts and wiring.
Reviewed changes
Copilot reviewed 17 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| server/sqlc/queries/curtailment.sql | Adds restoration transition + target reset queries. |
| server/internal/handlers/curtailment/translate.go | Adds Stop request/response translation and event → proto mapping (incl. timestamps and effective batch size). |
| server/internal/handlers/curtailment/handler.go | Removes Start gate; implements StopCurtailment handler with auth + admin override gate. |
| server/internal/handlers/curtailment/handler_test.go | Updates handler constructor usage after gate removal. |
| server/internal/handlers/curtailment/handler_stop_test.go | New Stop handler tests (auth, UUID validation, admin override gating). |
| server/internal/handlers/curtailment/handler_start_test.go | Removes Start gate test; updates handler constructor and stubs. |
| server/internal/domain/stores/sqlstores/curtailment.go | Implements BeginRestoreTransition in SQL store with transactional semantics. |
| server/internal/domain/stores/interfaces/curtailment.go | Extends CurtailmentStore interface with BeginRestoreTransition. |
| server/internal/domain/curtailment/service.go | Implements Service.Stop, min-duration gating, and ComputeEffectiveBatchSize. |
| server/internal/domain/curtailment/service_test.go | Extends fake store to support Stop-path testing. |
| server/internal/domain/curtailment/service_stop_test.go | New unit tests for Stop + batch sizing helper. |
| server/internal/domain/curtailment/reconciler/restore_test.go | New reconciler tests for restoring workflow + max-duration enforcement. |
| server/internal/domain/curtailment/reconciler/reconciler.go | Adds per-tick deadline, max-duration enforcement, restoring observer, and batched Uncurtail dispatch. |
| server/internal/domain/curtailment/reconciler/reconciler_test.go | Updates fakes to support restore/max-duration and Uncurtail result overrides. |
| server/internal/domain/curtailment/models/models.go | Adds desired_state constants and CountNonTerminalTargets. |
| server/generated/sqlc/db.go | Regenerates sqlc prepared statements for new queries. |
| server/generated/sqlc/curtailment.sql.go | Regenerates sqlc bindings for new restoration queries. |
| server/generated/grpc/curtailment/v1/curtailmentv1connect/curtailment.connect.go | Updates generated connect client/server docs for Stop semantics. |
| server/cmd/fleetd/main.go | Removes Start gate wiring; wires handler with new constructor and updates comment. |
| proto/curtailment/v1/curtailment.proto | Documents Stop semantics; adds effective_batch_size; tightens Stop override validation to lte:200. |
| client/src/protoFleet/api/generated/curtailment/v1/curtailment_pb.ts | Regenerates TS proto bindings to reflect proto updates. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46df1b316a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Make restore reconciliation recoverable under slow ticks, clear stale target errors correctly, and expose the active event recovery path needed after Start responses are lost. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c9379e3662
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Ensure GetActiveCurtailment returns active or restoring events before newer pending events so clients recover the event currently affecting miners.
ankitgoswami
left a comment
There was a problem hiding this comment.
left some comments but not 100% sure about the requirements so don't know if they are blocking.
… max_duration check Two small simplifications from review: - CountNonTerminalTargets: the explicit non-terminal case arm did the same thing as default. Collapse to terminal-skip + default-count. Comment notes unknown future states count as non-terminal — the conservative direction. - Service.Start: the upper-bound check on max_duration_seconds duplicates the check already in validateStartRequest. Drop the duplicate. The admin-gate against the org default stays — it is intrinsically post-normalization (req.MaxDurationSeconds may have been resolved from orgConfig.MaxDurationDefaultSec just above).
Adds a partial unique index on curtailment_event(org_id) WHERE state IN
('pending','active','restoring') so a concurrent Start that races the
selector's per-device exclusion check fails at insert time instead of
silently producing two overlapping events.
The unique-violation surfaces as a typed sentinel
ErrCurtailmentNonTerminalEventExists from the sqlstore boundary. Service.Start
catches the sentinel, reads the existing event via GetActiveEvent, and
returns AlreadyExists with the existing event_uuid + state so the caller
can act on the conflict (e.g. join the existing event, surface it in the
UI).
Migration 000051 appends the index alongside the CHECK constraints
already in the file. Index name and constraint name pinned in code:
nonTerminalEventPerOrgUniqueIndex matches the migration's identifier so
only this exact violation is mapped to the sentinel.
Adds NewAlreadyExistsError(f) and IsAlreadyExistsError helpers to
fleeterror to match the existing per-code patterns.
… the stop override The Stop-time restore_batch_size_override forced a per-Stop admin gate plus a race window (concurrent Stops with conflicting overrides silently dropped all but the first). It was also doubly redundant: the per-event restore_batch_size the operator sets at Start already lets admins tune the adaptive formula. This commit removes the override entirely and moves effective_batch_size computation to Start. The column is stamped from the selected target count when the event row is inserted, so it is non-null from creation. Stop just flips state; the reconciler's max-duration arm and the restorer read the already-stamped value. Proto change: optional uint32 restore_batch_size_override removed from StopCurtailmentRequest (pre-release per BE-1 policy; deleted, not reserved). Tag 2 frees up for the admin-gated `force` field in the next commit. Surface changes: - proto/curtailment.proto: field removed; bindings regenerated. - service: StopRequest loses RestoreBatchSizeOverride; Service.Stop drops ListTargetsByEvent + ComputeEffectiveBatchSize; validateStopRequest loses the override bound check. ComputeEffectiveBatchSize loses its override param. Service.Start (via buildInsertParams) stamps effective_batch_size on the InsertEventParams from the selected target count. - sqlc InsertCurtailmentEvent: effective_batch_size joins the INSERT column list. BeginCurtailmentRestoration drops the SET clause for it. - sqlstores: InsertEventWithTargets forwards EffectiveBatchSize; the BeginRestoreTransition signature loses the int32 parameter. - interfaces: BeginRestoreTransition signature updated. - reconciler.enforceMaxDuration: drops the ComputeEffectiveBatchSize recomputation; the event row already carries the value. - handler.StopCurtailment: drops the override admin-gate block. - translate.toStopRequest: drops the override translation + uint32-overflow safety check. Tests: - service_start_test.go: TestService_Start_StampsEffectiveBatchSize pins the formula at Start (floor 10, restore_batch_size floor, 5K → 50, 10K ceiling at 100). Uses tolerated-undershoot so the FIXED_KW selector takes every candidate. - service_stop_test.go: deleted override-precedence + override-bounds + list-targets-error tests (Stop no longer reads targets or accepts an override). TestComputeEffectiveBatchSize loses its override case + column. - handler_stop_test.go: deleted OverrideRequiresAdmin + PassesAdminOverride. - handler_test.go: removed stopWithBatchOverride from the role-gate matrix (4 cases). - service_test.go / reconciler/reconciler_test.go: BeginRestoreTransition signatures updated; beginRestoreLastBatch capture removed.
…ity bypass Decouples Start-time priority from Stop-time hysteresis bypass. Priority is set at event creation and reflects the curtailment urgency at that moment; reusing it to short-circuit the min_curtailed_duration_sec gate on Stop conflated two different operator intents. The bypass now lives where the decision actually happens. StopCurtailmentRequest gains `bool force = 2` (admin-gated handler-side via requireAdminFromContext, same pattern as the existing override fields on Start/Preview). checkMinCurtailedDurationGate takes a force parameter and bypasses on force=true; the Priority==Emergency shortcut is gone. Error message updated to direct admins to force=true rather than EMERGENCY-priority retry. EMERGENCY priority still skips post_event_cooldown_sec in the selector at Start time — that bypass is at the correct layer and stays. Only the Stop-time bypass moves. Surface changes: - proto/curtailment.proto: bool force = 2 on StopCurtailmentRequest. - handler.StopCurtailment: gates force=true via requireAdminFromContext. - translate.toStopRequest: carries Force from proto to service. - service.StopRequest: adds Force bool. - service.checkMinCurtailedDurationGate: signature change. Tests: - service_stop_test.go: TestService_Stop_ForceBypassesMinDuration replaces the old emergency-priority bypass test. New TestService_Stop_EmergencyPriorityNoLongerBypasses pins the regression: an EMERGENCY-priority event with unelapsed min_curtailed_duration_sec is now blocked unless force=true. - handler_stop_test.go: TestHandler_StopCurtailment_ForceRequiresAdmin (non-admin + Force=true → PermissionDenied) and TestHandler_StopCurtailment_AdminForcePassesThrough. - handler_test.go: Stop force role-gate cases added to the matrix (viewer/admin × session/API-key).
Two tail-end fixups missed from 1e5cf13 (the commit that dropped the int32 parameter from BeginRestoreTransition): - reconciler/restore_test.go: removed the stale beginRestoreLastBatch assertion in TestReconciler_EnforceMaxDuration_ElapsedTransitionsToRestoring; effective_batch_size is now stamped at Start and not touched by the transition. - handlers/curtailment/handler_start_test.go: updated the startStubStore BeginRestoreTransition panic-stub signature to match the new interface.
- Proto + connect/TS regen: rewrite the effective_batch_size docstring to match the actual stamp-at-Start behavior (the column is non-null from event creation, no operator override exists). Same revision touches the models.InsertEventParams godoc. - StopCurtailmentRequest: add `reserved "restore_batch_size_override"` so the deleted name can never be re-introduced by mistake. The field number itself is now `bool force`; reserving the old name documents the history without resurrecting the slot. - StartCurtailment RPC: document the AlreadyExists DebugMessage format as a stable substring so SDK consumers can recover the existing event's identity without parsing an undocumented contract. - Migration 000051: note that CREATE INDEX / ADD CONSTRAINT run without CONCURRENTLY / NOT VALID is safe only because the table is empty at deploy; future re-applications against populated tables should use the online forms.
toStartResponse built CurtailmentEvent from request + plan and never set EffectiveBatchSize, so clients reading the value from Start saw 0 while the persisted row carried the computed batch size. The mismatch forced callers into a follow-up GetActiveCurtailment poll. Stamp the adaptive batch size on the plan in Service.Start (single source of truth for buildInsertParams and the response builder), thread it through toStartResponse, and update the stale toEventProto inline comment that still described the field as nil-until-restoring.
runTick derived each per-event timeout from tickCtx, which already had its own 2×TickInterval deadline. A slow ListNonTerminalEvents or a slow first event would silently shrink the budget available to later events. Derive the per-event context from the parent ctx so each restoring event gets the full 2×TickInterval window; tickCtx.Err() still gates the loop, so total tick duration stays bounded. Also tag the restore telemetry-timeout aging path with a dedicated slog line so triage can distinguish a timed-out-and-wrote target from one where the follow-up UpdateTargetState DB write itself failed.
The ORDER BY CASE preferred active/restoring over pending when both existed, but the partial unique index added in 000051 makes "both exist" impossible: at most one row per org matches the WHERE state IN clause. The CASE expression also defeated idx_curtailment_event_active, forcing a sort over the filtered rows. Drop the ORDER BY entirely; LIMIT 1 now lets the planner satisfy the lookup directly via the partial unique index.
…g, and Stop error paths - Restorer: pin the restore-phase analog of the curtail-phase nil-candidate guard. A Dispatched+Active target whose candidate row has vanished burns retry budget via recordDispatchFailure instead of pinning the event indefinitely. - Service.Start: pin the absolute restore_batch_interval_sec ceiling (3600s) for admin callers. The non-admin cap test already exists; this pins what happens when an admin caller exceeds the absolute upper bound, preventing a future refactor that drops the validateStartRequest check from silently passing through to the DB CHECK constraint. - Service.Start: pin the AlreadyExists fallback when the identity lookup itself fails (insertEvent returns ErrCurtailmentNonTerminalEventExists and the subsequent GetActiveEvent errors out). Caller still gets AlreadyExists; the identity substring is omitted as documented. - StopCurtailment handler: pin the ListTargetsByEvent error propagation path. The transition has already committed when the targets read fails, so the handler returns the error without leaking a partial response.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d721bfef7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
79750ab derived each per-event context from the long-lived parent ctx (workCtx) so that a slow first event would not eat into later events' budgets. That objective is real, but it also reintroduced exactly the failure mode the Codex P1 thread (and the explicit ack at 7c6359f) called out: one tick can run for N × 2×TickInterval when multiple events are present, violating the documented bounded-tick contract. TestReconciler_RunTickStopsWhenTickBudgetExpires fails on HEAD with the parent-derived eventCtx because the second event no longer cancels when tickCtx expires. Restore the tickCtx-derived form. The aging-log addition from 79750ab stays in place.
…re and tighten plan comment The Start handler's translate.go wire path that populates EffectiveBatchSize had no end-to-end assertion; a regression that zeroed plan.EffectiveBatchSize before the translate call would not be caught. Add the assertion to the happy-path handler test (two candidates, no caller preference clamp to the floor of 10). Plan.EffectiveBatchSize's godoc claimed Stop and the reconciler read the value from Plan. Stop never touches Plan, and the reconciler reads from models.Event after persistence. Rewrite the comment to say "echoed in the Start response; Stop and the reconciler read the persisted event row" so future readers don't chase a nonexistent data-flow. Add a clarifying comment to the missing-candidate restore test noting that the disp.uncurtailCalls==0 assertion is enforced by the interval gate, not by the missing-candidate path it primarily pins.
… persist When the caller leaves max_duration_seconds unset and !allow_unbounded, Service.Start fills it from orgConfig.MaxDurationDefaultSec. The upper bound (maxFiniteDurationSeconds = 604800) is checked by validateStartRequest, which ran before the fill and saw nil. A misconfigured org default above the cap would tunnel into the DB and trip ck_curtailment_event_max_duration_bounds as an Internal error instead of a clean InvalidArgument. Mirror the existing <= 0 org-default guard with a > maxFiniteDurationSeconds guard at the same site. Pin with TestService_Start_RejectsOrgDefaultAboveAbsoluteCeiling.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 35cc67c417
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Extract candidatesByDeviceID: confirmDispatched, observeActive, and confirmDispatchedRestores each built the same map inline. One helper replaces three identical 4-line loops. - Consolidate skip-reason rendering. dispatchOneCurtail used skipReasonForDevice (linear scan with priority logic), while dispatchRestoreBatch inlined the same priority logic into a map builder. Extract skippedDeviceReason as the single source of the reason-priority rule (Reason > FilterName > generic). skipReasonForDevice becomes a thin wrapper for the single-device path; dispatchRestoreBatch calls it from the map-builder loop. - Trim multi-paragraph rationale comments on Stop, dispatchOneCurtail, observeRestoring, and dispatchRestoreBatch. The WHY (restart-safety gap, gate semantics, audit-double-batch caveat) survives; the WHAT that the code already shows is gone.
CountNonTerminalTargets was exported from the models package with no production caller (only its own test referenced it). Drafted earlier when batch sizing was target-count-based; the current design stamps effective_batch_size from len(plan.Selected) at Start time, so the count helper is unused. Delete the function and its test; the models package still ships the canonical state-set predicates next to it. toStopResponse was a one-line wrapper with a single caller. Inline into the StopCurtailment handler so the response shape is visible at the call site, matching how toStartResponse / toEventProtoWithTargets are already used elsewhere.
Trim multi-paragraph rationale on Service.Start's org-default upper-bound guard, Service.runSelector's pipeline overview, Service.Stop's invariant, and toStartRequest's max_duration_seconds=0 sentinel. The WHY (mismatch surfaces as InvalidArgument vs DB CHECK trip, mutual-exclusion check intent) survives; restated parameter descriptions and pipeline phase names are gone.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4549263ffa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Reserving only the old name left the tag number free, so an older client that still encodes restore_batch_size_override on tag 2 would have its varint payload decoded as force on the new shape — silently flipping force=true on any non-zero value and bypassing the min_curtailed_duration_sec gate. Reserve both number and name on StopCurtailmentRequest, and put force on tag 3. An old client now decodes its tag-2 payload into a reserved slot (ignored), so the Stop hits the min-duration gate and fails loudly with FailedPrecondition rather than aliasing into a silent bypass.
skippedDeviceReason is the shared rendering helper that both the single-device and batch restore dispatch paths route audit strings through. Only the Reason!="" branch was exercised before; the FilterName-only and both-empty fallback branches had no test coverage. Add a table-driven unit test pinning all three cases.
Background
Curtailment ships in stages. The contract foundation, persistence + preview, and the operator-facing Start + dispatch + reconciler all landed via #192. What's still missing is the path back: an operator-facing way to end an event, and the system's ability to bring miners back online safely. This PR closes that gap and removes the production gate on
StartCurtailment.Summary
StopCurtailment
The operator-facing end-of-event RPC. Validates
min_curtailed_duration_sechas elapsed (admin caller can supplyforce=trueto bypass the hysteresis gate), transitions the event torestoring, and in the same transaction flips every non-terminal target todesired_state='active', resetsstatetopending, and clears the per-phase metadata (retry_count,last_dispatched_at,last_batch_uuid,confirmed_at,last_error). Terminal targets are untouched. The reset gives the restorer a DB-backed, unambiguous work queue that survives afleetdrestart.min_curtailed_duration_seconly gates active events — pending events haven't curtailed anything yet, so the elapsed-since-started_atcheck doesn't apply. Idempotent re-Stop: a second call against an already-restoring event echoes the persisted row without touching targets. Terminal events returnFailedPrecondition; cross-org event UUIDs returnNotFound. These semantics are documented on theStopCurtailmentproto RPC declaration so SDK consumers see the idempotent-restoring vs. non-retryable-terminal split without reading the handler.StopCurtailmentRequestcarriesevent_uuid(tag 1) and an admin-gatedforceboolean (tag 3). Tag 2 is reserved — both the number and the namerestore_batch_size_override— so an old client's varint payload on the deleted slot can never silently alias intoforce=true. The handler enforces the admin role onforce=truebefore reaching the service, so non-admin callers seePermissionDeniedand never trip the bypass.Restorer
A DB-backed step inside the reconciler tick rather than a separate goroutine, so a restart resumes restore from persisted state without rehydrating in-memory workers. Each tick scans
restoringevents and:baseline_power_w × 0.5threshold, falling back to a positive-hash signal when baseline is null). A target whose candidate row has vanished (device unpaired or deleted) burns retry budget viarecordDispatchFailureinstead of pinning the event indefinitely.COMPLETEDorCOMPLETED_WITH_FAILURESbased on RestoreFailed count — once every target is terminal.DispatchedorDrifted.last_dispatched_atis null or older thanrestore_batch_interval_sec.The claim selects up to
effective_batch_sizepending targets ordered bydevice_identifier, dispatches a single batchedUncurtailcall covering all of them (sharing onebatch_uuid), and per-device commits state transitions based on the kept/skipped split returned by the command service. Bulk dispatch errors and per-device skips route through the samerecordDispatchFailureretry path the Curtail dispatch uses, and the per-device audit string comes from a singleskippedDeviceReasonhelper shared with single-device dispatch (Reason > FilterName > generic preflight fallback). The reconciler self-bypass onCurtailmentActiveFiltercoversUncurtailtraffic too, so restore batches aren't blocked by the very event they're restoring.Each tick body is bounded by a
2 × TickIntervaldeadline. Per-event contexts derive from the same tick budget, so the loop-entry gate and per-event deadline together cap total tick wall-time. A stuck DB during a large restore (ResetCurtailmentTargetsForRestoreover thousands of targets, slowListCandidateson a large fleet) returns an error to the tick instead of hanging the goroutine; the heartbeat upsert uses its own background-derived 5s deadline so liveness reporting stays honest even when the tick body errors out. Telemetry-timeout aging in the restore phase logs a dedicated slog line beforerecordDispatchFailureso triage can distinguish timed-out-and-wrote from timed-out-DB-error.Adaptive batch sizing — computed at Start
effective_batch_size = max(restore_batch_size, ceil(0.01 × selected_target_count))clamped to[10, 100], stamped on the event row at Start time from the selected-target count and read back by Stop, the reconciler's max-duration arm, and the restorer's batch claim. The column is non-null from event creation.StartCurtailmentResponse.event.effective_batch_sizeechoes the computed value, so SDK consumers see the resolved batch size in the same response as the new event UUID rather than needing a follow-upGetActiveCurtailmentpoll.A 5,000-miner event therefore restores in batches of 50; a 10K+ event caps at 100. Operators tune the formula's floor via the
restore_batch_sizefield onStartCurtailmentRequest.One non-terminal event per org
Migration 000051 adds a partial unique index
uq_curtailment_event_one_non_terminal_per_org ON curtailment_event (org_id) WHERE state IN ('pending','active','restoring').InsertEventWithTargetsmaps the unique-violation to a typedinterfaces.ErrCurtailmentNonTerminalEventExistssentinel, andService.Startcatches it: a concurrent Start that races the selector's per-device exclusion check now fails at insert time withAlreadyExists, surfaced with the existing event's UUID + state. TheAlreadyExistsDebugMessage embeds that identity as the stable substring(event_uuid=<uuid>, state="<state>")— documented on theStartCurtailmentRPC declaration so SDK consumers can recover the existing event without parsing an undocumented format.This closes the cross-event scope-overlap gap at the DB level. The rule is org-wide rather than device-set-aware (two non-overlapping device sets in the same org are still mutually exclusive), which is sufficient for v1's manual-trigger use case.
Migration 000051 also bounds the operational controls at the DB layer with CHECK constraints on
max_duration_seconds(NULL OR 0 < x <= 604800) andrestore_batch_interval_sec(0 <= x <= 3600). The migration is safe withoutCONCURRENTLY/NOT VALIDbecausecurtailment_eventhas zero rows at deploy (the productionStartCurtailmentgate is removed by the same PR); the file documents this assumption inline for any future re-application against a populated table.max_duration_seconds enforcement
Reconciler-side enforcement transitions an
activeevent torestoringwhen the cap elapses sincestarted_at. This closes the prior gap where a target sitting inDispatchedwith no positive telemetry stays there indefinitely under drift detection alone — the cap now drags the event into restore regardless.allow_unboundedevents and events with nilstarted_atshort-circuit out.When
max_duration_secondsis unset and!allow_unbounded,Service.Startfills the value fromorgConfig.MaxDurationDefaultSecand re-checks the absolute upper bound (maxFiniteDurationSeconds = 604800) on the normalized value. A misconfigured org default surfaces asInvalidArgument(org's max_duration_default_sec must be <= 604800) instead of tunnelling pastvalidateStartRequestand tripping the DB CHECK constraint asInternal.StartCurtailment gate flipped
With Stop, the restorer, and reconciler-side max-duration enforcement all in place, an active event now has both an operator-facing exit (Stop) and a wall-clock safety net. The
Handler.startEnabledfield, in-handler gate check, and the coupling test that pinned the gate are removed.main.goconstructs the handler with the leanNewHandler(curtailmentSvc).Restore semantics worth flagging
device_status. The operator who setforce_include_maintenance=trueowns restore timing. Operators expecting a maintenance window to outlast a curtailment event must sizemax_duration_seconds, not rely on schedule-coupled timing.Curtail; theUncurtailqueues behind earlier pending or processing work. The restorer accounts for this via the in-flight gate rather than assumingUncurtailsupersedes.BeginCurtailmentRestorationconcurrency model. The load-bearing guarantee is theWHERE state IN ('pending','active')row-lock guard combined with the per-org partial unique index added in migration 000051. The store'sErrNoRowsre-read path distinguishes "already restoring" (idempotent) from "already terminal" (FailedPrecondition).Wiring
cmd/fleetd/main.goconstructs the handler with the single-argumentNewHandlerand the wiring comment documents the current surface: Preview, Start, and Stop are wired; the reconciler drives non-terminal events terminal (including max-duration-based forced restore);UpdateCurtailmentEvent,GetActiveCurtailment, andListCurtailmentEventsreturnUnimplementedpending the read APIs ticket.What is intentionally not in this PR
UpdateCurtailmentEventfor operator-safe mutation ofreasonand restore controls — lands with the read/audit work.GetActiveCurtailmentandListCurtailmentEvents— same.Test plan
go build ./...andgo vet ./...clean.golangci-lint run ./internal/domain/curtailment/... ./internal/handlers/curtailment/... ./cmd/fleetd/...reports 0 issues.go teston the curtailment unit surface is green (DB-dependent integration tests fail on local connection-refused as usual).Coverage by area:
Service.Stop— happy path, idempotent already-restoring, terminal rejection,min_curtailed_duration_secgate withforce=truebypass, EMERGENCY priority does not bypass the gate at Stop time (regression), pending-event short-circuit, store-error propagation.Service.Start—TestService_Start_StampsEffectiveBatchSizepins the adaptive formula at Start time (floor 10,restore_batch_sizefloor, 5K → 50, 10K ceiling at 100).TestService_Start_NonTerminalOverlapReturnsAlreadyExistspins the per-org partial unique index path;TestService_Start_NonTerminalOverlapWithGetActiveErrorReturnsBareAlreadyExistspins the identity-lookup-fallback sub-branch.TestService_Start_RejectsOrgDefaultAboveAbsoluteCeilingpins the post-normalization upper-bound guard on the org-default path.TestService_Start_RejectsAdminRestoreBatchIntervalAboveAbsoluteCeilingpins the absolute ceiling for admin callers (above the non-admin cap but still bounded).ComputeEffectiveBatchSize— small-fleet floor, 5K-fleet → 50, 10K+ ceiling at 100,restore_batch_sizefloors the formula, negativerestore_batch_sizehandling.StopCurtailmenthandler — happy path through the full handler → service → store flow, missing sessionUnauthenticated, malformed UUIDInvalidArgument,forceadmin role gate (PermissionDeniedfor non-Admin), adminforce=trueflowing through and bypassing the min-duration gate,ListTargetsByEventerror propagation after a committed transition.StartCurtailmenthandler — happy path assertsEffectiveBatchSizeon the wire response (provestoStartResponsepopulates the computed value rather than always-zero).enforceMaxDuration— elapsed transitions to restoring (no drift dispatch), not-elapsed no-ops,allow_unboundedskips cap, nilstarted_atno-ops,BeginRestoreTransitionerror path skips drift this tick.observeRestoring— claim dispatches oneUncurtailper batch with sharedbatch_uuid, in-flight gate blocks claim, interval gate blocks claim, telemetry-confirmed dispatched promotes toResolved(and triggers terminal transition same tick), all-resolved →COMPLETED, mixed-with-failures →COMPLETED_WITH_FAILURES.TestReconciler_Restoring_MissingCandidateDuringConfirmConsumesRetryBudgetpins the nil-candidate guard in the restore phase, mirroring the curtail-phase analog.TestReconciler_RunTickStopsWhenTickBudgetExpirespins the per-event context bound; later events cancel when the shared tick budget expires.isRestored— power above/at/belowbaseline × 0.5thresholds, baseline-nil positive-hash fallback, baseline-nil zero-hash not-restored, no-telemetry not-restored, baseline-zero hash fallback.skippedDeviceReason— priority-order contract (Reasonwins, thenFilterName, then generic preflight fallback) pinned across all three branches.Closes #231
Refs #192