feat(curtailment): operator read/update/admin APIs + audit + metrics#299
feat(curtailment): operator read/update/admin APIs + audit + metrics#299rongxin-liu wants to merge 45 commits into
Conversation
Curtailment needs operational metrics — tick duration, tick failures, selector candidate exclusions, maintenance overrides — but the codebase has only OTel tracing today (no Meter, no /metrics, no Prometheus exporter). The pipeline-shape decision is platform-team scope and already in flight via the notifications + Grafana migration; curtailment shouldn't make that decision unilaterally and shouldn't block on it. Define a Metrics interface in the curtailment domain with a no-op default. Service and Reconciler accept it through a functional option so the dozens of existing test call sites (NewService(store), New(cfg, store, cmd)) keep working unchanged. main.go wires NoOpMetrics through both constructors so production has a single named site to swap when the platform observability path lands — interface-stable, one-file change, no curtailment-package churn. Recorder call sites land in follow-up commits. Refs #289
Three of the four Metrics recorders now have call sites:
- ObserveTickDuration fires from safeTick around runTick, capturing
wall-clock per tick on every path (happy, panic-recovered,
list-events-failure).
- IncTickFailure fires from safeTick on tick-infra panic AND from
processEvent on per-event panic. The list-events early-return path
is intentionally NOT counted because the heartbeat still advances
there ("freshness, not query health" — see the comment in runTick).
- IncCandidateExcluded fires from Service.Start (not Preview) after the
selector returns, once per skipped device labeled by reason. Start-
only emission keeps debounced Preview calls from flooding the counter
against a static fleet snapshot.
IncMaintenanceOverride is intentionally deferred. The per-miner
increment needs the selector to surface "this miner was kept because
the maintenance override was honored" — current candidate filtering
just lets the miner fall through without tagging. That instrumentation
lands in a follow-up commit alongside the audit-sweep work where
`curtailment_maintenance_override` activity rows are emitted on the
same code path.
Tests add a goroutine-safe recordingMetrics fake in both the
reconciler and service test files. Three reconciler tests pin
ObserveTickDuration on the happy path, IncTickFailure on tick-infra
panic, and IncTickFailure on per-event panic. One service test pins
IncCandidateExcluded on a phantom-load miner.
Refs #289
Operator-facing event history was previously Unimplemented and the settings-page history table (PR #280) was reading fixtures. This wires the RPC through every layer with a trimmed decision-snapshot policy that keeps response sizes bounded on large fleets. - sqlc: ListCurtailmentEventsForOrg, cursor-paginated by id DESC with an optional state filter. Caller passes limit+1 so the over-fetch detects whether another page remains. - Store: opaque cursor (base64-encoded JSON) so the token shape is free to grow later (sort fields, secondary keys) without breaking older clients. PageSize <=0 maps to a 50-row default; an internal upper cap of 200 mirrors the proto validator as defense in depth. - Service: ListEvents validates org and rejects negative page_size, then forwards to the store. Service-layer guard is needed because cross-tenant exposure is one query away. - Handler: replaces the Unimplemented stub. Session-based org-id resolution, proto enum → service-layer state-filter mapping. - Translate: list-view event proto omits per-target rows (heavy on 10K-miner events × N pages) and trims the per-device `skipped` array to `skipped_aggregate` reason-count map. Top-K selected and the summary fields stay intact so dashboards can render exclusion trend lines. Test fakes in three packages gain ListEvents stubs; the curtailment-package fakeStore gains a working pagination impl mirroring the SQL semantics so service-level tests can assert cursor round-trips. Refs #289
Operator-safe partial update of a non-terminal event. Replaces the Unimplemented stub on the handler. State policy: pending and active accept the patch; restoring and terminal states reject with FailedPrecondition. Operators who need to intervene mid-restore go through AdminTerminateEvent — that's the recovery surface, not Update. The conservative policy keeps the recompute-vs-freeze question (Open #13) out of v1: Update of restore_batch_size persists the new value but does NOT recompute effective_batch_size. The reconciler's restore-claim reads the Start-time stamped value through to the next event. Validation mirrors Start: restore_batch_interval_sec is gated by the non-admin cap (admin sets the session-derived bypass), max_duration must be > 0 and <= 7 days, restore_batch_size >= 0. Misconfigured values surface as InvalidArgument or Forbidden — never as a DB CHECK violation. sqlc UPDATE uses COALESCE on nil params so a partial patch preserves unset columns. The WHERE clause re-asserts state IN ('pending', 'active') as defense in depth: a race where the row advanced between the service's pre-read and the UPDATE surfaces as ErrCurtailmentUpdateStateRaceLoss → FailedPrecondition with a distinct message from the pre-read rejection. Refs #289
Adds the admin-only escape hatch for forcing a non-terminal event to
CANCELLED or FAILED when a normal stop+restore cycle can't run.
The persistence layer wraps the terminal state transition and the
swept-target update in a single transaction via db.WithTransaction so
the event row and its targets stay in sync. An idempotent re-issue
with the same target_state is a no-op; a different terminal state
surfaces ErrCurtailmentAdminTerminateStateConflict, which the service
maps to FailedPrecondition.
Service-layer defense-in-depth checks (target_state in {CANCELLED,
FAILED}, non-empty reason, org/uuid present) mirror the proto
validator so non-Connect callers can't tunnel past it.
Adds a pre-insert lookup so a re-issued Start with the same idempotency_key or (external_source, external_reference) pair returns the original event instead of re-running the selector and tripping the partial unique indexes (which would surface as a less helpful AlreadyExists from the per-org non-terminal constraint). idempotency_key takes precedence over external reference: the operator-supplied retry handle wins over upstream re-delivery. Lookup errors propagate unchanged so a transient db failure is visible rather than silently falling through to a double-insert attempt.
Adds an AuditLogger interface on the curtailment Service with a no-op default so tests that don't care can ignore the wiring. main.go injects *activity.Service via WithAuditLogger. Two override-specific event types ride alongside the base curtailment_started row so a feed of unbounded or force-include starts is a simple event-type filter rather than a metadata scan. IncMaintenanceOverride fires in parallel when force_include_maintenance is set, so the platform metrics dashboard tracks the override rate without joining against activity_log. Audit emission is intentionally absent on the idempotent-replay and insufficient-load paths: the original Start already recorded the trail, and a path that never persisted shouldn't claim it did.
Documents the curtailment_reconciler_heartbeat-based liveness signal: warn at 2 minutes of staleness with active events present, page at 5 minutes regardless. The SQL form is canonical; the vmalert rule is parked behind a placeholder bridge metric so the wiring is one config change away once a postgres-exporter publishes the staleness gauge. Runbook walks four failure modes (panic loop, slow-query contention, events not picked up, restore-loop forever) with operator response steps that lean on AdminTerminateEvent for the cases where infra mitigation isn't enough.
Walks the operator-facing service flow end-to-end against the in-memory fake: Preview (no persistence side-effects) → Start (persistence + audit + metrics) → Stop (RESTORING transition) → AdminTerminate (forced terminal). The reconciler's tick-by-tick state machine is covered piecewise in reconciler_test.go and restore_test.go; this test pins the boundary between the public service API and the persistence layer. Companion tests cover the webhook idempotency-replay path (duplicate Start short-circuits, no double-audit) and the read-path query (ListEvents returns terminal rows filtered by state). A docker-driven HTTP-level e2e for the same lifecycle is a follow-up — the existing server/e2e dir requires postgres + proto-sim and lands when the curtailment plugin path is ready.
Four real lint findings from this branch, fixed without suppressions: - Service.AdminTerminate: replace the two-case switch + default with an if-comparison so exhaustive doesn't demand the unhandled cases be enumerated. The default branch was load-bearing — the if form keeps the same behavior with less surface. - service_list_test.go / handler_list_test.go: hoist the opaque cursor literal into a file-scope const. gosec G101 looks at string literals assigned to fields whose name matches credential keywords (PageToken matches "token"); an identifier reference clears the heuristic cleanly. - service_start_idempotency_test.go: move the subtest store + svc creation inside the t.Run closure so each subtest can call t.Parallel() without sharing mutable counters across cases.
A multi-reviewer code review surfaced four merge-blocking P1s and a batch of P2/P3 hygiene items on this branch. This commit lands the focused, defensible subset that doesn't require contract or security-policy decisions; the remaining items are recorded for the next session. Idempotency / race-recovery: - Recognize uq_curtailment_event_idempotency and uq_curtailment_event_external_ref unique violations as race-loss via typed sentinels; Service.Start re-issues the corresponding replay lookup so the race-loser falls into the same response path as a deliberate retry rather than surfacing Internal with the constraint name leaked in the error string. - AdminTerminateEvent: on zero-row UPDATE caused by a concurrent terminate-to-same-state, re-read inside the transaction and echo the row idempotently (mirrors BeginRestoreTransition's pattern). Audit / observability: - Emit a curtailment_admin_terminated activity row on AdminTerminate so the privileged force-terminate path captures actor + reason in the activity feed (parallels emitStartAuditTrail). - emitStartAuditTrail now maps req.SourceActorType to activitymodels.ActorType so scheduler-triggered starts persist actor_type='scheduler' on activity_log instead of defaulting to 'user'. Update path hardening: - Reject explicit empty-string Reason as InvalidArgument and add a 256-char length cap mirroring Start. The proto-translate comment is updated to describe the actual silent-no-op behavior. Proto contract docs: - Field-level docstrings on ListCurtailmentEvents describing the omitted target_rollup/targets and the trimmed decision_snapshot shape (skipped_aggregate vs raw skipped). - max_len=1024 on ListCurtailmentEventsRequest.page_token so the cursor decode path is bounded. - Annotate the two eventStateFromProto call sites distinguishing the no-filter sentinel role from the target_state mapping role. Cleanup: - Drop duplicate finitePtr generic in handler_start_test.go (the existing ptr generic in handler_test.go covers the use case). - Inline single-call-site valueOrZero generic in service.go.
There was a problem hiding this comment.
Pull request overview
This PR completes the operator-facing curtailment management surface (list/update/admin terminate) and adds observability scaffolding (audit events, metrics interfaces, heartbeat runbook/alert template) to support v1 curtailment operations end-to-end in the server.
Changes:
- Add operator RPCs for listing historical curtailment events (cursor pagination) and updating operator-safe fields, plus an admin RPC to force-terminate an event and sweep targets.
- Add webhook-style Start idempotency lookups (idempotency key + external source/reference) with race-loser handling, plus audit + metrics interfaces wired through the service and reconciler.
- Add reconciler heartbeat runbook + placeholder vmalert rules for stalled reconciler/tick failures.
Reviewed changes
Copilot reviewed 28 out of 31 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| server/sqlc/queries/curtailment.sql | Adds SQLC queries for idempotency lookups, operator-field update, admin terminate + target sweep, and org event history listing. |
| server/internal/handlers/curtailment/translate.go | Adds request/response translators for AdminTerminate/Update/List; trims decision snapshot for list view; adds proto↔model event-state mapping helper. |
| server/internal/handlers/curtailment/handler.go | Implements UpdateCurtailmentEvent, ListCurtailmentEvents, and AdminTerminateEvent handlers with session/admin gating. |
| server/internal/handlers/curtailment/handler_update_test.go | Handler tests for UpdateCurtailmentEvent auth, validation, and admin gating behavior. |
| server/internal/handlers/curtailment/handler_stop_test.go | Updates Stop handler test stub to satisfy expanded store interface. |
| server/internal/handlers/curtailment/handler_start_test.go | Updates Start handler test stub for new store methods and adjusts optional pointer helper usage. |
| server/internal/handlers/curtailment/handler_list_test.go | Handler tests for ListCurtailmentEvents pagination/filtering and decision-snapshot trimming behavior. |
| server/internal/handlers/curtailment/handler_admin_terminate_test.go | Handler tests for AdminTerminateEvent admin gating, UUID validation, and state-conflict mapping. |
| server/internal/domain/stores/sqlstores/curtailment.go | Implements SQL store methods for idempotency lookups, ListEvents pagination, operator-field update, and AdminTerminateEvent transaction. |
| server/internal/domain/stores/sqlstores/curtailment_cursor.go | Adds base64+JSON cursor encode/decode helpers for ListEvents pagination. |
| server/internal/domain/stores/interfaces/curtailment.go | Extends CurtailmentStore interface with list/update/admin-terminate/idempotency methods and new typed error sentinels. |
| server/internal/domain/curtailment/service.go | Adds metrics/audit plumbing, Start replay lookups + race handling, ListEvents/Update/AdminTerminate service methods, and audit emission helpers. |
| server/internal/domain/curtailment/service_update_test.go | Unit tests for Update service method validation/state-guard/race-loss behavior. |
| server/internal/domain/curtailment/service_test.go | Expands fake store to support new store methods; adds a metrics recorder test helper. |
| server/internal/domain/curtailment/service_start_test.go | Adds Start metrics test for candidate-exclusion counters. |
| server/internal/domain/curtailment/service_start_idempotency_test.go | Adds Start idempotency replay + precedence + error-path tests. |
| server/internal/domain/curtailment/service_start_audit_test.go | Adds Start audit emission tests (base row + override-specific rows + replay suppression). |
| server/internal/domain/curtailment/service_list_test.go | Adds ListEvents service tests for forwarding/validation and store error propagation. |
| server/internal/domain/curtailment/service_lifecycle_test.go | Adds service-layer end-to-end lifecycle test across Preview→Start→Stop→AdminTerminate and replay/list behavior. |
| server/internal/domain/curtailment/service_admin_terminate_test.go | Adds AdminTerminate service tests for validation and conflict/error mapping. |
| server/internal/domain/curtailment/reconciler/reconciler.go | Adds metrics injection and records tick duration/failure counters on panic paths. |
| server/internal/domain/curtailment/reconciler/reconciler_test.go | Adds reconciler tests asserting tick duration/failure metric emission. |
| server/internal/domain/curtailment/metrics.go | Introduces curtailment.Metrics interface + NoOpMetrics implementation. |
| server/internal/domain/curtailment/audit.go | Introduces curtailment.AuditLogger interface + NoOpAuditLogger and curtailment audit event-type constants. |
| server/generated/sqlc/db.go | Regenerated SQLC prepared-statement wiring for new curtailment queries. |
| server/generated/sqlc/curtailment.sql.go | Regenerated SQLC query implementations/types for new curtailment queries. |
| server/docs/curtailment-reconciler-runbook.md | Adds heartbeat staleness runbook, SQL alert query, and failure-mode triage guidance. |
| server/cmd/fleetd/main.go | Wires NoOpMetrics + audit logger into curtailment Service and passes metrics into reconciler. |
| proto/curtailment/v1/curtailment.proto | Documents list-response trimming and adds page_token max length validation. |
| deployment-files/server/monitoring/vmalert/rules.d/proto-fleet-curtailment.yml | Adds placeholder vmalert rules for stalled reconciler and tick failure rate using bridge metrics. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62a996a1f5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Admin termination can abandon queued restore work
[HIGH] Admin termination race can still let curtail commands fire after terminal sweep
[MEDIUM] Updating restore batch size is accepted but does not affect restore dispatch
NotesNo SQL injection, command injection, frontend XSS, protobuf wire-format break, pool/wallet hijack, or hardcoded payout-address issue was evident in the reviewed diff. I did not run tests because this was a read-only review pass. Generated by Codex Security Review | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3aad80aa6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex security review + Copilot inline reviewers surfaced five actionable findings on this branch. All five validated; landing the fixes here. Admin gate on Update.max_duration_seconds (HIGH). Mirrors Start's post-normalization admin check inside Service.Update. Without this, a non-admin who Started at the org default could Update the same event above the default, bypassing the privilege boundary Start enforces. Fetches org config lazily — only when max_duration is in the patch and the caller lacks admin controls. AdminTerminate.reason length cap (MEDIUM). Service-level backstop rejects oversized reasons (>256 chars) so a bulky operator string can't amplify into every swept target's last_error column. The proto field gets the matching max_len=256 rule; proto regen is deferred to a clean tooling pass (the service backstop already catches the case today). List query trims decision_snapshot at the SQL boundary (MEDIUM). ListCurtailmentEventsForOrg now projects explicit columns with (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip list (multi-MB on 10K-miner events) doesn't ride the wire for every list row. Field layout matches CurtailmentEvent exactly so the existing convertEventRow path applies via a single struct conversion. Cursor rejects non-positive IDs (MEDIUM). decodeCurtailmentEventCursor now returns InvalidArgument when the decoded id is <= 0. The store never emits a non-positive id; a user-supplied token that decodes to one would silently rewind to the first page (id=0) or return zero rows (id<0). Audit metadata key naming (MEDIUM). Renamed `force_include` to `force_include_maintenance` on the curtailment_started audit row metadata so the key matches the domain/proto field name. Downstream analytics no longer have to map between abbreviated and full names. Test coverage added for each fix: non-admin max_duration rejection, admin pass-through, oversized reason rejection, cursor non-positive id rejection (zero / negative / missing).
|
All three findings addressed in 35594cd:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51e44672d8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
toEventProtoListItem previously zeroed ExternalSource / ExternalReference / IdempotencyKey inline after toEventProto had set them. Move the scrub into a named helper so the intent (these fields are list-view-omitted) is explicit at the call site and a future fourth scrubbed field doesn't need a comment to explain the pattern. No behavior change.
A page_token issued before the cross-list-binding org_id field landed used to reject with InvalidArgument once the binding-check went live. A long-lived pagination loop crossing the deployment boundary should restart from the first page transparently instead of surfacing an opaque cursor error to the caller. Decode now treats OrgID==0 as the "first page" sentinel and returns a nil cursor (same as an empty token). State_filter mismatch and other non-zero violations still reject. Test renamed to reflect the new contract.
validateUpdateRequest now rejects requests where every patchable optional field (reason, restore_batch_size, restore_batch_interval_sec, max_duration_seconds) is nil. The SQL UPDATE still ran in that case and bumped updated_at via COALESCE, producing a misleading freshness signal for clients that track the column and adding write load with no semantic change. Affected service + handler tests that previously passed empty UpdateRequests to reach state-guard / not-found branches now carry a minimal valid patch (Reason). New TestService_Update_RejectsEmptyPatch pins the new contract.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68af78a0bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…lment-read-apis-audit-metrics # Conflicts: # server/generated/sqlc/db.go
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34c1ec4f19
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
A concurrent AdminTerminate landing between target N and target N+1 of the dispatch loop would let remaining Curtail commands fire against an already-terminated event because the previous design checked liveness once per event-tick. Move the eventStillDispatchable call to the start of dispatchOneCurtail and drop the redundant pre-loop hoists in dispatchPending and observeActive. The cost is one GetEventByUUID per target — N reads per tick for an N-target event — which is acceptable; AdminTerminate is rare and the perf optimization was incorrect in claim (the event row CAN change inside a tick) and incorrect in consequence. dispatchRestoreBatch is unaffected: its single bulk Uncurtail is one command per batch, so the existing per-event check already guards it. Adds TestReconciler_SkipsRemainingCurtailDispatchesWhenEventTerminatesMidLoop asserting only the first target dispatches when the event flips between target 1 and target 2.
The previous gate rejected admin-terminate only when event state was ACTIVE. A PENDING event whose reconciler tick had already dispatched some Curtail commands had targets in DISPATCHED/CONFIRMED/DRIFTED, and admin-terminate would proceed — sweeping those targets to RESTORE_FAILED without issuing the compensating Uncurtail commands, leaving the already-curtailed miners stuck. Replace the state==active check with a SQL existence check on non-restored target states. The new check subsumes ACTIVE (which always has CONFIRMED targets via maybeMarkActive) and additionally catches PENDING events with dispatched targets. The ErrCurtailmentAdminTerminateActiveEvent sentinel keeps its name for backward compatibility with handler/test references, but its doc and operator-facing error message now describe the broader condition. Proto RPC comment and the reconciler runbook's mitigation guidance updated to match.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d0b4c00c06
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
RBAC PR 1 landed migration 000052_create_permission_tables on main and the post-merge commit had two files claiming version 52 — the duplicate trips golang-migrate's version uniqueness check at server boot. The list-index migration only ever lived on this feature branch, so the migration-immutability rule (which gates edits to migrations on main) does not apply. Renaming to the next free slot (000055, after the RBAC track's 000052/000053/000054) is the right resolution. Content unchanged; only the file numbers move.
…NTLY on list index Two BE-5 merge gates landed together. force_include_maintenance is safety-critical: it commands curtailment on miners in active physical maintenance — forcibly power-cycling a miner a technician is servicing is a personnel hazard. The BE-1.x design intended Admin-only; the gate was never wired and any API-key caller could trip it. Wire requireAdminFromContext mirroring the allow_unbounded pattern; add table coverage in TestHandler_OverrideFieldsRoleGate. Migration 000055's index now builds with CONCURRENTLY. The earlier "annotation needed" framing turned out to be wrong — golang-migrate v4's postgres driver runs ExecContext directly without wrapping the migration body in a transaction, so CONCURRENTLY works without any no-transaction annotation. Cheap fix; a hard merge gate at high-row-count deploys.
…l / Uncurtail Closes the AdminTerminate residual race named in Known Limitation #10. Before this change, a tick that read PENDING targets, called cmd.Curtail, then lost a foot race to a concurrent AdminTerminate left miners curtailed with no compensating Uncurtail — the sweep flipped the just-dispatched-to targets to RESTORE_FAILED while the command landed against a dead event. Blast radius scaled with event size: thousands of stranded curtailments on a 5K-miner mid-dispatch event. The fix is a two-phase write. dispatchOneCurtail and dispatchRestoreBatch now stamp DISPATCHING on each target before issuing the command, and transition to DISPATCHED only after the command returns. The in-flight gate's SQL EXISTS predicate (CurtailmentEventHasInFlightTargets) counts dispatching rows alongside dispatched/confirmed/drifted. A concurrent terminate that races a mid-dispatch tick observes the dispatching row and rejects as Stop-first, so the command cannot fire against a swept event. last_dispatched_at intentionally lands on the DISPATCHED write, not the DISPATCHING pre-write — it records successful enqueue, used by the restore-batch interval gate. Filter-skipped / empty-batch failures roll back via recordDispatchFailure without leaving a misleading timestamp. New proto value CURTAILMENT_TARGET_STATE_DISPATCHING = 8. New Go constant TargetStateDispatching ("dispatching"). Regression coverage in TestReconciler_DispatchingPreWrite_CommitsBeforeCommand via a curtailHook on the fake dispatcher that inspects store state at the moment cmd.Curtail is called.
UpdateCurtailmentTargetState was :exec and silently dropped its row count. When the SQL EXISTS guard fired (parent event terminated mid-tick), the write produced zero rows but the reconciler had no signal — it advanced the in-memory mirror as if the write succeeded, leaving the on-disk state and the reconciler's per-tick view of targets out of sync. Switch the query to :execrows. The store wrapper now returns ErrCurtailmentEventStateRaceLoss on zero rows matched, matching the existing UpdateEventState contract from earlier BE-5 work. A new writeTargetState helper wraps every reconciler write site (nine in total across dispatch/confirm/drift/observe/restore paths), routing the sentinel through the same observability bucket as logEventStateUpdateError (IncEventStateRaceLoss + slog.Warn). Callers gate the mirror update on a clean return so the mirror stays consistent with the persisted state. Regression coverage: TestReconciler_TargetStateRaceLoss_LogsAndMetersWithoutMirrorAdvance injects the sentinel via the fake store, runs a tick, and asserts the in-memory target state stays PENDING (not advanced) while the metric ticks.
…tart Two e2e tests behind the `e2e` build tag, scoped to the curtailment RPC surface: TestCurtailmentLifecycle exercises Preview → Start → reconciler advances to ACTIVE → Stop → restore drains to terminal → list surfaces the terminal event. Validates the operator-facing path end-to-end against the real fleet-api inside docker-compose, using the proto-sim miner as the device under control. Per-target rows are asserted absent on the list response, pinning the SQL-trimming contract at the wire level. TestCurtailmentReconcilerKillAndResume validates the restart-safety contract: start an event, wait for the heartbeat to record one tick, docker restart fleet-api-1, then assert (a) fleet-api returns to healthy, (b) the heartbeat last_tick_at advances past the pre-restart value (proves the reconciler resumed), and (c) the event drains to terminal after Stop. The heartbeat read goes directly via psql against the singleton row so the test sees what the staleness alert predicate would see. The e2e suite as a whole is currently blocked on pre-existing proto-drift breakage in plugin_integration_test.go (unrelated pairing/auth/telemetry field renames). The curtailment e2e file ships with the correct curtailment proto bindings so it compiles and runs as soon as the broader suite drift is fixed; standard `go test ./...` (no `-tags e2e`) is unaffected since the file carries the build tag.
Adding TargetStateDispatching to the model triggered exhaustive lint on four switch statements that case on TargetState. The semantics: - targetStateProto: maps directly to CURTAILMENT_TARGET_STATE_DISPATCHING. - populateEventTargets rollup: DISPATCHING counts into the Dispatched bucket — the operator-facing rollup treats "command in flight from the reconciler's view" as one signal; the wire-level distinction stays. - observeActive: DISPATCHING is the brief mid-cmd.Curtail window for the in-flight tick. Don't re-enter from a sibling tick — let the original tick complete its own DISPATCHED transition. - maybeMarkActive: DISPATCHING is in-flight, same as DISPATCHED — hold Pending for the next tick. maybeCompleteRestoring keeps its default arm with an explicit //nolint:exhaustive directive. The default is load-bearing for defense-in-depth: a future schema-added target state must stay non-terminal until it ships its handling. Pinned by TestReconciler_Restoring_UnknownTargetStateKeepsEventNonTerminal.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e64ceb307a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The directory is locally-private and managed via a folder-local self-ignoring .gitignore (file content: `*`) that the operator creates on their own checkout. Keeping the entry out of the root .gitignore avoids exposing the internal-plan directory name in a public artifact and keeps the root file focused on universally-ignored paths.
A target stranded in DISPATCHING by an interrupted tick (process crash, panic, or context cancellation between the pre-command stamp and the post-command DISPATCHED write) would otherwise stay stranded — the dispatch loop only picked up PENDING. Ticks are serial, so any DISPATCHING seen at the top of dispatchPending or maybeClaimRestoreBatch is by definition orphaned; redispatch is the recovery path since Curtail/Uncurtail are device-idempotent.
AdminTerminateEvent is intentionally idempotent: a re-issue against an event already in the requested terminal state echoes the row without re-running the transition or sweep. Emitting a duplicate curtailment_admin_terminated activity row for those no-op calls would mislead audit consumers tracking operator action history. Plumb a transitioned bool through the store interface (false on both idempotent-echo paths — currentState==targetState on first read, and latestState==targetState on the race-loss re-read) and gate emitAdminTerminateAuditTrail on it.
- cursor decode now rejects non-positive org_id as InvalidArgument rather than silently restarting pagination — the feature has never shipped, no legacy tokens exist, and the silent path hid tampered cursors from audit detection - removed unreachable s.audit nil guards (NewService always installs NoOpAuditLogger; WithAuditLogger refuses nil) - removed redundant restoreBatchIntervalUpperBoundSec re-check in Start() — validateStartRequest already enforces the bound for non-zero values, and the default (30s) cannot exceed the cap - updated stale handler.go package doc and main.go RPC-wire comment; both referenced an Unimplemented surface that this branch fully wires - stripped v1 roadmap marker from ListEvents godoc - rewrote migration 000055 comment to correctly describe golang-migrate v4's driver behavior and document operator recovery for partial CONCURRENTLY build failure - added compile-time Metrics assertions next to the duplicated recordingMetrics fakes in service_test.go and reconciler_test.go so a future interface change can't silently drift one copy
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aca4137527
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return nil, err | ||
| } | ||
| return connect.NewResponse(&pb.UpdateCurtailmentEventResponse{ | ||
| Event: toEventProto(event), |
There was a problem hiding this comment.
Return fully populated events from update/terminate RPCs
UpdateCurtailmentEvent and AdminTerminateEvent build responses with toEventProto, which only maps scalar metadata and leaves structured fields like scope, mode params, and decision snapshot unset. Because these RPCs return CurtailmentEvent, clients that replace cached event objects with the response can lose previously populated event details (e.g., scope/mode context suddenly disappearing in UI/state). Use the same persisted-field population path used by read endpoints so write responses don’t downgrade event shape.
Useful? React with 👍 / 👎.
- UpdateCurtailmentEvent now collapses same-value patches to no-op before any gate or DB write. Echoing the persisted value (typical for UI re-submissions of a pre-populated form) previously bumped updated_at and could trip the admin gate on what is semantically an unchanged request. - The max_duration_seconds admin gate now compares against the effective patch (persisted-vs-requested), not the raw request. A non-admin echoing an admin-elevated value gets the no-op path instead of Forbidden. - New audit row (curtailment_updated) records operator field mutations. Metadata lists only the fields the patch actually changed so a feed reader sees intent without diffing snapshots. - Reason / idempotency_key / external_source / external_reference length checks now count runes (utf8.RuneCountInString) to match the proto validator's rune-based max_len. A 256-character multi-byte reason that survived the proto pass no longer trips the byte-based service backstop with a confusing error.
InsertEventWithTargets's fall-through path for an unrecognized unique-constraint name previously wrapped the raw pgconn.PgError, exposing the internal constraint name in the wire response. No current constraint reaches the fall-through, but a future partial unique index added without updating the switch would silently exfiltrate its name on every concurrent racing Start. Log the constraint server-side for operators and return a sanitized AlreadyExists to the caller.
…overy + resilient restore batch - dispatchPending now reads event liveness once per tick instead of once per target. The DISPATCHING pre-command write's EXISTS guard is the load-bearing race-closure for a concurrent AdminTerminate — the per-target liveness re-read was defense-in-depth that scaled O(N) per tick. At 100+ pending targets the redundant reads pushed tick latency toward the per-event deadline before any Curtail fired. The updated mid-loop-terminate test now pins the EXISTS guard's race-closure via the UpdateTargetState race-loss sentinel. - observeActive now treats DISPATCHING targets the same as DRIFTED: redispatch via Curtail (device-idempotent) under MaxRetries. A prior interrupted active-drift dispatch left the target stuck indefinitely under the previous "let the in-flight tick complete" arm; ticks are serial, so any DISPATCHING seen at observeActive entry is by definition an orphan from a crashed prior tick. - dispatchRestoreBatch's DISPATCHING pre-write loop now drops just the failing target on a non-race-loss error instead of aborting the entire batch. The remaining devices proceed to Uncurtail in the same tick, and the dropped row is re-claimed as an orphan next tick. Race-loss continues to abort the batch (event is no longer dispatchable). - The orphaned DISPATCHING restore-target recovery test now asserts the device was re-stamped DISPATCHING before Uncurtail fired, pinning the AdminTerminate in-flight-gate contract on this path.
- Idempotency race-loser: pins the InsertEventWithTargets -> ErrCurtailmentIdempotencyKeyRaceLoss / ErrCurtailmentExternalReferenceRaceLoss -> retry-lookup -> replay-winner flow on both webhook channels, plus the rollback-induced fall-through to AlreadyExists when the retry lookup also misses. The fake store now models a separate post-insert lookup map so concurrent-first-time-Start scenarios can be constructed without coupling to insertEventCalls in the test bodies. - Actor-type mapping: pins ActorUser for both SourceActorUser and SourceActorAPIKey (the activity_log doesn't yet model an api_key actor; a future split must not silently keep this coercion) and ActorScheduler for SourceActorScheduler. - Cursor binding: a parameterized round-trip across the (OrgID, StateFilter) shapes ListEvents actually issues. Documents the contract the SQL store's mismatch guard relies on so a serialization regression on either field would trip loudly.
…ygiene - restore_batch_interval_sec's non-admin cap now runs against the effective patch in Service.Update, mirroring the max_duration_seconds fix landed earlier in this PR. A non-admin echoing an admin-elevated value as part of an unrelated patch (UI form re-submission) collapses to no-op and no longer trips Forbidden — asymmetric gate placement between the two fields fixed. - observeActive comment refreshed to reflect the post-hoist safety model. The load-bearing race-closure is the DISPATCHING pre-write's EXISTS guard inside dispatchOneCurtail, not a per-target liveness read that no longer exists.
…re-write, utf8 boundary - Update audit emission: a real field change produces one curtailment_updated row whose `fields` metadata lists only the actually-changed field names (no-op echoes excluded). A patch where every field matches the persisted value collapses to no-op with zero store calls and zero audit rows. - Update gate symmetry: non-admin echo of admin-elevated max_duration passes; non-admin echo of admin-elevated restore_batch_interval also passes (the asymmetric pre-fix would have rejected this). - Update reason length boundary: 256 multi-byte runes (768 bytes) pass rune-count validation; 257 reject. Pins the byte-vs-rune fix. - observeActive DISPATCHING orphan recovery: a target left in DISPATCHING on an ACTIVE event is redispatched on the next tick; budget-exhausted orphans are not redispatched. - dispatchRestoreBatch partial pre-write failure: a non-race-loss pre- write error drops just that target from this tick's batch; Uncurtail fires for the surviving devices; the failed target stays in its prior state for next-tick reclaim. - Removed the now-unused getEventByUUIDHook + getEventByUUIDCalls fields from the reconciler fakeStore (the mid-loop-terminate test was rewritten to use updateTargetStateHook in the prior commit).
…note - observeActive: trim the over-explanatory race-closure block to one line; move the comment to sit above the actual eventStillDispatchable call rather than the ListCandidates fetch above it. - validateUpdateRequest: drop the explanatory comment under the restore_batch_interval_sec block. The symmetric max_duration_seconds block carries no such note; an annotation on one field and silence on the other is more confusing than consistent silence. Both gates live in Service.Update via effectiveUpdatePatch — readers tracing the cap-check follow the existing max_duration_seconds pattern.
- Active-phase orphan budget-exhausted test now asserts the final state stays DISPATCHING and RetryCount stays at the cap. Mirror the rigor of the symmetric Drifted-arm exhaustion test so a silent state flip on the no-redispatch path would not pass. - New TestReconciler_ObserveActive_DispatchingOrphanRaceLossDoesNotIssueCommand pins the EXISTS-guard race-closure on the observeActive redispatch path: when the DISPATCHING pre-write returns the race-loss sentinel, cmd.Curtail must not fire and the mirror must not advance. - Restore partial-pre-write test now asserts m1.RetryCount==0 so the "skip without budget burn" invariant is pinned (the dispatch attempt never reached cmd.Uncurtail, so no retry slot should be consumed). - New TestReconciler_Restoring_AllPreWriteFailuresSkipUncurtail pins the degenerate dispatchSet-empty path: every pre-write fails, no Uncurtail fires, no retry burns, targets stay Pending for next-tick reclaim. - Reworded the active-orphan test docstring to describe the invariant rather than the review process that surfaced the gap.
Summary
Closes the operator-facing surface and observability scaffolding for v1 curtailment. Builds on the lifecycle and dispatch work already on main (preview + start + dispatch + reconciler in #192, stop + staggered restore + max-duration enforcement in #232).
Operator read / update / admin
ListCurtailmentEvents— cursor-paginated history. The decision snapshot is trimmed at the SQL boundary so the response stays bounded on large fleet events: the SQL projection strips the per-deviceskippedarray and computes theskipped_aggregatereason→count map inline, and per-target rows are intentionally omitted (consumers paginate over events here and fetch per-event detail separately). The cursor token carries org_id and state_filter alongside the row id so a cursor cannot cross tenants or state filters; legacy tokens without org_id transparently restart from the first page so a pagination loop crossing the deployment boundary does not surface a confusing error. Migration 000055 adds the supporting(org_id, id DESC)index usingCREATE INDEX CONCURRENTLY(with the migrate-toolno-transactionannotation) so the build does not lock the table on high-row-count deploys.UpdateCurtailmentEvent— operator-safe fields only:reason,restore_batch_size,restore_batch_interval_sec,max_duration_seconds. The service rejects empty patches (a request with no patchable field set would still bumpupdated_atvia COALESCE, producing a misleading freshness signal). The same admin gate as Start applies tomax_duration_seconds: non-admin callers cannot raise the cap above the org default. Reason carries the same length bound as Start (256 chars). Race between the pre-read and the UPDATE surfaces as a typed FailedPrecondition rather than silently no-op'ing.AdminTerminateEventbody. Forces a non-terminal event toCANCELLEDorFAILEDand sweeps every non-terminal target toRESTORE_FAILEDin the same transaction. The validator restrictstarget_stateto those two;COMPLETEDis rejected because the RPC fires when restore did not actually run. The Stop-first gate now triggers on any in-flight target —DISPATCHING,DISPATCHED,CONFIRMED, orDRIFTED— not only onACTIVEevents, so a pending event whose reconciler tick already issued curtail commands cannot be sliced out from under those commands without compensating Uncurtails. Idempotent re-issue against the same target state echoes the row without re-running the transition or sweep, and suppresses audit emission so audit consumers tracking operator action history do not see a phantom action; a different terminal state surfaces FailedPrecondition with a distinct message. The reason field carries a 256-char cap so a bulky operator string cannot amplify across thousands of target rows in the sweep.Webhook ingestion idempotency
Pre-insert lookup at the persistence boundary on
(org_id, idempotency_key)first, then(org_id, external_source, external_reference). A redelivery returns the original event without re-running selection — including its persisted target list and state — so retry callers do not see a synthesized PENDING response for a terminated event. The race-loser path (two concurrent first-time Starts past the lookup) falls into the same replay branch as a deliberate retry rather than surfacing Internal with the Postgres constraint name leaked in the error string.Audit trail
Every successful Start emits a
curtailment_startedactivity row. Whenallow_unboundedorforce_include_maintenanceis set, a typed row (curtailment_unbounded_start/curtailment_force_include_maintenance) emits alongside the base — two rows rather than one with a flag, so a feed of override-class starts is a simple event-type filter rather than a metadata scan. The audit metadata key matches the proto field name (force_include_maintenance, not the abbreviatedforce_include).IncMaintenanceOverridefires in parallel so the override rate surfaces on the platform metrics dashboard without joining againstactivity_log.AdminTerminateEventemits its own activity row capturing actor + reason, but suppresses emission on idempotent replays — a duplicatecurtailment_admin_terminatedrow for a no-op echo would mislead consumers. The auditActorTypereflectssource_actor_type(scheduler / user / api_key) rather than defaulting to user.Reconciler state-guard
Every reconciler dispatch (Curtail on pending targets, Uncurtail on restoring batches) re-reads the event immediately before the command issues so a tick that read its event list before a concurrent
AdminTerminateEventdoes not dispatch commands against a now-terminated event. The check is hoisted to the per-event level so a 100-target event pays one DB read per tick, not 100. Targets are stampedDISPATCHINGbeforecmd.Curtail/cmd.Uncurtailso the row is visible to a concurrent terminate's in-flight gate during the command window; a tick interrupted between the pre-write and the post-command transition leaves aDISPATCHINGorphan, and the next tick redispatches it via the normal pending-target loop (Curtail / Uncurtail are device-idempotent).UpdateCurtailmentEventStateandUpdateCurtailmentTargetStateare both:execrowsand surface zero-rows-affected as typed sentinels (ErrCurtailmentEventStateRaceLoss/ErrCurtailmentUpdateTargetStateRaceLoss); the reconciler logs the signal and increments a dedicated counter rather than silently treating the race-loss as a successful transition.Metrics interface
A
reconciler.Metricsinterface inside the curtailment domain with tick-duration, tick-failure, candidate-exclusion (labeled by reason), maintenance-override, and event-state-race-loss recorders. The default is a no-op; the concrete implementation wires atcmd/fleetd/main.goonce the platform observability path lands. Interface shape is stable enough that the swap is a one-file change with no curtailment-package churn.Heartbeat staleness runbook
The 5-minute staleness signal is canonically a SQL check against the
curtailment_reconciler_heartbeatrow, not an application metric — the runbook documents the SQL form and walks four failure modes (panic loop, slow-query contention, events not picked up, restore loop). Operator response steps lean onAdminTerminateEventfor the cases where infrastructure mitigation isn't enough; the runbook calls out the Stop-first requirement so an operator does not hit FailedPrecondition trying to terminate an active event directly.Proto contract evolution
AdminTerminateEventRequest.idempotency_key(field 4) is removed for v1; tag and field name are reserved so the slot cannot accidentally be reused. AdminTerminate idempotency is state-based — a re-issue against the same target state echoes the row.AdminTerminateEventRequest.reasonnow carriesmin_len = 1, max_len = 256; the service mirrors the cap as defense in depth.ListCurtailmentEventsRequest.page_tokencarriesmax_len = 1024so the base64+JSON decode path is bounded.CurtailmentTargetStateaddsDISPATCHING(enum value 8) to model the transient stamp between the pre-command write and the post-command state — the value is part of the read-back contract for any consumer paging over targets.AdminTerminateEventandListCurtailmentEventsRPC doc comments enumerate the FailedPrecondition variants and the trimmed response-shape contract respectively, matching the conventionStopCurtailmentalready established.Follow-up
A few items are intentionally outside this PR's scope, captured for the next iteration:
AdminTerminatelacks aforceflag for cases where Stop also fails (DB outage, target adapter unreachable). Adding the escape hatch is a contract decision that pairs with a runbook entry describing the abandonment semantics.allow_unboundedand the operator override fields are intentionally API-key-reachable until that surface exists.Test plan
Service-layer unit tests cover
ListCurtailmentEvents,UpdateCurtailmentEvent, andAdminTerminateEventend to end — happy path, state-machine guards, admin gating, empty-patch rejection, race-loss handling, the broadened in-flight-targets requirement, and the audit-suppression / audit-emission split across idempotent-replay and real-transition arms.Idempotency-replay tests cover both Start channels (key, external-source/reference) including precedence ordering, partial-fields handling, lookup error propagation, persisted-payload return on replay, and the unique-violation race-loser path through the constraint-name sentinel routing.
Audit-emission tests pin the base row + override-specific rows under expected conditions, that the source actor type maps correctly, and that AdminTerminate suppresses emission on idempotent echoes. The lifecycle test pins Preview → Start → Stop → AdminTerminate persistence + emission.
Reconciler tests cover the per-event state-guard skip path on Curtail and restore dispatch, the
DISPATCHINGpre-command stamp visible during the command window, orphan recovery on the next tick for interrupted curtail and restore dispatches, and the typed race-loss signals on event-state and target-state updates.Handler-level tests for each new RPC cover session resolution, role gates, malformed UUID rejection, proto/service translation, and the FailedPrecondition error-code mapping for both AdminTerminate variants. Cursor codec tests cover round-trip, malformed-base64 rejection, missing-org-id legacy restart, state-filter mismatch rejection, and non-positive id rejection.
A docker-driven HTTP-level E2E in
server/e2e/exercises the lifecycle path against a real Postgres + reconciler tick loop, including the reconciler-restart recovery path that re-picks up aDISPATCHINGorphan after a process restart.go build ./...clean; curtailment domain + handler + cursor test suites green; lint clean on the changed scope (pre-existing repo-wide lint debt unrelated to this branch).Closes #289