Skip to content

feat(curtailment): StartCurtailment + dispatch + reconciler#192

Merged
rongxin-liu merged 34 commits into
mainfrom
feat/issue-191-curtailment-start-dispatch-reconciler
May 19, 2026
Merged

feat(curtailment): StartCurtailment + dispatch + reconciler#192
rongxin-liu merged 34 commits into
mainfrom
feat/issue-191-curtailment-start-dispatch-reconciler

Conversation

@rongxin-liu
Copy link
Copy Markdown
Contributor

@rongxin-liu rongxin-liu commented May 7, 2026

Background

Curtailment is the proto-fleet feature that reduces a fleet's mining power on demand. The contract foundation (#118) shipped the proto surface with handler stubs returning Unimplemented. The admin layer (#173) added AdminTerminateEvent, three admin-gated override fields, session-only registration of the recovery RPC, and requireAdminFromContext. Persistence and the operator-facing PreviewCurtailmentPlan (#188) followed.

This PR ships the operator-facing StartCurtailment write path, the per-device dispatch primitives (Curtail / Uncurtail), the active-event preflight filter, the schedule processor's curtailment-skip emission, and the background reconciler that drives non-terminal events forward. The StartCurtailment operator entrypoint is gated off (startEnabled=false in cmd/fleetd/main.go) until the Stop + restorer + max_duration_seconds enforcement work lands; the dispatch primitives, filter, and reconciler ship live so the wiring soaks in production. With the gate flipped, an operator's Start request lands in DB, the reconciler dispatches Curtail per target under the operator's user identity, telemetry confirms each target, and the event progresses pending → active with drift detection + bounded redispatch.

Summary

Operator surface

StartCurtailment validates the request, runs the selector pipeline shared with PreviewCurtailmentPlan, and persists event + per-target rows in one transaction. baseline_power_w is captured from the latest telemetry sample per target. Insufficient curtailable load returns InvalidArgument with the same structured detail as Preview. Source actor (user / api_key) and the operator's user.id are derived from session.Info so audit attribution stays correct without pulling session into the service.

StartCurtailment is gated behind a startEnabled bool field on the handler; cmd/fleetd/main.go passes false. With the gate off, the RPC returns Unimplemented regardless of caller. This protects production from an event reaching active without an exit path until Stop + restorer ship in a follow-up.

Operator-id plumbing and FK fix

Migration 000050 adds curtailment_event.created_by_user_id (NOT NULL FK to user(id)). The operator's session.Info.UserID is captured at handler entry, threaded through StartRequest.CreatedByUserID, persisted on the event row, and read back by the reconciler when synthesizing dispatch context. Without this column the reconciler would dispatch with UserID=0 and every command_batch_log insert would fail the FK to user, burning targets to RestoreFailed on the first reconciler tick. NOT NULL with no backfill is safe because PreviewCurtailmentPlan writes no rows to curtailment_event — the table is empty in any environment that has only run earlier migrations.

Dispatch primitives

  • proto/minercommand/v1/command.proto: extends CommandType with COMMAND_TYPE_CURTAIL and COMMAND_TYPE_UNCURTAIL. Generated Go + TS regenerated.
  • commandtype.Curtail / commandtype.Uncurtail Go enum values with String / FromString round-trip and MarshalText / UnmarshalText parity. activityEventType arms emit curtail / uncurtail for audit-log rows.
  • session.ActorCurtailment and the matching activitymodels.ActorCurtailment constant, plus an actorTypeFromSession arm so reconciler-dispatched commands attribute to the curtailment actor in audit rows.
  • dto.CurtailPayload JSON DTO carrying the curtailment level on queue messages (int32 mirroring sdk.CurtailLevel).
  • Capability mapping: COMMAND_TYPE_CURTAIL requires CapabilityCurtailFull only (dispatch sends FULL); COMMAND_TYPE_UNCURTAIL uses the OR-set since restore is level-independent. hasAnyCapability gains the corresponding switch arms.
  • executeCommandOnDevice dispatch arms invoke minerInfo.Curtail / minerInfo.Uncurtail. Existing PluginMiner already implements both via the DeviceCurtailment optional SDK interface.
  • Service.Curtail / Service.Uncurtail public methods on the command service mirror the Reboot / StartMining shape so the reconciler dispatches through the standard preflight + queue pipeline rather than poking the queue directly.

Defensive guards on the dispatch boundary: malformed CurtailPayload and out-of-range Level surface as FailedPrecondition (not Internal) so they fail permanently on the first attempt rather than burning MaxFailureRetries against a deterministic input bug.

Preflight + schedule integration

CurtailmentActiveFilter blocks external commands targeting devices that are part of a non-terminal curtailment event. Reconciler self-traffic bypasses the filter via Actor == session.ActorCurtailment AND CommandType ∈ {Curtail, Uncurtail}, mirroring how ScheduleConflictFilter bypasses ActorScheduler. The Actor field is internal-only — handlers never set it from request data — so the bypass cannot be forged from outside. The common case (no active events) short-circuits without building a set.

The schedule processor distinguishes curtailment-active skips from priority-conflict skips and emits schedule_skipped_due_to_curtailment instead of schedule_executed device_count=0, so operators can see the actual cause when an active curtailment event preempts a scheduled command.

Reconciler

Single-instance background goroutine with a serial 30s tick, end-of-tick heartbeat upsert, and per-event panic isolation. Lifecycle wired in cmd/fleetd/main.go alongside the schedule processor with the same drain-before-cancel + watchdog shape.

Per tick, for each non-terminal event:

  1. pending events: dispatch Curtail per pending target under the operator's synthesized session; transition the target to dispatched. Confirm any already-dispatched targets via the latest telemetry sample. When every target is confirmed or terminally failed, transition the event pending → active. All-terminal-failed events skip past active to completed_with_failures.
  2. active events: drift detection on confirmed targets. A device whose telemetry no longer satisfies the curtailed predicate is marked drifted and re-dispatched up to MaxRetries; budget exhaustion routes the target to RestoreFailed.
  3. restoring events are owned by the future restorer; the reconciler does not write here.
  4. Heartbeat upserts at end of tick with current tick UUID, duration, and active-event count. Detached from workCtx so shutdown-watchdog cancellation cannot drop the final liveness signal.

The confirmation predicate (isCurtailed) takes a requirePositiveEvidence flag: confirm path requires positive evidence (missing telemetry → not curtailed); drift path preserves curtailed=true on missing telemetry so a flaky sensor cannot trigger a redispatch storm.

Idempotency

The idempotency_key field is plumbed through the request, the service request shape, and the curtailment_event column. The lookup query at the persistence boundary is intentionally not implemented — the field is plumbed end-to-end so the lookup can land later without a contract change. Until then, duplicate-key Start calls surface as Internal from the partial-unique-index violation.

Hardening landed during review

Tightenings on top of the original BE-3 surface, all driven by review threads on this PR:

  • Strict uint32 → int32 conversion at the translate boundary. max_duration_seconds, restore_batch_size, restore_batch_interval_sec, and min_curtailed_duration_sec reject overflow with InvalidArgument naming the offending field rather than silently saturating.
  • allow_unbounded=true + non-zero max_duration_seconds rejected explicitly. The translate layer previously dropped the cap silently when allow_unbounded=true; it now parses unconditionally so the service-level mutual-exclusion check fires from the proto surface, preventing a "bounded with safety cap" admin intent from silently becoming an unbounded event.
  • Whitespace-only reason rejected before persistence, matching the DB CHECK (length(trim(reason)) > 0) constraint.
  • max_duration_seconds=0 resolves to the org default (with allow_unbounded=false) at the service layer, matching the proto contract's "use org default" sentinel.
  • Confirm path requires positive telemetry evidence. Missing or non-finite power samples on a dispatched target do NOT promote it to confirmed; only positive curtailment evidence does. The drift path takes the opposite policy (missing samples preserve curtailed=true) so a flaky sensor cannot trigger a redispatch storm.
  • Empty dispatch batch counts as a failed attempt against the retry budget. A Start that resolves to zero live devices at dispatch time does not silently leave the target stuck.
  • Missing candidate row on a dispatched target consumes retry budget rather than stalling. After exhaustion the target lands in RestoreFailed and the event can still progress to active (or completed_with_failures if all targets failed).

Wiring

cmd/fleetd/main.go registers CurtailmentActiveFilter on commandSvc (after ScheduleConflictFilter) and starts the reconciler with the same graceful-shutdown pattern as the schedule processor. The handler is constructed with startEnabled: false.

What is intentionally not in this PR

  • StopCurtailment and the restorer. The reconciler's restoring arm is a no-op.
  • Reconciler-side max_duration_seconds enforcement (event-level termination on the cap elapsing). Pairs with the restorer.
  • UpdateCurtailmentEvent, GetActiveCurtailment, ListCurtailmentEvents.
  • idempotency_key lookup at the persistence boundary.
  • Connect-RPC plumbing for Curtail / Uncurtail in MinerCommandService — these are internal dispatches driven by curtailment events, not operator-facing commands. The proto enum extension is enough for CheckCommandCapabilities.
  • Reclassifying the other 5 dispatch arms' unmarshal errors from Internal to FailedPrecondition — separate cleanup PR.

Known gap

A target that lands in Dispatched and never receives positive curtailment evidence via telemetry stays there indefinitely (no time-based redispatch in this PR). Operators rely on AdminTerminateEvent as the escape hatch. The follow-up that implements max_duration_seconds enforcement adds the event-level time-based termination path covering this case; until then StartCurtailment is gated off in production.

Test plan

go build ./... and golangci-lint run ./... pass cleanly. go test ./internal/domain/curtailment/... ./internal/handlers/curtailment/... ./internal/domain/schedule/... ./internal/domain/command/... ./internal/domain/commandtype/... -count=1 is green for the unit surface (DB-dependent integration tests fail on connection-refused as usual; not regressions).

Coverage by area:

  • Service.Start — validation rejection per Start-specific field, including created_by_user_id <= 0, whitespace-only reason, and allow_unbounded + finite max_duration_seconds mutual exclusion; selector forwarding (whole-org, device-list); insufficient-load returns InvalidArgument with detail; empty-plan returns InvalidArgument; persistence with baseline_power_w and created_by_user_id captured; source-actor derivation per auth method; max_duration_seconds=0 resolves to org default before persistence.
  • StartCurtailment handler — happy-path with stub service; startEnabled=false returns Unimplemented even with valid creds; admin-gate preserved for restore_batch_size_override / candidate_min_power_w_override; API-key vs. session source-actor attribution; CreatedByUserID flows from session.Info.UserID into the persisted event; allow_unbounded=true paired with non-zero max_duration_seconds returns InvalidArgument without reaching persistence; strict uint32 → int32 overflow rejection on each of the four uint32 fields.
  • Service.Curtail / Uncurtail — queue receives the right command type with the right payload.
  • CurtailmentActiveFilter — bypass for ActorCurtailment + Curtail/Uncurtail; no-active-events fast path; partial-skip with multi-device kept/skipped split; empty-input passthrough.
  • Reconciler — pending → dispatched transition; telemetry-confirmed → event active; drift detection happy path; retry exhaustion routes to RestoreFailed; per-event error isolation; heartbeat advances on every tick; isCurtailed predicate covers nil / finite / non-finite power × baseline × hash combinations across both confirm and drift modes; empty dispatch batch consumes retry budget; missing candidate row on dispatched target consumes retry budget.
  • Schedule processor — schedule fires with all devices in an active curtailment event; processor emits schedule_skipped_due_to_curtailment and not schedule_executed.
  • commandtype.Type round-trip — pins StringFromString for all 13 values; rejection arm covers unknown labels.
  • session.Actor constants — distinct lowercase labels; no collision with ActorScheduler.
  • Capability mapping — asymmetric CURTAIL (FULL-only) and UNCURTAIL (OR-set) shapes pinned; hasAnyCapability covered for each curtail capability plus an OR-semantics case.
  • executeCommandOnDevice dispatch — Curtail dispatches with payload-derived level; surfaces unmarshal failure as FailedPrecondition; rejects out-of-range levels (level=0 AND level=3) as FailedPrecondition with Curtail never invoked; Uncurtail dispatches with the empty request.

Closes #191
Refs #118
Refs #173
Refs #188

Adds the foundation primitives a curtailment Start RPC and reconciler
will dispatch through:

- proto CommandType: COMMAND_TYPE_CURTAIL, COMMAND_TYPE_UNCURTAIL
- commandtype.Curtail, commandtype.Uncurtail with String/FromString
  round-trip
- session.ActorCurtailment for self-originated traffic so future
  command-preflight filters can bypass curtailment-active gating
- dto.CurtailPayload carrying the curtailment level
- capability mapping for the new command types (OR-relationship across
  CapabilityCurtailFull and CapabilityCurtailEfficiency, matching the
  DeviceCurtailment optional interface)
- executeCommandOnDevice dispatch arms invoking minerInfo.Curtail and
  minerInfo.Uncurtail with the SDK request shapes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rongxin-liu rongxin-liu requested a review from a team as a code owner May 7, 2026 14:52
Copilot AI review requested due to automatic review settings May 7, 2026 14:52
@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (b027e73691c7f4ffc928e66d259efad9a561a246...1737b8858e9309500c5c046ac2ca68b0c59acf43, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Async curtail command failures can leave events stuck forever

  • Category: Reliability
  • Location: server/internal/domain/curtailment/reconciler/reconciler.go:331
  • Description: r.cmd.Curtail only enqueues an async command batch, but the reconciler immediately records the target as dispatched. Later command execution failures are not reconciled against last_batch_uuid; confirmDispatched only waits for telemetry and does not time out or consume retry budget when telemetry never confirms curtailment.
  • Impact: A failed curtail command can leave targets in dispatched and the event in pending indefinitely, while the active-curtailment filter continues blocking normal commands for those devices.
  • Recommendation: Track batch terminal status for last_batch_uuid and convert failed/expired dispatched targets through recordDispatchFailure. Add a confirmation timeout so stale telemetry or failed execution cannot park an event forever.

[HIGH] Schedule reverts are marked complete even when curtailment skips them

  • Category: Reliability
  • Location: server/internal/domain/schedule/processor.go:712
  • Description: The end-of-window revert path logs CurtailmentActiveFilter skips, but then still calls revertToActive and logRevert. If curtailment blocks some or all SetPowerTarget(...DEFAULT...) reverts, the schedule is moved out of RUNNING and will not retry after curtailment ends.
  • Impact: Miners can be stranded in a scheduled power-target mode after the window expires, reducing hashrate/revenue and making the activity log claim a revert happened when it did not.
  • Recommendation: If any curtailment-active skips occur during end-of-window revert, keep the schedule running or persist the skipped device IDs for retry after curtailment release. Only mark the window reverted once all required revert commands were dispatched or explicitly no-op.

[MEDIUM] Reconciler has no cross-process lease or target claiming

  • Category: Concurrency
  • Location: server/cmd/fleetd/main.go:421
  • Description: Every fleetd process starts a reconciler, and the reconciler lists all non-terminal events without a DB lease, advisory lock, or per-target claim. Multiple server replicas or overlapping deploys can process the same pending targets.
  • Impact: Duplicate curtail batches can be enqueued for the same miner, producing command storms, misleading audit rows, and racy target state updates.
  • Recommendation: Add a DB-backed singleton lease/advisory lock for the reconciler, and claim work atomically with UPDATE ... WHERE state = 'pending' ... RETURNING or SELECT FOR UPDATE SKIP LOCKED.

[MEDIUM] Pending dispatch fanout creates one command batch per target without a cap

  • Category: Reliability
  • Location: server/internal/domain/curtailment/reconciler/reconciler.go:269
  • Description: dispatchPending loops over every pending target and calls dispatchOneCurtail, which creates a separate command batch and status polling routine per miner.
  • Impact: A whole-org curtailment on a large fleet can enqueue thousands of batches and spawn thousands of polling goroutines in one tick, stressing the DB, queue, and command executor.
  • Recommendation: Dispatch pending targets in bounded batches, reuse one command batch for multiple devices where possible, and add per-tick/event limits with backpressure.

[MEDIUM] Full-curtail confirmation ignores positive hash rate when power is below threshold

  • Category: Reliability
  • Location: server/internal/domain/curtailment/reconciler/reconciler.go:649
  • Description: For targets with a baseline, isCurtailed returns true based only on latest_power_w <= baseline * factor. A miner still reporting positive hash rate can be confirmed as fully curtailed if its power drops below the threshold.
  • Impact: The system may overstate curtailed kW and mark an event active even though selected miners are still hashing, which can miss demand-response targets.
  • Recommendation: For CurtailLevelFull, require hash rate to be zero/non-positive when it is available, or treat positive hash as drift regardless of the power threshold.

Notes

I did not find pool URL, wallet, payout address, shell-out, Nmap, mDNS, Rust, or infrastructure changes in the reviewed diff. StartCurtailment is currently gated off in fleetd, but the reconciler is started, so the reconciler findings still matter for any existing/non-terminal DB rows and before the start gate is lifted.


Generated by Codex Security Review |
Triggered by: @rongxin-liu |
Review workflow run

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds foundational curtailment command plumbing across the proto surface, server command domain, and execution path so upcoming StartCurtailment + reconciler work can dispatch Curtail/Uncurtail through the existing queue + capability-check framework.

Changes:

  • Extend minercommand.v1.CommandType with COMMAND_TYPE_CURTAIL and COMMAND_TYPE_UNCURTAIL (and regenerate Go/TS outputs).
  • Add Go domain enums + DTO payload (commandtype.Curtail/Uncurtail, dto.CurtailPayload) and hook command execution dispatch to minerInfo.Curtail/Uncurtail.
  • Add session attribution (session.ActorCurtailment) and update capability mapping/tests for the new command types.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
server/internal/domain/session/models.go Adds ActorCurtailment session actor for self-originated curtailment traffic attribution.
server/internal/domain/session/models_test.go Pins actor string labels and ensures new actor doesn’t collide with existing values.
server/internal/domain/miner/dto/command_dto.go Introduces CurtailPayload for queue message payloads (curtailment level).
server/internal/domain/commandtype/enum.go Adds Curtail/Uncurtail command types with String/FromString support.
server/internal/domain/commandtype/enum_test.go Adds round-trip and label-stability tests for command type string conversions.
server/internal/domain/command/service.go Extends activity event mapping for curtail / uncurtail.
server/internal/domain/command/execution_service.go Adds execution dispatch arms that call SDK curtailment methods.
server/internal/domain/command/execution_service_test.go Adds unit tests verifying curtail/un-curtail dispatch and payload unmarshalling behavior.
server/internal/domain/command/capability_mapping.go Maps new proto command types to curtailment capability set (OR semantics).
server/internal/domain/command/capability_checker_test.go Ensures new command types are included in capability mapping + RequiresCapabilityCheck.
server/generated/grpc/minercommand/v1/command.pb.go Regenerated Go protobuf output reflecting new command types.
proto/minercommand/v1/command.proto Adds COMMAND_TYPE_CURTAIL and COMMAND_TYPE_UNCURTAIL enum values.
client/src/protoFleet/api/generated/minercommand/v1/command_pb.ts Regenerated TS protobuf output reflecting new command types.

Comment thread server/internal/domain/command/capability_mapping.go Outdated
rongxin-liu and others added 2 commits May 7, 2026 17:36
…ties

The capability_mapping entries for COMMAND_TYPE_CURTAIL and
COMMAND_TYPE_UNCURTAIL were added without a matching arm in
hasAnyCapability, so the switch fell through and CheckCommandCapabilities
silently reported none_supported=true for every device — even when the
device's CommandCapabilities advertised CurtailFullSupported or
CurtailEfficiencySupported.

Add the two missing case arms wired to the existing proto fields, plus
unit coverage for both constants (true/false per capability) and an
OR-semantics test confirming the map entry now actually gates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reject out-of-range CurtailLevel at the dispatch boundary as
  FailedPrecondition (which is in the permanent-failure arm of
  markQueueMessageStatus) so a malformed payload doesn't burn through
  MaxFailureRetries before being rejected.
- Add activitymodels.ActorCurtailment and route session.ActorCurtailment
  through actorTypeFromSession so reconciler-dispatched commands attribute
  to the curtailment actor instead of falling through to ActorUser.
- Trim the CurtailPayload.Level doc comment of its hardcoded numeric
  values (1=Efficiency, 2=Full) which had no compile-time anchor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated 1 comment.

Comment thread server/internal/domain/command/execution_service.go Outdated
rongxin-liu and others added 5 commits May 7, 2026 17:46
Loop the bounds-guard sub-test over level=0 (below Efficiency) and
level=3 (above Full) so a mutation on either comparison operator is
caught. Each iteration creates a fresh gomock controller and asserts the
miner's Curtail is never called.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion

A malformed curtail payload is a deterministic, non-recoverable input
bug — retrying it MaxFailureRetries times on the per-device FIFO queue
just blocks the queue head. Reclassify the unmarshal-error arm to
FailedPrecondition (which short-circuits to UpdateMessagePermanentlyFailed
on the first attempt) so it matches the invalid-level branch already in
this dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CheckCommandCapabilities for COMMAND_TYPE_CURTAIL was OR'ing FULL and
Efficiency, so a miner that advertised only Efficiency was reported as
supporting CURTAIL even though curtailment dispatches FULL. Operators or
automations relying on the capability surface for fleet selection would
see false positives and the batch would fail at execution.

Tighten the mapping to CapabilityCurtailFull only; UNCURTAIL keeps the
OR-set since restore is level-independent. Tests split accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ubtest

Simplification pass: tighten three new comments where the code already
communicated what the comment said, and split the bounds-guard test loop
into named t.Run subtests so failures attribute per level value.

- capability_mapping.go: drop "currently" qualifier and 3 redundant lines
- commandtype/enum.go: trim Curtail godoc to one-liner matching peers
- command_dto.go: drop the type-restating second sentence
- execution_service_test.go: wrap level=0 / level=3 in named subtests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the StartCurtailment write path. The handler validates the
request, runs the existing selector pipeline shared with Preview, and
persists event + per-target rows in a single transaction with
baseline_power_w captured from the latest telemetry sample per device.
On insufficient curtailable load the handler returns InvalidArgument
with the same structured detail Preview emits today; on empty plan after
a successful selector run it returns InvalidArgument rather than
persisting an empty event.

Service-level changes:
- Refactor Preview into a shared runSelector pipeline so Start reuses
  org-config + scope + candidate + classify + buildPlan without
  duplication; Plan gains an EventUUID populated only on Start success.
- Add Service.Start with a StartRequest superset of PreviewRequest
  (restore batch, durations, idempotency/external attribution, reason).
  validateStartRequest layers the new bounds checks on top of the
  existing Preview validator.
- Idempotency lookup is left as a TODO at the persistence boundary;
  the field is plumbed end-to-end so the lookup query can land later
  without a contract change.

Persistence:
- CurtailmentStore.InsertEventWithTargets is the transactional helper
  that wraps both inserts so an Insert that succeeds with zero targets
  cannot leak into the lifecycle tables.
- The sqlstore implementation runs both queries inside db.WithTransaction.

Handler:
- StartCurtailment derives source_actor_type from session.Info
  (user / api_key) so the audit trail attributes correctly without
  pulling session into the service. The admin override gate already
  fires when restore_batch_size_override or candidate_min_power_w_override
  are set.
- Response echoes the persisted event with target rollup; pending state
  is the persisted shape since dispatch is the reconciler's job.

Out of scope for this commit (separate follow-ups still complete BE-3):
- Initial Curtail batch dispatch and the reconciler that picks up
  pending events.
- CurtailmentActiveFilter registration on commandSvc.
- Schedule-processor curtailment-skip emission.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 129c208fc0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/handlers/curtailment/translate.go Outdated
Comment thread server/internal/domain/curtailment/service.go Outdated
rongxin-liu and others added 2 commits May 7, 2026 22:04
…event filter

Curtail / Uncurtail public methods on the command service mirror the
existing Reboot / StartMining shape so the curtailment reconciler can
dispatch through the standard preflight + queue pipeline rather than
poking the queue directly.

CurtailmentActiveFilter gates external commands against devices that
are part of an active curtailment event. Curtailment-origin traffic
(Actor=ActorCurtailment) bypasses the filter so the reconciler can
issue Curtail/Uncurtail without self-blocking, mirroring how
ScheduleConflictFilter bypasses ActorScheduler for scheduler-origin
traffic. The filter short-circuits when no active events exist so the
common-case command preflight stays cheap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…leetd wiring

Reconciler picks up pending curtailment events and drives them through
their lifecycle: dispatch initial Curtail per pending target with
Actor=ActorCurtailment so the active-event filter self-bypasses, watch
telemetry to confirm targets, transition the event pending->active when
all targets confirm, then run drift detection on active events and
re-dispatch up to MaxRetries before declaring drift-exhausted.

Each tick is single-instance and serial. End-of-tick heartbeat upsert
backs the operational liveness alert. Per-event panics are caught so
one bad event doesn't kill the tick.

Schedule processor now distinguishes curtailment-active filter skips and
emits schedule_skipped_due_to_curtailment instead of the generic
schedule_executed device_count=0 path, so operators can see when an
active curtailment event preempted a scheduled command.

fleetd registers CurtailmentActiveFilter on commandSvc alongside the
existing ScheduleConflictFilter, and starts the reconciler alongside the
schedule processor with the same graceful-shutdown shape.

New sqlc queries: ListNonTerminalCurtailmentEvents,
UpdateCurtailmentEventState, UpdateCurtailmentTargetState,
UpsertCurtailmentReconcilerHeartbeat. Store interface gains
ListNonTerminalEvents / UpdateEventState / UpdateTargetState /
UpsertHeartbeat to wrap them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rongxin-liu rongxin-liu changed the title feat(curtailment): add Curtail/Uncurtail command type and dispatch path feat(curtailment): StartCurtailment + dispatch + reconciler May 7, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48bbed492e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated
Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated
rongxin-liu and others added 3 commits May 7, 2026 23:04
Several reconciler issues that block real operation under failure modes:

- A filter-skipped command (cmd.Curtail returns nil error with non-empty
  result.Skipped) was silently treated as a successful dispatch. The
  device never received the curtail command but the target moved to
  Dispatched, so confirmDispatched would never see telemetry move and
  the target stayed dispatched indefinitely. Now examines Skipped after
  the nil-error check; matching device entries fall through to the
  recordDispatchFailure path with the skip reason as LastError.
- A single permanently-failing dispatch held the entire event in pending
  forever — maybeMarkActive required all targets in Confirmed with no
  terminal-failure budget. dispatchOneCurtail errors now bump RetryCount,
  persist LastError, and at MaxRetries transition the target to
  RestoreFailed; maybeMarkActive admits Confirmed-or-terminal as the
  promotion condition. When every target is RestoreFailed the event
  transitions to completed_with_failures rather than sitting non-terminal.
- After Confirmed→Drifted→Dispatched (drift redispatch), observeActive's
  Dispatched arm was an empty fall-through. Targets stayed dispatched
  even when telemetry showed curtailment had resumed. The arm now calls
  the same telemetry-confirmation path used during initial dispatch so
  drift recovery closes back to Confirmed in the same flow.
- ListTargetsByEvent fetched three times per pending event per tick
  (dispatchPending + confirmDispatched + maybeMarkActive). dispatchPending
  now fetches once and threads the slice through subsequent phases;
  per-target updates mutate in-place. The structural change also resolves
  the silent-error-swallowing in confirmDispatched and maybeMarkActive
  since neither phase fetches anymore.
- RetryCount was never reset when a re-dispatched target re-confirmed,
  so the budget was consumed across unrelated drift episodes; an
  off-by-one between checkDrift and observeActive's guards meant
  checkDrift's effective budget was MaxRetries+1. Both now reset on
  (re)confirm and use >=MaxRetries as the consistent boundary.
- runTick lacked a top-level defer recover(); a panic in
  ListNonTerminalEvents tore down the goroutine. safeTick now wraps
  each tick with panic recovery and the next tick still runs.
- Heartbeat upsert at end of runTick used workCtx; the shutdown
  watchdog's workCancel could drop the final heartbeat write. The
  upsert now uses a Background-derived ctx with a 5s timeout so a
  staleness alert won't fire spuriously after a clean restart.
- Reconciler.Start was not idempotent — double-Start spawned parallel
  tick loops. A running flag guards both Start and Stop.
- Successful dispatchOneCurtail now clears LastError on the target row
  rather than leaving the previous failure string in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h_size

Add proto buf.validate length bounds to the four text fields on
StartCurtailmentRequest that previously accepted arbitrary input:

- idempotency_key, reason, external_source, external_reference all gain
  string.max_len = 256 so a malicious or buggy caller cannot persist
  multi-megabyte values through these unbounded TEXT columns. The
  defense-in-depth length check in validateStartRequest catches non-
  Connect callers (internal CLIs, tests, future non-Connect surfaces).

- restore_batch_size gains uint32.lte = 10000 to match the existing
  bound on StopCurtailmentRequest.restore_batch_size_override; the
  proto contract was inconsistent across the two sibling fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip helpers

InsertEvent and InsertTarget on CurtailmentStore became unused once
InsertEventWithTargets landed as the only transactional write surface.
Removing the methods drops the SQL store implementations and the
panic-stub-on-each-fake-store boilerplate.

countConflictSkips and countCurtailmentActiveSkips in schedule/processor
were structural duplicates differing only in the constant compared.
Extracted into countSkipsByFilter; the per-filter wrappers are one-liners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f37d0f9648

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9939d07cca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/handlers/curtailment/translate.go Outdated
rongxin-liu and others added 2 commits May 8, 2026 09:36
…ndary

The four uint32 fields on StartCurtailmentRequest (max_duration_seconds,
restore_batch_size, restore_batch_interval_sec, min_curtailed_duration_sec)
went through uint32ToInt32Saturating, which silently clamped values above
math.MaxInt32 to MaxInt32. A caller sending 3_000_000_000 seconds got a
different persisted duration with no validation error, breaking
request/response accuracy for valid protobuf inputs.

Replace the saturating helper with uint32ToInt32Strict that returns
InvalidArgument naming the offending field. Connect-validated inputs are
unaffected (the proto bounds cap reachable values well below MaxInt32);
non-Connect callers and any future field without a proto bound now see
a clear error instead of silent saturation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…otency_key

A retry with a previously-used (org_id, idempotency_key) pair tripped the
partial unique index uq_curtailment_event_idempotency at insert time,
surfacing as Internal — defeating the purpose of exposing an idempotency
key on this API.

Add GetCurtailmentEventByIdempotencyKey sqlc query (org-scoped) and a
matching CurtailmentStore.GetEventByIdempotencyKey method. Service.Start
performs the lookup before the selector + insert pipeline; on a hit it
reconstructs a minimal Plan from the persisted event + targets and
returns early so the retry produces the same response shape as the
original Start. NotFound on the lookup falls through to the normal path.

The reconstructed Plan carries the persisted event_uuid, the persisted
max_duration_seconds (so EffectiveMaxDurationSeconds is populated), and
SelectedDevice entries built from curtailment_target rows. Skipped
candidates and estimated kW values are not re-derived from
decision_snapshot_jsonb to avoid coupling the retry path to the snapshot
schema; clients can re-fetch full detail via the read APIs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 79980009c4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/handlers/curtailment/translate.go
Comment thread server/internal/domain/curtailment/reconciler/reconciler.go
rongxin-liu and others added 3 commits May 8, 2026 09:57
confirmOneDispatched and checkDrift previously skipped silently when the
candidate row was missing — typically after a device unpaired or got deleted
mid-event. The target stayed in dispatched/confirmed forever, never consumed
its retry budget, and the event could not progress to a terminal state.

Both paths now route through recordDispatchFailure so the retry budget moves
forward and the target lands in RestoreFailed at exhaustion. The confirm path
keeps the target dispatched while retrying; the drift path keeps it drifted.

Add two regression tests covering each branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Service.Start short-circuits on a matching idempotency_key, the response
was built from the *retry* request's fields (priority, reason, scope, mode
params, batch sizes, include_maintenance, etc.) bolted onto the persisted
event_uuid + targets. A caller reusing a key with drifted metadata would see
the new attributes echoed alongside the original event, which is internally
inconsistent.

Thread the persisted Event + Target rows through Plan and have the response
translator describe the persisted row directly: state, priority, reason,
scope (reconstructed from scope_jsonb), mode params (from mode_params_jsonb),
batch sizes, observed/baseline power, retry counts, and a real per-state
target rollup. Fresh-Start path is unchanged.

Add a handler-level regression test that retries with deliberately drifted
metadata and asserts the response describes the persisted row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…achable arm

planFromExistingEvent populated plan.Selected on every idempotent retry, but
the response now reads plan.PersistedTargets exclusively on that path; the
Selected loop was a per-retry O(N) allocation with no consumer.

Also drop the unreachable "active" arm of desiredStateProto and its const —
v1 only writes "curtailed". Future stop/restore work can add the arm when it
adds the writer. Inline the remaining literal so the cross-package mirror
constant goes away.

Trim two godocs to one line each per the project's terse-comment preference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cda1a3c8f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/handlers/curtailment/translate.go Outdated
Comment thread server/internal/domain/curtailment/service.go
rongxin-liu and others added 2 commits May 8, 2026 12:42
…only

startResponseFromPersisted called effectiveMaxDurationSeconds with a nil
request, leaning on the proto getter's nil-safety. The intent on the
persisted path is "render plan.EffectiveMaxDurationSeconds, mapping nil to
0 for allow_unbounded events" — a fresh-path helper that falls back to
req.GetMaxDurationSeconds() obscures that and would break loudly if the
generated getter ever changed.

Add a persistedMaxDurationSeconds helper that takes only the *int32 plan
field. effectiveMaxDurationSeconds delegates to it for the plan-set branch.

Add a regression test for the allow_unbounded retry path: response surfaces
0 (proto default), not the retry request's drifted MaxDurationSeconds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two concurrent Starts with the same (org_id, idempotency_key) can both miss
the pre-read short-circuit; one wins the unique-index insert, the other
trips uq_curtailment_event_idempotency. The loser's request was returning
Internal — defeating the retry-safety guarantee the key is supposed to give
operators.

Add interfaces.ErrCurtailmentIdempotencyKeyConflict as the typed signal.
sqlstore.InsertEventWithTargets matches pgErr Code/ConstraintName against
the partial unique index name and surfaces the sentinel; constraint match
keeps the sweep narrow so a future unique constraint on curtailment_event
isn't silently swallowed. Service.Start, on receiving the sentinel, re-reads
via GetEventByIdempotencyKey and short-circuits to planFromExistingEvent so
the loser sees the winner's persisted shape — same retry contract as the
pre-read hit path.

Add a service-level regression test that drives the race deterministically
through a fakeStore.idempotencyRaceWinner hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe8af068b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated
Comment thread server/internal/domain/stores/sqlstores/curtailment.go Outdated
Reduces PR scope creep, fixes the production-blocking FK violation,
gates the operator entrypoint until BE-4 ships Stop, and trims the
reconciler down to what BE-3 must own.

FK fix. Migration 000043 adds curtailment_event.created_by_user_id
(NOT NULL FK -> user.id). The operator's session.Info.UserID is
captured at StartCurtailment, threaded through StartRequest into the
persisted event row, and read back by the reconciler when synthesizing
dispatch context. Without this column the reconciler would dispatch
with UserID=0 and every command_batch_log insert would fail the FK,
burning targets through MaxRetries to RestoreFailed.

Drop scope creep. Removes the idempotency-key lookup + retry-shaped
response that grew during review (~250 LOC across translate.go,
service.go, selector.go, the store interface, and sqlc queries). PR
description had always flagged the lookup query as a TODO; restored
to original scope. Duplicate idempotency_key Start calls now surface
as Internal until BE-5 lands the lookup at the persistence boundary.

Feature gate. StartCurtailment handler is gated behind startEnabled
(default false in cmd/fleetd/main.go) since BE-3 ships dispatch +
reconciler primitives but no Stop / restorer / max_duration_seconds
enforcement (those are BE-4). An event reaching active has no exit
path short of AdminTerminateEvent, so the operator entrypoint stays
dormant in production until BE-4 flips the flag.

Reconciler complexity. isPositivelyCurtailed and isCurtailedByPower
folded into one isCurtailed(..., requirePositiveEvidence bool);
processEvent's explicit terminal-state arm replaced with a default;
dispatchPending's empty-targets defensive transition simplified to
log + return.

Post-review polish. CreatedByUserID assertions added to handler and
service happy-path tests; new RejectsMissingCreatedByUserID
negative-case test; redundant default arms in resolvePriority and
strategyReasonLabel removed; persistedMaxDurationSeconds and the two
count-skips wrappers inlined; longer comments on the MaxRetries guard
in observeActive's drift arm and on the Start-mid-Stop concurrency
edge in Stop() that fleetd's single-lifecycle invariant makes
unreachable in practice.
Pulled long rationale paragraphs out of source comments where the depth
belongs in PR/commit messages. Stripped roadmap version markers (v1, BE-X)
from godocs and field annotations per project convention. Kept the load-
bearing whys: invariants the next reader needs to make a correct edit
(e.g. detached heartbeat ctx, in-memory mirror semantics, panic-recovery
asymmetry, FK-driven UserID flow, NOT-NULL-without-backfill safety claim).

Touches reconciler, service, translate, handler, sqlstores, sqlc queries,
migration 000043, schedule processor, session models, fleetd wiring, and
the StartCurtailmentRequest proto. Generated proto/sqlc outputs follow.

No behavior changes; tests + lint clean.
…lment-start-dispatch-reconciler

# Conflicts:
#	server/generated/sqlc/db.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b684fc99c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/handlers/curtailment/translate.go Outdated
…conciler

Resolve conflicts:
- server/cmd/fleetd/main.go: combine curtailment Start gating
  (NewHandler(curtailmentSvc, false)) with the agent->fleetnode rename
  and the new sites/buildings/fleetnode handlers from main.
- server/internal/domain/stores/sqlstores/curtailment.go: keep the new
  reconciler-facing methods (InsertEventWithTargets, ListNonTerminal,
  UpdateEventState/UpdateTargetState, UpsertHeartbeat) and re-add the
  "time" import that the auto-merge dropped when main moved its local
  ptrTo* helpers to helpers.go (helpers now live in helpers.go).
- server/migrations: renumber my 000048_add_curtailment_event_created_by
  to 000050 to clear the duplicate-version collision with main's new
  000048_rename_agent_to_fleetnode and 000049_add_site_id_to_device_set_rack
  (see docs/solutions/database-issues/duplicate-golang-migrate-version-after-merge-2026-05-12.md).
@rongxin-liu rongxin-liu marked this pull request as draft May 13, 2026 13:44
…ion_seconds

The handler's translate layer parsed max_duration_seconds only inside
the !AllowUnbounded branch, so an admin request with both fields set
had the cap silently dropped and the event persisted as unbounded —
the service-level mutual-exclusion check was dead code on the real
call path. Move the parse out of the branch so the check fires; pin
the InvalidArgument path with a new handler test.
Comment thread server/internal/handlers/curtailment/handler.go
Copy link
Copy Markdown
Collaborator

@mcharles-square mcharles-square left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth looking at the claude review feedback. is there a valid concern around concurrent starts?

@rongxin-liu rongxin-liu merged commit c5e7343 into main May 19, 2026
126 of 128 checks passed
@rongxin-liu rongxin-liu deleted the feat/issue-191-curtailment-start-dispatch-reconciler branch May 19, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add curtailment Start RPC, command dispatch, and reconciler loop

3 participants