Skip to content

feat(curtailment): operator read/update/admin APIs + audit + metrics#299

Open
rongxin-liu wants to merge 45 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics
Open

feat(curtailment): operator read/update/admin APIs + audit + metrics#299
rongxin-liu wants to merge 45 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics

Conversation

@rongxin-liu
Copy link
Copy Markdown
Contributor

@rongxin-liu rongxin-liu commented May 21, 2026

Summary

Closes the operator-facing surface and observability scaffolding for v1 curtailment. Builds on the lifecycle and dispatch work already on main (preview + start + dispatch + reconciler in #192, stop + staggered restore + max-duration enforcement in #232).

Operator read / update / admin

  • ListCurtailmentEvents — cursor-paginated history. The decision snapshot is trimmed at the SQL boundary so the response stays bounded on large fleet events: the SQL projection strips the per-device skipped array and computes the skipped_aggregate reason→count map inline, and per-target rows are intentionally omitted (consumers paginate over events here and fetch per-event detail separately). The cursor token carries org_id and state_filter alongside the row id so a cursor cannot cross tenants or state filters; legacy tokens without org_id transparently restart from the first page so a pagination loop crossing the deployment boundary does not surface a confusing error. Migration 000055 adds the supporting (org_id, id DESC) index using CREATE INDEX CONCURRENTLY (with the migrate-tool no-transaction annotation) so the build does not lock the table on high-row-count deploys.
  • UpdateCurtailmentEvent — operator-safe fields only: reason, restore_batch_size, restore_batch_interval_sec, max_duration_seconds. The service rejects empty patches (a request with no patchable field set would still bump updated_at via COALESCE, producing a misleading freshness signal). The same admin gate as Start applies to max_duration_seconds: non-admin callers cannot raise the cap above the org default. Reason carries the same length bound as Start (256 chars). Race between the pre-read and the UPDATE surfaces as a typed FailedPrecondition rather than silently no-op'ing.
  • AdminTerminateEvent body. Forces a non-terminal event to CANCELLED or FAILED and sweeps every non-terminal target to RESTORE_FAILED in the same transaction. The validator restricts target_state to those two; COMPLETED is rejected because the RPC fires when restore did not actually run. The Stop-first gate now triggers on any in-flight target — DISPATCHING, DISPATCHED, CONFIRMED, or DRIFTED — not only on ACTIVE events, so a pending event whose reconciler tick already issued curtail commands cannot be sliced out from under those commands without compensating Uncurtails. Idempotent re-issue against the same target state echoes the row without re-running the transition or sweep, and suppresses audit emission so audit consumers tracking operator action history do not see a phantom action; a different terminal state surfaces FailedPrecondition with a distinct message. The reason field carries a 256-char cap so a bulky operator string cannot amplify across thousands of target rows in the sweep.

Webhook ingestion idempotency

Pre-insert lookup at the persistence boundary on (org_id, idempotency_key) first, then (org_id, external_source, external_reference). A redelivery returns the original event without re-running selection — including its persisted target list and state — so retry callers do not see a synthesized PENDING response for a terminated event. The race-loser path (two concurrent first-time Starts past the lookup) falls into the same replay branch as a deliberate retry rather than surfacing Internal with the Postgres constraint name leaked in the error string.

Audit trail

Every successful Start emits a curtailment_started activity row. When allow_unbounded or force_include_maintenance is set, a typed row (curtailment_unbounded_start / curtailment_force_include_maintenance) emits alongside the base — two rows rather than one with a flag, so a feed of override-class starts is a simple event-type filter rather than a metadata scan. The audit metadata key matches the proto field name (force_include_maintenance, not the abbreviated force_include). IncMaintenanceOverride fires in parallel so the override rate surfaces on the platform metrics dashboard without joining against activity_log. AdminTerminateEvent emits its own activity row capturing actor + reason, but suppresses emission on idempotent replays — a duplicate curtailment_admin_terminated row for a no-op echo would mislead consumers. The audit ActorType reflects source_actor_type (scheduler / user / api_key) rather than defaulting to user.

Reconciler state-guard

Every reconciler dispatch (Curtail on pending targets, Uncurtail on restoring batches) re-reads the event immediately before the command issues so a tick that read its event list before a concurrent AdminTerminateEvent does not dispatch commands against a now-terminated event. The check is hoisted to the per-event level so a 100-target event pays one DB read per tick, not 100. Targets are stamped DISPATCHING before cmd.Curtail / cmd.Uncurtail so the row is visible to a concurrent terminate's in-flight gate during the command window; a tick interrupted between the pre-write and the post-command transition leaves a DISPATCHING orphan, and the next tick redispatches it via the normal pending-target loop (Curtail / Uncurtail are device-idempotent). UpdateCurtailmentEventState and UpdateCurtailmentTargetState are both :execrows and surface zero-rows-affected as typed sentinels (ErrCurtailmentEventStateRaceLoss / ErrCurtailmentUpdateTargetStateRaceLoss); the reconciler logs the signal and increments a dedicated counter rather than silently treating the race-loss as a successful transition.

Metrics interface

A reconciler.Metrics interface inside the curtailment domain with tick-duration, tick-failure, candidate-exclusion (labeled by reason), maintenance-override, and event-state-race-loss recorders. The default is a no-op; the concrete implementation wires at cmd/fleetd/main.go once the platform observability path lands. Interface shape is stable enough that the swap is a one-file change with no curtailment-package churn.

Heartbeat staleness runbook

The 5-minute staleness signal is canonically a SQL check against the curtailment_reconciler_heartbeat row, not an application metric — the runbook documents the SQL form and walks four failure modes (panic loop, slow-query contention, events not picked up, restore loop). Operator response steps lean on AdminTerminateEvent for the cases where infrastructure mitigation isn't enough; the runbook calls out the Stop-first requirement so an operator does not hit FailedPrecondition trying to terminate an active event directly.

Proto contract evolution

  • AdminTerminateEventRequest.idempotency_key (field 4) is removed for v1; tag and field name are reserved so the slot cannot accidentally be reused. AdminTerminate idempotency is state-based — a re-issue against the same target state echoes the row.
  • AdminTerminateEventRequest.reason now carries min_len = 1, max_len = 256; the service mirrors the cap as defense in depth.
  • ListCurtailmentEventsRequest.page_token carries max_len = 1024 so the base64+JSON decode path is bounded.
  • CurtailmentTargetState adds DISPATCHING (enum value 8) to model the transient stamp between the pre-command write and the post-command state — the value is part of the read-back contract for any consumer paging over targets.
  • AdminTerminateEvent and ListCurtailmentEvents RPC doc comments enumerate the FailedPrecondition variants and the trimmed response-shape contract respectively, matching the convention StopCurtailment already established.

Follow-up

A few items are intentionally outside this PR's scope, captured for the next iteration:

  • AdminTerminate lacks a force flag for cases where Stop also fails (DB outage, target adapter unreachable). Adding the escape hatch is a contract decision that pairs with a runbook entry describing the abandonment semantics.
  • The per-event liveness check covers dispatch paths; the confirmation and drift-detection paths bail on the typed target-state race-loss sentinel but the event-state promotion path still relies on the SQL-level EXISTS guard alone. A unified write-path bail on the typed sentinel is a follow-up.
  • A scoped API key (least-privilege replacement for the current admin-API-key blanket allow) lands separately; allow_unbounded and the operator override fields are intentionally API-key-reachable until that surface exists.

Test plan

Service-layer unit tests cover ListCurtailmentEvents, UpdateCurtailmentEvent, and AdminTerminateEvent end to end — happy path, state-machine guards, admin gating, empty-patch rejection, race-loss handling, the broadened in-flight-targets requirement, and the audit-suppression / audit-emission split across idempotent-replay and real-transition arms.

Idempotency-replay tests cover both Start channels (key, external-source/reference) including precedence ordering, partial-fields handling, lookup error propagation, persisted-payload return on replay, and the unique-violation race-loser path through the constraint-name sentinel routing.

Audit-emission tests pin the base row + override-specific rows under expected conditions, that the source actor type maps correctly, and that AdminTerminate suppresses emission on idempotent echoes. The lifecycle test pins Preview → Start → Stop → AdminTerminate persistence + emission.

Reconciler tests cover the per-event state-guard skip path on Curtail and restore dispatch, the DISPATCHING pre-command stamp visible during the command window, orphan recovery on the next tick for interrupted curtail and restore dispatches, and the typed race-loss signals on event-state and target-state updates.

Handler-level tests for each new RPC cover session resolution, role gates, malformed UUID rejection, proto/service translation, and the FailedPrecondition error-code mapping for both AdminTerminate variants. Cursor codec tests cover round-trip, malformed-base64 rejection, missing-org-id legacy restart, state-filter mismatch rejection, and non-positive id rejection.

A docker-driven HTTP-level E2E in server/e2e/ exercises the lifecycle path against a real Postgres + reconciler tick loop, including the reconciler-restart recovery path that re-picks up a DISPATCHING orphan after a process restart.

go build ./... clean; curtailment domain + handler + cursor test suites green; lint clean on the changed scope (pre-existing repo-wide lint debt unrelated to this branch).

Closes #289

Curtailment needs operational metrics — tick duration, tick failures,
selector candidate exclusions, maintenance overrides — but the codebase
has only OTel tracing today (no Meter, no /metrics, no Prometheus
exporter). The pipeline-shape decision is platform-team scope and
already in flight via the notifications + Grafana migration; curtailment
shouldn't make that decision unilaterally and shouldn't block on it.

Define a Metrics interface in the curtailment domain with a no-op
default. Service and Reconciler accept it through a functional option
so the dozens of existing test call sites (NewService(store), New(cfg,
store, cmd)) keep working unchanged. main.go wires NoOpMetrics through
both constructors so production has a single named site to swap when
the platform observability path lands — interface-stable, one-file
change, no curtailment-package churn.

Recorder call sites land in follow-up commits.

Refs #289
Three of the four Metrics recorders now have call sites:

- ObserveTickDuration fires from safeTick around runTick, capturing
  wall-clock per tick on every path (happy, panic-recovered,
  list-events-failure).
- IncTickFailure fires from safeTick on tick-infra panic AND from
  processEvent on per-event panic. The list-events early-return path
  is intentionally NOT counted because the heartbeat still advances
  there ("freshness, not query health" — see the comment in runTick).
- IncCandidateExcluded fires from Service.Start (not Preview) after the
  selector returns, once per skipped device labeled by reason. Start-
  only emission keeps debounced Preview calls from flooding the counter
  against a static fleet snapshot.

IncMaintenanceOverride is intentionally deferred. The per-miner
increment needs the selector to surface "this miner was kept because
the maintenance override was honored" — current candidate filtering
just lets the miner fall through without tagging. That instrumentation
lands in a follow-up commit alongside the audit-sweep work where
`curtailment_maintenance_override` activity rows are emitted on the
same code path.

Tests add a goroutine-safe recordingMetrics fake in both the
reconciler and service test files. Three reconciler tests pin
ObserveTickDuration on the happy path, IncTickFailure on tick-infra
panic, and IncTickFailure on per-event panic. One service test pins
IncCandidateExcluded on a phantom-load miner.

Refs #289
Operator-facing event history was previously Unimplemented and the
settings-page history table (PR #280) was reading fixtures. This wires
the RPC through every layer with a trimmed decision-snapshot policy
that keeps response sizes bounded on large fleets.

- sqlc: ListCurtailmentEventsForOrg, cursor-paginated by id DESC with
  an optional state filter. Caller passes limit+1 so the over-fetch
  detects whether another page remains.
- Store: opaque cursor (base64-encoded JSON) so the token shape is
  free to grow later (sort fields, secondary keys) without breaking
  older clients. PageSize <=0 maps to a 50-row default; an internal
  upper cap of 200 mirrors the proto validator as defense in depth.
- Service: ListEvents validates org and rejects negative page_size,
  then forwards to the store. Service-layer guard is needed because
  cross-tenant exposure is one query away.
- Handler: replaces the Unimplemented stub. Session-based org-id
  resolution, proto enum → service-layer state-filter mapping.
- Translate: list-view event proto omits per-target rows (heavy on
  10K-miner events × N pages) and trims the per-device `skipped`
  array to `skipped_aggregate` reason-count map. Top-K selected and
  the summary fields stay intact so dashboards can render exclusion
  trend lines.

Test fakes in three packages gain ListEvents stubs; the
curtailment-package fakeStore gains a working pagination impl mirroring
the SQL semantics so service-level tests can assert cursor round-trips.

Refs #289
Operator-safe partial update of a non-terminal event. Replaces the
Unimplemented stub on the handler.

State policy: pending and active accept the patch; restoring and
terminal states reject with FailedPrecondition. Operators who need to
intervene mid-restore go through AdminTerminateEvent — that's the
recovery surface, not Update. The conservative policy keeps the
recompute-vs-freeze question (Open #13) out of v1: Update of
restore_batch_size persists the new value but does NOT recompute
effective_batch_size. The reconciler's restore-claim reads the
Start-time stamped value through to the next event.

Validation mirrors Start: restore_batch_interval_sec is gated by the
non-admin cap (admin sets the session-derived bypass), max_duration
must be > 0 and <= 7 days, restore_batch_size >= 0. Misconfigured
values surface as InvalidArgument or Forbidden — never as a DB CHECK
violation.

sqlc UPDATE uses COALESCE on nil params so a partial patch preserves
unset columns. The WHERE clause re-asserts state IN ('pending',
'active') as defense in depth: a race where the row advanced between
the service's pre-read and the UPDATE surfaces as
ErrCurtailmentUpdateStateRaceLoss → FailedPrecondition with a distinct
message from the pre-read rejection.

Refs #289
Adds the admin-only escape hatch for forcing a non-terminal event to
CANCELLED or FAILED when a normal stop+restore cycle can't run.

The persistence layer wraps the terminal state transition and the
swept-target update in a single transaction via db.WithTransaction so
the event row and its targets stay in sync. An idempotent re-issue
with the same target_state is a no-op; a different terminal state
surfaces ErrCurtailmentAdminTerminateStateConflict, which the service
maps to FailedPrecondition.

Service-layer defense-in-depth checks (target_state in {CANCELLED,
FAILED}, non-empty reason, org/uuid present) mirror the proto
validator so non-Connect callers can't tunnel past it.
Adds a pre-insert lookup so a re-issued Start with the same
idempotency_key or (external_source, external_reference) pair returns
the original event instead of re-running the selector and tripping
the partial unique indexes (which would surface as a less helpful
AlreadyExists from the per-org non-terminal constraint).

idempotency_key takes precedence over external reference: the
operator-supplied retry handle wins over upstream re-delivery.
Lookup errors propagate unchanged so a transient db failure is
visible rather than silently falling through to a double-insert
attempt.
Adds an AuditLogger interface on the curtailment Service with a no-op
default so tests that don't care can ignore the wiring. main.go injects
*activity.Service via WithAuditLogger. Two override-specific event types
ride alongside the base curtailment_started row so a feed of unbounded
or force-include starts is a simple event-type filter rather than a
metadata scan.

IncMaintenanceOverride fires in parallel when force_include_maintenance
is set, so the platform metrics dashboard tracks the override rate
without joining against activity_log.

Audit emission is intentionally absent on the idempotent-replay and
insufficient-load paths: the original Start already recorded the trail,
and a path that never persisted shouldn't claim it did.
Documents the curtailment_reconciler_heartbeat-based liveness signal:
warn at 2 minutes of staleness with active events present, page at 5
minutes regardless. The SQL form is canonical; the vmalert rule is
parked behind a placeholder bridge metric so the wiring is one config
change away once a postgres-exporter publishes the staleness gauge.

Runbook walks four failure modes (panic loop, slow-query contention,
events not picked up, restore-loop forever) with operator response
steps that lean on AdminTerminateEvent for the cases where infra
mitigation isn't enough.
Walks the operator-facing service flow end-to-end against the
in-memory fake: Preview (no persistence side-effects) → Start
(persistence + audit + metrics) → Stop (RESTORING transition) →
AdminTerminate (forced terminal). The reconciler's tick-by-tick
state machine is covered piecewise in reconciler_test.go and
restore_test.go; this test pins the boundary between the public
service API and the persistence layer.

Companion tests cover the webhook idempotency-replay path
(duplicate Start short-circuits, no double-audit) and the
read-path query (ListEvents returns terminal rows filtered by
state).

A docker-driven HTTP-level e2e for the same lifecycle is a
follow-up — the existing server/e2e dir requires postgres +
proto-sim and lands when the curtailment plugin path is ready.
Four real lint findings from this branch, fixed without suppressions:

- Service.AdminTerminate: replace the two-case switch + default with
  an if-comparison so exhaustive doesn't demand the unhandled cases
  be enumerated. The default branch was load-bearing — the if form
  keeps the same behavior with less surface.

- service_list_test.go / handler_list_test.go: hoist the opaque
  cursor literal into a file-scope const. gosec G101 looks at string
  literals assigned to fields whose name matches credential keywords
  (PageToken matches "token"); an identifier reference clears the
  heuristic cleanly.

- service_start_idempotency_test.go: move the subtest store + svc
  creation inside the t.Run closure so each subtest can call
  t.Parallel() without sharing mutable counters across cases.
A multi-reviewer code review surfaced four merge-blocking P1s and a
batch of P2/P3 hygiene items on this branch. This commit lands the
focused, defensible subset that doesn't require contract or
security-policy decisions; the remaining items are recorded for the
next session.

Idempotency / race-recovery:
- Recognize uq_curtailment_event_idempotency and
  uq_curtailment_event_external_ref unique violations as race-loss
  via typed sentinels; Service.Start re-issues the corresponding
  replay lookup so the race-loser falls into the same response path
  as a deliberate retry rather than surfacing Internal with the
  constraint name leaked in the error string.
- AdminTerminateEvent: on zero-row UPDATE caused by a concurrent
  terminate-to-same-state, re-read inside the transaction and echo
  the row idempotently (mirrors BeginRestoreTransition's pattern).

Audit / observability:
- Emit a curtailment_admin_terminated activity row on AdminTerminate
  so the privileged force-terminate path captures actor + reason in
  the activity feed (parallels emitStartAuditTrail).
- emitStartAuditTrail now maps req.SourceActorType to
  activitymodels.ActorType so scheduler-triggered starts persist
  actor_type='scheduler' on activity_log instead of defaulting to
  'user'.

Update path hardening:
- Reject explicit empty-string Reason as InvalidArgument and add a
  256-char length cap mirroring Start. The proto-translate comment
  is updated to describe the actual silent-no-op behavior.

Proto contract docs:
- Field-level docstrings on ListCurtailmentEvents describing the
  omitted target_rollup/targets and the trimmed decision_snapshot
  shape (skipped_aggregate vs raw skipped).
- max_len=1024 on ListCurtailmentEventsRequest.page_token so the
  cursor decode path is bounded.
- Annotate the two eventStateFromProto call sites distinguishing
  the no-filter sentinel role from the target_state mapping role.

Cleanup:
- Drop duplicate finitePtr generic in handler_start_test.go (the
  existing ptr generic in handler_test.go covers the use case).
- Inline single-call-site valueOrZero generic in service.go.
@rongxin-liu rongxin-liu requested a review from a team as a code owner May 21, 2026 23:13
Copilot AI review requested due to automatic review settings May 21, 2026 23:13
@github-actions github-actions Bot added documentation Improvements or additions to documentation automation server shared labels May 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the operator-facing curtailment management surface (list/update/admin terminate) and adds observability scaffolding (audit events, metrics interfaces, heartbeat runbook/alert template) to support v1 curtailment operations end-to-end in the server.

Changes:

  • Add operator RPCs for listing historical curtailment events (cursor pagination) and updating operator-safe fields, plus an admin RPC to force-terminate an event and sweep targets.
  • Add webhook-style Start idempotency lookups (idempotency key + external source/reference) with race-loser handling, plus audit + metrics interfaces wired through the service and reconciler.
  • Add reconciler heartbeat runbook + placeholder vmalert rules for stalled reconciler/tick failures.

Reviewed changes

Copilot reviewed 28 out of 31 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
server/sqlc/queries/curtailment.sql Adds SQLC queries for idempotency lookups, operator-field update, admin terminate + target sweep, and org event history listing.
server/internal/handlers/curtailment/translate.go Adds request/response translators for AdminTerminate/Update/List; trims decision snapshot for list view; adds proto↔model event-state mapping helper.
server/internal/handlers/curtailment/handler.go Implements UpdateCurtailmentEvent, ListCurtailmentEvents, and AdminTerminateEvent handlers with session/admin gating.
server/internal/handlers/curtailment/handler_update_test.go Handler tests for UpdateCurtailmentEvent auth, validation, and admin gating behavior.
server/internal/handlers/curtailment/handler_stop_test.go Updates Stop handler test stub to satisfy expanded store interface.
server/internal/handlers/curtailment/handler_start_test.go Updates Start handler test stub for new store methods and adjusts optional pointer helper usage.
server/internal/handlers/curtailment/handler_list_test.go Handler tests for ListCurtailmentEvents pagination/filtering and decision-snapshot trimming behavior.
server/internal/handlers/curtailment/handler_admin_terminate_test.go Handler tests for AdminTerminateEvent admin gating, UUID validation, and state-conflict mapping.
server/internal/domain/stores/sqlstores/curtailment.go Implements SQL store methods for idempotency lookups, ListEvents pagination, operator-field update, and AdminTerminateEvent transaction.
server/internal/domain/stores/sqlstores/curtailment_cursor.go Adds base64+JSON cursor encode/decode helpers for ListEvents pagination.
server/internal/domain/stores/interfaces/curtailment.go Extends CurtailmentStore interface with list/update/admin-terminate/idempotency methods and new typed error sentinels.
server/internal/domain/curtailment/service.go Adds metrics/audit plumbing, Start replay lookups + race handling, ListEvents/Update/AdminTerminate service methods, and audit emission helpers.
server/internal/domain/curtailment/service_update_test.go Unit tests for Update service method validation/state-guard/race-loss behavior.
server/internal/domain/curtailment/service_test.go Expands fake store to support new store methods; adds a metrics recorder test helper.
server/internal/domain/curtailment/service_start_test.go Adds Start metrics test for candidate-exclusion counters.
server/internal/domain/curtailment/service_start_idempotency_test.go Adds Start idempotency replay + precedence + error-path tests.
server/internal/domain/curtailment/service_start_audit_test.go Adds Start audit emission tests (base row + override-specific rows + replay suppression).
server/internal/domain/curtailment/service_list_test.go Adds ListEvents service tests for forwarding/validation and store error propagation.
server/internal/domain/curtailment/service_lifecycle_test.go Adds service-layer end-to-end lifecycle test across Preview→Start→Stop→AdminTerminate and replay/list behavior.
server/internal/domain/curtailment/service_admin_terminate_test.go Adds AdminTerminate service tests for validation and conflict/error mapping.
server/internal/domain/curtailment/reconciler/reconciler.go Adds metrics injection and records tick duration/failure counters on panic paths.
server/internal/domain/curtailment/reconciler/reconciler_test.go Adds reconciler tests asserting tick duration/failure metric emission.
server/internal/domain/curtailment/metrics.go Introduces curtailment.Metrics interface + NoOpMetrics implementation.
server/internal/domain/curtailment/audit.go Introduces curtailment.AuditLogger interface + NoOpAuditLogger and curtailment audit event-type constants.
server/generated/sqlc/db.go Regenerated SQLC prepared-statement wiring for new curtailment queries.
server/generated/sqlc/curtailment.sql.go Regenerated SQLC query implementations/types for new curtailment queries.
server/docs/curtailment-reconciler-runbook.md Adds heartbeat staleness runbook, SQL alert query, and failure-mode triage guidance.
server/cmd/fleetd/main.go Wires NoOpMetrics + audit logger into curtailment Service and passes metrics into reconciler.
proto/curtailment/v1/curtailment.proto Documents list-response trimming and adds page_token max length validation.
deployment-files/server/monitoring/vmalert/rules.d/proto-fleet-curtailment.yml Adds placeholder vmalert rules for stalled reconciler and tick failure rate using bridge metrics.

Comment thread server/internal/domain/stores/sqlstores/curtailment_cursor.go
Comment thread server/internal/domain/curtailment/service.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62a996a1f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/service.go
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (cf6ff018f14bbcbaa8e67972132b3f01d4247300...0f166e93b6e898c36578cb7f785a20d84a107130, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Admin termination can abandon queued restore work

  • Category: Reliability
  • Location: server/sqlc/queries/curtailment.sql:186
  • Description: AdminTerminateEvent only blocks targets in dispatching, dispatched, confirmed, or drifted. After StopCurtailment, already-curtailed miners are reset to desired_state='active' and state='pending' while waiting for Uncurtail. An admin can terminate the restoring event before the next restore batch dispatches; the sweep then marks those pending restore targets restore_failed and the reconciler stops processing them.
  • Impact: Miners that were curtailed can remain off/curtailed indefinitely with no compensating Uncurtail, causing hashrate and revenue loss.
  • Recommendation: Treat restoring targets with desired_state='active' and any non-terminal state, especially pending, as not safe to terminate. Either reject admin termination until restore work is terminal, or make the admin path explicitly issue/confirm restore before marking the event terminal. Add a regression test for StopCurtailment -> AdminTerminateEvent before the next reconciler tick.

[HIGH] Admin termination race can still let curtail commands fire after terminal sweep

  • Category: Concurrency
  • Location: server/internal/domain/stores/sqlstores/curtailment.go:402
  • Description: The in-flight check runs before the terminal update/sweep, but it does not lock the event or target rows. The reconciler can write DISPATCHING and call cmd.Curtail after this check but before or during the admin transaction’s terminal update. The target update’s EXISTS guard also does not serialize with the event update under read committed isolation.
  • Impact: A curtail command can land while admin termination is sweeping the event terminal, again leaving miners curtailed without a restore path.
  • Recommendation: Serialize admin termination and dispatch pre-writes with a shared lock. For example, lock the event row and relevant target rows FOR UPDATE before the in-flight check/sweep, and make reconciler dispatch pre-writes acquire the same event/target lock or block on the terminal transition before any command call.

[MEDIUM] Updating restore batch size is accepted but does not affect restore dispatch

  • Category: Correctness
  • Location: server/internal/domain/curtailment/service.go:311
  • Description: UpdateCurtailmentEvent accepts and persists restore_batch_size, but the reconciler restores using effective_batch_size, which remains stamped from Start. The API can therefore report the updated batch size while actual restore behavior still uses the old effective value.
  • Impact: Operators can believe they changed restore pacing when they did not, leading to incorrect recovery timing and operational decisions.
  • Recommendation: Either reject restore_batch_size updates until recomputation is implemented, or recompute effective_batch_size transactionally for pending/active events and return the actual effective value clearly to clients.

Notes

No SQL injection, command injection, frontend XSS, protobuf wire-format break, pool/wallet hijack, or hardcoded payout-address issue was evident in the reviewed diff. I did not run tests because this was a read-only review pass.


Generated by Codex Security Review |
Triggered by: @rongxin-liu |
Review workflow run

@github-actions github-actions Bot added javascript Pull requests that update javascript code client labels May 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3aad80aa6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/handlers/curtailment/translate.go
Comment thread server/internal/domain/stores/sqlstores/curtailment.go
Codex security review + Copilot inline reviewers surfaced five
actionable findings on this branch. All five validated; landing the
fixes here.

Admin gate on Update.max_duration_seconds (HIGH).
  Mirrors Start's post-normalization admin check inside Service.Update.
  Without this, a non-admin who Started at the org default could Update
  the same event above the default, bypassing the privilege boundary
  Start enforces. Fetches org config lazily — only when max_duration is
  in the patch and the caller lacks admin controls.

AdminTerminate.reason length cap (MEDIUM).
  Service-level backstop rejects oversized reasons (>256 chars) so a
  bulky operator string can't amplify into every swept target's
  last_error column. The proto field gets the matching max_len=256
  rule; proto regen is deferred to a clean tooling pass (the service
  backstop already catches the case today).

List query trims decision_snapshot at the SQL boundary (MEDIUM).
  ListCurtailmentEventsForOrg now projects explicit columns with
  (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip
  list (multi-MB on 10K-miner events) doesn't ride the wire for every
  list row. Field layout matches CurtailmentEvent exactly so the
  existing convertEventRow path applies via a single struct
  conversion.

Cursor rejects non-positive IDs (MEDIUM).
  decodeCurtailmentEventCursor now returns InvalidArgument when the
  decoded id is <= 0. The store never emits a non-positive id; a
  user-supplied token that decodes to one would silently rewind to the
  first page (id=0) or return zero rows (id<0).

Audit metadata key naming (MEDIUM).
  Renamed `force_include` to `force_include_maintenance` on the
  curtailment_started audit row metadata so the key matches the
  domain/proto field name. Downstream analytics no longer have to map
  between abbreviated and full names.

Test coverage added for each fix: non-admin max_duration rejection,
admin pass-through, oversized reason rejection, cursor non-positive
id rejection (zero / negative / missing).
@rongxin-liu
Copy link
Copy Markdown
Contributor Author

All three findings addressed in 35594cd:

  • HIGH — Update.max_duration_seconds admin gate: Service.Update now fetches OrgConfig and applies the same post-normalization gate as Service.Start. Coverage in TestService_Update_{RejectsNonAdmin,AllowsAdmin}MaxDurationAboveOrgDefault.
  • MEDIUM — list query loads full snapshots: ListCurtailmentEventsForOrg projects (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip list (multi-MB on 10K-miner events) doesn't ride the wire for every list row.
  • MEDIUM — AdminTerminate.reason unbounded: added max_len=256 on the proto field and a service-level length backstop. Coverage in TestService_AdminTerminate_RejectsOversizedReason.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51e44672d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/service.go
Comment thread server/sqlc/queries/curtailment.sql Outdated
toEventProtoListItem previously zeroed ExternalSource / ExternalReference /
IdempotencyKey inline after toEventProto had set them. Move the scrub
into a named helper so the intent (these fields are list-view-omitted)
is explicit at the call site and a future fourth scrubbed field doesn't
need a comment to explain the pattern.

No behavior change.
A page_token issued before the cross-list-binding org_id field landed
used to reject with InvalidArgument once the binding-check went live.
A long-lived pagination loop crossing the deployment boundary should
restart from the first page transparently instead of surfacing an
opaque cursor error to the caller.

Decode now treats OrgID==0 as the "first page" sentinel and returns a
nil cursor (same as an empty token). State_filter mismatch and other
non-zero violations still reject. Test renamed to reflect the new
contract.
validateUpdateRequest now rejects requests where every patchable
optional field (reason, restore_batch_size, restore_batch_interval_sec,
max_duration_seconds) is nil. The SQL UPDATE still ran in that case
and bumped updated_at via COALESCE, producing a misleading freshness
signal for clients that track the column and adding write load with
no semantic change.

Affected service + handler tests that previously passed empty
UpdateRequests to reach state-guard / not-found branches now carry a
minimal valid patch (Reason). New TestService_Update_RejectsEmptyPatch
pins the new contract.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68af78a0bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated
…lment-read-apis-audit-metrics

# Conflicts:
#	server/generated/sqlc/db.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34c1ec4f19

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/stores/sqlstores/curtailment.go Outdated
A concurrent AdminTerminate landing between target N and target N+1 of
the dispatch loop would let remaining Curtail commands fire against an
already-terminated event because the previous design checked liveness
once per event-tick.

Move the eventStillDispatchable call to the start of dispatchOneCurtail
and drop the redundant pre-loop hoists in dispatchPending and
observeActive. The cost is one GetEventByUUID per target — N reads per
tick for an N-target event — which is acceptable; AdminTerminate is rare
and the perf optimization was incorrect in claim (the event row CAN
change inside a tick) and incorrect in consequence.

dispatchRestoreBatch is unaffected: its single bulk Uncurtail is one
command per batch, so the existing per-event check already guards it.

Adds TestReconciler_SkipsRemainingCurtailDispatchesWhenEventTerminatesMidLoop
asserting only the first target dispatches when the event flips
between target 1 and target 2.
The previous gate rejected admin-terminate only when event state was
ACTIVE. A PENDING event whose reconciler tick had already dispatched
some Curtail commands had targets in DISPATCHED/CONFIRMED/DRIFTED, and
admin-terminate would proceed — sweeping those targets to RESTORE_FAILED
without issuing the compensating Uncurtail commands, leaving the
already-curtailed miners stuck.

Replace the state==active check with a SQL existence check on
non-restored target states. The new check subsumes ACTIVE (which always
has CONFIRMED targets via maybeMarkActive) and additionally catches
PENDING events with dispatched targets.

The ErrCurtailmentAdminTerminateActiveEvent sentinel keeps its name for
backward compatibility with handler/test references, but its doc and
operator-facing error message now describe the broader condition. Proto
RPC comment and the reconciler runbook's mitigation guidance updated to
match.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d0b4c00c06

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/service.go Outdated
RBAC PR 1 landed migration 000052_create_permission_tables on main and
the post-merge commit had two files claiming version 52 — the duplicate
trips golang-migrate's version uniqueness check at server boot.

The list-index migration only ever lived on this feature branch, so
the migration-immutability rule (which gates edits to migrations on
main) does not apply. Renaming to the next free slot (000055, after
the RBAC track's 000052/000053/000054) is the right resolution.

Content unchanged; only the file numbers move.
…NTLY on list index

Two BE-5 merge gates landed together.

force_include_maintenance is safety-critical: it commands curtailment
on miners in active physical maintenance — forcibly power-cycling a
miner a technician is servicing is a personnel hazard. The BE-1.x
design intended Admin-only; the gate was never wired and any API-key
caller could trip it. Wire requireAdminFromContext mirroring the
allow_unbounded pattern; add table coverage in
TestHandler_OverrideFieldsRoleGate.

Migration 000055's index now builds with CONCURRENTLY. The earlier
"annotation needed" framing turned out to be wrong — golang-migrate
v4's postgres driver runs ExecContext directly without wrapping the
migration body in a transaction, so CONCURRENTLY works without any
no-transaction annotation. Cheap fix; a hard merge gate at
high-row-count deploys.
…l / Uncurtail

Closes the AdminTerminate residual race named in Known Limitation #10.
Before this change, a tick that read PENDING targets, called
cmd.Curtail, then lost a foot race to a concurrent AdminTerminate left
miners curtailed with no compensating Uncurtail — the sweep flipped
the just-dispatched-to targets to RESTORE_FAILED while the command
landed against a dead event. Blast radius scaled with event size:
thousands of stranded curtailments on a 5K-miner mid-dispatch event.

The fix is a two-phase write. dispatchOneCurtail and
dispatchRestoreBatch now stamp DISPATCHING on each target before
issuing the command, and transition to DISPATCHED only after the
command returns. The in-flight gate's SQL EXISTS predicate
(CurtailmentEventHasInFlightTargets) counts dispatching rows
alongside dispatched/confirmed/drifted. A concurrent terminate that
races a mid-dispatch tick observes the dispatching row and rejects as
Stop-first, so the command cannot fire against a swept event.

last_dispatched_at intentionally lands on the DISPATCHED write, not
the DISPATCHING pre-write — it records successful enqueue, used by
the restore-batch interval gate. Filter-skipped / empty-batch failures
roll back via recordDispatchFailure without leaving a misleading
timestamp.

New proto value CURTAILMENT_TARGET_STATE_DISPATCHING = 8. New Go
constant TargetStateDispatching ("dispatching"). Regression coverage
in TestReconciler_DispatchingPreWrite_CommitsBeforeCommand via a
curtailHook on the fake dispatcher that inspects store state at the
moment cmd.Curtail is called.
UpdateCurtailmentTargetState was :exec and silently dropped its row count.
When the SQL EXISTS guard fired (parent event terminated mid-tick), the
write produced zero rows but the reconciler had no signal — it advanced
the in-memory mirror as if the write succeeded, leaving the on-disk state
and the reconciler's per-tick view of targets out of sync.

Switch the query to :execrows. The store wrapper now returns
ErrCurtailmentEventStateRaceLoss on zero rows matched, matching the
existing UpdateEventState contract from earlier BE-5 work. A new
writeTargetState helper wraps every reconciler write site (nine in total
across dispatch/confirm/drift/observe/restore paths), routing the
sentinel through the same observability bucket as logEventStateUpdateError
(IncEventStateRaceLoss + slog.Warn). Callers gate the mirror update on
a clean return so the mirror stays consistent with the persisted state.

Regression coverage: TestReconciler_TargetStateRaceLoss_LogsAndMetersWithoutMirrorAdvance
injects the sentinel via the fake store, runs a tick, and asserts the
in-memory target state stays PENDING (not advanced) while the metric ticks.
…tart

Two e2e tests behind the `e2e` build tag, scoped to the curtailment
RPC surface:

TestCurtailmentLifecycle exercises Preview → Start → reconciler
advances to ACTIVE → Stop → restore drains to terminal → list
surfaces the terminal event. Validates the operator-facing path
end-to-end against the real fleet-api inside docker-compose, using
the proto-sim miner as the device under control. Per-target rows
are asserted absent on the list response, pinning the SQL-trimming
contract at the wire level.

TestCurtailmentReconcilerKillAndResume validates the restart-safety
contract: start an event, wait for the heartbeat to record one tick,
docker restart fleet-api-1, then assert (a) fleet-api returns to
healthy, (b) the heartbeat last_tick_at advances past the
pre-restart value (proves the reconciler resumed), and (c) the
event drains to terminal after Stop. The heartbeat read goes
directly via psql against the singleton row so the test sees what
the staleness alert predicate would see.

The e2e suite as a whole is currently blocked on pre-existing
proto-drift breakage in plugin_integration_test.go (unrelated
pairing/auth/telemetry field renames). The curtailment e2e file
ships with the correct curtailment proto bindings so it compiles
and runs as soon as the broader suite drift is fixed; standard
`go test ./...` (no `-tags e2e`) is unaffected since the file
carries the build tag.
Adding TargetStateDispatching to the model triggered exhaustive lint on
four switch statements that case on TargetState. The semantics:

- targetStateProto: maps directly to CURTAILMENT_TARGET_STATE_DISPATCHING.
- populateEventTargets rollup: DISPATCHING counts into the Dispatched
  bucket — the operator-facing rollup treats "command in flight from the
  reconciler's view" as one signal; the wire-level distinction stays.
- observeActive: DISPATCHING is the brief mid-cmd.Curtail window for the
  in-flight tick. Don't re-enter from a sibling tick — let the original
  tick complete its own DISPATCHED transition.
- maybeMarkActive: DISPATCHING is in-flight, same as DISPATCHED — hold
  Pending for the next tick.

maybeCompleteRestoring keeps its default arm with an explicit
//nolint:exhaustive directive. The default is load-bearing for
defense-in-depth: a future schema-added target state must stay
non-terminal until it ships its handling. Pinned by
TestReconciler_Restoring_UnknownTargetStateKeepsEventNonTerminal.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e64ceb307a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated
Comment thread server/internal/domain/curtailment/service.go Outdated
The directory is locally-private and managed via a folder-local
self-ignoring .gitignore (file content: `*`) that the operator creates
on their own checkout. Keeping the entry out of the root .gitignore
avoids exposing the internal-plan directory name in a public artifact
and keeps the root file focused on universally-ignored paths.
A target stranded in DISPATCHING by an interrupted tick (process crash,
panic, or context cancellation between the pre-command stamp and the
post-command DISPATCHED write) would otherwise stay stranded — the
dispatch loop only picked up PENDING. Ticks are serial, so any
DISPATCHING seen at the top of dispatchPending or maybeClaimRestoreBatch
is by definition orphaned; redispatch is the recovery path since
Curtail/Uncurtail are device-idempotent.
AdminTerminateEvent is intentionally idempotent: a re-issue against an
event already in the requested terminal state echoes the row without
re-running the transition or sweep. Emitting a duplicate
curtailment_admin_terminated activity row for those no-op calls would
mislead audit consumers tracking operator action history.

Plumb a transitioned bool through the store interface (false on both
idempotent-echo paths — currentState==targetState on first read, and
latestState==targetState on the race-loss re-read) and gate
emitAdminTerminateAuditTrail on it.
- cursor decode now rejects non-positive org_id as InvalidArgument
  rather than silently restarting pagination — the feature has never
  shipped, no legacy tokens exist, and the silent path hid tampered
  cursors from audit detection
- removed unreachable s.audit nil guards (NewService always installs
  NoOpAuditLogger; WithAuditLogger refuses nil)
- removed redundant restoreBatchIntervalUpperBoundSec re-check in
  Start() — validateStartRequest already enforces the bound for
  non-zero values, and the default (30s) cannot exceed the cap
- updated stale handler.go package doc and main.go RPC-wire comment;
  both referenced an Unimplemented surface that this branch fully wires
- stripped v1 roadmap marker from ListEvents godoc
- rewrote migration 000055 comment to correctly describe golang-migrate
  v4's driver behavior and document operator recovery for partial
  CONCURRENTLY build failure
- added compile-time Metrics assertions next to the duplicated
  recordingMetrics fakes in service_test.go and reconciler_test.go
  so a future interface change can't silently drift one copy
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aca4137527

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return nil, err
}
return connect.NewResponse(&pb.UpdateCurtailmentEventResponse{
Event: toEventProto(event),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return fully populated events from update/terminate RPCs

UpdateCurtailmentEvent and AdminTerminateEvent build responses with toEventProto, which only maps scalar metadata and leaves structured fields like scope, mode params, and decision snapshot unset. Because these RPCs return CurtailmentEvent, clients that replace cached event objects with the response can lose previously populated event details (e.g., scope/mode context suddenly disappearing in UI/state). Use the same persisted-field population path used by read endpoints so write responses don’t downgrade event shape.

Useful? React with 👍 / 👎.

- UpdateCurtailmentEvent now collapses same-value patches to no-op
  before any gate or DB write. Echoing the persisted value (typical
  for UI re-submissions of a pre-populated form) previously bumped
  updated_at and could trip the admin gate on what is semantically
  an unchanged request.
- The max_duration_seconds admin gate now compares against the
  effective patch (persisted-vs-requested), not the raw request.
  A non-admin echoing an admin-elevated value gets the no-op path
  instead of Forbidden.
- New audit row (curtailment_updated) records operator field
  mutations. Metadata lists only the fields the patch actually
  changed so a feed reader sees intent without diffing snapshots.
- Reason / idempotency_key / external_source / external_reference
  length checks now count runes (utf8.RuneCountInString) to match
  the proto validator's rune-based max_len. A 256-character
  multi-byte reason that survived the proto pass no longer trips
  the byte-based service backstop with a confusing error.
InsertEventWithTargets's fall-through path for an unrecognized
unique-constraint name previously wrapped the raw pgconn.PgError,
exposing the internal constraint name in the wire response. No
current constraint reaches the fall-through, but a future partial
unique index added without updating the switch would silently
exfiltrate its name on every concurrent racing Start.

Log the constraint server-side for operators and return a
sanitized AlreadyExists to the caller.
…overy + resilient restore batch

- dispatchPending now reads event liveness once per tick instead of
  once per target. The DISPATCHING pre-command write's EXISTS guard
  is the load-bearing race-closure for a concurrent AdminTerminate —
  the per-target liveness re-read was defense-in-depth that scaled
  O(N) per tick. At 100+ pending targets the redundant reads pushed
  tick latency toward the per-event deadline before any Curtail
  fired. The updated mid-loop-terminate test now pins the EXISTS
  guard's race-closure via the UpdateTargetState race-loss sentinel.
- observeActive now treats DISPATCHING targets the same as DRIFTED:
  redispatch via Curtail (device-idempotent) under MaxRetries. A
  prior interrupted active-drift dispatch left the target stuck
  indefinitely under the previous "let the in-flight tick complete"
  arm; ticks are serial, so any DISPATCHING seen at observeActive
  entry is by definition an orphan from a crashed prior tick.
- dispatchRestoreBatch's DISPATCHING pre-write loop now drops just
  the failing target on a non-race-loss error instead of aborting
  the entire batch. The remaining devices proceed to Uncurtail in
  the same tick, and the dropped row is re-claimed as an orphan
  next tick. Race-loss continues to abort the batch (event is no
  longer dispatchable).
- The orphaned DISPATCHING restore-target recovery test now asserts
  the device was re-stamped DISPATCHING before Uncurtail fired,
  pinning the AdminTerminate in-flight-gate contract on this path.
- Idempotency race-loser: pins the InsertEventWithTargets ->
  ErrCurtailmentIdempotencyKeyRaceLoss / ErrCurtailmentExternalReferenceRaceLoss
  -> retry-lookup -> replay-winner flow on both webhook channels,
  plus the rollback-induced fall-through to AlreadyExists when the
  retry lookup also misses. The fake store now models a separate
  post-insert lookup map so concurrent-first-time-Start scenarios
  can be constructed without coupling to insertEventCalls in the
  test bodies.
- Actor-type mapping: pins ActorUser for both SourceActorUser and
  SourceActorAPIKey (the activity_log doesn't yet model an
  api_key actor; a future split must not silently keep this
  coercion) and ActorScheduler for SourceActorScheduler.
- Cursor binding: a parameterized round-trip across the (OrgID,
  StateFilter) shapes ListEvents actually issues. Documents the
  contract the SQL store's mismatch guard relies on so a
  serialization regression on either field would trip loudly.
…ygiene

- restore_batch_interval_sec's non-admin cap now runs against the
  effective patch in Service.Update, mirroring the max_duration_seconds
  fix landed earlier in this PR. A non-admin echoing an admin-elevated
  value as part of an unrelated patch (UI form re-submission) collapses
  to no-op and no longer trips Forbidden — asymmetric gate placement
  between the two fields fixed.
- observeActive comment refreshed to reflect the post-hoist safety
  model. The load-bearing race-closure is the DISPATCHING pre-write's
  EXISTS guard inside dispatchOneCurtail, not a per-target liveness
  read that no longer exists.
…re-write, utf8 boundary

- Update audit emission: a real field change produces one
  curtailment_updated row whose `fields` metadata lists only the
  actually-changed field names (no-op echoes excluded). A patch where
  every field matches the persisted value collapses to no-op with zero
  store calls and zero audit rows.
- Update gate symmetry: non-admin echo of admin-elevated max_duration
  passes; non-admin echo of admin-elevated restore_batch_interval also
  passes (the asymmetric pre-fix would have rejected this).
- Update reason length boundary: 256 multi-byte runes (768 bytes) pass
  rune-count validation; 257 reject. Pins the byte-vs-rune fix.
- observeActive DISPATCHING orphan recovery: a target left in
  DISPATCHING on an ACTIVE event is redispatched on the next tick;
  budget-exhausted orphans are not redispatched.
- dispatchRestoreBatch partial pre-write failure: a non-race-loss pre-
  write error drops just that target from this tick's batch; Uncurtail
  fires for the surviving devices; the failed target stays in its prior
  state for next-tick reclaim.
- Removed the now-unused getEventByUUIDHook + getEventByUUIDCalls
  fields from the reconciler fakeStore (the mid-loop-terminate test
  was rewritten to use updateTargetStateHook in the prior commit).
…note

- observeActive: trim the over-explanatory race-closure block to one
  line; move the comment to sit above the actual eventStillDispatchable
  call rather than the ListCandidates fetch above it.
- validateUpdateRequest: drop the explanatory comment under the
  restore_batch_interval_sec block. The symmetric max_duration_seconds
  block carries no such note; an annotation on one field and silence on
  the other is more confusing than consistent silence. Both gates live
  in Service.Update via effectiveUpdatePatch — readers tracing the
  cap-check follow the existing max_duration_seconds pattern.
- Active-phase orphan budget-exhausted test now asserts the final state
  stays DISPATCHING and RetryCount stays at the cap. Mirror the rigor
  of the symmetric Drifted-arm exhaustion test so a silent state flip
  on the no-redispatch path would not pass.
- New TestReconciler_ObserveActive_DispatchingOrphanRaceLossDoesNotIssueCommand
  pins the EXISTS-guard race-closure on the observeActive redispatch
  path: when the DISPATCHING pre-write returns the race-loss sentinel,
  cmd.Curtail must not fire and the mirror must not advance.
- Restore partial-pre-write test now asserts m1.RetryCount==0 so the
  "skip without budget burn" invariant is pinned (the dispatch attempt
  never reached cmd.Uncurtail, so no retry slot should be consumed).
- New TestReconciler_Restoring_AllPreWriteFailuresSkipUncurtail pins
  the degenerate dispatchSet-empty path: every pre-write fails, no
  Uncurtail fires, no retry burns, targets stay Pending for next-tick
  reclaim.
- Reworded the active-orphan test docstring to describe the invariant
  rather than the review process that surfaced the gap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client documentation Improvements or additions to documentation javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(curtailment): operator read APIs + admin terminate + audit + metrics interface + E2E

2 participants