Skip to content

Harden failure handling & concurrency (audit items 1–3)#1084

Merged
schmidt-scaled merged 1 commit into
mainfrom
audit-hardening-main
Jun 8, 2026
Merged

Harden failure handling & concurrency (audit items 1–3)#1084
schmidt-scaled merged 1 commit into
mainfrom
audit-hardening-main

Conversation

@schmidt-scaled
Copy link
Copy Markdown
Contributor

Addresses the highest-impact findings from the failure-mode / race audit of the control plane. Cherry-picked from the same fix on `performance-optimization`.

Item 1 — systemic

  • `DBController.atomic_update()` — transactional read-modify-write (FDB serializable isolation + `fdb.transactional` auto-retry = true CAS), replacing the lost-update-prone `read(); mutate; write_to_db()` pattern.
  • Applied to `set_node_status` (FSM guards evaluated against the fresh row inside the tx; event emission / peer broadcast / task cancellation moved to after commit) and all 5 snapshot `ref_count` sites (3 increments, 2 decrements).
  • Task lease: `JobSchedule.owner` + `TASK_LEASE_TTL_SEC` + `claim_task()` (hostname-keyed — a same-host restart re-claims immediately; a second replica is locked out until the lease goes stale). Wired as a gate into the restart / migration / lvol_migration / port_allow / node_add runners.

Item 2 — RPC contract (safe subset)

  • Drop POST from the urllib3 read-retry `allowed_methods` so non-idempotent SPDK RPCs are no longer silently re-applied on a read timeout; connection-error retries are preserved.
  • Guard the 5 `for d in get_bdevs()` sites: a `None` (RPC-failure) return now raises a clear, catchable error instead of an opaque `TypeError`.
  • (The broader `_request`-raises-on-error contract flip is intentionally deferred — it breaks hundreds of `if not ret:` callers and needs its own pass.)

Item 3 — quick wins

  • Per-task `try/except` in `tasks_runner_restart`, `tasks_runner_node_add` (+ cap its unbounded backoff), `device_monitor`, `mgmt_node_monitor` so one task/node cannot kill the whole service.
  • Wire `is_migration_active_on_node` into `snapshot_controller.add` to enforce the one-migration-per-source-node freeze invariant (previously dead code).
  • `lvol_monitor` reconciles stale `STATUS_IN_CREATION` zombies (force-delete past `LVOL_IN_CREATION_STALE_SEC`) so crashed creates stop leaking pool capacity.

Tests

  • `tests/test_task_lease.py` (8) — lease/claim + staleness logic.
  • `tests/test_rpc_retry.py` (3) — POST excluded from retry, connect retries preserved.
  • Full unit + migration-unit suite green locally (117 passed / 44 FDB-e2e skipped).

⚠️ Before relying in production

The FDB-backed paths (`atomic_update`, the snapshot freeze-guard, the `IN_CREATION` reconciler) can only be exercised against a live FoundationDB — needs a lab integration run, not just unit CI.

🤖 Generated with Claude Code

Addresses the highest-impact findings from the failure/race audit.

Item 1 — systemic:
- DBController.atomic_update(): transactional read-modify-write (FDB
  serializable isolation + fdb.transactional auto-retry = true CAS),
  replacing the lost-update-prone read();mutate;write_to_db() pattern.
- Applied to set_node_status (guards evaluated against the fresh row
  inside the tx; events/peer-broadcast/task-cancel moved post-commit)
  and all 5 snapshot ref_count sites (3 inc, 2 dec).
- Task lease: JobSchedule.owner + TASK_LEASE_TTL_SEC + claim_task()
  (hostname-keyed; same-host restart re-claims immediately, a second
  replica is locked out until the lease goes stale). Wired as a gate
  into the restart/migration/lvol_migration/port_allow/node_add runners.

Item 2 — RPC contract (safe subset):
- Drop POST from urllib3 read-retry allowed_methods so non-idempotent
  SPDK RPCs are no longer silently re-applied on a read timeout;
  connect-error retries are preserved.
- Guard the 5 `for d in get_bdevs()` sites: a None (RPC-failure) return
  now raises a clear, catchable error instead of an opaque TypeError.
  (The broader _request-raises-on-error flip is intentionally deferred.)

Item 3 — quick wins:
- Per-task try/except in tasks_runner_restart, tasks_runner_node_add
  (+ cap its unbounded backoff), device_monitor, mgmt_node_monitor so
  one task/node cannot kill the whole service.
- Wire is_migration_active_on_node into snapshot_controller.add to
  enforce the one-migration-per-source-node freeze invariant (was
  dead code).
- lvol_monitor reconciles stale STATUS_IN_CREATION zombies (force-delete
  past LVOL_IN_CREATION_STALE_SEC) so crashed creates stop leaking
  pool capacity.

Tests: tests/test_task_lease.py (8, lease/claim logic),
tests/test_rpc_retry.py (3, POST excluded from retry). FDB-backed paths
(atomic_update, snapshot guard) still require a lab integration run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@schmidt-scaled schmidt-scaled force-pushed the audit-hardening-main branch from 28a2688 to 04529d5 Compare June 8, 2026 11:14
@schmidt-scaled schmidt-scaled merged commit ca9fb4e into main Jun 8, 2026
9 checks passed
@schmidt-scaled schmidt-scaled deleted the audit-hardening-main branch June 8, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant