Harden failure handling & concurrency (audit items 1–3) by schmidt-scaled · Pull Request #1084 · simplyblock/sbcli

schmidt-scaled · 2026-06-08T10:59:23Z

Addresses the highest-impact findings from the failure-mode / race audit of the control plane. Cherry-picked from the same fix on `performance-optimization`.

Item 1 — systemic

`DBController.atomic_update()` — transactional read-modify-write (FDB serializable isolation + `fdb.transactional` auto-retry = true CAS), replacing the lost-update-prone `read(); mutate; write_to_db()` pattern.
Applied to `set_node_status` (FSM guards evaluated against the fresh row inside the tx; event emission / peer broadcast / task cancellation moved to after commit) and all 5 snapshot `ref_count` sites (3 increments, 2 decrements).
Task lease: `JobSchedule.owner` + `TASK_LEASE_TTL_SEC` + `claim_task()` (hostname-keyed — a same-host restart re-claims immediately; a second replica is locked out until the lease goes stale). Wired as a gate into the restart / migration / lvol_migration / port_allow / node_add runners.

Item 2 — RPC contract (safe subset)

Drop POST from the urllib3 read-retry `allowed_methods` so non-idempotent SPDK RPCs are no longer silently re-applied on a read timeout; connection-error retries are preserved.
Guard the 5 `for d in get_bdevs()` sites: a `None` (RPC-failure) return now raises a clear, catchable error instead of an opaque `TypeError`.
(The broader `_request`-raises-on-error contract flip is intentionally deferred — it breaks hundreds of `if not ret:` callers and needs its own pass.)

Item 3 — quick wins

Per-task `try/except` in `tasks_runner_restart`, `tasks_runner_node_add` (+ cap its unbounded backoff), `device_monitor`, `mgmt_node_monitor` so one task/node cannot kill the whole service.
Wire `is_migration_active_on_node` into `snapshot_controller.add` to enforce the one-migration-per-source-node freeze invariant (previously dead code).
`lvol_monitor` reconciles stale `STATUS_IN_CREATION` zombies (force-delete past `LVOL_IN_CREATION_STALE_SEC`) so crashed creates stop leaking pool capacity.

Tests

`tests/test_task_lease.py` (8) — lease/claim + staleness logic.
`tests/test_rpc_retry.py` (3) — POST excluded from retry, connect retries preserved.
Full unit + migration-unit suite green locally (117 passed / 44 FDB-e2e skipped).

⚠️ Before relying in production

The FDB-backed paths (`atomic_update`, the snapshot freeze-guard, the `IN_CREATION` reconciler) can only be exercised against a live FoundationDB — needs a lab integration run, not just unit CI.

🤖 Generated with Claude Code

Addresses the highest-impact findings from the failure/race audit. Item 1 — systemic: - DBController.atomic_update(): transactional read-modify-write (FDB serializable isolation + fdb.transactional auto-retry = true CAS), replacing the lost-update-prone read();mutate;write_to_db() pattern. - Applied to set_node_status (guards evaluated against the fresh row inside the tx; events/peer-broadcast/task-cancel moved post-commit) and all 5 snapshot ref_count sites (3 inc, 2 dec). - Task lease: JobSchedule.owner + TASK_LEASE_TTL_SEC + claim_task() (hostname-keyed; same-host restart re-claims immediately, a second replica is locked out until the lease goes stale). Wired as a gate into the restart/migration/lvol_migration/port_allow/node_add runners. Item 2 — RPC contract (safe subset): - Drop POST from urllib3 read-retry allowed_methods so non-idempotent SPDK RPCs are no longer silently re-applied on a read timeout; connect-error retries are preserved. - Guard the 5 `for d in get_bdevs()` sites: a None (RPC-failure) return now raises a clear, catchable error instead of an opaque TypeError. (The broader _request-raises-on-error flip is intentionally deferred.) Item 3 — quick wins: - Per-task try/except in tasks_runner_restart, tasks_runner_node_add (+ cap its unbounded backoff), device_monitor, mgmt_node_monitor so one task/node cannot kill the whole service. - Wire is_migration_active_on_node into snapshot_controller.add to enforce the one-migration-per-source-node freeze invariant (was dead code). - lvol_monitor reconciles stale STATUS_IN_CREATION zombies (force-delete past LVOL_IN_CREATION_STALE_SEC) so crashed creates stop leaking pool capacity. Tests: tests/test_task_lease.py (8, lease/claim logic), tests/test_rpc_retry.py (3, POST excluded from retry). FDB-backed paths (atomic_update, snapshot guard) still require a lab integration run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

schmidt-scaled force-pushed the audit-hardening-main branch from 28a2688 to 04529d5 Compare June 8, 2026 11:14

schmidt-scaled merged commit ca9fb4e into main Jun 8, 2026
9 checks passed

schmidt-scaled deleted the audit-hardening-main branch June 8, 2026 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden failure handling & concurrency (audit items 1–3)#1084

Harden failure handling & concurrency (audit items 1–3)#1084
schmidt-scaled merged 1 commit into
mainfrom
audit-hardening-main

schmidt-scaled commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

schmidt-scaled commented Jun 8, 2026

Item 1 — systemic

Item 2 — RPC contract (safe subset)

Item 3 — quick wins

Tests

⚠️ Before relying in production

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant