Harden failure handling & concurrency (audit items 1–3)#1084
Merged
Conversation
Addresses the highest-impact findings from the failure/race audit. Item 1 — systemic: - DBController.atomic_update(): transactional read-modify-write (FDB serializable isolation + fdb.transactional auto-retry = true CAS), replacing the lost-update-prone read();mutate;write_to_db() pattern. - Applied to set_node_status (guards evaluated against the fresh row inside the tx; events/peer-broadcast/task-cancel moved post-commit) and all 5 snapshot ref_count sites (3 inc, 2 dec). - Task lease: JobSchedule.owner + TASK_LEASE_TTL_SEC + claim_task() (hostname-keyed; same-host restart re-claims immediately, a second replica is locked out until the lease goes stale). Wired as a gate into the restart/migration/lvol_migration/port_allow/node_add runners. Item 2 — RPC contract (safe subset): - Drop POST from urllib3 read-retry allowed_methods so non-idempotent SPDK RPCs are no longer silently re-applied on a read timeout; connect-error retries are preserved. - Guard the 5 `for d in get_bdevs()` sites: a None (RPC-failure) return now raises a clear, catchable error instead of an opaque TypeError. (The broader _request-raises-on-error flip is intentionally deferred.) Item 3 — quick wins: - Per-task try/except in tasks_runner_restart, tasks_runner_node_add (+ cap its unbounded backoff), device_monitor, mgmt_node_monitor so one task/node cannot kill the whole service. - Wire is_migration_active_on_node into snapshot_controller.add to enforce the one-migration-per-source-node freeze invariant (was dead code). - lvol_monitor reconciles stale STATUS_IN_CREATION zombies (force-delete past LVOL_IN_CREATION_STALE_SEC) so crashed creates stop leaking pool capacity. Tests: tests/test_task_lease.py (8, lease/claim logic), tests/test_rpc_retry.py (3, POST excluded from retry). FDB-backed paths (atomic_update, snapshot guard) still require a lab integration run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
28a2688 to
04529d5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses the highest-impact findings from the failure-mode / race audit of the control plane. Cherry-picked from the same fix on `performance-optimization`.
Item 1 — systemic
Item 2 — RPC contract (safe subset)
Item 3 — quick wins
Tests
The FDB-backed paths (`atomic_update`, the snapshot freeze-guard, the `IN_CREATION` reconciler) can only be exercised against a live FoundationDB — needs a lab integration run, not just unit CI.
🤖 Generated with Claude Code