Skip to content

Commit b85d360

Browse files
authored
Merge pull request #124 from SoundBlaster/feature/P2-T2-stale-socket-recovery
P2-T2: Self-healing stale socket and PID file recovery
2 parents 0afbcce + 7db825f commit b85d360

10 files changed

Lines changed: 400 additions & 13 deletions

File tree

SPECS/ARCHIVE/INDEX.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
# mcpbridge-wrapper Tasks Archive
22

3-
**Last Updated:** 2026-03-01 (BUG-T8 + REVIEW archived)
3+
**Last Updated:** 2026-03-01 (P2-T2 archived)
44

55
## Archived Tasks
66

77
| Task ID | Folder | Archived | Verdict |
88
|---------|--------|----------|---------|
9+
| P2-T2 | [P2-T2_Self-healing_stale_socket_and_PID_file_recovery/](P2-T2_Self-healing_stale_socket_and_PID_file_recovery/) | 2026-03-01 | PASS |
910
| BUG-T8 | [BUG-T8_Fix_broker_proxy_bridge_exits_after_first_write_due_to_BaseProtocol_missing_drain_helper/](BUG-T8_Fix_broker_proxy_bridge_exits_after_first_write_due_to_BaseProtocol_missing_drain_helper/) | 2026-03-01 | PASS |
1011
| P1-T3 | [P1-T3_Improve_MCP_settings_examples_in_README_to_present_broker_setup_first/](P1-T3_Improve_MCP_settings_examples_in_README_to_present_broker_setup_first/) | 2026-03-01 | PASS |
1112
| P1-T2 | [P1-T2_Add_Xcode_26_4_known_issue_release_notes_link_to_README/](P1-T2_Add_Xcode_26_4_known_issue_release_notes_link_to_README/) | 2026-02-28 | PASS |
@@ -171,6 +172,7 @@
171172

172173
| File | Description |
173174
|------|-------------|
175+
| [REVIEW_P2-T2_stale_socket_recovery.md](_Historical/REVIEW_P2-T2_stale_socket_recovery.md) | Review report for P2-T2 |
174176
| [Workplan_0.4.0.md](_Historical/Workplan_0.4.0.md) | Archived workplan snapshot for release 0.4.0 |
175177
| [REVIEW_p1_t1_version_badge_script_tests.md](_Historical/REVIEW_p1_t1_version_badge_script_tests.md) | Review report for P1-T1 (version badge script/tests) |
176178
| [REVIEW_p1_t2_xcode_26_4_known_issue_link.md](_Historical/REVIEW_p1_t2_xcode_26_4_known_issue_link.md) | Review report for P1-T2 |
@@ -290,6 +292,7 @@
290292

291293
| Date | Task ID | Action |
292294
|------|---------|--------|
295+
| 2026-03-01 | P2-T2 | Archived Self-healing stale socket and PID file recovery (PASS) |
293296
| 2026-03-01 | WORKPLAN | Archived Workplan_0.4.0.md to _Historical and reset SPECS/Workplan.md |
294297
| 2026-02-28 | P15-T1 | Archived REVIEW_p15_t1_next_release_readiness report |
295298
| 2026-02-28 | P15-T1 | Archived Validate_project_readiness_for_the_next_release (PASS) |
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# P2-T2: Self-healing stale socket and PID file recovery
2+
3+
**Status:** In Progress
4+
**Priority:** P0
5+
**Branch:** feature/P2-T2-stale-socket-recovery
6+
**Created:** 2026-03-01
7+
8+
---
9+
10+
## Problem Statement
11+
12+
When the broker daemon crashes or is killed, it leaves `broker.sock` and `broker.pid` on disk. The proxy's `_spawn_broker_if_needed` checks `socket_path.exists()` and skips spawning if the socket file is present — even if no process is listening. This silently blocks all future broker mode sessions until the user manually deletes the files.
13+
14+
---
15+
16+
## Root Cause
17+
18+
In `proxy.py``_spawn_broker_if_needed`, line 131-133:
19+
20+
```python
21+
# Check if socket already exists (race condition: broker started without PID file yet)
22+
if socket_path.exists():
23+
logger.debug("Broker socket already present; skipping spawn.")
24+
return
25+
```
26+
27+
Existence check (`Path.exists()`) does not verify whether anything is actually listening on the socket. A stale socket file left after a crash passes the existence check and prevents spawn.
28+
29+
---
30+
31+
## Solution
32+
33+
### 1. `proxy.py` — Liveness check in `_spawn_broker_if_needed`
34+
35+
Replace the plain existence check with a socket connect attempt:
36+
37+
```python
38+
if socket_path.exists():
39+
import socket as _socket
40+
try:
41+
with _socket.socket(_socket.AF_UNIX, _socket.SOCK_STREAM) as s:
42+
s.settimeout(1.0)
43+
s.connect(str(socket_path))
44+
# Connection succeeded → broker is alive
45+
logger.debug("Broker socket present and accepting connections; skipping spawn.")
46+
return
47+
except ConnectionRefusedError:
48+
# Stale socket — broker is not listening
49+
logger.warning(
50+
"Stale socket found (broker not accepting connections); removing stale files."
51+
)
52+
socket_path.unlink(missing_ok=True)
53+
pid_file.unlink(missing_ok=True)
54+
# Fall through to spawn
55+
```
56+
57+
This ensures:
58+
- If broker is alive (socket accepts connections): skip spawn (no change in behaviour)
59+
- If broker is dead (socket refuses connections): remove stale files and proceed with spawn
60+
- `FileNotFoundError` and other OS errors during connect are suppressed — treated as "not alive" (also falls through to spawn path)
61+
62+
### 2. `daemon.py``atexit` cleanup on daemon exit
63+
64+
The daemon already removes files via `_cleanup_files()` in `stop()`, and `stop()` is called by the SIGTERM/SIGINT handlers in `run_forever()`. However, if the Python interpreter exits abnormally (e.g., unhandled exception, explicit `sys.exit()`), cleanup may be skipped.
65+
66+
Add an `atexit` registration in `start()`:
67+
68+
```python
69+
import atexit
70+
atexit.register(self._cleanup_files)
71+
```
72+
73+
This ensures `_cleanup_files()` runs even for abnormal exits (excluding SIGKILL, which cannot be intercepted).
74+
75+
---
76+
77+
## Files to Change
78+
79+
| File | Change |
80+
|------|--------|
81+
| `src/mcpbridge_wrapper/broker/proxy.py` | Replace existence-only socket check with connect-based liveness check |
82+
| `src/mcpbridge_wrapper/broker/daemon.py` | Register `atexit` cleanup in `start()` |
83+
84+
---
85+
86+
## Tests to Add
87+
88+
In `tests/unit/test_broker_proxy.py` — new class `TestBrokerProxyStaleSocket`:
89+
90+
1. **`test_stale_socket_triggers_spawn`** — socket file exists but connect raises `ConnectionRefusedError`; verify `Popen` is called and stale files are removed.
91+
2. **`test_live_socket_skips_spawn`** — socket file exists and connect succeeds; verify `Popen` is NOT called.
92+
3. **`test_stale_socket_with_stale_pid_file_triggers_spawn`** — both files exist, connect raises `ConnectionRefusedError`; verify both files are removed and spawn proceeds.
93+
94+
In `tests/unit/test_broker_daemon.py` — new class `TestBrokerDaemonAtExit`:
95+
96+
4. **`test_atexit_registered_after_start`** — after `daemon.start()`, `atexit` registry includes `_cleanup_files`.
97+
98+
---
99+
100+
## Acceptance Criteria
101+
102+
- [ ] After broker crash, next `--broker-spawn` session auto-recovers without manual file removal
103+
- [ ] Liveness check uses `connect()` not `exists()`
104+
- [ ] Daemon registers `atexit` cleanup on `start()`
105+
- [ ] All existing broker tests pass
106+
- [ ] New tests cover the stale-socket scenario and atexit registration
107+
- [ ] `ruff check src/` passes
108+
- [ ] `mypy src/` passes (if configured)
109+
- [ ] Coverage ≥ 90%
110+
111+
---
112+
113+
## Dependencies
114+
115+
None — can be implemented independently.
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# P2-T2 Validation Report
2+
3+
**Task:** Self-healing stale socket and PID file recovery
4+
**Date:** 2026-03-01
5+
**Branch:** feature/P2-T2-stale-socket-recovery
6+
**Verdict:** PASS
7+
8+
---
9+
10+
## Quality Gate Results
11+
12+
| Gate | Result | Details |
13+
|------|--------|---------|
14+
| `ruff check src/` | ✅ PASS | All checks passed |
15+
| `pytest` | ✅ PASS | 719 passed, 5 skipped |
16+
| `pytest --cov` | ✅ PASS | 91.78% coverage (≥ 90% required) |
17+
| `mypy src/` | N/A | Not configured |
18+
19+
---
20+
21+
## Changes Made
22+
23+
### `src/mcpbridge_wrapper/broker/proxy.py`
24+
25+
- Added `import socket` at module level (replaces inline import-inside-function approach)
26+
- Replaced plain `socket_path.exists()` guard in `_spawn_broker_if_needed` with a connect-based liveness check:
27+
- If socket file exists and `connect()` succeeds → broker is alive, skip spawn
28+
- If socket file exists and `connect()` raises `OSError` (covers `ConnectionRefusedError`, `OSError: socket operation on non-socket`, timeouts) → treat as stale, remove both socket and PID files, fall through to spawn
29+
30+
### `src/mcpbridge_wrapper/broker/daemon.py`
31+
32+
- Added `import atexit` at module level
33+
- Registered `self._cleanup_files` with `atexit` in `start()` after the transport is successfully started, ensuring socket/PID files are removed even on abnormal interpreter exits (unhandled exceptions, `sys.exit()`). SIGKILL cannot be intercepted and is not covered.
34+
35+
---
36+
37+
## Tests Added / Updated
38+
39+
### `tests/unit/test_broker_proxy.py`
40+
41+
- **Updated** `TestBrokerProxyAutoSpawn::test_spawn_noop_when_socket_exists` — now mocks `socket.socket.connect()` to succeed (simulating a live broker) instead of relying on the old existence-only check.
42+
43+
- **Added** `TestBrokerProxyStaleSocket` class with 3 tests:
44+
- `test_stale_socket_triggers_spawn` — socket file exists, `connect()` raises `ConnectionRefusedError``Popen` is called
45+
- `test_stale_socket_removes_files` — socket and PID files are removed before spawn attempt
46+
- `test_live_socket_skips_spawn` — socket file exists, `connect()` succeeds → `Popen` NOT called
47+
48+
### `tests/unit/test_broker_daemon.py`
49+
50+
- **Added** `TestBrokerDaemonAtExit` class with 1 test:
51+
- `test_atexit_registered_after_start` — verifies `atexit.register` is called with `_cleanup_files` during `start()`
52+
53+
---
54+
55+
## Acceptance Criteria Status
56+
57+
- [x] After broker crash, next `--broker-spawn` session auto-recovers without manual file removal
58+
- [x] Liveness check uses `connect()` not `exists()`
59+
- [x] Daemon registers `atexit` cleanup on `start()`
60+
- [x] All existing broker tests pass (719 passed, 5 skipped)
61+
- [x] New tests cover the stale-socket scenario and atexit registration
62+
- [x] `ruff check src/` passes
63+
- [x] Coverage ≥ 90%
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
## REVIEW REPORT — P2-T2 Stale Socket Recovery
2+
3+
**Scope:** origin/main..HEAD
4+
**Files:** 4 (proxy.py, daemon.py, test_broker_proxy.py, test_broker_daemon.py)
5+
**Date:** 2026-03-01
6+
7+
---
8+
9+
### Summary Verdict
10+
11+
- [ ] Approve
12+
- [x] Approve with comments
13+
- [ ] Request changes
14+
- [ ] Block
15+
16+
---
17+
18+
### Critical Issues
19+
20+
None.
21+
22+
---
23+
24+
### Secondary Issues
25+
26+
**[Low] Blocking socket connect inside async context**
27+
28+
In `proxy.py``_spawn_broker_if_needed`, the liveness check uses `socket.socket(AF_UNIX, ...)` with `s.settimeout(1.0)` — a synchronous blocking call inside an `async def` function. If the daemon is hung (not refusing but also not accepting), the event loop is blocked for up to 1 second.
29+
30+
In practice, Unix domain socket connects on the local machine are instantaneous; the timeout protects against edge cases. For broker mode, the 1-second block is acceptable given that:
31+
- `_spawn_broker_if_needed` is called once per proxy session start
32+
- The hang scenario (listening socket not accepting) is extremely rare with broker
33+
34+
**Fix (optional, lower priority):** Could be converted to an async probe via `asyncio.open_unix_connection` with `asyncio.wait_for`. Defer unless profiling shows impact.
35+
36+
---
37+
38+
**[Nit] atexit handler not deregistered after clean stop**
39+
40+
In `daemon.py`, `atexit.register(self._cleanup_files)` is called on successful start. After `stop()` completes and removes files, the atexit handler remains registered. On interpreter exit after a clean shutdown, `_cleanup_files` runs again on already-absent files.
41+
42+
This is safe because `_cleanup_files` uses `unlink(missing_ok=True)`. However, repeated registration across multiple `start()`/`stop()` cycles (not currently possible due to PID file locking, but possible in tests) would accumulate atexit registrations.
43+
44+
**Fix (optional):** Use `atexit.unregister(self._cleanup_files)` in `stop()`. Low priority — current behavior is safe.
45+
46+
---
47+
48+
### Architectural Notes
49+
50+
- The catch-all `except OSError` in `_spawn_broker_if_needed` handles `ConnectionRefusedError`, `FileNotFoundError` (TOCTOU: socket disappears between `exists()` and `connect()`), and `OSError: [Errno 38] Socket operation on non-socket` (connecting to a regular file). This is the correct breadth of coverage.
51+
- The atexit approach complements the existing SIGTERM/SIGINT handlers in `run_forever()`. The combination means clean files are guaranteed on: normal exit, `sys.exit()`, SIGTERM, SIGINT. Only SIGKILL (non-interceptable) leaves stale files — which the new proxy liveness check now handles automatically.
52+
- The existing `test_spawn_noop_when_socket_exists` test was correctly updated to mock `socket.connect()` success rather than relying on a plain `touch()`'d file (which would have produced an `OSError` with the new code).
53+
54+
---
55+
56+
### Tests
57+
58+
- 4 new tests added (3 for stale socket proxy scenarios, 1 for atexit daemon registration).
59+
- 1 existing test updated to reflect new liveness-check semantics.
60+
- All 719 tests pass; coverage 91.78% (≥ 90%).
61+
- **Gap (Low priority):** No explicit test for TOCTOU race (socket file disappears between `exists()` and `connect()`). Covered implicitly by `except OSError` but not explicitly tested.
62+
63+
---
64+
65+
### Next Steps
66+
67+
- **P2-T3** (P1, now unblocked by P2-T2 ✅): Fix double-spawn race condition with `fcntl.flock` spawn lock
68+
- **Optional** (Nit): Add `atexit.unregister` in `daemon.stop()` if test-cycle isolation becomes an issue
69+
- **Optional** (Low): Convert blocking socket probe to async `open_unix_connection` + `wait_for`

SPECS/INPROGRESS/next.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,17 @@
11
# No Active Task
22

3-
**Status:** Idle — BUG-T8 archived. Select the next task from `SPECS/Workplan.md`.
3+
**Status:** Idle — P2-T2 archived. Select the next task from `SPECS/Workplan.md`.
44

55
## Recently Archived
66

7+
- **P2-T2** — Self-healing stale socket and PID file recovery (2026-03-01, PASS)
78
- **BUG-T8** — Fix broker proxy bridge exits after first write due to BaseProtocol missing _drain_helper (2026-03-01, PASS)
89
- **P1-T3** — Improve MCP settings examples in README to present broker setup first (2026-03-01, PASS)
910
- **P1-T2** — Add Xcode 26.4 known issue release-notes link to README (2026-02-28, PASS)
1011
- **P1-T1** — Add the version badge in the README.md (2026-02-28, PASS)
12+
13+
## Suggested Next Tasks
14+
15+
- **P2-T1** (P1) — Replace --broker-spawn/--broker-connect with single --broker flag
16+
- **P2-T3** (P1) — Fix double-spawn race condition when MCP client toggles rapidly (depends on P2-T2 ✅)
17+
- **P2-T4** (P1) — Surface broker unavailability as JSON-RPC error instead of silent timeout

SPECS/Workplan.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,8 @@ Add new tasks using the canonical template in [TASK_TEMPLATE.md](TASK_TEMPLATE.m
7171
- [ ] All MCP settings examples in README and DocC use `--broker`
7272
- [ ] All existing tests pass
7373

74-
#### ⬜️ P2-T2: Self-healing stale socket and PID file recovery
74+
#### ✅ P2-T2: Self-healing stale socket and PID file recovery
75+
- **Status:** ✅ Completed (2026-03-01)
7576
- **Description:** When the broker daemon crashes or is killed, it leaves `broker.sock` and `broker.pid` on disk. The proxy's `_spawn_broker_if_needed` checks `socket_path.exists()` and skips spawning if the socket file is present — even if no process is listening. This silently blocks all future broker mode sessions until the user manually deletes the files. Fix by validating socket liveness via `connect()` before concluding a broker is running: if `connect()` fails with `ConnectionRefusedError`, treat both files as stale, remove them, and proceed with spawn. Also clean up socket file on daemon exit via `atexit`/signal handler.
7677
- **Priority:** P0
7778
- **Dependencies:** none
@@ -80,10 +81,10 @@ Add new tasks using the canonical template in [TASK_TEMPLATE.md](TASK_TEMPLATE.m
8081
- `src/mcpbridge_wrapper/broker/proxy.py` — liveness check in `_spawn_broker_if_needed`
8182
- `src/mcpbridge_wrapper/broker/daemon.py` — socket cleanup on exit
8283
- **Acceptance Criteria:**
83-
- [ ] After broker crash, next `--broker-spawn` (or `--broker`) session auto-recovers without manual file removal
84-
- [ ] Liveness check uses `connect()` not `exists()`
85-
- [ ] Daemon removes `broker.sock` on clean exit and on SIGTERM
86-
- [ ] All existing broker tests pass
84+
- [x] After broker crash, next `--broker-spawn` (or `--broker`) session auto-recovers without manual file removal
85+
- [x] Liveness check uses `connect()` not `exists()`
86+
- [x] Daemon removes `broker.sock` on clean exit and on SIGTERM
87+
- [x] All existing broker tests pass
8788

8889
#### ⬜️ P2-T3: Fix double-spawn race condition when MCP client toggles rapidly
8990
- **Description:** When an MCP client (e.g. Zed) toggles the connection off/on quickly, two proxy processes start simultaneously. Both check for a running broker, find none, and both spawn a daemon. Two competing daemons fight over the socket path: one wins, the other crashes. The losing proxy's client gets no broker and shows 0 tools. Fix with a filesystem lock (e.g. `fcntl.flock` on the PID file) so only one spawn attempt proceeds at a time; the second waiter detects the winner's daemon and connects.

src/mcpbridge_wrapper/broker/daemon.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
from __future__ import annotations
2020

2121
import asyncio
22+
import atexit
2223
import contextlib
2324
import json
2425
import logging
@@ -134,6 +135,10 @@ async def start(self) -> None:
134135
await self._rollback_startup()
135136
raise
136137

138+
# Ensure socket/PID files are removed even on abnormal interpreter exit
139+
# (e.g. unhandled exception, sys.exit). SIGKILL cannot be intercepted.
140+
atexit.register(self._cleanup_files)
141+
137142
self._state = BrokerState.READY
138143
logger.info(
139144
"Broker READY (upstream PID %s)",

src/mcpbridge_wrapper/broker/proxy.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
import contextlib
1919
import logging
2020
import os
21+
import socket
2122
import sys
2223

2324
from mcpbridge_wrapper.broker.types import BrokerConfig
@@ -127,10 +128,23 @@ async def _spawn_broker_if_needed(self) -> None:
127128
except (ValueError, ProcessLookupError, PermissionError):
128129
logger.debug("Stale PID file; will spawn broker.")
129130

130-
# Check if socket already exists (race condition: broker started without PID file yet)
131+
# Check if socket already exists and is actually alive.
132+
# A stale socket file left after a crash passes exists() but refuses connections.
131133
if socket_path.exists():
132-
logger.debug("Broker socket already present; skipping spawn.")
133-
return
134+
try:
135+
with socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) as s:
136+
s.settimeout(1.0)
137+
s.connect(str(socket_path))
138+
# Connection succeeded — broker is alive
139+
logger.debug("Broker socket present and accepting connections; skipping spawn.")
140+
return
141+
except OSError:
142+
logger.warning(
143+
"Stale socket found (broker not accepting connections); removing stale files."
144+
)
145+
socket_path.unlink(missing_ok=True)
146+
pid_file.unlink(missing_ok=True)
147+
# Fall through to spawn
134148

135149
logger.info("Spawning broker daemon…")
136150
import subprocess

0 commit comments

Comments
 (0)