Working toward a "subinterpreter-forkserver" spawning backend#447
Draft
goodboy wants to merge 137 commits into
Draft
Working toward a "subinterpreter-forkserver" spawning backend#447goodboy wants to merge 137 commits into
goodboy wants to merge 137 commits into
Conversation
418a7ca to
4425023
Compare
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.
Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.
Deats,
- four scenarios via `--scenario`:
- `control_subint_thread_fork` — the KNOWN-BROKEN case as a
harness sanity; if the child DOESN'T abort, our analysis
is wrong
- `main_thread_fork` — baseline sanity, must always succeed
- `worker_thread_fork` — architectural assertion: regular
`threading.Thread` attached to main interp calls
`os.fork()`; child should survive post-fork cleanup
- `full_architecture` — end-to-end: fork from a main-interp
worker thread, then in child create a subint driving a
worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
"child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
`os.waitpid()` of the parent + per-scenario status prints to
observe the child's fate
Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).
Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.
Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
- `fork_from_worker_thread(child_target, thread_name)` —
spawn a main-interp `threading.Thread`, call `os.fork()`
from it, shuttle the child pid back to main via a pipe
- `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
create a fresh subint + drive `_interpreters.exec()` on
a dedicated worker thread running the `bootstrap` str
(typically imports `trio`, defines an async entry, calls
`trio.run()`)
- `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
pass/fail classification reusable from harness AND the
eventual real spawn path
- feature-gated py3.14+ via the public
`concurrent.interpreters` presence check; matches the gate
in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
(cross-refs `_subint_fork` stub + the two `conc-anal/`
docs) and status: EXPERIMENTAL, not yet registered in
`_spawn._methods`
Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).
Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.
Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
for the GIL-hostage safety reason doc'd in
`ai/conc-anal/subint_sigint_starvation_issue.md`:
- `test_fork_from_worker_thread_via_trio` — parent-side
plumbing baseline. `trio.run()` off-loads forkserver
prims via `trio.to_thread.run_sync()` + asserts the
child reaps cleanly
- `test_fork_and_run_trio_in_child` — end-to-end: forked
child calls `run_subint_in_worker_thread()` with a
bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
`dump_on_hang()` for post-mortem if the outer
`pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
drive the primitives directly rather than going through
tractor's spawn-method registry (which the forkserver
isn't plugged into yet)
Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.
Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.
Deats,
- three reasons enumerated for the current constraints:
- class-A GIL-starvation — **fixed** by isolated mode:
subints don't share main's GIL so abandoned-thread
contention disappears
- destroy race / tstate-recycling from `subint_proc` —
**unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
cross-mode, so isolated doesn't obviously fix it;
needs empirical retest on py3.14 + isolated API
- fork-from-main-interp-tstate (the CPython-level
`_PyInterpreterState_DeleteExceptMain` gate) — the
narrow reason for using a dedicated thread; **probably
fixed** IF the destroy-race also resolves (bc trio's
cache threads never drove subints → clean main-interp
tstate)
- TL;DR table of which constraints unwind under each
resolution branch
- four-step audit plan for when `msgspec#563` lands:
- flip `_subint` to isolated mode
- empirical destroy-race retest
- audit `_subint_forkserver.py` — drop `non_trio`
qualifier / maybe inline primitives
- doc fallout — close the three `subint_*_issue.md`
siblings w/ post-mortem notes
Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Promote `_subint_forkserver` from primitives-only into a
registered spawn backend: `'subint_forkserver'` is now a
`SpawnMethodKey` literal, dispatched via `_methods` to
the new `subint_forkserver_proc()` target, feature-gated
under the existing `subint`-family py3.14+ case, and
selectable via `--spawn-backend=subint_forkserver`.
Deats,
- new `subint_forkserver_proc()` spawn target in
`_subint_forkserver`:
- mirrors `trio_proc()`'s supervision model — real OS
subprocess so `Portal.cancel_actor()` + `soft_kill()`
on graceful teardown, `os.kill(SIGKILL)` on hard-reap
(no `_interpreters.destroy()` race to fuss over bc the
child lives in its own process)
- only real diff from `trio_proc` is the spawn mechanism:
fork from a main-interp worker thread via
`fork_from_worker_thread()` (off-loaded to trio's
thread pool) instead of `trio.lowlevel.open_process()`
- child-side `_child_target` closure runs
`tractor._child._actor_child_main()` with
`spawn_method='trio'` — the child is a regular trio
actor, "subint_forkserver" names how the parent
spawned, not what the child runs
- new `_ForkedProc` class — thin `trio.Process`-compatible
shim around a raw OS pid: `.poll()` via
`waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio
cache thread, `.kill()` via `SIGKILL`, `.returncode`
cached for repeat calls. `.stdin`/`.stdout`/`.stderr`
are `None` (fork-w/o-exec inherits parent FDs; we don't
marshal them) which matches `soft_kill()`'s `is not None`
guards
Also, new backend-tier test
`test_subint_forkserver_spawn_basic` drives the registered
backend end-to-end via `open_root_actor` + `open_nursery` +
`run_in_actor` w/ a trivial portal-RPC round-trip. Uses a
`forkserver_spawn_method` fixture to flip
`_spawn_method`/`_ctx` for the test's duration + restore on
teardown (so other session-level tests don't observe the
global flip). Test module docstring reworked to describe
the three tiers now covered: (1) primitive-level, (2)
parent-trio-driven primitives, (3) full registered backend.
Status: still-open work (tracked on `tractor#379`) doc'd
inline in the module docstring — no cancel/hard-kill stress
coverage yet, child-side subint-hosted root runtime still
future (gated on `msgspec#563`), thread-hygiene audit
pending the same unblock.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`os.fork()` inherits the parent's entire memory image, including `tractor.runtime._state` globals that encode "this process is the root actor" — `_runtime_vars`'s `_is_root=True`, pre-populated `_root_mailbox` + `_registry_addrs`, and the parent's `_current_actor` singleton. A fresh `exec`-based child starts with those globals at their module-level defaults (all falsey/empty). The forkserver child needs to match that shape BEFORE calling `_actor_child_main()`, otherwise `Actor.__init__()` takes the `is_root_process() == True` branch and pre-populates `self.enable_modules`, which then trips `assert not self.enable_modules` at the top of `Actor._from_parent()` on the subsequent parent→child `SpawnSpec` handshake. Fix: at the start of `_child_target`, null `_state._current_actor` and overwrite `_runtime_vars` with a cold-root blank (`_is_root=False`, empty mailbox/addrs, `_debug_mode=False`) before `_actor_child_main()` runs. Found-via: `test_subint_forkserver_spawn_basic` hitting the `enable_modules` assert on child-side runtime boot. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Post-fork `_runtime_vars` reset in `subint_forkserver_proc`
was previously done via direct mutation of
`_state._runtime_vars` from an external module + an inline
default dict duplicating the `_state.py`-internal defaults.
Split the access surface into a pure getter + explicit
setter so the reset call site becomes a one-liner
composition.
Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
`_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
live `_runtime_vars` is now initialised from
`dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
kwarg. When True, returns a fresh copy of
`_RUNTIME_VARS_DEFAULTS` instead of the live dict —
still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
atomic replacement of the live dict's contents via
`.clear()` + `.update()`, so existing references to the
same dict object remain valid. Accepts either the
historical dict form or the `RuntimeVars` struct
Deats `tractor/spawn/_subint_forkserver.py`,
- collapse the prior ad-hoc `.update({...})` block into
`set_runtime_vars(get_runtime_vars(clear_values=True))`
- drop the `_state._current_actor = None` line —
`_trio_main` unconditionally overwrites it downstream,
so no explicit reset needed (noted in the XXX comment)
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT`
documents an empirical SIGINT-delivery gap in the
`subint_forkserver` backend: when the parent dies via
`SIGKILL` (no IPC `Portal.cancel_actor()` possible) and
`SIGINT` is sent to the orphan child, the child DOES NOT
unwind — CPython's default `KeyboardInterrupt` is delivered
to `threading.main_thread()`, whose tstate is dead in the
post-fork child bc fork inherited the worker thread, not
main. Trio running on the fork-inherited worker thread
therefore never observes the signal. Marked
`xfail(strict=True)` so the mark flips to XPASS→fail once
the backend grows explicit SIGINT plumbing.
Deats,
- harness runs the failure-mode sequence out-of-process:
1. harness subprocess runs a fresh Python script
that calls `try_set_start_method('subint_forkserver')`
then opens a root actor + one `sleep_forever` subactor
2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers
off harness `stdout` to confirm IPC handshake
completed
3. `SIGKILL` the parent, `proc.wait()` to reap the
zombie (otherwise `os.kill(pid, 0)` keeps reporting
it alive)
4. assert the child survived the parent-reap (i.e. was
actually orphaned, not reaped too) before moving on
5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)`
every 100ms for up to 10s
- supporting helpers: `_read_marker()` with per-proc
bytes-buffer to carry partial lines across calls,
`_process_alive()` liveness probe via `kill(pid, 0)`
- Linux-only via `platform.system() != 'Linux'` skip —
orphan-reparenting semantics don't generalize to
other platforms
- port offset (`reg_addr[1] + 17`) so the harness listener
doesn't race concurrently-running backend tests
- best-effort `finally:` cleanup: `SIGKILL` any still-alive
pids + `proc.kill()` + bounded `proc.wait()` to avoid
leaking orphans across the session
Also, tier-4 header comment documents the cross-backend
generalization path: applicable to any multi-process
backend (`trio`, `mp_spawn`, `mp_forkserver`,
`subint_forkserver`), NOT to plain `subint` (in-process
subints have no orphan OS-child). Move path: lift
harness into `tests/_orphan_harness.py`, parametrize on
session `_spawn_method`, add
`skipif _spawn_method == 'subint'`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add configuration surface for future child-side SIGINT
plumbing in `subint_forkserver_proc` without wiring up the
actual trio-native SIGINT bridge — lifting one entry-guard
clause will flip the `'trio'` branch live once the
underlying fork-prelude plumbing is implemented.
Deats,
- new `ChildSigintMode = Literal['ipc', 'trio']` type +
`_DEFAULT_CHILD_SIGINT = 'ipc'` module-level default.
Docstring block enumerates both:
- `'ipc'` (default, currently the only implemented mode):
no child-side SIGINT handler — `trio.run()` is on the
fork-inherited non-main thread where
`signal.set_wakeup_fd()` is main-thread-only, so
cancellation flows exclusively via the parent's
`Portal.cancel_actor()` IPC path. Known gap: orphan
children don't respond to SIGINT
(`test_orphaned_subactor_sigint_cleanup_DRAFT`)
- `'trio'` (scaffolded only): manual SIGINT → trio-cancel
bridge in the fork-child prelude so external Ctrl-C
reaches stuck grandchildren even w/ a dead parent
- `subint_forkserver_proc` pulls `child_sigint` out of
`proc_kwargs` (matches how `trio_proc` threads config to
`open_process`, keeps `start_actor(proc_kwargs=...)` as
the ergonomic entry point); validates membership + raises
`NotImplementedError` for `'trio'` at the backend-entry
guard
- `_child_target` grows a `match child_sigint:` arm that
slots in the future `'trio'` impl without restructuring
— today only the `'ipc'` case is reachable
- module docstring "Still-open work" list grows a bullet
pointing at this config + the xfail'd orphan-SIGINT test
No behavioral change on the default path — `'ipc'` is the
existing flow. Scaffolding only.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical follow-up to the xfail'd orphan-SIGINT test: the hang is **not** "trio can't install a handler on a non-main thread" (the original hypothesis from the `child_sigint` scaffold commit). On py3.14: - `threading.current_thread() is threading.main_thread()` IS True post-fork — CPython re-designates the fork-inheriting thread as "main" correctly - trio's `KIManager` SIGINT handler IS installed in the subactor (`signal.getsignal(SIGINT)` confirms) - the kernel DOES deliver SIGINT to the thread But `faulthandler` dumps show the subactor wedged in `trio/_core/_io_epoll.py::get_events` — trio's wakeup-fd mechanism (which turns SIGINT into an epoll-wake) isn't firing. So the `except KeyboardInterrupt` at `tractor/spawn/_entry.py::_trio_main:164` — the runtime's intentional "KBI-as-OS-cancel" path — never fires. Deats, - new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md` (+385 LOC): full writeup — TL;DR, symptom reproducer, the "intentional cancel path" the bug defeats, diagnostic evidence (`faulthandler` output + `getsignal` probe), ruled-out hypotheses (non-main-thread issue, wakeup-fd inheritance, KBI-as-trio-check-exception), and fix directions - `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail `reason` + test docstring rewritten to match the refined understanding — old wording blamed the non-main-thread path, new wording points at the `epoll_wait` wedge + cross-refs the new conc-anal doc - `_subint_forkserver` module docstring's `child_sigint='trio'` bullet updated: now notes trio's handler is already correctly installed, so the flag may end up a no-op / doc-only mode once the real root cause is fixed Closing the gap aligns with existing design intent (make the already-designed "KBI-as-OS-cancel" behavior actually fire), not a new feature. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.
Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
listing the spawn backends whose subactor runtime is trio-native
(`'trio'`, `'subint_forkserver'`). Both the enable-site
(`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
key.
off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
reports the full compatible-set + the rejected method instead of the
stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
`'subint_forkserver'` to the existing `('trio', 'subint')` tuple
— fork child-side runtime receives the same SpawnSpec IPC handshake as
the others.
- `subint_forkserver_proc` child-target now passes
`spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
`Actor.pformat()` / log lines reflect the actual parent-side spawn
mechanism rather than masquerading as plain `trio`.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up to 72d1b90 (was prev commit adding `debug_mode` for `subint_forkserver`): that commit wired the runtime-side `subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the `subint_forkserver_proc` child-target was still passing `spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines would report the subactor as plain `'trio'` instead of the actual parent-side spawn mechanism. Flip the label to `'subint_forkserver'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Also, some slight touchups in `.spawn._subint`.
Fork-based backends (esp. `subint_forkserver`) can leak child actor processes on cancelled / SIGINT'd test runs; the zombies keep the tractor default registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`) bound, so every subsequent session can't bind and 50+ unrelated tests fail with the same `TooSlowError` / "address in use" signature. Document the pre-flight + post-cancel check as a mandatory step 4. Deats, - **primary signal**: `ss -tlnp | grep ':1616'` for a bound TCP registry listener — the authoritative check since :1616 is unique to our runtime - `pgrep -af` scoped to `$(pwd)/py[0-9]*/bin/python.* _actor_child_main|subint-forkserv` for leftover actor/forkserver procs — scoped deliberately so we don't false-flag legit long-running tractor- embedding apps like `piker` - `ls /tmp/registry@*.sock` for stale UDS sockets - scoped cleanup recipe (SIGTERM + SIGKILL sweep using the same `$(pwd)/py*` pattern, UDS `rm -f`, re-verify) plus a fallback for when a zombie holds :1616 but doesn't match the pattern: `ss -tlnp` → kill by PID - explicit false-positive warning calling out the `piker` case (`~/repos/piker/py*/bin/python3 -m tractor._child ...`) so a bare `pgrep` doesn't lead to nuking unrelated apps Goal: short-circuit the "spelunking into test code" rabbit-hole when the real cause is just a leaked PID from a prior session, without collateral damage to other tractor-embedding projects on the same box. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md` captures a descendant-leak surfaced while wiring `subint_forkserver` into the full test matrix: running `tests/test_cancellation.py` under `--spawn-backend=subint_forkserver` reproducibly leaks **exactly 5** `subint-forkserv` comm-named child processes that survive session exit, each holding a `LISTEN` on `:1616` (the tractor default registry addr) — and therefore poisons every subsequent test session that defaults to that addr. Deats, - TL;DR + ruled-out checks confirming the procs are ours (not piker / other tractor-embedding apps) — `/proc/$pid/cmdline` + cwd both resolve to this repo's `py314/` venv - root cause: `_ForkedProc.kill()` is PID-scoped (plain `os.kill(SIGKILL)` to the direct child), not tree-scoped — grandchildren spawned during a multi-level cancel test get reparented to init and inherit the registry listen socket - proposed fix directions ranked: (1) put each forkserver-spawned subactor in its own process- group (`os.setpgrp()` in fork-child) + tree-kill via `os.killpg(pgid, SIGKILL)` on teardown, (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit `/proc/<pid>/task/*/children` walk. Vote: (1) — POSIX-standard, aligns w/ `start_new_session=True` semantics in `subprocess.Popen` / trio's `open_process` - inline reproducer + cleanup recipe scoped to `$(pwd)/py314/bin/python.*pytest.*spawn-backend= subint_forkserver` so cleanup doesn't false-flag unrelated tractor procs (consistent w/ `run-tests` skill's zombie-check guidance) Stopgap hygiene fix (wiring `reg_addr` through the 5 leaky tests in `test_cancellation.py`) is incoming as a follow-up — that one stops the blast radius, but zombies still accumulate per-run until the real tree-kill fix lands. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Stopgap companion to d012196 (`subint_forkserver` test-cancellation leak doc): five tests in `tests/test_cancellation.py` were running against the default `:1616` registry, so any leaked `subint-forkserv` descendant from a prior test holds the port and blows up every subsequent run with `TooSlowError` / "address in use". Thread the session-unique `reg_addr` fixture through so each run picks its own port — zombies can no longer poison other tests (they'll only cross-contaminate whatever happens to share their port, which is now nothing). Deats, - add `reg_addr: tuple` fixture param to: - `test_cancel_infinite_streamer` - `test_some_cancels_all` - `test_nested_multierrors` - `test_cancel_via_SIGINT` - `test_cancel_via_SIGINT_other_task` - explicitly pass `registry_addrs=[reg_addr]` to the two `open_nursery()` calls that previously had no kwargs at all (in `test_cancel_via_SIGINT` and `test_cancel_via_SIGINT_other_task`) - add bounded `@pytest.mark.timeout(7, method='thread')` to `test_nested_multierrors` so a hung run doesn't wedge the whole session Still doesn't close the real leak — the `subint_forkserver` backend's `_ForkedProc.kill()` is PID-scoped not tree-scoped, so grandchildren survive teardown regardless of registry port. This commit is just blast-radius containment until that fix lands. See `ai/conc-anal/ subint_forkserver_test_cancellation_leak_issue.md`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
The previous cleanup recipe went straight to SIGTERM+SIGKILL, which hides bugs: tractor is structured concurrent — `_trio_main` catches SIGINT as an OS-cancel and cascades `Portal.cancel_actor` over IPC to every descendant. So a graceful SIGINT exercises the actual SC teardown path; if it hangs, that's a real bug to file (the forkserver `:1616` zombie was originally suspected to be one of these but turned out to be a teardown gap in `_ForkedProc.kill()` instead). Deats, - step 1: `pkill -INT` scoped to `$(pwd)/py*` — no sleep yet, just send the signal - step 2: bounded wait loop (10 × 0.3s = ~3s) using `pgrep` to poll for exit. Loop breaks early on clean exit - step 3: `pkill -9` only if graceful timed out, w/ a logged escalation msg so it's obvious when SC teardown didn't complete - step 4: same SIGINT-first ladder for the rare `:1616`-holding zombie that doesn't match the cmdline pattern (find PID via `ss -tlnp`, then `kill -INT NNNN; sleep 1; kill -9 NNNN`) - steps 5-6: UDS-socket `rm -f` + re-verify unchanged Goal: surface real teardown bugs through the test- cleanup workflow instead of papering over them with `-9`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Major rewrite of `subint_forkserver_test_cancellation_leak_issue.md` after empirical investigation revealed the earlier "descendant-leak + missing tree-kill" diagnosis conflated two unrelated symptoms: 1. **5-zombie leak holding `:1616`** — turned out to be a self-inflicted cleanup bug: `pkill`-ing a bg pytest task (SIGTERM/SIGKILL, no SIGINT) skipped the SC graceful cancel cascade entirely. Codified the real fix — SIGINT-first ladder w/ bounded wait before SIGKILL — in e5e2afb (`run-tests` SKILL) and `feedback_sc_graceful_cancel_first.md`. 2. **`test_nested_multierrors[subint_forkserver]` hangs indefinitely** — the actual backend bug, and it's a deadlock not a leak. Deats, - new diagnosis: all 5 procs are kernel-`S` in `do_epoll_wait`; pytest-main's trio-cache workers are in `os.waitpid` waiting for children that are themselves waiting on IPC that never arrives — graceful `Portal.cancel_actor` cascade never reaches its targets - tree-structure evidence: asymmetric depth across two identical `run_in_actor` calls — child 1 (3 threads) spawns both its grandchildren; child 2 (1 thread) never completes its first nursery `run_in_actor`. Smells like a race on fork- inherited state landing differently per spawn ordering - new hypothesis: `os.fork()` from a subactor inherits the ROOT parent's IPC listener FDs transitively. Grandchildren end up with three overlapping FD sets (own + direct-parent + root), so IPC routing becomes ambiguous. Predicts bug scales with fork depth — matches reality: single- level spawn works, multi-level hangs - ruled out: `_ForkedProc.kill()` tree-kill (never reaches hard-kill path), `:1616` contention (fixed by `reg_addr` fixture wiring), GIL starvation (each subactor has its own OS process+GIL), child-side KBI absorption (`_trio_main` only catches KBI at `trio.run()` callsite, reached only on trio-loop exit) - four fix directions ranked: (1) blanket post-fork `closerange()`, (2) `FD_CLOEXEC` + audit, (3) targeted FD cleanup via `actor.ipc_server` handle, (4) `os.posix_spawn` w/ `file_actions`. Vote: (3) — surgical, doesn't break the "no exec" design of `subint_forkserver` - standalone repro added (`spawn_and_error(breadth= 2, depth=1)` under `trio.fail_after(20)`) - stopgap: skip `test_nested_multierrors` + multi- level-spawn tests under the backend via `@pytest.mark.skipon_spawn_backend(...)` until fix lands Killing the "tree-kill descendants" fix-direction section: it addressed a bug that didn't exist. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Implements fix-direction (1)/blunt-close-all-FDs from b71705b (`subint_forkserver` nested-cancel hang diag), targeting the multi-level cancel-cascade deadlock in `test_nested_multierrors[subint_forkserver]`. The diagnosis doc voted for surgical FD cleanup via `actor.ipc_server` handle as the cleanest approach, but going blunt is actually the right call: after `os.fork()`, the child immediately enters `_actor_child_main()` which opens its OWN IPC sockets / wakeup-fd / epoll-fd / etc. — none of the parent's FDs are needed. Closing everything except stdio is safe AND defends against future listener/IPC additions to the parent inheriting silently into children. Deats, - new `_close_inherited_fds(keep={0,1,2}) -> int` helper. Linux fast-path enumerates `/proc/self/fd`; POSIX fallback uses `RLIMIT_NOFILE` range. Matches the stdlib `subprocess._posixsubprocess.close_fds` strategy. Returns close-count for sanity logging - wire into `fork_from_worker_thread._worker()`'s post-fork child prelude — runs immediately after the pid-pipe `os.close(rfd/wfd)`, before the user `child_target` callable executes - docstring cross-refs the diagnosis doc + spells out the FD-inheritance-cascade mechanism and why the close-all approach is safe for our spawn shape Validation pending: re-run `test_nested_multierrors[subint_forkserver]` to confirm the deadlock is gone. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Two coordinated improvements to the `subint_forkserver` backend: 1. Replace `trio.to_thread.run_sync(os.waitpid, ..., abandon_on_cancel=False)` in `_ForkedProc.wait()` with `trio.lowlevel.wait_readable(pidfd)`. The prior version blocked a trio cache thread on a sync syscall — outer cancel scopes couldn't unwedge it when something downstream got stuck. Same pattern `trio.Process.wait()` and `proc_waiter` (the mp backend) already use. 2. Drop the `@pytest.mark.xfail(strict=True)` from `test_orphaned_subactor_sigint_cleanup_DRAFT` — the test now PASSES after 0cd0b63 (fork-child FD scrub). Same root cause as the nested-cancel hang: inherited IPC/trio FDs were poisoning the child's event loop. Closing them lets SIGINT propagation work as designed. Deats, - `_ForkedProc.__init__` opens a pidfd via `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+) - `wait()` parks on `trio.lowlevel.wait_readable()`, then non-blocking `waitpid(WNOHANG)` to collect the exit status (correct since the pidfd signal IS the child-exit notification) - `ChildProcessError` swallow handles the rare race where someone else reaps first - pidfd closed after `wait()` completes (one-shot semantics) + `__del__` belt-and-braces for unexpected-teardown paths - test docstring's `@xfail` block replaced with a `# NOTE` comment explaining the historical context + cross-ref to the conc-anal doc; test remains in place as a regression guard The two changes are interdependent — the cancellable `wait()` matters for the same nested- cancel scenarios the FD scrub fixes, since the original deadlock had trio cache workers wedged in `os.waitpid` swallowing the outer cancel. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Completes the nested-cancel deadlock fix started in 0cd0b63 (fork-child FD scrub) and fe540d0 (pidfd- cancellable wait). The remaining piece: the parent- channel `process_messages` loop runs under `shield=True` (so normal cancel cascades don't kill it prematurely), and relies on EOF arriving when the parent closes the socket to exit naturally. Under exec-spawn backends (`trio_proc`, mp) that EOF arrival is reliable — parent's teardown closes the handler-task socket deterministically. But fork- based backends like `subint_forkserver` share enough process-image state that EOF delivery becomes racy: the loop parks waiting for an EOF that only arrives after the parent finishes its own teardown, but the parent is itself blocked on `os.waitpid()` for THIS actor's exit. Mutual wait → deadlock. Deats, - `async_main` stashes the cancel-scope returned by `root_tn.start(...)` for the parent-chan `process_messages` task onto the actor as `_parent_chan_cs` - `Actor.cancel()`'s teardown path (after `ipc_server.cancel()` + `wait_for_shutdown()`) calls `self._parent_chan_cs.cancel()` to explicitly break the shield — no more waiting for EOF delivery, unwinding proceeds deterministically regardless of backend - inline comments on both sites explain the mutual- wait deadlock + why the explicit cancel is backend-agnostic rather than a forkserver-specific workaround With this + the prior two fixes, the `subint_forkserver` nested-cancel cascade unwinds cleanly end-to-end. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:
1. Skip-mark the test via
`@pytest.mark.skipon_spawn_backend('subint_forkserver',
reason=...)` so it stops blocking the test
matrix while the remaining bug is being chased.
The reason string cross-refs the conc-anal doc
for full context.
2. Update the conc-anal doc
(`subint_forkserver_test_cancellation_leak_issue.md`) with the
empirical state after the three nested- cancel fix commits
(`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
shield break) landed, narrowing the remaining hang from "everything
broken" to "peer-channel loops don't exit on `service_tn` cancel".
Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
`_parent_chan_cs.cancel()` wiring from `57935804`
works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
channel handlers in `handle_stream_from_peer`
(inbound connections handled by
`stream_handler_tn`, which IS `service_tn` in the
current config).
- after `_parent_chan_cs.cancel()` fires, NEW
shielded loops appear on the session reg_addr
port — probably discovery-layer reconnection;
doesn't block teardown but indicates the cascade
has more moving parts than expected.
The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two new sections in `subint_forkserver_test_cancellation_leak_issue.md` documenting continued investigation of the `test_nested_multierrors[subint_forkserver]` peer- channel-loop hang: 1. **"Attempted fix (DID NOT work) — hypothesis (3)"**: tried sync-closing peer channels' raw socket fds from `_serve_ipc_eps`'s finally block (iterate `server._peers`, `_chan._transport. stream.socket.close()`). Theory was that sync close would propagate as `EBADF` / `ClosedResourceError` into the stuck `recv_some()` and unblock it. Result: identical hang. Either trio holds an internal fd reference that survives external close, or the stuck recv isn't even the root blocker. Either way: ruled out, experiment reverted, skip-mark restored. 2. **"Aside: `-s` flag changes behavior for peer- intensive tests"**: noticed `test_context_stream_semantics.py` under `subint_forkserver` hangs with default `--capture=fd` but passes with `-s` (`--capture=no`). Working hypothesis: subactors inherit pytest's capture pipe (fds 1,2 — which `_close_inherited_fds` deliberately preserves); verbose subactor logging fills the buffer, writes block, deadlock. Fix direction (if confirmed): redirect subactor stdout/stderr to `/dev/null` or a file in `_actor_child_main`. Not a blocker on the main investigation; deserves its own mini-tracker. Both sections are diagnosis-only — no code changes in this commit. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Three places that previously swallowed exceptions silently now log via `log.exception()` so they surface in the runtime log when something weird happens — easier to track down sneaky failures in the fork-from-worker-thread / subint-bootstrap primitives. Deats, - `_close_inherited_fds()`: post-fork child's per-fd `os.close()` swallow now logs the fd that failed to close. The comment notes the expected failure modes (already-closed-via-listdir-race, otherwise-unclosable) — both still fine to ignore semantically, but worth flagging in the log. - `fork_from_worker_thread()` parent-side timeout branch: the `os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close failure separately before raising the `worker thread didn't return` RuntimeError. - `run_subint_in_worker_thread._drive()`: when `_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`, log the full call signature (interp_id + bootstrap) along with the captured exception, before stashing into `err` for the outer caller. Behavior unchanged — only adds observability. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
goodboy
commented
Apr 24, 2026
| # Revisit `subint_forkserver` thread-cache constraints once msgspec PEP 684 support lands | ||
|
|
||
| Follow-up tracker for cleanup work gated on the msgspec | ||
| PEP 684 adoption upstream ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)). |
Owner
Author
There was a problem hiding this comment.
update this to the new subints compat issue we make @ msgspec.
Rework reap/diag tooling to identify tractor sub-actors via intrinsic proc signals — cmdline/comm markers from `setproctitle` — instead of env-var or cwd matching. Deats, - new `_is_tractor_subactor()` checks cmdline for `tractor[` / `tractor._child` markers, falls back to `/proc/<pid>/comm` for zombie-resilient detection (kernel preserves `comm` past exit until reap) - `_read_comm()` reads kernel per-task name set by `setproctitle()` — the zombie-safe ID signal - `_read_status_state()` reads single-letter proc state from `/proc/<pid>/status` (`Z` = zombie) - `find_orphans()` drops `repo_root` requirement, uses `_is_tractor_subactor()` for intrinsic sub-actor ID instead of cwd coincidence-matching - new `find_zombies()` with optional `parent_pid` filter for zombie-state sub-actors Also, - rename `pytree` -> `ptree` throughout xontrib - add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to distinguish `system.slice` services vs `user.slice` desktop apps from genuinely leaked orphans - `_ptree` classifies `ppid==1` procs into `system-slice`, `user-slice`, and `orphans` buckets with per-section output - `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes tractor importable from active venv (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Comment/docstring updates: `subint_forkserver` is a clean `NotImplementedError` stub — not an alias to variant-1 (`main_thread_forkserver`). Key reserved in-place (not aliased) so the subint-hosted-child impl can flip without API churn once jcrist/msgspec#1026 unblocks PEP 684 subints. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Add module-level `pytestmark` applying per-test `reap_subactors_per_test`, `track_orphaned_uds_per_test`, and `detect_runaway_subactors_per_test` fixtures — registrar tests stress discovery roundtrips that historically left orphaned UDS sock-files. Deats, - drop unused `say_hello()` fn, keep only `say_hello_use_wait`; rename param `func` -> `ria_fn`. - use `@tractor_test(timeout=7)` instead of separate `@pytest.mark.timeout(7, method='thread')` decorator. - add `with_timeout()` helper, wire into `test_subactors_unregister_on_cancel_remote_daemon`. - uncomment `_timeout_main()` in `test_stale_entry_is_deleted`, use configurable `timeout` var + `debug_mode` guard for `tractor.pause()` on cancel. - `dump_on_hang(seconds=timeout*2)` instead of hardcoded `20`. - fix typo "oustanding" -> "outstanding". (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
`acli.bindspace_scan piker` now resolves `<name>` to `$XDG_RUNTIME_DIR/<name>` — useful for projects like `piker` that bind sibling sub-dirs alongside tractor's default. Full paths still work as-is. Also, - rename "unparseable" section to "non-tractor" with clearer desc (filename lacks `@<pid>` suffix) - print per-sock `ss -lpx 'src = <path>'` cmds for non-tractor socks so callers can manually resolve listener-PID liveness (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Forking spawner + UDS transport has different timing vs `trio_proc` — streaming example completes faster in some cases, slower in others depending on fork overhead + sock setup. Deats, - add `expect_cancel` param to `cancel_after()`, raise `ActorTooSlowError` when cancel scope fires unexpectedly instead of silently returning `None`. - `time_quad_ex` fixture: bump timeout +1 for forking+UDS, explicit `ActorTooSlowError` on `None` result instead of bare `assert results`. - `test_not_fast_enough_quad`: `xfail` for forking+UDS being "too fast" (cancel doesn't fire bc streaming finishes before delay). - add `is_forking_spawner`, `tpt_proto` fixture params throughout. Also, - `_testing/pytest.py`: widen `start_method` parametrize and `is_forking_spawner` fixture to `scope='session'`. - `"""` -> `'''` docstring style throughout. - hoist `_non_linux` to module scope (was redefined locally in two places). - type hints, kwarg-style `partial()` calls. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/spawn_time_boot_death_dup_name_issue.md` documenting the spawn-time rc=2 race under rapid same-name spawning against a forkserver + registrar — the `wait_for_peer_or_proc_death` helper now surfaces the death instead of parking forever on the handshake wait. Also, - extract inline `xfail` into module-level `_DOGGY_BOOT_RACE_XFAIL` marker. - apply it to `n_dups=8` too (previously bare) bc larger N widens the race window enough to fire occasionally. - link to tracking issue #456. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Add `capture` dimension to CI matrix so fork-based
backends run `--capture=sys` (fork-child × `--capture=fd`
is a known deadlock). Non-fork backends keep `fd`.
Deats,
- two `include:` rows for `main_thread_forkserver` on
linux py3.13: tcp + uds, both `capture: 'sys'`
- job name updated to show `capture=` mode
- timeout bumped 16 -> 20 min to accommodate the
additional matrix cells
- `--capture=${{ matrix.capture }}` replaces hardcoded
`--capture=fd`
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Split the old `live`/`orphans` sock classification into three ppid-aware buckets: `live-active` (PID alive, parent owns it), `orphaned-alive` (PID alive but `ppid==1`, init-adopted — `acli.reap` candidate), and `orphaned-dead` (PID gone, sock stale). Deats, - new `_ppid()` helper reads `/proc/<pid>/stat` field [3] for parent PID, handles the tricky `(comm)` field (can contain spaces/parens) by splitting from last `)`. - live-active rows now show `(ppid=<N>)` for ctx. - orphaned-alive rows flagged `(adopted by init)`. - cleanup suggestion: `acli.reap --uds` for both alive-orphan graceful cancel + dead-sock cleanup in one shot; manual `rm` kept as fallback. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
(this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Outer `signal.alarm` cap that fires even when trio's `fail_after` is blocked by a shielded-await deadlock (the bug-class-3 hang under MTF backends). Only armed for fork-based spawners where the bug lives. Deats, - `_DIAG_CAP_S = fail_after_s + 5` — slightly larger than the trio-native guard so it always loses when the in-band path works. - `test_log.cancel()` breadcrumbs at each cancel-scope boundary so the last-fired breadcrumb names the swallow point on hang. - try/finally wrapping around each scope level for deterministic breadcrumb emission. - add `is_forking_spawner`, `set_fork_aware_capture` fixture params. - rework `fail_after_s`: 4s for fork, 12s for trio (was 30/12). Also, - `test_sigint_both_stream_types`: `assert 0` -> `pytest.fail()`, add TODO re `pytest.raises()`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
- `test_ext_types_over_ipc`: wrap `main()` in `fa_main()` with `trio.fail_after(2)` + commented `capfd.disabled()` investigation (pytest#14444). - `test_basic_payload_spec`: add fixture param with note on fork-spawner hang prevention. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
(this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
- `skipon_spawn_backend('subint')`: expand reason with specific
analysis doc refs + GH issue #379 umbrella link.
- add `track_orphaned_uds_per_test` fixture via `usefixtures` to
blame-attribute UDS sock-file orphans left by SIGKILL cancel
cascades.
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
For forking spawner backends that is. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Deats, - `test_echoserver_detailed_mechanics`: add `is_forking_spawner` param, wrap `main()` in `fa_main()` with per-backend `trio.fail_after` (4s fork / 1s trio) to cap cancel-cascade teardown that compounds under forkserver. - `test_sigint_closes_lifetime_stack`: swap `start_method` param for `is_forking_spawner`, pre-init `tmp_file`/`ctx` to `None` so KBI firing before `open_context` body doesn't `UnboundLocalError`, add `pytest.fail` guard for the spawn-time IPC race case, arm `signal.alarm` AFK-safety cap (10s) under fork backends Also, - `pytestmark`: add `track_orphaned_uds_per_test` + `detect_runaway_subactors_per_test` fixtures. - `delay()`: hardcode `return 1e3` at top (debug override still in place). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Extract all pure-Python diagnostic helpers (`dump_proc_tree`, `dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`, `ensure_sudo_cached`, etc.) from the xonsh xontrib into a new `tractor/_testing/trace.py` module so the same logic is callable from both the `acli.*` terminal aliases AND in-test capture-on-hang fixtures. Deats, - `_testing/trace.py`: new module (1171 lines) — proc-tree walker, hung-state dumper, bindspace scanner, `dump_all()` snapshot archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM (trio `fail_after` + auto-snapshot on `TooSlowError`), `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on `SIGALRM`), plus pytest fixture wrappers for both. - `_testing/pytest.py`: re-export the two fixtures via `from .trace import` so pytest plugin-discovery picks them up. - `tractor_diag.xsh`: thin terminal wrappers that import from `_testing.trace` — drops ~627 lines of inline impl. Add `acli.dump_all` alias for full snapshot-bundle CLI access. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Deats, - `_find_tractor_strays()`: scan `/proc/*/cmdline` for `tractor._child` procs NOT in the walk's `seen` set — surfaces ghost subactor trees from prior test runs (cross-test launchpad contamination). - `dump_proc_tree(include_strays=True)`: refactor classification into `_classify_walk()` closure, walk stray roots as additional trees, emit stray-root summary in header. Also: `tractor._child` procs reparented to init are now always classified as orphans regardless of cgroup-slice (leaked subactor ≠ desktop-launched app). - `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest `--capture=sys` redirection so snapshot paths always land on the real terminal - `fail_after_w_trace()`: capture diag snapshot on non-`TooSlowError` exceptions when the `fail_after` scope's cancel had already fired (e.g. nursery wraps `Cancelled` into a `BaseExceptionGroup` that escapes before `TooSlowError` can be raised). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list populated by `_do_capture_snapshot()` on each successful dump; add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode - `_testing/pytest.py`: add `pytest_terminal_summary` hook that prints all captured snapshot dirs at end-of-session so paths don't get buried in scrollback (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
`reap(include_descendants=True)` now expands each orphan-root pid into its full psutil subtree before delivering SIGINT, so a multi-level leaked actor-tree gets torn down in a single pass instead of requiring repeated calls (each pass kills the current `ppid==1` level, the level below becomes init-adopted, etc.). Falls back to the original flat `pids` list when `psutil` is unavailable. Emits a log line when expansion adds descendant pids. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs that appeared during the test — leaked subactors whose mid-tier parent died during cascade teardown, reparenting them to init. Pre-yield snapshot of existing orphans scopes reap to THIS test's leaks only, avoiding reap of unrelated tractor uses (piker, etc.) on the box. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Append "Snapshot evidence (2026-05-13)" section to `cancel_cascade_too_slow_under_main_thread_forkserver_issue.md` documenting `fail_after_w_trace` diag capture results for `test_nested_multierrors` under the MTF backend — reproduction cmd, ptree analysis, observed hang signature, and updated triage plan. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Deats,
- `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with
conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`,
`track_orphaned_uds_per_test`,
`detect_runaway_subactors_per_test` fixtures
- `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add
MTF `xfail(strict=False)` with detailed race-window comment
explaining the BEG shape mismatch, wrap body in
`fail_after_w_trace` with per-backend timeout budget, bump
`@tractor_test(timeout=10)`, drop old multiprocessing depth
special-casing
- `test_multierror_fast_nursery`: wrap in
`fail_after_w_trace(30.0)`, accept `TooSlowError` in
`pytest.raises`, surface explicit `pytest.fail` on hang
- `test_cancel_while_childs_child_in_sync_sleep`: swap
`spawn_backend` param for `is_forking_spawner`, widen
`fail_after` delay for fork-based spawners
- `test_remote_error`, `test_multierror`,
`test_cancel_infinite_streamer`, `test_some_cancels_all`: add
`set_fork_aware_capture` fixture param
- Drop commented-out per-test `skipon_spawn_backend` blocks (now
covered by module-level `pytestmark`)
(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Replace inline `trio.fail_after` + manual `signal.alarm` guard with the `_testing.trace` CM helpers that auto-capture a full ptree/wchan/py-spy diag snapshot to disk on timeout. Deats, - inner guard: `trio.fail_after` → `fail_after_w_trace` (async CM, captures on `TooSlowError`). - outer AFK guard: raw `signal.alarm` → `afk_alarm_w_trace` (sync CM, captures on `SIGALRM`), only armed under fork backends. Extracts `_run_and_match()` helper to keep branching clean. - bump `fail_after_s` from 4/12 → 8/20 to stop borderline flakes while diag harness accumulates evidence. - drop `_DIAG_CAP_S` var + manual signal import (now internal to `afk_alarm_w_trace`). (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Only flag `tractor._child` procs as cross-test ghosts of THIS run if `ppid==1` (init-adopted real leak) or `ppid` is in the walk's `seen` set (descendant we missed via race). Previously, procs whose `ppid` points to some OTHER live non-`pytest` (in the use of `acli.ptree pytest`) process belong to a different tractor app (`piker`, another `pytest` shell, a long-running tractor daemon) and were being falsely flagged as cross-test ghosts. Deats, - post-cmdline-match check via `_ppid_from_proc(pid)`, short-circuit on `None` (proc died in-flight). - expand module docstring to spell out the ownership filter rule + its rationale. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Per-terminal optimized `watch`-like xonsh alias that runs an arbitrary callable alias in a loop inside the alt-screen buffer with flicker-free repaint. Supersedes the inline `acli.ptree` polling .xsh snippet (removed from `_ptree` docstr in favor of `acli.watch acli.ptree pytest`). Deats, - alt-screen entry/exit (`\033[?1049h/l`) + cursor-hide (`\033[?25l/h`) wrapped in try/finally so Ctrl-C always returns to a pristine shell. - per-frame draw uses cursor-home (`\033[H`) + per-line EL (`\033[K` before each `\n`) + post-draw erase-down (`\033[J`) → stale tail chars from a longer prior frame are obvi cleared; no full-screen flash. - SIGWINCH-aware: terminal resize sets a flag, next frame does a full clear (`\033[H\033[2J`) instead of the cheap cursor-home path. - Ctrl-C handling: install `signal.default_int_handler` so `KeyboardInterrupt` lands cleanly; prior handler restored on exit. - Output capture: redirect the alias's stdout to `StringIO` per frame so we can post-process the EL fix. Aliases writing directly to `sys.stdout.buffer` / `os.write(1)` bypass capture — EL-fix won't apply but loop still works. - Alias unwrap: xonsh stores callables as either a bare callable OR `[fn, *preset_args]`. Both forms handled; subprocess-style aliases rejected w/ a friendly err msg. - `argparse` w/ `-n`/`--interval` (default 0.3s); rest of argv forwarded as alias args. - Reg `'acli.watch': watch` in `_TCLI_ALIASES`. Other, - Tn `_ptree` `args: list[str]` param. - Mod-header `Provides:` block updated w/ `acli.watch` entry. - Top-level imports: `os`, `sys`, `signal`, `time`, `typing.Callable`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Adopt the `_testing.trace` CM helpers in two MTF-hang-prone tests so on-timeout we get a fresh `ptree`/`wchan`/`py-spy` diag snapshot on disk instead of opaque pytest timeout-kills. Same shape as bd07a95 for `test_dynamic_pub_sub`. Deats, - `test_echoserver_detailed_mechanics`: * inner `trio.fail_after` → `fail_after_w_trace`. Adds `fail_after_w_trace: FailAfterWTraceFactory` fixture param. * mv per-backend `timeout` calc to top of test body (was interleaved w/ helper defs). * factor deep `open_nursery`/`open_context`/`open_stream` body into `_body()` so the wrapping `main()` stays a 2-liner — keeps the nested-CM block at its natural indent level instead of pushing it under yet another `async with`. * drop `with_timeout: bool` knob + `fa_main()` helper (knob was hard-coded `True`). - `test_sigint_closes_lifetime_stack`: * outer `signal.alarm`/`try`/`finally` → single `afk_alarm_w_trace(10)` CM. Adds `afk_alarm_w_trace: AfkAlarmWTraceFactory` fixture param. * drop `_AFK_CAP_S` + `armed_alarm` vars (CM owns both). * explanatory comment refreshed to mention `AFKAlarmTimeout` + the disk-snapshot side effect. Other, - Drop debug `return 1e3` short-circuit from `delay()` fixture — snuck in as a scratch line, was clobbering the proper `debug_mode`-branched return. - Top-level import: `FailAfterWTraceFactory`, `AfkAlarmWTraceFactory` from `tractor._testing.trace`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
subint_forkserverspawn backend (#379 follow-on)(i can't believe how fast we vibed this 😂 )
Motivation
Stacked on top of PR #446 /
subint_fork_backend: thatbranch laid the
_subint_fork.pystub and the post-fork-CPythonexploration showing that
os.fork()from a non-main subinterpreteraborts the child at the CPython layer. This branch operationalizes
the workaround sketched there — fork from a regular
threading.Threadattached to the main interpreter (one that hasnever entered a subint) — and wires it through tractor as a
first-class spawn backend.
The implementation looks tiny (one new module) but the supporting
work was the bulk of the patch series: multiple cancel-cascade hangs
surfaced under fork-based teardown that didn't exist for any
exec-based backend (
trio,mp_*). The shared-memory imageinherited across
os.fork()makes parent↔child socket-EOF deliveryracy, which exposed a latent
process_messages-shielded-loopdeadlock; once cracked, that fix benefits every backend. A separate
pytest --capture=fd× fork-child interaction was traced to atest_nested_multierrorscancel-cascade hang gated on the capturemode — tracked at #449 and worked-around by defaulting the suite to
--capture=sys.Late in the series the backend was split into two clearly-named
variants: variant 1 —
main_thread_forkserver— is theworking backend that ships today (forks from a regular main-interp
worker thread, child runs trio on its own main interp; NO
subinterpreter anywhere); variant 2 —
subint_forkserver—is reserved as a placeholder for the future subint-isolated child
runtime, gated on jcrist/msgspec#1026 (PEP 684 isolated-mode
support). Today the
'subint_forkserver'spawn-method keydispatches to a
NotImplementedErrorstub that points operators atthe variant-1 key. The "subint" prefix on both modules is
family-naming — they live alongside
_subint.py/_subint_fork.pyfrom the broader #379 series.The two
threading.Threadprimitives in_main_thread_forkserverare deliberately heavy-handed (full ad-hoc threads, not
trio.to_thread.run_sync) to side-step legacy-config-subint GILstarvation; once
msgspeclands PEP 684 support and we can useisolated subints, that constraint relaxes — auditable revisit
tracked at #450. Also bundled: a
tractor-reapzombie-subactorcleanup CLI +
_testing._reapshared impl + session-scoped autousefixture, so a mid-teardown timeout no longer leaves orphan
subactors competing for ports across test sessions; a follow-up
commit extends
tractor-reapwith a--shmmode that sweepsorphaned
/dev/shm/*segments owned by the current uid that nolive process is mapping or holding open.
Src of research
The following provide info on why/how this impl makes sense,
fork()can be hacked now?". The "Our own thoughts" sectionsketches the worker-thread fork pattern that this backend
implements.
subint_fork_backendblock and the
_subint_fork.pystub returningNotImplementedError.Py_mod_multiple_interpreters. Untilmsgspecadopts the slot we're stuck on legacy-config subints,which forces our heavier thread design (see Audit
subint_forkserverthread constraints once msgspec PEP 684 lands #450).msgspecPEP 684 isolated-mode tracker; gatesvariant-2.
ai/conc-anal/subint_fork_from_main_thread_smoketest.py(
control_subint_thread_fork,main_thread_fork,worker_thread_fork,full_architecture) — pre-tractor proofthat the workaround is sound.
Module-level design docs (read these top-of-file docstrings
for the per-backend architectural justifications, fork-semantics
analysis, and migration plans — much richer than the
high-level summary in this PR description):
tractor/spawn/_main_thread_forkserver.pyDesign rationale (why a forkserver + why in-process),
What survives the fork? — POSIX semantics, FYI: how this
dodges the
trio.run()×fork()hazards, Implementationstatus, Still-open work, TODO gated on msgspec PEP 684.
tractor/spawn/_subint_forkserver.pywould buy us (3 wins: cheaper forks, true parallelism,
multi-actor-per-process), what lives here today
(
run_subint_in_worker_thread), what will live here whenvariant 2 ships.
tractor/spawn/_subint.pysubintbackend (parent of this stack, PR A subinterpreter-in-thread spawning backend #446).Why we use the private
_interpretersC module instead ofconcurrent.interpreters's public'isolated'API; py3.14+feature gate rationale; msgspec PEP 684 migration path.
tractor/spawn/_subint_fork.pyPointers to the CPython-source-line analysis in
subint_fork_blocked_by_cpython_post_fork_issue.md.Summary of changes
By chronological commit,
(82332fbc) Lift the validated fork primitives into
tractor.spawn._subint_forkserver:fork_from_worker_thread()+run_subint_in_worker_thread()as the two re-usable buildingblocks.
(25e400d5) Add trio-parent integration tests covering
tier-1 (primitives driven from inside
trio.run()) and tier-2(full backend wired through
open_root_actor+open_nursery).(cf2e71d8) Document the PEP 684 audit-plan under
ai/conc-anal/subint_forkserver_thread_constraints_on_pep684_issue.md— the upstream-gated cleanup work tracked at Audit
subint_forkserverthread constraints once msgspec PEP 684 lands #450.(26914fde) Wire
'subint_forkserver'as a first-classSpawnMethodKeyand_methodsregistry entry; thetry_set_start_methodcase re-uses the subint-family py3.14+gate.
(63ab7c98) (7804a9fe) Reset post-fork
_statein the forkserver child via a new pureget_runtime_vars(clear_values=True)+ siblingset_runtime_vars()API; without the reset the child inheritsthe parent's
_is_root=Trueand tripsActor._from_parent()onthe
SpawnSpechandshake.(76605d56) (dcd5c1ff)
(a72deef7) Add a DRAFT orphan-SIGINT test scaffold +
child_sigintmodes; refine the diagnosis — the hang is NOT amissing handler, trio's loop stays wedged in
epoll_waitdespitedelivery. Full trace + fix dirs in
ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md.(f5f37b69) (5e85f184)
(8bcbe730) (e31eb8d7) Shorten timeouts
in forkserver suites, drop dead f-string prefixes, enable
debug_modefor the forkserver path (added to a new_DEBUG_COMPATIBLE_BACKENDSlist in_root), and label theforkserver child in log attribution.
(1e357dcf) Mv
test_subint_cancellation.pyinto thenew
tests/spawn/subpkg alongside the forkserver test module.(d093c319) (70d58c4b) Teach the
/run-testsskill a zombie-actor post-run check + a SIGINT-firstgraceful cleanup ladder. Per the SC-discipline rule: graceful
cancel before SIGKILL.
(e3f4f5a3) (1af21210) Add the
test-cancellation leak doc
(
subint_forkserver_test_cancellation_leak_issue.md) and wirereg_addrfixture through the leaky cancel tests so each rungets a unique registrar address.
(35da8089) (9993db01) Refine the
nested-cancel hang diagnosis; add a post-fork FD scrub in the
fork-child prelude as the current workaround.
(c20b05e1) Use
pidfdfor cancellable_ForkedProc.wait()— replaces a blockingos.waitpid()with atrio-cancellable poll.(8ac3dfeb) Break the parent-channel shield in
Actor.cancel()teardown via a captured_parent_chan_cs.Without this, the shielded
process_messagesloop parks on EOFthat only arrives AFTER the parent tears down — under fork
backends the parent is itself blocked on this child's exit.
Mutual-wait deadlock; explicit cancel makes teardown deterministic
regardless of backend.
(506617c6) (ab86f761)
(458a35cf) (7cd47ef7) Skip-mark + narrow
the cancel hang; surface silent failures in the forkserver child;
doc ruled-out fixes + the capture-pipe aside.
(76d12060) Claude-perms tweak so
/commit-msgoutputscan be written.
(4106ba73) (eceed29d)
(4c133ab5) Pin the forkserver hang to
pytest --capture=fd(subint_forkserver:test_nested_multierrorscancel-cascade hang gated by pytest--capture=fd#449); codify the capture-pipe-hang lesson inskills; default
pytestto--capture=sysinpyproject.tomlwith the trade-off rationale inlined.
(e312a68d) (4d055543) Bound the peer-clear
wait in
async_main'sfinally(3smove_on_after) and narrowthe forkserver hang to the
async_mainouter tn — load-bearingfor backend-agnostic teardown determinism.
(d6e70e9d) Import-or-skip
.devx.tests requiringgreenback— keeps the suite collectable without the optionaldep.
(b350aa09) Wire
reg_addrthrough infected-asynciotests for parallel-run isolation.
(2ca0f41e) (44bdb169) Skip
test_loglevel_propagated_to_subactoron the forkserver backendtoo; tighten the orphan-SIGINT xfail to
strict=True.(eae478f3) (6d76b604) Add
tractor._testing._reap(SC-polite SIGINT-first reap, descendant_reap_orphaned_subactorsfixture; add thetractor-reapCLI(
scripts/tractor-reap) wrapping the same impl.(c99d475d) (aa3e2309) Fix
mp.SharedMemoryunder fork-without-exec —tractor.ipc._mp_bs.disable_mantracker(force_disable=True)isnow the default (belt+suspenders: no-op
ManTrackermonkey-patchtrack=False);_shm.open_shm_listalways wires theunlinklifetime callback (was 3.12-and-below only); document the
incompat in
ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md.(4f12d69b) Extend
tractor-reapwith--shm(and--shm-only) modes that sweep orphaned/dev/shm/<key>segmentsowned by the current uid with no live process mapping or holding
them open. Match-criteria via
psutil.Process.memory_maps()+.open_files()— kernel-canonical, no reliance ontractor-specific shm-key naming, so unrelated apps' segments are
always preserved. Adds
psutil>=7.0.0to thetestingdepgroup.
(65fcfbf2) (9b05f659)
(66f1941f) Bump
test_stale_entry_is_deletedtimeoutto 30s; wire
test_dynamic_pub_subto standard fixtures; wirereg_addrintotest_context_stream_semantics.(54561959) Surface subint bootstrap excs in
_subint.subint_proc()(try/except BaseException+log.exception(...)around_interpreters.exec()); also log_interpreters.is_running(interp_id)on hard-kill timeout todisambiguate "thread leaked, subint already done" from "thread
alive bc subint is wedged".
?TODOnotes the anyio-borrow pathfor re-raising bootstrap excs in the parent task and migrating to
_interpreters.set___main___attrs()for non-literal SpawnSpecargs.
(3ab99d55) (4b5176e2) Major
module-docstring expansion for
_subint_forkserver: designrationale (in-process forkserver vs.
mp.forkserver's sidecar;why a forkserver at all vs. forking from a trio task), POSIX/trio
fork mechanics, what survives the fork boundary, and the
future-subint payoffs (cheaper forks, true parallelism via
per-interp GIL, multi-actor-per-process). Bump the gated msgspec
link from
#563→#1026.(99dade0f) Extract the truly-generic
main-interp-worker-thread fork primitives
(
fork_from_worker_thread,_close_inherited_fds,_ForkedProc,wait_child,_format_child_exit) into a sibling_main_thread_forkserver.pymodule — the primitive layer is nowhonestly named (none of these helpers touch a subint). Re-exports
preserved.
(57dae0e4) Split the backend into variant 1 + variant
2 modules. Variant 1 (
main_thread_forkserver) becomes thecanonical working impl: new
SpawnMethodKeyliteral,_methodsdispatch entry,
Actor._from_parent()match-arm,main_thread_forkserver_proc()spawn-coro stamping its ownSpawnSpec/ log lines. Variant 2 (subint_forkserver) shrinksto a placeholder describing the future subint-isolated child
runtime gated on Port to new
concurrent.interpreterssupport incpython3.14+ jcrist/msgspec#1026; legacy'subint_forkserver'key still aliases to variant-1 here (flipped to
NotImplementedErrorin the next commit).(5e83881f) Reduce
_subint_forkserver.pyto itsvariant-2 placeholder shape: add
subint_forkserver_procasyncstub raising
NotImplementedErrorwith a redirect msg pointingat
main_thread_forkserver+ Port to newconcurrent.interpreterssupport incpython3.14+ jcrist/msgspec#1026 + Trying out sub-interpreters (subints), maybefork()can be hacked now?' #379. Flipthe
_methodsregistry to dispatch the stub directly so--spawn-backend=subint_forkservererrors cleanly. Drop deadmodule-scope (
ChildSigintMode,_DEFAULT_CHILD_SIGINT, unusedimports).
(9f0709ee) Rename
tests/spawn/test_subint_forkserver.py→test_main_thread_forkserver.py; migrate test/smoketest importsto
tractor.spawn._main_thread_forkserver; orphan-harnesssubprocess argv flipped to
'main_thread_forkserver'. Drop thevariant-2 module's backward-compat re-exports of fork primitives.
(205382a3) Sweep
subint_forkserver→main_thread_forkserverin remaining string-match refs:_DEBUG_COMPATIBLE_BACKENDS,test_loglevel_propagated_to_subactor's capfd-skip,test_sigint_closes_lifetime_stack's xfail, comment/docstringrefs across
_runtime,_state,_testing.pytest,_subint,pyproject.toml,test_cancellation,test_registrar. Drop thetest_shm.py"broken onmain_thread_forkserver" skip-mark —_mp_bs+_shmfixes make those tests pass.(cbdf1eb6) Add
test_subint_forkserver_key_errors_cleanlyregression guardpinning the variant-2 reservation contract: the
'subint_forkserver'key MUST raiseNotImplementedError(notsilently dispatch to variant-1), and the error msg must surface
both the working-backend pointer (
main_thread_forkserver) +the upstream blocker (
msgspec#1026).(7c5dd4d0) Fix
_testing.addr.get_rando_addrcross-process collisions: the
_rando_port: str = random.randint(...)default-arg expression was evaluated ONCE atmodule-import — making it a per-process singleton. Two parallel
pytest sessions had a 1/9000 birthday-pair chance of cascade-
failing every
reg_addr-using test. Switch to per-callrandom.randint()salted withos.getpid(); drop the bogus: strannotation.Future follow up
Resolve
test_nested_multierrorscancel-cascade hang under--capture=fd(subint_forkserver:test_nested_multierrorscancel-cascade hang gated by pytest--capture=fd#449). The--capture=sysdefault is aworkaround; the underlying pytest-capture-machinery ↔ fork-child
stdio interaction is not yet root-caused. See
ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md.Wire the variant-2
subint_forkserver_proc()impl oncemsgspecships PEP 684 isolated-mode support(Port to new
concurrent.interpreterssupport incpython3.14+ jcrist/msgspec#1026 → tracked at Auditsubint_forkserverthread constraints once msgspec PEP 684 lands #450). Today it's aNotImplementedErrorstub; the unblock lets the child runtimelive in an isolated subint while the parent's
trio.run()keepsrunning on main.
Audit
main_thread_forkserverthread-constraint cleanup oncemsgspecships PEP 684 support (Auditsubint_forkserverthread constraints once msgspec PEP 684 lands #450). Both primitives currentlyallocate dedicated
threading.Threadinstances rather than usingtrio.to_thread.run_sync; theTODO — cleanup gated on msgspec PEP 684 supportblock in the module docstring catalogs the threeentangled root causes that block the cleanup today.
Surface subint bootstrap exceptions to the parent task via a
nonlocal errslot._subint.subint_proc()currently logs themvia
log.exception()only — the?TODOnear the_interpreters.exec()call points at anyio'sto_interpreter._interp_call(retval, is_exception)pattern asthe next step. Coordinates with the
trio.Cancelledpaths aroundsubint_exited.wait().Migrate SpawnSpec arg-passing to
_interpreters.set___main___attrs()._subint.subint_proc()'s?TODOat thebootstrapliteral: same API anyio uses into_interpreter._Worker.call(); needed oncenon-
repr()-roundtrippable values (SpawnSpecstruct,callables) get passed through.
Implement
child_sigint='trio'mode (or remove the flag).Scaffolded in
_main_thread_forkserverbut currently a no-oppending the orphan-SIGINT root-cause fix tracked in
ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md.Once trio's
epoll_waitwedge is fixed, the flag may end up ano-op / doc-only mode.
Add cancellation / hard-kill stress coverage for the
forkserver backend (counterpart to
tests/spawn/test_subint_cancellation.pyfor the plainsubintbackend). Module docstring lists this under "Still-open work".
Run the
?TODOtypo-check enhancement intractor._testing.pytest— pipeskipon_spawn_backendargsthrough the try-set-backend checker rather than just
assert in get_args(SpawnMethodKey).xplatform pass for
tractor._testing._reap— process-reappath is currently Linux-only via
/proc/<pid>/{status,cwd,cmdline}; the--shmphase isLinux/FreeBSD only (macOS POSIX shm has no fs-visible path;
Windows is a different story). Module docstring notes a
psutil-based rewrite is viable since the dep is alreadytest-time.
Root-cause the Mode-A cancel-cascade hang under heavy
fork-spawn contention (
main_thread_forkserver: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention #451). Reproduces ~17% of runs at 3parallel pytest streams ×
cpu_count - 2actors; ≈0% on anidle single-stream system. Parent-side dump shows trio's main
thread parked in
trio._core._io_epoll.get_events()line 245— cancel cascade has reached the I/O wait but
epoll.pollnever returns. Workaround in this PR: per-test
trio.fail_after(12)cap +reap_subactors_per_testopt-infixture confine the failure to the originating test (no
cascade contamination of sibling tests), so the suite stays
green. Real fix needs a contention-amplifier reproducer +
stackscopetask-tree dumps from parent + each subactor atthe t≈9 mid-cascade mark.
(this pr content was generated in some part by
claude-code)