Skip to content

Working toward a "subinterpreter-forkserver" spawning backend#447

Draft
goodboy wants to merge 137 commits into
subint_fork_backendfrom
subint_forkserver_backend
Draft

Working toward a "subinterpreter-forkserver" spawning backend#447
goodboy wants to merge 137 commits into
subint_fork_backendfrom
subint_forkserver_backend

Conversation

@goodboy
Copy link
Copy Markdown
Owner

@goodboy goodboy commented Apr 22, 2026

Add subint_forkserver spawn backend (#379 follow-on)

(i can't believe how fast we vibed this 😂 )

Motivation

Stacked on top of PR #446 / subint_fork_backend: that
branch laid the _subint_fork.py stub and the post-fork-CPython
exploration showing that os.fork() from a non-main subinterpreter
aborts the child at the CPython layer. This branch operationalizes
the workaround sketched there — fork from a regular
threading.Thread attached to the main interpreter (one that has
never entered a subint) — and wires it through tractor as a
first-class spawn backend.

The implementation looks tiny (one new module) but the supporting
work was the bulk of the patch series: multiple cancel-cascade hangs
surfaced under fork-based teardown that didn't exist for any
exec-based backend (trio, mp_*). The shared-memory image
inherited across os.fork() makes parent↔child socket-EOF delivery
racy, which exposed a latent process_messages-shielded-loop
deadlock; once cracked, that fix benefits every backend. A separate
pytest --capture=fd × fork-child interaction was traced to a
test_nested_multierrors cancel-cascade hang gated on the capture
mode — tracked at #449 and worked-around by defaulting the suite to
--capture=sys.

Late in the series the backend was split into two clearly-named
variants: variant 1main_thread_forkserver — is the
working backend that ships today (forks from a regular main-interp
worker thread, child runs trio on its own main interp; NO
subinterpreter anywhere); variant 2subint_forkserver
is reserved as a placeholder for the future subint-isolated child
runtime, gated on jcrist/msgspec#1026 (PEP 684 isolated-mode
support). Today the 'subint_forkserver' spawn-method key
dispatches to a NotImplementedError stub that points operators at
the variant-1 key. The "subint" prefix on both modules is
family-naming — they live alongside _subint.py /
_subint_fork.py from the broader #379 series.

The two threading.Thread primitives in _main_thread_forkserver
are deliberately heavy-handed (full ad-hoc threads, not
trio.to_thread.run_sync) to side-step legacy-config-subint GIL
starvation; once msgspec lands PEP 684 support and we can use
isolated subints, that constraint relaxes — auditable revisit
tracked at #450. Also bundled: a tractor-reap zombie-subactor
cleanup CLI + _testing._reap shared impl + session-scoped autouse
fixture, so a mid-teardown timeout no longer leaves orphan
subactors competing for ports across test sessions; a follow-up
commit extends tractor-reap with a --shm mode that sweeps
orphaned /dev/shm/* segments owned by the current uid that no
live process is mapping or holding open.


Src of research

The following provide info on why/how this impl makes sense,

  • goodboy/tractor#379
    • motivating issue: "Trying out sub-interpreters (subints), maybe
      fork() can be hacked now?". The "Our own thoughts" section
      sketches the worker-thread fork pattern that this backend
      implements.
  • PR #446 — subint_fork_backend
    • the immediate base branch — establishes the post-fork-CPython
      block and the _subint_fork.py stub returning
      NotImplementedError.
  • https://peps.python.org/pep-0684/
  • jcrist/msgspec#1026
    • upstream msgspec PEP 684 isolated-mode tracker; gates
      variant-2.
  • ai/conc-anal/subint_fork_from_main_thread_smoketest.py
    • standalone CPython-level reproducer covering the four scenarios
      (control_subint_thread_fork, main_thread_fork,
      worker_thread_fork, full_architecture) — pre-tractor proof
      that the workaround is sound.

Module-level design docs (read these top-of-file docstrings
for the per-backend architectural justifications, fork-semantics
analysis, and migration plans — much richer than the
high-level summary in this PR description):

  • tractor/spawn/_main_thread_forkserver.py
    • variant-1 backend (working today). Sections: Background,
      Design rationale (why a forkserver + why in-process),
      What survives the fork? — POSIX semantics, FYI: how this
      dodges the trio.run() × fork() hazards, Implementation
      status, Still-open work, TODO gated on msgspec PEP 684.
  • tractor/spawn/_subint_forkserver.py
    • variant-2 placeholder. Sections: Future arch — what subints
      would buy us (3 wins: cheaper forks, true parallelism,
      multi-actor-per-process), what lives here today
      (run_subint_in_worker_thread), what will live here when
      variant 2 ships.
  • tractor/spawn/_subint.py
    • in-thread subint backend (parent of this stack, PR A subinterpreter-in-thread spawning backend #446).
      Why we use the private _interpreters C module instead of
      concurrent.interpreters's public 'isolated' API; py3.14+
      feature gate rationale; msgspec PEP 684 migration path.
  • tractor/spawn/_subint_fork.py
    • fork-from-non-main-subint RFC stub (blocked upstream).
      Pointers to the CPython-source-line analysis in
      subint_fork_blocked_by_cpython_post_fork_issue.md.

Summary of changes

By chronological commit,

  • (82332fbc) Lift the validated fork primitives into
    tractor.spawn._subint_forkserver: fork_from_worker_thread() +
    run_subint_in_worker_thread() as the two re-usable building
    blocks.

  • (25e400d5) Add trio-parent integration tests covering
    tier-1 (primitives driven from inside trio.run()) and tier-2
    (full backend wired through open_root_actor + open_nursery).

  • (cf2e71d8) Document the PEP 684 audit-plan under
    ai/conc-anal/subint_forkserver_thread_constraints_on_pep684_issue.md
    — the upstream-gated cleanup work tracked at Audit subint_forkserver thread constraints once msgspec PEP 684 lands #450.

  • (26914fde) Wire 'subint_forkserver' as a first-class
    SpawnMethodKey and _methods registry entry; the
    try_set_start_method case re-uses the subint-family py3.14+
    gate.

  • (63ab7c98) (7804a9fe) Reset post-fork
    _state in the forkserver child via a new pure
    get_runtime_vars(clear_values=True) + sibling
    set_runtime_vars() API; without the reset the child inherits
    the parent's _is_root=True and trips Actor._from_parent() on
    the SpawnSpec handshake.

  • (76605d56) (dcd5c1ff)
    (a72deef7) Add a DRAFT orphan-SIGINT test scaffold +
    child_sigint modes; refine the diagnosis — the hang is NOT a
    missing handler, trio's loop stays wedged in epoll_wait despite
    delivery. Full trace + fix dirs in
    ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md.

  • (f5f37b69) (5e85f184)
    (8bcbe730) (e31eb8d7) Shorten timeouts
    in forkserver suites, drop dead f-string prefixes, enable
    debug_mode for the forkserver path (added to a new
    _DEBUG_COMPATIBLE_BACKENDS list in _root), and label the
    forkserver child in log attribution.

  • (1e357dcf) Mv test_subint_cancellation.py into the
    new tests/spawn/ subpkg alongside the forkserver test module.

  • (d093c319) (70d58c4b) Teach the
    /run-tests skill a zombie-actor post-run check + a SIGINT-first
    graceful cleanup ladder. Per the SC-discipline rule: graceful
    cancel before SIGKILL.

  • (e3f4f5a3) (1af21210) Add the
    test-cancellation leak doc
    (subint_forkserver_test_cancellation_leak_issue.md) and wire
    reg_addr fixture through the leaky cancel tests so each run
    gets a unique registrar address.

  • (35da8089) (9993db01) Refine the
    nested-cancel hang diagnosis; add a post-fork FD scrub in the
    fork-child prelude as the current workaround.

  • (c20b05e1) Use pidfd for cancellable
    _ForkedProc.wait() — replaces a blocking os.waitpid() with a
    trio-cancellable poll.

  • (8ac3dfeb) Break the parent-channel shield in
    Actor.cancel() teardown via a captured _parent_chan_cs.
    Without this, the shielded process_messages loop parks on EOF
    that only arrives AFTER the parent tears down — under fork
    backends the parent is itself blocked on this child's exit.
    Mutual-wait deadlock; explicit cancel makes teardown deterministic
    regardless of backend.

  • (506617c6) (ab86f761)
    (458a35cf) (7cd47ef7) Skip-mark + narrow
    the cancel hang; surface silent failures in the forkserver child;
    doc ruled-out fixes + the capture-pipe aside.

  • (76d12060) Claude-perms tweak so /commit-msg outputs
    can be written.

  • (4106ba73) (eceed29d)
    (4c133ab5) Pin the forkserver hang to pytest --capture=fd (subint_forkserver: test_nested_multierrors cancel-cascade hang gated by pytest --capture=fd #449); codify the capture-pipe-hang lesson in
    skills; default pytest to --capture=sys in pyproject.toml
    with the trade-off rationale inlined.

  • (e312a68d) (4d055543) Bound the peer-clear
    wait in async_main's finally (3s move_on_after) and narrow
    the forkserver hang to the async_main outer tn — load-bearing
    for backend-agnostic teardown determinism.

  • (d6e70e9d) Import-or-skip .devx. tests requiring
    greenback — keeps the suite collectable without the optional
    dep.

  • (b350aa09) Wire reg_addr through infected-asyncio
    tests for parallel-run isolation.

  • (2ca0f41e) (44bdb169) Skip
    test_loglevel_propagated_to_subactor on the forkserver backend
    too; tighten the orphan-SIGINT xfail to strict=True.

  • (eae478f3) (6d76b604) Add
    tractor._testing._reap (SC-polite SIGINT-first reap, descendant

    • orphan modes) + the session-scoped autouse
      _reap_orphaned_subactors fixture; add the tractor-reap CLI
      (scripts/tractor-reap) wrapping the same impl.
  • (c99d475d) (aa3e2309) Fix
    mp.SharedMemory under fork-without-exec —
    tractor.ipc._mp_bs.disable_mantracker(force_disable=True) is
    now the default (belt+suspenders: no-op ManTracker monkey-patch

    • track=False); _shm.open_shm_list always wires the unlink
      lifetime callback (was 3.12-and-below only); document the
      incompat in
      ai/conc-anal/subint_forkserver_mp_shared_memory_issue.md.
  • (4f12d69b) Extend tractor-reap with --shm (and
    --shm-only) modes that sweep orphaned /dev/shm/<key> segments
    owned by the current uid with no live process mapping or holding
    them open. Match-criteria via psutil.Process.memory_maps() +
    .open_files() — kernel-canonical, no reliance on
    tractor-specific shm-key naming, so unrelated apps' segments are
    always preserved. Adds psutil>=7.0.0 to the testing dep
    group.

  • (65fcfbf2) (9b05f659)
    (66f1941f) Bump test_stale_entry_is_deleted timeout
    to 30s; wire test_dynamic_pub_sub to standard fixtures; wire
    reg_addr into test_context_stream_semantics.

  • (54561959) Surface subint bootstrap excs in
    _subint.subint_proc() (try/except BaseException +
    log.exception(...) around _interpreters.exec()); also log
    _interpreters.is_running(interp_id) on hard-kill timeout to
    disambiguate "thread leaked, subint already done" from "thread
    alive bc subint is wedged". ?TODO notes the anyio-borrow path
    for re-raising bootstrap excs in the parent task and migrating to
    _interpreters.set___main___attrs() for non-literal SpawnSpec
    args.

  • (3ab99d55) (4b5176e2) Major
    module-docstring expansion for _subint_forkserver: design
    rationale (in-process forkserver vs. mp.forkserver's sidecar;
    why a forkserver at all vs. forking from a trio task), POSIX/trio
    fork mechanics, what survives the fork boundary, and the
    future-subint payoffs (cheaper forks, true parallelism via
    per-interp GIL, multi-actor-per-process). Bump the gated msgspec
    link from #563#1026.

  • (99dade0f) Extract the truly-generic
    main-interp-worker-thread fork primitives
    (fork_from_worker_thread, _close_inherited_fds, _ForkedProc,
    wait_child, _format_child_exit) into a sibling
    _main_thread_forkserver.py module — the primitive layer is now
    honestly named (none of these helpers touch a subint). Re-exports
    preserved.

  • (57dae0e4) Split the backend into variant 1 + variant
    2 modules. Variant 1 (main_thread_forkserver) becomes the
    canonical working impl: new SpawnMethodKey literal, _methods
    dispatch entry, Actor._from_parent() match-arm,
    main_thread_forkserver_proc() spawn-coro stamping its own
    SpawnSpec / log lines. Variant 2 (subint_forkserver) shrinks
    to a placeholder describing the future subint-isolated child
    runtime gated on Port to new concurrent.interpreters support in cpython 3.14+ jcrist/msgspec#1026; legacy 'subint_forkserver'
    key still aliases to variant-1 here (flipped to
    NotImplementedError in the next commit).

  • (5e83881f) Reduce _subint_forkserver.py to its
    variant-2 placeholder shape: add subint_forkserver_proc async
    stub raising NotImplementedError with a redirect msg pointing
    at main_thread_forkserver + Port to new concurrent.interpreters support in cpython 3.14+ jcrist/msgspec#1026 + Trying out sub-interpreters (subints), maybe fork() can be hacked now?' #379. Flip
    the _methods registry to dispatch the stub directly so
    --spawn-backend=subint_forkserver errors cleanly. Drop dead
    module-scope (ChildSigintMode, _DEFAULT_CHILD_SIGINT, unused
    imports).

  • (9f0709ee) Rename
    tests/spawn/test_subint_forkserver.py
    test_main_thread_forkserver.py; migrate test/smoketest imports
    to tractor.spawn._main_thread_forkserver; orphan-harness
    subprocess argv flipped to 'main_thread_forkserver'. Drop the
    variant-2 module's backward-compat re-exports of fork primitives.

  • (205382a3) Sweep subint_forkserver
    main_thread_forkserver in remaining string-match refs:
    _DEBUG_COMPATIBLE_BACKENDS,
    test_loglevel_propagated_to_subactor's capfd-skip,
    test_sigint_closes_lifetime_stack's xfail, comment/docstring
    refs across _runtime, _state, _testing.pytest, _subint,
    pyproject.toml, test_cancellation, test_registrar. Drop the
    test_shm.py "broken on main_thread_forkserver" skip-mark —
    _mp_bs + _shm fixes make those tests pass.

  • (cbdf1eb6) Add
    test_subint_forkserver_key_errors_cleanly regression guard
    pinning the variant-2 reservation contract: the
    'subint_forkserver' key MUST raise NotImplementedError (not
    silently dispatch to variant-1), and the error msg must surface
    both the working-backend pointer (main_thread_forkserver) +
    the upstream blocker (msgspec#1026).

  • (7c5dd4d0) Fix _testing.addr.get_rando_addr
    cross-process collisions: the _rando_port: str = random.randint(...) default-arg expression was evaluated ONCE at
    module-import — making it a per-process singleton. Two parallel
    pytest sessions had a 1/9000 birthday-pair chance of cascade-
    failing every reg_addr-using test. Switch to per-call
    random.randint() salted with os.getpid(); drop the bogus
    : str annotation.


Future follow up

  • Resolve test_nested_multierrors cancel-cascade hang under
    --capture=fd (subint_forkserver: test_nested_multierrors cancel-cascade hang gated by pytest --capture=fd #449). The --capture=sys default is a
    workaround; the underlying pytest-capture-machinery ↔ fork-child
    stdio interaction is not yet root-caused. See
    ai/conc-anal/subint_forkserver_test_cancellation_leak_issue.md.

  • Wire the variant-2 subint_forkserver_proc() impl once
    msgspec ships PEP 684 isolated-mode support
    (Port to new concurrent.interpreters support in cpython 3.14+ jcrist/msgspec#1026 → tracked at Audit subint_forkserver thread constraints once msgspec PEP 684 lands #450). Today it's a
    NotImplementedError stub; the unblock lets the child runtime
    live in an isolated subint while the parent's trio.run() keeps
    running on main.

  • Audit main_thread_forkserver thread-constraint cleanup once
    msgspec ships PEP 684 support (Audit subint_forkserver thread constraints once msgspec PEP 684 lands #450). Both primitives currently
    allocate dedicated threading.Thread instances rather than using
    trio.to_thread.run_sync; the TODO — cleanup gated on msgspec PEP 684 support block in the module docstring catalogs the three
    entangled root causes that block the cleanup today.

  • Surface subint bootstrap exceptions to the parent task via a
    nonlocal err slot. _subint.subint_proc() currently logs them
    via log.exception() only — the ?TODO near the
    _interpreters.exec() call points at anyio's
    to_interpreter._interp_call (retval, is_exception) pattern as
    the next step. Coordinates with the trio.Cancelled paths around
    subint_exited.wait().

  • Migrate SpawnSpec arg-passing to
    _interpreters.set___main___attrs(). _subint.subint_proc()'s
    ?TODO at the bootstrap literal: same API anyio uses in
    to_interpreter._Worker.call(); needed once
    non-repr()-roundtrippable values (SpawnSpec struct,
    callables) get passed through.

  • Implement child_sigint='trio' mode (or remove the flag).
    Scaffolded in _main_thread_forkserver but currently a no-op
    pending the orphan-SIGINT root-cause fix tracked in
    ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md.
    Once trio's epoll_wait wedge is fixed, the flag may end up a
    no-op / doc-only mode.

  • Add cancellation / hard-kill stress coverage for the
    forkserver backend (counterpart to
    tests/spawn/test_subint_cancellation.py for the plain subint
    backend). Module docstring lists this under "Still-open work".

  • Run the ?TODO typo-check enhancement in
    tractor._testing.pytest — pipe skipon_spawn_backend args
    through the try-set-backend checker rather than just assert in get_args(SpawnMethodKey).

  • xplatform pass for tractor._testing._reap — process-reap
    path is currently Linux-only via
    /proc/<pid>/{status,cwd,cmdline}; the --shm phase is
    Linux/FreeBSD only (macOS POSIX shm has no fs-visible path;
    Windows is a different story). Module docstring notes a
    psutil-based rewrite is viable since the dep is already
    test-time.

  • Root-cause the Mode-A cancel-cascade hang under heavy
    fork-spawn contention (main_thread_forkserver: cancel-cascade occasionally hangs >9s under heavy fork-spawn contention #451). Reproduces ~17% of runs at 3
    parallel pytest streams × cpu_count - 2 actors; ≈0% on an
    idle single-stream system. Parent-side dump shows trio's main
    thread parked in trio._core._io_epoll.get_events() line 245
    — cancel cascade has reached the I/O wait but epoll.poll
    never returns. Workaround in this PR: per-test
    trio.fail_after(12) cap + reap_subactors_per_test opt-in
    fixture confine the failure to the originating test (no
    cascade contamination of sibling tests), so the suite stays
    green. Real fix needs a contention-amplifier reproducer +
    stackscope task-tree dumps from parent + each subactor at
    the t≈9 mid-cascade mark.


(this pr content was generated in some part by claude-code)

@goodboy goodboy force-pushed the subint_forkserver_backend branch from 418a7ca to 4425023 Compare April 23, 2026 15:44
@goodboy goodboy self-assigned this Apr 23, 2026
goodboy added 27 commits April 23, 2026 18:48
Standalone script to validate the "main-interp worker-thread
forkserver + subint-hosted trio" arch proposed as a workaround
to the CPython-level refusal doc'd in
`ai/conc-anal/subint_fork_blocked_by_cpython_post_fork_issue.md`.

Deliberately NOT a `tractor` test — zero `tractor` imports.
Uses `_interpreters` (private stdlib) + `os.fork()` directly so
pass/fail is a property of CPython alone, independent of our
runtime. Requires py3.14+.

Deats,
- four scenarios via `--scenario`:
  - `control_subint_thread_fork` — the KNOWN-BROKEN case as a
    harness sanity; if the child DOESN'T abort, our analysis
    is wrong
  - `main_thread_fork` — baseline sanity, must always succeed
  - `worker_thread_fork` — architectural assertion: regular
    `threading.Thread` attached to main interp calls
    `os.fork()`; child should survive post-fork cleanup
  - `full_architecture` — end-to-end: fork from a main-interp
    worker thread, then in child create a subint driving a
    worker thread running `trio.run()`
- exit code 0 on EXPECTED outcome (for `control_*` that means
  "child aborted", not "child succeeded")
- each scenario prints a self-contained pass/fail banner; use
  `os.waitpid()` of the parent + per-scenario status prints to
  observe the child's fate

Also, log NLNet provenance for this session's three-sub-phase
work (py3.13 gate tightening, `pytest-timeout` + marker
refactor, `subint_fork` prototype → CPython-block finding).

Prompt-IO: ai/prompt-io/claude/20260422T200723Z_797f57c_prompt_io.md

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The smoketest (prior commit) empirically validated the
"fork-from-main-interp-worker-thread" arch on py3.14. Promote
the validated primitives out of the `ai/conc-anal/` smoketest
into `tractor.spawn._subint_forkserver` so they can eventually
be wired into a real "subint forkserver" spawn backend.

Deats,
- new module `tractor/spawn/_subint_forkserver.py` (337 LOC):
  - `fork_from_worker_thread(child_target, thread_name)` —
    spawn a main-interp `threading.Thread`, call `os.fork()`
    from it, shuttle the child pid back to main via a pipe
  - `run_trio_in_subint(bootstrap, ...)` — post-fork helper:
    create a fresh subint + drive `_interpreters.exec()` on
    a dedicated worker thread running the `bootstrap` str
    (typically imports `trio`, defines an async entry, calls
    `trio.run()`)
  - `wait_child(pid, expect_exit_ok)` — `os.waitpid()` +
    pass/fail classification reusable from harness AND the
    eventual real spawn path
- feature-gated py3.14+ via the public
  `concurrent.interpreters` presence check; matches the gate
  in `tractor.spawn._subint`
- module docstring doc's the CPython-block context
  (cross-refs `_subint_fork` stub + the two `conc-anal/`
  docs) and status: EXPERIMENTAL, not yet registered in
  `_spawn._methods`

Also, refactor the smoketest
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py` to
import the primitives from the new module rather than inline
its own copies. Keeps the smoketest and the tractor-side
impl in sync as the forkserver design evolves; the smoketest
remains a zero-`tractor`-runtime CPython-level check
(imports ONLY the three primitives, no runtime bring-up).

Status: next step is to drive these from a parent-side
`trio.run()` and hook the returned child pid into the normal
actor-nursery/IPC flow — then register `subint_forkserver`
as a `SpawnMethodKey` in `_spawn.py`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New pytest module `tests/spawn/test_subint_forkserver.py`
drives the forkserver primitives from inside a real
`trio.run()` in the parent — the runtime shape tractor will
actually use when we wire up a `subint_forkserver` spawn
backend proper. Complements the standalone no-trio-in-parent
`ai/conc-anal/subint_fork_from_main_thread_smoketest.py`.

Deats,
- new test pkg `tests/spawn/` (+ empty `__init__.py`)
- two tests, both `@pytest.mark.timeout(30, method='thread')`
  for the GIL-hostage safety reason doc'd in
  `ai/conc-anal/subint_sigint_starvation_issue.md`:
  - `test_fork_from_worker_thread_via_trio` — parent-side
    plumbing baseline. `trio.run()` off-loads forkserver
    prims via `trio.to_thread.run_sync()` + asserts the
    child reaps cleanly
  - `test_fork_and_run_trio_in_child` — end-to-end: forked
    child calls `run_subint_in_worker_thread()` with a
    bootstrap str that does `trio.run()` in a fresh subint
- both tests wrap the inner `trio.run()` in a
  `dump_on_hang()` for post-mortem if the outer
  `pytest-timeout` fires
- intentionally NOT using `--spawn-backend` — the tests
  drive the primitives directly rather than going through
  tractor's spawn-method registry (which the forkserver
  isn't plugged into yet)

Also, rename `run_trio_in_subint()` →
`run_subint_in_worker_thread()` for naming consistency with
the sibling `fork_from_worker_thread()`. The action is really
"host a subint on a worker thread", not specifically "run
trio" — trio just happens to be the typical payload.
Propagate the rename to the smoketest.

Further, add a "TODO — cleanup gated on msgspec PEP 684
support" section to the `_subint_forkserver` module
docstring: flags the dedicated-`threading.Thread` design as
potentially-revisable once isolated-mode subints are viable
in tractor. Cross-refs `msgspec#563` + `tractor#379` and
points at an audit-plan conc-anal doc we'll add next.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up tracker companion to the module-docstring TODO
added in `372a0f32`. Catalogs why `_subint_forkserver`'s
two "non-trio thread" constraints
(`fork_from_worker_thread()` +
`run_subint_in_worker_thread()` both allocating dedicated
`threading.Thread`s; test helper named
`run_fork_in_non_trio_thread`) exist today, and which of
them would dissolve once msgspec PEP 684 support ships
(`msgspec#563`) and tractor flips to isolated-mode subints.

Deats,
- three reasons enumerated for the current constraints:
  - class-A GIL-starvation — **fixed** by isolated mode:
    subints don't share main's GIL so abandoned-thread
    contention disappears
  - destroy race / tstate-recycling from `subint_proc` —
    **unclear**: `_PyXI_Enter` + `_PyXI_Exit` are
    cross-mode, so isolated doesn't obviously fix it;
    needs empirical retest on py3.14 + isolated API
  - fork-from-main-interp-tstate (the CPython-level
    `_PyInterpreterState_DeleteExceptMain` gate) — the
    narrow reason for using a dedicated thread; **probably
    fixed** IF the destroy-race also resolves (bc trio's
    cache threads never drove subints → clean main-interp
    tstate)
- TL;DR table of which constraints unwind under each
  resolution branch
- four-step audit plan for when `msgspec#563` lands:
  - flip `_subint` to isolated mode
  - empirical destroy-race retest
  - audit `_subint_forkserver.py` — drop `non_trio`
    qualifier / maybe inline primitives
  - doc fallout — close the three `subint_*_issue.md`
    siblings w/ post-mortem notes

Also, cross-refs the three sibling `conc-anal/` docs, PEPs
684 + 734, `msgspec#563`, and `tractor#379` (the overall
subint spawn-backend tracking issue).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Promote `_subint_forkserver` from primitives-only into a
registered spawn backend: `'subint_forkserver'` is now a
`SpawnMethodKey` literal, dispatched via `_methods` to
the new `subint_forkserver_proc()` target, feature-gated
under the existing `subint`-family py3.14+ case, and
selectable via `--spawn-backend=subint_forkserver`.

Deats,
- new `subint_forkserver_proc()` spawn target in
  `_subint_forkserver`:
  - mirrors `trio_proc()`'s supervision model — real OS
    subprocess so `Portal.cancel_actor()` + `soft_kill()`
    on graceful teardown, `os.kill(SIGKILL)` on hard-reap
    (no `_interpreters.destroy()` race to fuss over bc the
    child lives in its own process)
  - only real diff from `trio_proc` is the spawn mechanism:
    fork from a main-interp worker thread via
    `fork_from_worker_thread()` (off-loaded to trio's
    thread pool) instead of `trio.lowlevel.open_process()`
  - child-side `_child_target` closure runs
    `tractor._child._actor_child_main()` with
    `spawn_method='trio'` — the child is a regular trio
    actor, "subint_forkserver" names how the parent
    spawned, not what the child runs
- new `_ForkedProc` class — thin `trio.Process`-compatible
  shim around a raw OS pid: `.poll()` via
  `waitpid(WNOHANG)`, async `.wait()` off-loaded to a trio
  cache thread, `.kill()` via `SIGKILL`, `.returncode`
  cached for repeat calls. `.stdin`/`.stdout`/`.stderr`
  are `None` (fork-w/o-exec inherits parent FDs; we don't
  marshal them) which matches `soft_kill()`'s `is not None`
  guards

Also, new backend-tier test
`test_subint_forkserver_spawn_basic` drives the registered
backend end-to-end via `open_root_actor` + `open_nursery` +
`run_in_actor` w/ a trivial portal-RPC round-trip. Uses a
`forkserver_spawn_method` fixture to flip
`_spawn_method`/`_ctx` for the test's duration + restore on
teardown (so other session-level tests don't observe the
global flip). Test module docstring reworked to describe
the three tiers now covered: (1) primitive-level, (2)
parent-trio-driven primitives, (3) full registered backend.

Status: still-open work (tracked on `tractor#379`) doc'd
inline in the module docstring — no cancel/hard-kill stress
coverage yet, child-side subint-hosted root runtime still
future (gated on `msgspec#563`), thread-hygiene audit
pending the same unblock.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`os.fork()` inherits the parent's entire memory image,
including `tractor.runtime._state` globals that encode
"this process is the root actor" — `_runtime_vars`'s
`_is_root=True`, pre-populated `_root_mailbox` +
`_registry_addrs`, and the parent's `_current_actor`
singleton.

A fresh `exec`-based child starts with those globals at
their module-level defaults (all falsey/empty). The
forkserver child needs to match that shape BEFORE calling
`_actor_child_main()`, otherwise `Actor.__init__()` takes
the `is_root_process() == True` branch and pre-populates
`self.enable_modules`, which then trips
`assert not self.enable_modules` at the top of
`Actor._from_parent()` on the subsequent parent→child
`SpawnSpec` handshake.

Fix: at the start of `_child_target`, null
`_state._current_actor` and overwrite `_runtime_vars` with
a cold-root blank (`_is_root=False`, empty mailbox/addrs,
`_debug_mode=False`) before `_actor_child_main()` runs.

Found-via: `test_subint_forkserver_spawn_basic` hitting
the `enable_modules` assert on child-side runtime boot.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Post-fork `_runtime_vars` reset in `subint_forkserver_proc`
was previously done via direct mutation of
`_state._runtime_vars` from an external module + an inline
default dict duplicating the `_state.py`-internal defaults.
Split the access surface into a pure getter + explicit
setter so the reset call site becomes a one-liner
composition.

Deats `tractor/runtime/_state.py`,
- extract initial values into a module-level
  `_RUNTIME_VARS_DEFAULTS: dict[str, Any]` constant; the
  live `_runtime_vars` is now initialised from
  `dict(_RUNTIME_VARS_DEFAULTS)`
- `get_runtime_vars()` grows a `clear_values: bool = False`
  kwarg. When True, returns a fresh copy of
  `_RUNTIME_VARS_DEFAULTS` instead of the live dict —
  still a **pure read**, never mutates anything
- new `set_runtime_vars(rtvars: dict | RuntimeVars)` —
  atomic replacement of the live dict's contents via
  `.clear()` + `.update()`, so existing references to the
  same dict object remain valid. Accepts either the
  historical dict form or the `RuntimeVars` struct

Deats `tractor/spawn/_subint_forkserver.py`,
- collapse the prior ad-hoc `.update({...})` block into
  `set_runtime_vars(get_runtime_vars(clear_values=True))`
- drop the `_state._current_actor = None` line —
  `_trio_main` unconditionally overwrites it downstream,
  so no explicit reset needed (noted in the XXX comment)

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Tier-4 test `test_orphaned_subactor_sigint_cleanup_DRAFT`
documents an empirical SIGINT-delivery gap in the
`subint_forkserver` backend: when the parent dies via
`SIGKILL` (no IPC `Portal.cancel_actor()` possible) and
`SIGINT` is sent to the orphan child, the child DOES NOT
unwind — CPython's default `KeyboardInterrupt` is delivered
to `threading.main_thread()`, whose tstate is dead in the
post-fork child bc fork inherited the worker thread, not
main. Trio running on the fork-inherited worker thread
therefore never observes the signal. Marked
`xfail(strict=True)` so the mark flips to XPASS→fail once
the backend grows explicit SIGINT plumbing.

Deats,
- harness runs the failure-mode sequence out-of-process:
  1. harness subprocess runs a fresh Python script
     that calls `try_set_start_method('subint_forkserver')`
     then opens a root actor + one `sleep_forever` subactor
  2. parse `PARENT_READY=<pid>` + `CHILD_PID=<pid>` markers
     off harness `stdout` to confirm IPC handshake
     completed
  3. `SIGKILL` the parent, `proc.wait()` to reap the
     zombie (otherwise `os.kill(pid, 0)` keeps reporting
     it alive)
  4. assert the child survived the parent-reap (i.e. was
     actually orphaned, not reaped too) before moving on
  5. `SIGINT` the orphan child, poll `os.kill(child_pid, 0)`
     every 100ms for up to 10s
- supporting helpers: `_read_marker()` with per-proc
  bytes-buffer to carry partial lines across calls,
  `_process_alive()` liveness probe via `kill(pid, 0)`
- Linux-only via `platform.system() != 'Linux'` skip —
  orphan-reparenting semantics don't generalize to
  other platforms
- port offset (`reg_addr[1] + 17`) so the harness listener
  doesn't race concurrently-running backend tests
- best-effort `finally:` cleanup: `SIGKILL` any still-alive
  pids + `proc.kill()` + bounded `proc.wait()` to avoid
  leaking orphans across the session

Also, tier-4 header comment documents the cross-backend
generalization path: applicable to any multi-process
backend (`trio`, `mp_spawn`, `mp_forkserver`,
`subint_forkserver`), NOT to plain `subint` (in-process
subints have no orphan OS-child). Move path: lift
harness into `tests/_orphan_harness.py`, parametrize on
session `_spawn_method`, add
`skipif _spawn_method == 'subint'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add configuration surface for future child-side SIGINT
plumbing in `subint_forkserver_proc` without wiring up the
actual trio-native SIGINT bridge — lifting one entry-guard
clause will flip the `'trio'` branch live once the
underlying fork-prelude plumbing is implemented.

Deats,
- new `ChildSigintMode = Literal['ipc', 'trio']` type +
  `_DEFAULT_CHILD_SIGINT = 'ipc'` module-level default.
  Docstring block enumerates both:
  - `'ipc'` (default, currently the only implemented mode):
    no child-side SIGINT handler — `trio.run()` is on the
    fork-inherited non-main thread where
    `signal.set_wakeup_fd()` is main-thread-only, so
    cancellation flows exclusively via the parent's
    `Portal.cancel_actor()` IPC path. Known gap: orphan
    children don't respond to SIGINT
    (`test_orphaned_subactor_sigint_cleanup_DRAFT`)
  - `'trio'` (scaffolded only): manual SIGINT → trio-cancel
    bridge in the fork-child prelude so external Ctrl-C
    reaches stuck grandchildren even w/ a dead parent
- `subint_forkserver_proc` pulls `child_sigint` out of
  `proc_kwargs` (matches how `trio_proc` threads config to
  `open_process`, keeps `start_actor(proc_kwargs=...)` as
  the ergonomic entry point); validates membership + raises
  `NotImplementedError` for `'trio'` at the backend-entry
  guard
- `_child_target` grows a `match child_sigint:` arm that
  slots in the future `'trio'` impl without restructuring
  — today only the `'ipc'` case is reachable
- module docstring "Still-open work" list grows a bullet
  pointing at this config + the xfail'd orphan-SIGINT test

No behavioral change on the default path — `'ipc'` is the
existing flow. Scaffolding only.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Empirical follow-up to the xfail'd orphan-SIGINT test:
the hang is **not** "trio can't install a handler on a
non-main thread" (the original hypothesis from the
`child_sigint` scaffold commit). On py3.14:

- `threading.current_thread() is threading.main_thread()`
  IS True post-fork — CPython re-designates the
  fork-inheriting thread as "main" correctly
- trio's `KIManager` SIGINT handler IS installed in the
  subactor (`signal.getsignal(SIGINT)` confirms)
- the kernel DOES deliver SIGINT to the thread

But `faulthandler` dumps show the subactor wedged in
`trio/_core/_io_epoll.py::get_events` — trio's
wakeup-fd mechanism (which turns SIGINT into an epoll-wake)
isn't firing. So the `except KeyboardInterrupt` at
`tractor/spawn/_entry.py::_trio_main:164` — the runtime's
intentional "KBI-as-OS-cancel" path — never fires.

Deats,
- new `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
  (+385 LOC): full writeup — TL;DR, symptom reproducer,
  the "intentional cancel path" the bug defeats,
  diagnostic evidence (`faulthandler` output +
  `getsignal` probe), ruled-out hypotheses
  (non-main-thread issue, wakeup-fd inheritance,
  KBI-as-trio-check-exception), and fix directions
- `test_orphaned_subactor_sigint_cleanup_DRAFT` xfail
  `reason` + test docstring rewritten to match the
  refined understanding — old wording blamed the
  non-main-thread path, new wording points at the
  `epoll_wait` wedge + cross-refs the new conc-anal doc
- `_subint_forkserver` module docstring's
  `child_sigint='trio'` bullet updated: now notes trio's
  handler is already correctly installed, so the flag may
  end up a no-op / doc-only mode once the real root cause
  is fixed

Closing the gap aligns with existing design intent (make
the already-designed "KBI-as-OS-cancel" behavior actually
fire), not a new feature.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.

Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
  listing the spawn backends whose subactor runtime is trio-native
  (`'trio'`, `'subint_forkserver'`). Both the enable-site
  (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
  key.
  off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
  reports the full compatible-set + the rejected method instead of the
  stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
  `'subint_forkserver'` to the existing `('trio', 'subint')` tuple
  — fork child-side runtime receives the same SpawnSpec IPC handshake as
  the others.
- `subint_forkserver_proc` child-target now passes
  `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
  `Actor.pformat()` / log lines reflect the actual parent-side spawn
  mechanism rather than masquerading as plain `trio`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up to 72d1b90 (was prev commit adding `debug_mode` for
`subint_forkserver`): that commit wired the runtime-side
`subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the
`subint_forkserver_proc` child-target was still passing
`spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines
would report the subactor as plain `'trio'` instead of the actual
parent-side spawn mechanism. Flip the label to `'subint_forkserver'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Also, some slight touchups in `.spawn._subint`.
Fork-based backends (esp. `subint_forkserver`) can
leak child actor processes on cancelled / SIGINT'd
test runs; the zombies keep the tractor default
registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`)
bound, so every subsequent session can't bind and
50+ unrelated tests fail with the same
`TooSlowError` / "address in use" signature. Document
the pre-flight + post-cancel check as a mandatory
step 4.

Deats,
- **primary signal**: `ss -tlnp | grep ':1616'` for a
  bound TCP registry listener — the authoritative
  check since :1616 is unique to our runtime
- `pgrep -af` scoped to `$(pwd)/py[0-9]*/bin/python.*
  _actor_child_main|subint-forkserv` for leftover
  actor/forkserver procs — scoped deliberately so we
  don't false-flag legit long-running tractor-
  embedding apps like `piker`
- `ls /tmp/registry@*.sock` for stale UDS sockets
- scoped cleanup recipe (SIGTERM + SIGKILL sweep
  using the same `$(pwd)/py*` pattern, UDS `rm -f`,
  re-verify) plus a fallback for when a zombie holds
  :1616 but doesn't match the pattern: `ss -tlnp` →
  kill by PID
- explicit false-positive warning calling out the
  `piker` case (`~/repos/piker/py*/bin/python3 -m
  tractor._child ...`) so a bare `pgrep` doesn't lead
  to nuking unrelated apps

Goal: short-circuit the "spelunking into test code"
rabbit-hole when the real cause is just a leaked PID
from a prior session, without collateral damage to
other tractor-embedding projects on the same box.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.

Deats,
- TL;DR + ruled-out checks confirming the procs are
  ours (not piker / other tractor-embedding apps) —
  `/proc/$pid/cmdline` + cwd both resolve to this
  repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
  (plain `os.kill(SIGKILL)` to the direct child),
  not tree-scoped — grandchildren spawned during a
  multi-level cancel test get reparented to init and
  inherit the registry listen socket
- proposed fix directions ranked: (1) put each
  forkserver-spawned subactor in its own process-
  group (`os.setpgrp()` in fork-child) + tree-kill
  via `os.killpg(pgid, SIGKILL)` on teardown,
  (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
  `/proc/<pid>/task/*/children` walk. Vote: (1) —
  POSIX-standard, aligns w/ `start_new_session=True`
  semantics in `subprocess.Popen` / trio's
  `open_process`
- inline reproducer + cleanup recipe scoped to
  `$(pwd)/py314/bin/python.*pytest.*spawn-backend=
  subint_forkserver` so cleanup doesn't false-flag
  unrelated tractor procs (consistent w/
  `run-tests` skill's zombie-check guidance)

Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Stopgap companion to d012196 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).

Deats,
- add `reg_addr: tuple` fixture param to:
  - `test_cancel_infinite_streamer`
  - `test_some_cancels_all`
  - `test_nested_multierrors`
  - `test_cancel_via_SIGINT`
  - `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
  two `open_nursery()` calls that previously had no
  kwargs at all (in `test_cancel_via_SIGINT` and
  `test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
  to `test_nested_multierrors` so a hung run doesn't
  wedge the whole session

Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The previous cleanup recipe went straight to
SIGTERM+SIGKILL, which hides bugs: tractor is
structured concurrent — `_trio_main` catches SIGINT
as an OS-cancel and cascades `Portal.cancel_actor`
over IPC to every descendant. So a graceful SIGINT
exercises the actual SC teardown path; if it hangs,
that's a real bug to file (the forkserver `:1616`
zombie was originally suspected to be one of these
but turned out to be a teardown gap in
`_ForkedProc.kill()` instead).

Deats,
- step 1: `pkill -INT` scoped to `$(pwd)/py*` — no
  sleep yet, just send the signal
- step 2: bounded wait loop (10 × 0.3s = ~3s) using
  `pgrep` to poll for exit. Loop breaks early on
  clean exit
- step 3: `pkill -9` only if graceful timed out, w/
  a logged escalation msg so it's obvious when SC
  teardown didn't complete
- step 4: same SIGINT-first ladder for the rare
  `:1616`-holding zombie that doesn't match the
  cmdline pattern (find PID via `ss -tlnp`, then
  `kill -INT NNNN; sleep 1; kill -9 NNNN`)
- steps 5-6: UDS-socket `rm -f` + re-verify
  unchanged

Goal: surface real teardown bugs through the test-
cleanup workflow instead of papering over them with
`-9`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:

1. **5-zombie leak holding `:1616`** — turned out to
   be a self-inflicted cleanup bug: `pkill`-ing a bg
   pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
   the SC graceful cancel cascade entirely. Codified
   the real fix — SIGINT-first ladder w/ bounded
   wait before SIGKILL — in e5e2afb (`run-tests`
   SKILL) and
   `feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
   hangs indefinitely** — the actual backend bug,
   and it's a deadlock not a leak.

Deats,
- new diagnosis: all 5 procs are kernel-`S` in
  `do_epoll_wait`; pytest-main's trio-cache workers
  are in `os.waitpid` waiting for children that are
  themselves waiting on IPC that never arrives —
  graceful `Portal.cancel_actor` cascade never
  reaches its targets
- tree-structure evidence: asymmetric depth across
  two identical `run_in_actor` calls — child 1
  (3 threads) spawns both its grandchildren; child 2
  (1 thread) never completes its first nursery
  `run_in_actor`. Smells like a race on fork-
  inherited state landing differently per spawn
  ordering
- new hypothesis: `os.fork()` from a subactor
  inherits the ROOT parent's IPC listener FDs
  transitively. Grandchildren end up with three
  overlapping FD sets (own + direct-parent + root),
  so IPC routing becomes ambiguous. Predicts bug
  scales with fork depth — matches reality: single-
  level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
  reaches hard-kill path), `:1616` contention (fixed
  by `reg_addr` fixture wiring), GIL starvation
  (each subactor has its own OS process+GIL),
  child-side KBI absorption (`_trio_main` only
  catches KBI at `trio.run()` callsite, reached
  only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
  `closerange()`, (2) `FD_CLOEXEC` + audit,
  (3) targeted FD cleanup via `actor.ipc_server`
  handle, (4) `os.posix_spawn` w/ `file_actions`.
  Vote: (3) — surgical, doesn't break the "no exec"
  design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
  2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
  level-spawn tests under the backend via
  `@pytest.mark.skipon_spawn_backend(...)` until
  fix lands

Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Implements fix-direction (1)/blunt-close-all-FDs from
b71705b (`subint_forkserver` nested-cancel hang
diag), targeting the multi-level cancel-cascade
deadlock in
`test_nested_multierrors[subint_forkserver]`.

The diagnosis doc voted for surgical FD cleanup via
`actor.ipc_server` handle as the cleanest approach,
but going blunt is actually the right call: after
`os.fork()`, the child immediately enters
`_actor_child_main()` which opens its OWN IPC
sockets / wakeup-fd / epoll-fd / etc. — none of the
parent's FDs are needed. Closing everything except
stdio is safe AND defends against future
listener/IPC additions to the parent inheriting
silently into children.

Deats,
- new `_close_inherited_fds(keep={0,1,2}) -> int`
  helper. Linux fast-path enumerates `/proc/self/fd`;
  POSIX fallback uses `RLIMIT_NOFILE` range. Matches
  the stdlib `subprocess._posixsubprocess.close_fds`
  strategy. Returns close-count for sanity logging
- wire into `fork_from_worker_thread._worker()`'s
  post-fork child prelude — runs immediately after
  the pid-pipe `os.close(rfd/wfd)`, before the user
  `child_target` callable executes
- docstring cross-refs the diagnosis doc + spells
  out the FD-inheritance-cascade mechanism and why
  the close-all approach is safe for our spawn shape

Validation pending: re-run `test_nested_multierrors[subint_forkserver]`
to confirm the deadlock is gone.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two coordinated improvements to the `subint_forkserver` backend:

1. Replace `trio.to_thread.run_sync(os.waitpid, ...,
   abandon_on_cancel=False)` in `_ForkedProc.wait()`
   with `trio.lowlevel.wait_readable(pidfd)`. The
   prior version blocked a trio cache thread on a
   sync syscall — outer cancel scopes couldn't
   unwedge it when something downstream got stuck.
   Same pattern `trio.Process.wait()` and
   `proc_waiter` (the mp backend) already use.

2. Drop the `@pytest.mark.xfail(strict=True)` from
   `test_orphaned_subactor_sigint_cleanup_DRAFT` —
   the test now PASSES after 0cd0b63 (fork-child
   FD scrub). Same root cause as the nested-cancel
   hang: inherited IPC/trio FDs were poisoning the
   child's event loop. Closing them lets SIGINT
   propagation work as designed.

Deats,
- `_ForkedProc.__init__` opens a pidfd via
  `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+)
- `wait()` parks on `trio.lowlevel.wait_readable()`,
  then non-blocking `waitpid(WNOHANG)` to collect
  the exit status (correct since the pidfd signal
  IS the child-exit notification)
- `ChildProcessError` swallow handles the rare race
  where someone else reaps first
- pidfd closed after `wait()` completes (one-shot
  semantics) + `__del__` belt-and-braces for
  unexpected-teardown paths
- test docstring's `@xfail` block replaced with a
  `# NOTE` comment explaining the historical
  context + cross-ref to the conc-anal doc; test
  remains in place as a regression guard

The two changes are interdependent — the
cancellable `wait()` matters for the same nested-
cancel scenarios the FD scrub fixes, since the
original deadlock had trio cache workers wedged in
`os.waitpid` swallowing the outer cancel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Completes the nested-cancel deadlock fix started in
0cd0b63 (fork-child FD scrub) and fe540d0 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.

Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.

Deats,
- `async_main` stashes the cancel-scope returned by
  `root_tn.start(...)` for the parent-chan
  `process_messages` task onto the actor as
  `_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
  `ipc_server.cancel()` + `wait_for_shutdown()`)
  calls `self._parent_chan_cs.cancel()` to
  explicitly break the shield — no more waiting for
  EOF delivery, unwinding proceeds deterministically
  regardless of backend
- inline comments on both sites explain the mutual-
  wait deadlock + why the explicit cancel is
  backend-agnostic rather than a forkserver-specific
  workaround

With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:

1. Skip-mark the test via
   `@pytest.mark.skipon_spawn_backend('subint_forkserver',
   reason=...)` so it stops blocking the test
   matrix while the remaining bug is being chased.
   The reason string cross-refs the conc-anal doc
   for full context.

2. Update the conc-anal doc
   (`subint_forkserver_test_cancellation_leak_issue.md`) with the
   empirical state after the three nested- cancel fix commits
   (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
   shield break) landed, narrowing the remaining hang from "everything
   broken" to "peer-channel loops don't exit on `service_tn` cancel".

Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
  `_parent_chan_cs.cancel()` wiring from `57935804`
  works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
  channel handlers in `handle_stream_from_peer`
  (inbound connections handled by
  `stream_handler_tn`, which IS `service_tn` in the
  current config).
- after `_parent_chan_cs.cancel()` fires, NEW
  shielded loops appear on the session reg_addr
  port — probably discovery-layer reconnection;
  doesn't block teardown but indicates the cascade
  has more moving parts than expected.

The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two new sections in
`subint_forkserver_test_cancellation_leak_issue.md`
documenting continued investigation of the
`test_nested_multierrors[subint_forkserver]` peer-
channel-loop hang:

1. **"Attempted fix (DID NOT work) — hypothesis
   (3)"**: tried sync-closing peer channels' raw
   socket fds from `_serve_ipc_eps`'s finally block
   (iterate `server._peers`, `_chan._transport.
   stream.socket.close()`). Theory was that sync
   close would propagate as `EBADF` /
   `ClosedResourceError` into the stuck
   `recv_some()` and unblock it. Result: identical
   hang. Either trio holds an internal fd
   reference that survives external close, or the
   stuck recv isn't even the root blocker. Either
   way: ruled out, experiment reverted, skip-mark
   restored.
2. **"Aside: `-s` flag changes behavior for peer-
   intensive tests"**: noticed
   `test_context_stream_semantics.py` under
   `subint_forkserver` hangs with default
   `--capture=fd` but passes with `-s`
   (`--capture=no`). Working hypothesis: subactors
   inherit pytest's capture pipe (fds 1,2 — which
   `_close_inherited_fds` deliberately preserves);
   verbose subactor logging fills the buffer,
   writes block, deadlock. Fix direction (if
   confirmed): redirect subactor stdout/stderr to
   `/dev/null` or a file in `_actor_child_main`.
   Not a blocker on the main investigation;
   deserves its own mini-tracker.

Both sections are diagnosis-only — no code changes
in this commit.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Three places that previously swallowed exceptions silently now log via
`log.exception()` so they surface in the runtime log when something
weird happens — easier to track down sneaky failures in the
fork-from-worker-thread / subint-bootstrap primitives.

Deats,
- `_close_inherited_fds()`: post-fork child's per-fd `os.close()`
  swallow now logs the fd that failed to close. The comment notes the
  expected failure modes (already-closed-via-listdir-race,
  otherwise-unclosable) — both still fine to ignore semantically, but
  worth flagging in the log.
- `fork_from_worker_thread()` parent-side timeout branch: the
  `os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close
  failure separately before raising the `worker thread didn't return`
  RuntimeError.
- `run_subint_in_worker_thread._drive()`: when
  `_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`,
  log the full call signature (interp_id + bootstrap) along with the
  captured exception, before stashing into `err` for the outer caller.

Behavior unchanged — only adds observability.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
# Revisit `subint_forkserver` thread-cache constraints once msgspec PEP 684 support lands

Follow-up tracker for cleanup work gated on the msgspec
PEP 684 adoption upstream ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this to the new subints compat issue we make @ msgspec.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goodboy added 11 commits May 8, 2026 00:51
Rework reap/diag tooling to identify tractor sub-actors via
intrinsic proc signals — cmdline/comm markers from `setproctitle` —
instead of env-var or cwd matching.

Deats,
- new `_is_tractor_subactor()` checks cmdline for `tractor[` /
  `tractor._child` markers, falls back to `/proc/<pid>/comm` for
  zombie-resilient detection (kernel preserves `comm` past exit
  until reap)
- `_read_comm()` reads kernel per-task name set by `setproctitle()`
  — the zombie-safe ID signal
- `_read_status_state()` reads single-letter proc state from
  `/proc/<pid>/status` (`Z` = zombie)
- `find_orphans()` drops `repo_root` requirement, uses
  `_is_tractor_subactor()` for intrinsic sub-actor ID instead of
  cwd coincidence-matching
- new `find_zombies()` with optional `parent_pid` filter for
  zombie-state sub-actors

Also,
- rename `pytree` -> `ptree` throughout xontrib
- add `_which_cgroup_slice()` — reads `/proc/<pid>/cgroup` to
  distinguish `system.slice` services vs `user.slice` desktop apps
  from genuinely leaked orphans
- `_ptree` classifies `ppid==1` procs into `system-slice`,
  `user-slice`, and `orphans` buckets with per-section output
- `_tractor_reap` drops `git rev-parse` / `sys.path` hack — assumes
  tractor importable from active venv

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Comment/docstring updates: `subint_forkserver` is a clean
`NotImplementedError` stub — not an alias to variant-1
(`main_thread_forkserver`). Key reserved in-place (not aliased) so
the subint-hosted-child impl can flip without API churn once
jcrist/msgspec#1026 unblocks PEP 684 subints.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add module-level `pytestmark` applying per-test
`reap_subactors_per_test`, `track_orphaned_uds_per_test`, and
`detect_runaway_subactors_per_test` fixtures — registrar tests stress
discovery roundtrips that historically left orphaned UDS sock-files.

Deats,
- drop unused `say_hello()` fn, keep only `say_hello_use_wait`;
  rename param `func` -> `ria_fn`.
- use `@tractor_test(timeout=7)` instead of separate
  `@pytest.mark.timeout(7, method='thread')` decorator.
- add `with_timeout()` helper, wire into
  `test_subactors_unregister_on_cancel_remote_daemon`.
- uncomment `_timeout_main()` in `test_stale_entry_is_deleted`, use
  configurable `timeout` var + `debug_mode` guard for `tractor.pause()`
  on cancel.
- `dump_on_hang(seconds=timeout*2)` instead of hardcoded `20`.
- fix typo "oustanding" -> "outstanding".

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`acli.bindspace_scan piker` now resolves `<name>` to
`$XDG_RUNTIME_DIR/<name>` — useful for projects like
`piker` that bind sibling sub-dirs alongside tractor's
default. Full paths still work as-is.

Also,
- rename "unparseable" section to "non-tractor" with
  clearer desc (filename lacks `@<pid>` suffix)
- print per-sock `ss -lpx 'src = <path>'` cmds for
  non-tractor socks so callers can manually resolve
  listener-PID liveness

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Forking spawner + UDS transport has different timing
vs `trio_proc` — streaming example completes faster
in some cases, slower in others depending on fork
overhead + sock setup.

Deats,
- add `expect_cancel` param to `cancel_after()`, raise
  `ActorTooSlowError` when cancel scope fires unexpectedly instead of
  silently returning `None`.
- `time_quad_ex` fixture: bump timeout +1 for forking+UDS, explicit
  `ActorTooSlowError` on `None` result instead of bare `assert results`.
- `test_not_fast_enough_quad`: `xfail` for forking+UDS being "too fast"
  (cancel doesn't fire bc streaming finishes before delay).
- add `is_forking_spawner`, `tpt_proto` fixture params throughout.

Also,
- `_testing/pytest.py`: widen `start_method` parametrize and
  `is_forking_spawner` fixture to `scope='session'`.
- `"""` -> `'''` docstring style throughout.
- hoist `_non_linux` to module scope (was redefined locally in two
  places).
- type hints, kwarg-style `partial()` calls.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/spawn_time_boot_death_dup_name_issue.md`
documenting the spawn-time rc=2 race under rapid
same-name spawning against a forkserver + registrar
— the `wait_for_peer_or_proc_death` helper now surfaces
the death instead of parking forever on the handshake
wait.

Also,
- extract inline `xfail` into module-level
  `_DOGGY_BOOT_RACE_XFAIL` marker.
- apply it to `n_dups=8` too (previously bare) bc
  larger N widens the race window enough to fire
  occasionally.
- link to tracking issue #456.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add `capture` dimension to CI matrix so fork-based
backends run `--capture=sys` (fork-child × `--capture=fd`
is a known deadlock). Non-fork backends keep `fd`.

Deats,
- two `include:` rows for `main_thread_forkserver` on
  linux py3.13: tcp + uds, both `capture: 'sys'`
- job name updated to show `capture=` mode
- timeout bumped 16 -> 20 min to accommodate the
  additional matrix cells
- `--capture=${{ matrix.capture }}` replaces hardcoded
  `--capture=fd`

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Split the old `live`/`orphans` sock classification
into three ppid-aware buckets: `live-active` (PID
alive, parent owns it), `orphaned-alive` (PID alive
but `ppid==1`, init-adopted — `acli.reap` candidate),
and `orphaned-dead` (PID gone, sock stale).

Deats,
- new `_ppid()` helper reads `/proc/<pid>/stat` field [3] for parent
  PID, handles the tricky `(comm)` field (can contain spaces/parens) by
  splitting from last `)`.
- live-active rows now show `(ppid=<N>)` for ctx.
- orphaned-alive rows flagged `(adopted by init)`.
- cleanup suggestion: `acli.reap --uds` for both
  alive-orphan graceful cancel + dead-sock cleanup
  in one shot; manual `rm` kept as fallback.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Outer `signal.alarm` cap that fires even when trio's
`fail_after` is blocked by a shielded-await deadlock
(the bug-class-3 hang under MTF backends). Only armed
for fork-based spawners where the bug lives.

Deats,
- `_DIAG_CAP_S = fail_after_s + 5` — slightly larger than the
  trio-native guard so it always loses when the in-band path works.
- `test_log.cancel()` breadcrumbs at each cancel-scope boundary so the
  last-fired breadcrumb names the swallow point on hang.
- try/finally wrapping around each scope level for deterministic
  breadcrumb emission.
- add `is_forking_spawner`, `set_fork_aware_capture` fixture params.
- rework `fail_after_s`: 4s for fork, 12s for trio (was 30/12).

Also,
- `test_sigint_both_stream_types`: `assert 0` -> `pytest.fail()`, add
  TODO re `pytest.raises()`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- `test_ext_types_over_ipc`: wrap `main()` in `fa_main()` with
  `trio.fail_after(2)` + commented `capfd.disabled()` investigation
  (pytest#14444).
- `test_basic_payload_spec`: add fixture param with note on fork-spawner
  hang prevention.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 18 commits May 13, 2026 12:03
(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- `skipon_spawn_backend('subint')`: expand reason with specific
  analysis doc refs + GH issue #379 umbrella link.
- add `track_orphaned_uds_per_test` fixture via `usefixtures` to
  blame-attribute UDS sock-file orphans left by SIGKILL cancel
  cascades.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
For forking spawner backends that is.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Deats,
- `test_echoserver_detailed_mechanics`: add `is_forking_spawner`
  param, wrap `main()` in `fa_main()` with per-backend
  `trio.fail_after` (4s fork / 1s trio) to cap cancel-cascade
  teardown that compounds under forkserver.
- `test_sigint_closes_lifetime_stack`: swap `start_method` param
  for `is_forking_spawner`, pre-init `tmp_file`/`ctx` to `None` so
  KBI firing before `open_context` body doesn't `UnboundLocalError`,
  add `pytest.fail` guard for the spawn-time IPC race case, arm
  `signal.alarm` AFK-safety cap (10s) under fork backends

Also,
- `pytestmark`: add `track_orphaned_uds_per_test` +
  `detect_runaway_subactors_per_test` fixtures.
- `delay()`: hardcode `return 1e3` at top (debug override still in
  place).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Extract all pure-Python diagnostic helpers (`dump_proc_tree`,
`dump_hung_state`, `scan_bindspace`, `dump_all`, `resolve_pids`,
`ensure_sudo_cached`, etc.) from the xonsh xontrib into a new
`tractor/_testing/trace.py` module so the same logic is callable
from both the `acli.*` terminal aliases AND in-test capture-on-hang
fixtures.

Deats,
- `_testing/trace.py`: new module (1171 lines) — proc-tree walker,
  hung-state dumper, bindspace scanner, `dump_all()` snapshot
  archiver, `AFKAlarmTimeout` exc, `fail_after_w_trace()` async CM
  (trio `fail_after` + auto-snapshot on `TooSlowError`),
  `afk_alarm_w_trace()` sync CM (`signal.alarm` + snapshot on
  `SIGALRM`), plus pytest fixture wrappers for both.
- `_testing/pytest.py`: re-export the two fixtures via `from .trace
  import` so pytest plugin-discovery picks them up.
- `tractor_diag.xsh`: thin terminal wrappers that import from
  `_testing.trace` — drops ~627 lines of inline impl. Add
  `acli.dump_all` alias for full snapshot-bundle CLI access.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Deats,
- `_find_tractor_strays()`: scan `/proc/*/cmdline` for
  `tractor._child` procs NOT in the walk's `seen` set — surfaces
  ghost subactor trees from prior test runs (cross-test launchpad
  contamination).
- `dump_proc_tree(include_strays=True)`: refactor classification
  into `_classify_walk()` closure, walk stray roots as additional
  trees, emit stray-root summary in header. Also: `tractor._child`
  procs reparented to init are now always classified as orphans
  regardless of cgroup-slice (leaked subactor ≠ desktop-launched
  app).
- `_do_capture_snapshot()`: use `sys.__stderr__` to bypass pytest
  `--capture=sys` redirection so snapshot paths always land on the
  real terminal
- `fail_after_w_trace()`: capture diag snapshot on
  non-`TooSlowError` exceptions when the `fail_after` scope's
  cancel had already fired (e.g. nursery wraps `Cancelled` into a
  `BaseExceptionGroup` that escapes before `TooSlowError` can be
  raised).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
- `_testing/trace.py`: add `_SNAPSHOT_INDEX` session- scoped list
  populated by `_do_capture_snapshot()` on each successful dump;
  add TODO for future `TRACTOR_TRACE_HOLD=1` pause-on-hang mode
- `_testing/pytest.py`: add `pytest_terminal_summary` hook that
  prints all captured snapshot dirs at end-of-session so paths
  don't get buried in scrollback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`reap(include_descendants=True)` now expands each orphan-root pid
into its full psutil subtree before delivering SIGINT, so a
multi-level leaked actor-tree gets torn down in a single pass
instead of requiring repeated calls (each pass kills the current
`ppid==1` level, the level below becomes init-adopted, etc.).

Falls back to the original flat `pids` list when `psutil` is
unavailable. Emits a log line when expansion adds descendant pids.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Post-yield now also reaps init-adopted (`ppid==1`) tractor procs
that appeared during the test — leaked subactors whose mid-tier
parent died during cascade teardown, reparenting them to init.
Pre-yield snapshot of existing orphans scopes reap to THIS test's
leaks only, avoiding reap of unrelated tractor uses (piker, etc.)
on the box.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Append "Snapshot evidence (2026-05-13)" section to
`cancel_cascade_too_slow_under_main_thread_forkserver_issue.md`
documenting `fail_after_w_trace` diag capture results for
`test_nested_multierrors` under the MTF backend — reproduction cmd,
ptree analysis, observed hang signature, and updated triage plan.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Deats,
- `pytestmark`: enrich `skipon_spawn_backend('subint')` reason with
  conc-anal doc refs + GH#379 link, add `reap_subactors_per_test`,
  `track_orphaned_uds_per_test`,
  `detect_runaway_subactors_per_test` fixtures
- `test_nested_multierrors`: parametrize over `depth` `{1, 3}`, add
  MTF `xfail(strict=False)` with detailed race-window comment
  explaining the BEG shape mismatch, wrap body in
  `fail_after_w_trace` with per-backend timeout budget, bump
  `@tractor_test(timeout=10)`, drop old multiprocessing depth
  special-casing
- `test_multierror_fast_nursery`: wrap in
  `fail_after_w_trace(30.0)`, accept `TooSlowError` in
  `pytest.raises`, surface explicit `pytest.fail` on hang
- `test_cancel_while_childs_child_in_sync_sleep`: swap
  `spawn_backend` param for `is_forking_spawner`, widen
  `fail_after` delay for fork-based spawners
- `test_remote_error`, `test_multierror`,
  `test_cancel_infinite_streamer`, `test_some_cancels_all`: add
  `set_fork_aware_capture` fixture param
- Drop commented-out per-test `skipon_spawn_backend` blocks (now
  covered by module-level `pytestmark`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Replace inline `trio.fail_after` + manual `signal.alarm` guard with the
`_testing.trace` CM helpers that auto-capture a full ptree/wchan/py-spy
diag snapshot to disk on timeout.

Deats,
- inner guard: `trio.fail_after` → `fail_after_w_trace` (async CM,
  captures on `TooSlowError`).
- outer AFK guard: raw `signal.alarm` → `afk_alarm_w_trace` (sync
  CM, captures on `SIGALRM`), only armed under fork backends.
  Extracts `_run_and_match()` helper to keep branching clean.
- bump `fail_after_s` from 4/12 → 8/20 to stop borderline flakes
  while diag harness accumulates evidence.
- drop `_DIAG_CAP_S` var + manual signal import (now internal to
  `afk_alarm_w_trace`).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Only flag `tractor._child` procs as cross-test ghosts of
THIS run if `ppid==1` (init-adopted real leak) or `ppid`
is in the walk's `seen` set (descendant we missed via
race).

Previously, procs whose `ppid` points to some OTHER live non-`pytest`
(in the use of `acli.ptree pytest`) process belong to a different
tractor app (`piker`, another `pytest` shell, a long-running tractor
daemon) and were being falsely flagged as cross-test ghosts.

Deats,
- post-cmdline-match check via `_ppid_from_proc(pid)`,
  short-circuit on `None` (proc died in-flight).
- expand module docstring to spell out the ownership
  filter rule + its rationale.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Per-terminal optimized `watch`-like xonsh alias that
runs an arbitrary callable alias in a loop inside the
alt-screen buffer with flicker-free repaint. Supersedes
the inline `acli.ptree` polling .xsh snippet (removed
from `_ptree` docstr in favor of
`acli.watch acli.ptree pytest`).

Deats,
- alt-screen entry/exit (`\033[?1049h/l`) + cursor-hide
  (`\033[?25l/h`) wrapped in try/finally so Ctrl-C always
  returns to a pristine shell.
- per-frame draw uses cursor-home (`\033[H`) + per-line
  EL (`\033[K` before each `\n`) + post-draw erase-down
  (`\033[J`) → stale tail chars from a longer prior
  frame are obvi cleared; no full-screen flash.
- SIGWINCH-aware: terminal resize sets a flag, next
  frame does a full clear (`\033[H\033[2J`) instead of
  the cheap cursor-home path.
- Ctrl-C handling: install `signal.default_int_handler`
  so `KeyboardInterrupt` lands cleanly; prior handler
  restored on exit.
- Output capture: redirect the alias's stdout to
  `StringIO` per frame so we can post-process the EL
  fix. Aliases writing directly to `sys.stdout.buffer`
  / `os.write(1)` bypass capture — EL-fix won't apply
  but loop still works.
- Alias unwrap: xonsh stores callables as either a bare
  callable OR `[fn, *preset_args]`. Both forms handled;
  subprocess-style aliases rejected w/ a friendly err
  msg.
- `argparse` w/ `-n`/`--interval` (default 0.3s); rest
  of argv forwarded as alias args.
- Reg `'acli.watch': watch` in `_TCLI_ALIASES`.

Other,
- Tn `_ptree` `args: list[str]` param.
- Mod-header `Provides:` block updated w/ `acli.watch`
  entry.
- Top-level imports: `os`, `sys`, `signal`, `time`,
  `typing.Callable`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adopt the `_testing.trace` CM helpers in two MTF-hang-prone
tests so on-timeout we get a fresh
`ptree`/`wchan`/`py-spy` diag snapshot on disk instead of
opaque pytest timeout-kills. Same shape as bd07a95 for
`test_dynamic_pub_sub`.

Deats,
- `test_echoserver_detailed_mechanics`:
  * inner `trio.fail_after` → `fail_after_w_trace`. Adds
    `fail_after_w_trace: FailAfterWTraceFactory` fixture
    param.
  * mv per-backend `timeout` calc to top of test body (was
    interleaved w/ helper defs).
  * factor deep
    `open_nursery`/`open_context`/`open_stream` body into
    `_body()` so the wrapping `main()` stays a 2-liner —
    keeps the nested-CM block at its natural indent level
    instead of pushing it under yet another `async with`.
  * drop `with_timeout: bool` knob + `fa_main()` helper
    (knob was hard-coded `True`).
- `test_sigint_closes_lifetime_stack`:
  * outer `signal.alarm`/`try`/`finally` → single
    `afk_alarm_w_trace(10)` CM. Adds
    `afk_alarm_w_trace: AfkAlarmWTraceFactory` fixture
    param.
  * drop `_AFK_CAP_S` + `armed_alarm` vars (CM owns both).
  * explanatory comment refreshed to mention
    `AFKAlarmTimeout` + the disk-snapshot side effect.

Other,
- Drop debug `return 1e3` short-circuit from `delay()`
  fixture — snuck in as a scratch line, was clobbering the
  proper `debug_mode`-branched return.
- Top-level import: `FailAfterWTraceFactory`,
  `AfkAlarmWTraceFactory` from `tractor._testing.trace`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cancellation SC teardown semantics and anti-zombie semantics enhancement New feature or request python_updates spawning of processes, (shm) threads, tasks on varied (OS-specific) backends the_AIs_are_taking_over slowly conceding to the reality the botz mk us more productive, but we require SC to avoid skynet..

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spawn-time boot-death (rc=2) under rapid same-name spawn against a registrar

1 participant