Skip to content

gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork()#146316

Draft
gpshead wants to merge 1 commit intopython:mainfrom
gpshead:gh-146313-single
Draft

gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork()#146316
gpshead wants to merge 1 commit intopython:mainfrom
gpshead:gh-146313-single

Conversation

@gpshead
Copy link
Member

@gpshead gpshead commented Mar 23, 2026

Problem

ResourceTracker.__del__ (added in gh-88887) calls os.waitpid(pid, 0) which blocks indefinitely if a process created via os.fork() still holds the tracker pipe's write end. The tracker never sees EOF, never exits, and the parent hangs at interpreter shutdown.

Root cause

Three requirements conflict:

Fix

Two layers:

Timeout safety-net. _stop_locked() gains a wait_timeout parameter. When called from __del__, it polls with WNOHANG using exponential backoff for up to 1 second instead of blocking indefinitely.

At-fork handler. An os.register_at_fork(after_in_child=...) handler closes the inherited pipe fd in the child unless a preserve flag is set. popen_fork.Popen._launch() sets the flag before its fork so mp.Process(fork) children keep the fd and reuse the parent's tracker (preserving gh-80849). Raw os.fork() children close the fd, letting the parent reap promptly.

Result

Scenario Before After
Raw os.fork(), parent exits while child alive deadlock ~30ms reap
mp.Process(fork), parent joins then exits ~30ms reap ~30ms reap
mp.Process(fork), parent exits abnormally deadlock 1s bounded wait
No fork (gh-88887 scenario) ~30ms reap ~30ms reap

The at-fork handler makes the timeout unreachable in all well-behaved paths. The timeout remains as a safety net for abnormal shutdowns.

Problem

ResourceTracker.__del__ (added in pythongh-88887) calls os.waitpid(pid, 0)
which blocks indefinitely if a process created via os.fork() still
holds the tracker pipe's write end. The tracker never sees EOF, never
exits, and the parent hangs at interpreter shutdown.

Root cause

Three requirements conflict:

- pythongh-88887 wants the parent to reap the tracker to prevent zombies
- pythongh-80849 wants mp.Process(fork) children to reuse the parent's
  tracker via the inherited pipe fd
- pythongh-146313 shows the parent can't block in waitpid() if a child
  holds the fd -- the tracker won't see EOF until all copies close

Fix

Two layers:

Timeout safety-net. _stop_locked() gains a wait_timeout parameter.
When called from __del__, it polls with WNOHANG using exponential
backoff for up to 1 second instead of blocking indefinitely.

At-fork handler. An os.register_at_fork(after_in_child=...) handler
closes the inherited pipe fd in the child unless a preserve flag is
set. popen_fork.Popen._launch() sets the flag before its fork so
mp.Process(fork) children keep the fd and reuse the parent's tracker
(preserving pythongh-80849). Raw os.fork() children close the fd, letting
the parent reap promptly.

Result

  Scenario                                       Before     After
  Raw os.fork(), parent exits while child alive  deadlock   ~30ms reap
  mp.Process(fork), parent joins then exits      ~30ms reap ~30ms reap
  mp.Process(fork), parent exits abnormally      deadlock   1s bounded wait
  No fork (pythongh-88887 scenario)                    ~30ms reap ~30ms reap

The at-fork handler makes the timeout unreachable in all well-behaved
paths. The timeout remains as a safety net for abnormal shutdowns.
Copy link
Contributor

@itamaro itamaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly nits, overall looks good, thanks!

os.close(self._fd)
except OSError:
pass
self._fd = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting this to None after closing exposes us to race conditions?

Comment on lines +115 to +118
# - A raw os.fork() leaves the flag unset. We close the fd so
# the parent's __del__ can reap the tracker without waiting
# for us to exit. If we later need a tracker, ensure_running()
# will launch a fresh one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - A raw os.fork() leaves the flag unset. We close the fd so
# the parent's __del__ can reap the tracker without waiting
# for us to exit. If we later need a tracker, ensure_running()
# will launch a fresh one.
# - A raw os.fork() leaves the flag unset. We close the fd in the child after forking so
# the parent's __del__ can reap the tracker without waiting
# for the child to exit. If we later need a tracker, ensure_running()
# will launch a fresh one.

Comment on lines +108 to +113
# - multiprocessing.Process with the 'fork' start method sets
# _fork_intent.preserve_fd before forking. The child keeps the
# fd and reuses the parent's tracker (gh-80849). This is safe
# because multiprocessing's atexit handler joins all children
# before the parent's __del__ runs, so by then the fd copies
# are gone and the parent can reap the tracker promptly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be safe, but it's unclear to me why this is desirable in the multiprocessing.Process scenario
why do we ever want forked children to share the resource tracker with the parent?

@bedevere-app
Copy link

bedevere-app bot commented Mar 23, 2026

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants