Skip to content

fix: daemonize core-agent and remove worker-owned process lifecycle#325

Open
jrothrock wants to merge 3 commits into
masterfrom
fix/daemonize-core-agent
Open

fix: daemonize core-agent and remove worker-owned process lifecycle#325
jrothrock wants to merge 3 commits into
masterfrom
fix/daemonize-core-agent

Conversation

@jrothrock
Copy link
Copy Markdown
Collaborator

Problem

Two related bugs in cluster deployments, both rooted in the same design flaw — workers treating the core-agent as a child process they own:

Always-spawns bug: start() called peerRunning() to check if a socket existed, but then called startProcess() unconditionally regardless of the result. In a cluster, every worker that initialised Scout would spawn a new core-agent binary even if one was already running. The second agent would fail to bind the port and exit immediately, leaving that worker with a dead detachedProcess reference.

Cluster-shutdown bug (issue #117): stopProcess() sent SIGKILL to the core-agent process group and removed the socket. When any worker called scout.shutdown() (including on crash/unhandled rejection with allowShutdown: true), it killed the shared core-agent for every other worker in the cluster. Other workers continued serving HTTP requests normally (Scout's send path is async and fire-and-forget after onFinished), but all APM data was silently dropped until a replacement core-agent spawned — exactly the window when visibility is most needed.

Fix

Mirrors the approach used by the Python agent (scout_apm_python/src/scout_apm/core/agent/manager.py):

  • --daemonize true added to the binary args. The binary forks itself into a true background daemon; the spawned process exits immediately. No PID is retained by the worker.
  • start() returns early when peerRunning() is true — the worker connects to the existing daemon rather than spawning another.
  • stopProcess() is now a no-op. Workers have no PID to kill and should not attempt to manage the daemon's lifecycle.
  • allowShutdown branch removed from Scout.shutdown() — disconnect (draining the socket pool) is the full extent of what a worker does on shutdown.
  • Removed getProcess() and detachedProcess field entirely.

Test changes

  • test/util.ts cleanup(): calls agent.disconnect() instead of process.kill()
  • Updated two tests that relied on getProcess() to reflect the new behaviour

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

jrothrock and others added 3 commits May 29, 2026 16:39
The core-agent is now launched with --daemonize true so the binary
forks itself into a true daemon. No PID is retained by the worker.

Fixes two related cluster bugs:

1. Always-spawns bug: start() now returns early when peerRunning()
   is true instead of calling startProcess() unconditionally. Each
   worker that starts after the first simply connects to the already-
   running daemon.

2. Cluster-shutdown bug (issue #117): stopProcess() is now a no-op.
   Workers can no longer kill the shared core-agent when shutting
   down or crashing, which previously took down Scout for all other
   workers in the cluster until a replacement was spawned.

This mirrors how the Python agent handles the core-agent lifecycle:
launch it, forget the PID, let the daemon manage itself.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… instantly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rker-owned

stopProcess() is now a no-op; no worker can kill the shared daemon.
allowShutdown had no documented behavior and no remaining effect.
Stripped from ScoutConfiguration type and all test callsites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jrothrock jrothrock changed the base branch from master to docs/update-readme May 29, 2026 22:28
@jrothrock jrothrock force-pushed the fix/daemonize-core-agent branch 2 times, most recently from 6663542 to 25564af Compare May 29, 2026 22:58
@jrothrock jrothrock changed the base branch from docs/update-readme to master May 29, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant