fix: daemonize core-agent and remove worker-owned process lifecycle#325
Open
jrothrock wants to merge 3 commits into
Open
fix: daemonize core-agent and remove worker-owned process lifecycle#325jrothrock wants to merge 3 commits into
jrothrock wants to merge 3 commits into
Conversation
The core-agent is now launched with --daemonize true so the binary forks itself into a true daemon. No PID is retained by the worker. Fixes two related cluster bugs: 1. Always-spawns bug: start() now returns early when peerRunning() is true instead of calling startProcess() unconditionally. Each worker that starts after the first simply connects to the already- running daemon. 2. Cluster-shutdown bug (issue #117): stopProcess() is now a no-op. Workers can no longer kill the shared core-agent when shutting down or crashing, which previously took down Scout for all other workers in the cluster until a replacement was spawned. This mirrors how the Python agent handles the core-agent lifecycle: launch it, forget the PID, let the daemon manage itself. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… instantly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rker-owned stopProcess() is now a no-op; no worker can kill the shared daemon. allowShutdown had no documented behavior and no remaining effect. Stripped from ScoutConfiguration type and all test callsites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6663542 to
25564af
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two related bugs in cluster deployments, both rooted in the same design flaw — workers treating the core-agent as a child process they own:
Always-spawns bug:
start()calledpeerRunning()to check if a socket existed, but then calledstartProcess()unconditionally regardless of the result. In a cluster, every worker that initialised Scout would spawn a new core-agent binary even if one was already running. The second agent would fail to bind the port and exit immediately, leaving that worker with a deaddetachedProcessreference.Cluster-shutdown bug (issue #117):
stopProcess()sentSIGKILLto the core-agent process group and removed the socket. When any worker calledscout.shutdown()(including on crash/unhandled rejection withallowShutdown: true), it killed the shared core-agent for every other worker in the cluster. Other workers continued serving HTTP requests normally (Scout's send path is async and fire-and-forget afteronFinished), but all APM data was silently dropped until a replacement core-agent spawned — exactly the window when visibility is most needed.Fix
Mirrors the approach used by the Python agent (
scout_apm_python/src/scout_apm/core/agent/manager.py):--daemonize trueadded to the binary args. The binary forks itself into a true background daemon; the spawned process exits immediately. No PID is retained by the worker.start()returns early whenpeerRunning()is true — the worker connects to the existing daemon rather than spawning another.stopProcess()is now a no-op. Workers have no PID to kill and should not attempt to manage the daemon's lifecycle.allowShutdownbranch removed fromScout.shutdown()— disconnect (draining the socket pool) is the full extent of what a worker does on shutdown.getProcess()anddetachedProcessfield entirely.Test changes
test/util.tscleanup(): callsagent.disconnect()instead ofprocess.kill()getProcess()to reflect the new behaviourCo-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com