[SPARK-57191][YARN] Fix driver hang when MonitorThread encounters unexpected exception#56274
Open
shrirangmhalgi wants to merge 3 commits into
Open
[SPARK-57191][YARN] Fix driver hang when MonitorThread encounters unexpected exception#56274shrirangmhalgi wants to merge 3 commits into
shrirangmhalgi wants to merge 3 commits into
Conversation
…xpected exception In YARN client mode, YarnClientSchedulerBackend's MonitorThread only catches InterruptedException/InterruptedIOException. If any other exception occurs (e.g., network failure, credential expiration) during application monitoring, the thread dies silently while the driver JVM hangs indefinitely due to non-daemon threads (SparkUI, heartbeats) keeping the process alive. This patch adds a NonFatal catch clause that logs the error and calls sc.stop() to ensure the driver shuts down cleanly instead of hanging.
Contributor
Author
|
@pan3793 / @sarutak / @LuciferYang Could you please review this small fix. The |
LuciferYang
reviewed
Jun 2, 2026
…path Wire the reflected thread into backend.monitorThread so that when sc.stop() triggers YarnClientSchedulerBackend.stop(), the full production path (stop -> monitorThread.stopMonitor()) is exercised.
sarutak
reviewed
Jun 2, 2026
Member
|
it sounds like a good idea to expand |
…comment wording - Use structured logging API (logError(log"...", e)) per sarutak's review - Add System.exit(1) when AM_CLIENT_MODE_EXIT_ON_ERROR is set, matching the existing happy-path behavior for FAILED/KILLED states - Fix test comment: 'fatal error' -> 'unexpected non-fatal error'
Contributor
Author
|
Thanks @sarutak and @pan3793 for the reviews! All the feedback is addressed in the latest commit:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In YARN client mode,
YarnClientSchedulerBackend'sMonitorThreadonly catchesInterruptedException/InterruptedIOException. If any other exception occurs during application monitoring (e.g., network failure, credential expiration, or other runtime errors), the thread dies silently. Since the driver JVM has active non-daemon threads (SparkUI, heartbeats), the process hangs indefinitely in a zombie state.This patch adds a
NonFatalcatch clause that logs the error and callssc.stop(), ensuring the driver shuts down cleanly.Why are the changes needed?
In managed environments (cloud platform agents, workflow schedulers), a hung driver is indistinguishable from one doing legitimate post-execution work. This causes resource leakage, orphaned processes, and extended job timeout durations.
Does this PR introduce any user-facing change?
Yes. Previously, certain failures in the monitor thread caused the driver to hang forever. Now the driver shuts down cleanly with an error log.
How was this patch tested?
Added a new test in
YarnClientSchedulerBackendSuitewith a test that mocksClient.monitorApplicationto throw aRuntimeExceptionand assertssc.stop()is called (viaSparkListener.onApplicationEnd).Was this patch authored or co-authored using generative AI tooling?
Yes.