experimental/ssh: surface connect failures instead of hanging#5456
Open
anton-107 wants to merge 4 commits into
Open
experimental/ssh: surface connect failures instead of hanging#5456anton-107 wants to merge 4 commits into
anton-107 wants to merge 4 commits into
Conversation
Improve diagnostics when `databricks ssh connect` fails. - Surface bootstrap job-run errors: when the SSH server bootstrap job reaches a terminal/failed state, fetch the run's state message, notebook error/trace, and run-page URL and show them, instead of the generic "server metadata error / metadata.json doesn't exist". - Guard against hangs when the server is up but the handshake never completes (e.g. the container image has no OpenSSH server, so the server can't launch /usr/sbin/sshd and holds the websocket open). The client now aborts after a handshake timeout with an actionable hint, and exits promptly when the server closes the connection, instead of hanging until ssh's ConnectTimeout. - Add an openssh-server hint when ssh exits with its connection-failure code (255). Tests cover the failed-run message formatting, the fast exit on server close, and the handshake timeout. WIP: the missing-sshd path still incurs a handshake-timeout wait; a server-side pre-flight sshd check (tracked separately) would turn it into an immediate, clear job failure. Co-authored-by: Isaac
Collaborator
|
Commit: 6bec572
24 interesting tests: 15 SKIP, 7 KNOWN, 2 flaky
Top 28 slowest tests (at least 2 minutes):
|
Add FAILURE_MODES.md describing how to reproduce, and what the user sees
for, the two `databricks ssh connect` container failure modes: a missing
OpenSSH server (sshd launched lazily, connection drops after "Connected!")
and a container that can't run the Python bootstrap ("Could not reach
driver" before sshd matters). Includes example Dockerfiles, a cluster
spec, a working control image, how to read the bootstrap job logs, and
the local unit tests that cover the same paths. Linked from README.
Co-authored-by: Isaac
Co-authored-by: Isaac
Contributor
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Originated from a customer case:
databricks ssh connectto a dedicated cluster whoseDocker container image was missing an OpenSSH server (
/usr/sbin/sshd). The failuresurfaced terribly — either a generic
server metadata error / metadata.json doesn't exist,or the client just hung (the local
sshwaited on its 360sConnectTimeout). The rootcause was buried in the cluster's job-run logs.
This PR improves the diagnostics for
ssh connectfailures.What
Surface bootstrap job-run errors. When the SSH server bootstrap job reaches a
terminal/failed state, fetch the run's state message, notebook error/trace, and run-page
URL and show them — both when the task terminates before reaching RUNNING and when it dies
after, during metadata polling. (
experimental/ssh/internal/client/client.go)Guard against hangs when the server is up but the handshake never completes. If the
container image has no
sshd, the server can't launch/usr/sbin/sshdon connect andholds the websocket open, so both proxy loops block forever. The client now runs the
proxy loops in the background and aborts after a handshake timeout (no server response)
with an actionable hint, and also exits promptly when the server does close the
connection. (
experimental/ssh/internal/proxy/client.go)openssh-server hint when
sshexits with its connection-failure code (255).(
spawnSSHClient)Tests
client_internal_test.go: failed-run message formatting (state message + trace + run URL),truncation, terminal-state detection (SDK mocks).
proxy/client_server_test.go: fast exit when the server closes the connection; abort on thehandshake timeout when the server sends nothing.
All
experimental/ssh/...tests pass; lint clean.Status / follow-ups (WIP)
sshdpath still incurs a ~30s handshake-timeout wait before failing. Thecleaner fix is a server-side pre-flight
sshdcheck (fail the bootstrap job immediatelywith a clear message), tracked separately — that would turn this case into an instant,
clear job failure handled by improvement Bump github.com/databrickslabs/terraform-provider-databricks from 0.5.7 to 0.5.8 #1.
shortened or made configurable.
This pull request and its description were written by Isaac.