Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions experimental/ssh/FAILURE_MODES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Reproducing `databricks ssh connect` failure modes

This guide documents container/cluster misconfigurations that make `databricks ssh connect`
fail, how to reproduce each one, the symptom the user sees, and where the real error lives. It
is primarily a testing aid for the SSH feature's error-handling paths.

For the connection flow and architecture, see [README.md](./README.md).

## Background: where failures surface

The bootstrap is a **Python notebook job** that starts `databricks ssh server` on the cluster.
The server publishes its port to the workspace (`metadata.json`), the client reads it, prints
`Connected!`, and spawns `ssh`. The SSH daemon (`/usr/sbin/sshd`) is launched **lazily, per
client connection** (see `internal/server/sshd.go` and `internal/proxy/server.go`). Because of
this ordering, different misconfigurations fail at different stages:

| Stage | Needs | Failure mode if missing |
| --- | --- | --- |
| Bootstrap job runs | a working Databricks **Python** runtime in the image | [Mode 2](#mode-2-container-cant-run-the-python-bootstrap) |
| Per-connection SSH | **`/usr/sbin/sshd`** (OpenSSH server) in the image | [Mode 1](#mode-1-container-missing-the-openssh-server-sshd) |

## Prerequisites

- A workspace with **Databricks Container Services** (custom Docker images) enabled.
- Permission to create a **dedicated (single-user)** cluster.
- A dev build of the CLI. See the *Development* section of [README.md](./README.md):
```shell
./task build snapshot-release
./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist --debug
```
- A container registry the workspace can pull from (e.g. a public Docker Hub repo) to host the
test images below. Build them for the cluster's architecture (`linux/amd64` on most clouds):
```shell
docker buildx build --platform linux/amd64 -t <namespace>/<image>:<tag> --push .
```

The cluster specs below use a single-node dedicated cluster. Adjust `node_type_id` and
`spark_version` for your cloud and DBR version:

```json
{
"cluster_name": "ssh-failure-repro",
"spark_version": "16.4.x-scala2.12",
"node_type_id": "<your-cloud-node-type>",
"num_workers": 0,
"data_security_mode": "SINGLE_USER",
"single_user_name": "<you@example.com>",
"spark_conf": { "spark.databricks.cluster.profile": "singleNode", "spark.master": "local[*, 4]" },
"custom_tags": { "ResourceClass": "SingleNode" },
"autotermination_minutes": 60,
"docker_image": { "url": "<namespace>/<image>:<tag>" }
}
```

Create it with `databricks clusters create --json @cluster.json --no-wait` and wait for the
`RUNNING` state (a custom-container pull can take several minutes).

## Mode 1: container missing the OpenSSH server (`sshd`)

A notebook-capable image that does **not** ship `openssh-server`. Build it by removing the SSH
server from an image that otherwise works:

```dockerfile
FROM databricksruntime/standard:16.4-LTS
RUN (apt-get remove -y openssh-server || true) \
&& rm -f /usr/sbin/sshd /usr/bin/sshd
```

Create a cluster on this image, then:

```shell
./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist
```

**Symptom.** The bootstrap job succeeds and publishes metadata, so the client prints
`Connected!` — and then the connection drops. The server can't launch `/usr/sbin/sshd` for the
incoming connection and holds the websocket open, so historically the `ssh` client **hung**
until its `ConnectTimeout`. The real error,
`failed to start SSHD process: ... /usr/sbin/sshd: no such file or directory`, is only written
to the bootstrap job's **stdout logs** while the job is still `RUNNING` — it is never a failed
job state.

**With the error-handling improvements** the client aborts after a handshake timeout (no SSH
banner from the server) with an actionable hint to install `openssh-server`, and exits
promptly instead of hanging.

**Fix.** Install `openssh-server` in the image (`apt-get install -y openssh-server`).

## Mode 2: container can't run the Python bootstrap

A bare/minimal base that lacks a working Databricks Python runtime. The simplest example is
`databricksruntime/rbase:16.4-LTS` used directly as the cluster image (it is an R *base* layer;
notably it has no functioning `/databricks/python` notebook-execution environment).

Create a cluster on `databricksruntime/rbase:16.4-LTS`, then:

```shell
./cli ssh connect --cluster=<cluster-id> --releases-dir=./dist
```

**Symptom.** The bootstrap is a Python notebook, but the image can't execute notebook commands,
so the job fails with `Could not reach driver of cluster <id>`. The SSH server never starts and
never publishes metadata, so the client fails with
`server metadata error / ... metadata.json doesn't exist` — **before** the `sshd` step is ever
reached. (A trivial `print(...)` notebook job submitted to the same cluster fails the same way,
which is a quick way to confirm the image, not the SSH feature, is at fault.)

**With the error-handling improvements** the client fetches the failed run's state message,
notebook error/trace, and run-page URL and shows them instead of the generic metadata error.

**Fix.** Build on a notebook-capable base (e.g. `databricksruntime/standard:...`) or otherwise
provide a working Databricks Python environment, in addition to `openssh-server`.

## Working control

`databricksruntime/standard:16.4-LTS` ships **both** a working Python runtime **and** `sshd`,
so `ssh connect` to a cluster on it succeeds end to end. Use it as a baseline to confirm your
workspace, cluster spec, and dev build are healthy before reproducing a failure mode.

## Inspecting the bootstrap job logs

`ssh connect` prints `Job submitted successfully with run ID: <id>`. Inspect it with:

```shell
databricks jobs get-run <id> # open run_page_url in the UI
databricks jobs get-run-output <task-run-id> # task-run-id = .tasks[0].run_id of the run
```

Caveat: for a **running** server task, `get-run-output`'s `logs`/`error` are not populated —
the `sshd` error from [Mode 1](#mode-1-container-missing-the-openssh-server-sshd) lives in the
live notebook cell stdout / driver logs, not the Jobs run-output API. A failed run from
[Mode 2](#mode-2-container-cant-run-the-python-bootstrap) does populate the run's state message
and error.

## Reproducing locally, without a workspace

The proxy-layer behaviors have unit tests that don't need a cluster:

- `internal/proxy/client_server_test.go`
- `TestClientExitsWhenServerCommandFails` — server can't launch its command and closes the
connection; the client exits promptly.
- `TestClientTimesOutWhenServerSendsNothing` — server holds the connection open and sends
nothing (the Mode 1 shape); the client aborts on the handshake timeout.
- `internal/client/client_internal_test.go` — formatting of a failed bootstrap run's error
(state message, error trace, run-page URL) using SDK mocks.

```shell
go test ./experimental/ssh/...
```
3 changes: 3 additions & 0 deletions experimental/ssh/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ databricks ssh connect --cluster=id
./cli ssh connect --cluster=<id> --releases-dir=./dist --debug # or modify ssh config accordingly
```

To reproduce and test the known `ssh connect` failure modes (container missing `sshd`, or a
container that can't run the Python bootstrap), see [FAILURE_MODES.md](./FAILURE_MODES.md).

## Design

High level:
Expand Down
135 changes: 123 additions & 12 deletions experimental/ssh/internal/client/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -489,16 +489,19 @@ func getServerMetadata(ctx context.Context, client *databricks.WorkspaceClient,
return wsMetadata.Port, string(bodyBytes), effectiveClusterID, nil
}

func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) error {
// submitSSHTunnelJob submits the bootstrap job and waits for the SSH server task to start.
// It returns the job run ID (when known) so callers can fetch and surface the run's error
// details if the server never comes up.
func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) (int64, error) {
sessionID := opts.SessionIdentifier()
contentDir, err := sshWorkspace.GetWorkspaceContentDir(ctx, client, version, sessionID)
if err != nil {
return fmt.Errorf("failed to get workspace content directory: %w", err)
return 0, fmt.Errorf("failed to get workspace content directory: %w", err)
}

err = client.Workspace.MkdirsByPath(ctx, contentDir) //nolint:staticcheck // Deprecated in SDK v0.127.0. Migration to WorkspaceHierarchyService tracked separately.
if err != nil {
return fmt.Errorf("failed to create directory in the remote workspace: %w", err)
return 0, fmt.Errorf("failed to create directory in the remote workspace: %w", err)
}

sshTunnelJobName := "ssh-server-bootstrap-" + sessionID
Expand All @@ -514,7 +517,7 @@ func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient,
Overwrite: true,
})
if err != nil {
return fmt.Errorf("failed to create ssh-tunnel notebook: %w", err)
return 0, fmt.Errorf("failed to create ssh-tunnel notebook: %w", err)
}

baseParams := map[string]string{
Expand Down Expand Up @@ -569,12 +572,13 @@ func submitSSHTunnelJob(ctx context.Context, client *databricks.WorkspaceClient,

waiter, err := client.Jobs.Submit(ctx, submitRequest)
if err != nil {
return fmt.Errorf("failed to submit job: %w", err)
return 0, fmt.Errorf("failed to submit job: %w", err)
}

cmdio.LogString(ctx, fmt.Sprintf("Job submitted successfully with run ID: %d", waiter.RunId))

return waitForJobToStart(ctx, client, waiter.RunId, opts.TaskStartupTimeout)
// Return the run ID even on error so callers can fetch the run's failure details.
return waiter.RunId, waitForJobToStart(ctx, client, waiter.RunId, opts.TaskStartupTimeout)
}

func spawnSSHClient(ctx context.Context, userName, privateKeyPath string, serverPort int, clusterID string, opts ClientOptions) error {
Expand Down Expand Up @@ -610,7 +614,18 @@ func spawnSSHClient(ctx context.Context, userName, privateKeyPath string, server
sshCmd.Stdout = os.Stdout
sshCmd.Stderr = os.Stderr

return sshCmd.Run()
err = sshCmd.Run()
// ssh reserves exit code 255 for its own connection-level failures (a remote command's exit
// code is passed through as-is, 0-254). The most common cause here is the cluster's container
// image missing an OpenSSH server, so the server can't launch sshd once we connect — the
// connection then drops right after "Connected!". Surface an actionable hint rather than
// leaving the user with ssh's opaque "Connection closed" message.
if exitErr, ok := errors.AsType[*exec.ExitError](err); ok && exitErr.ExitCode() == 255 {
cmdio.LogString(ctx, cmdio.Yellow(ctx, "The SSH connection closed unexpectedly. If it dropped right after connecting, "+
"the cluster's container image is likely missing an OpenSSH server: ensure 'openssh-server' "+
"is installed (it provides /usr/sbin/sshd), then check the SSH server job run logs."))
}
return err
}

func runSSHProxy(ctx context.Context, client *databricks.WorkspaceClient, serverPort int, clusterID string, opts ClientOptions) error {
Expand Down Expand Up @@ -691,9 +706,10 @@ func waitForJobToStart(ctx context.Context, client *databricks.WorkspaceClient,
return sshTask, nil
}

// Check for terminal failure states
// Check for terminal failure states. Surface the run's actual error (e.g. a notebook
// traceback or "Could not reach driver") instead of a generic message.
if currentState == jobs.RunLifecycleStateV2StateTerminated {
return nil, retries.Halt(errors.New("task terminated before reaching running state"))
return nil, retries.Halt(fmt.Errorf("ssh server bootstrap job failed:\n%s", describeRunFailure(ctx, client, runID)))
}

// Continue polling for other states
Expand All @@ -703,6 +719,94 @@ func waitForJobToStart(ctx context.Context, client *databricks.WorkspaceClient,
return err
}

// maxRunFailureTraceBytes bounds how much of a failed run's error trace we print to the
// terminal; the full output is always available via the run page URL.
const maxRunFailureTraceBytes = 2000

// describeRunFailure fetches a failed bootstrap run's error details and formats them for the
// terminal. It is best-effort: any API error is folded into the returned text rather than
// propagated, so callers can always embed the result in their own error.
func describeRunFailure(ctx context.Context, client *databricks.WorkspaceClient, runID int64) string {
if runID == 0 {
return " (no job run ID available)"
}

run, err := client.Jobs.GetRun(ctx, jobs.GetRunRequest{RunId: runID})
if err != nil {
return fmt.Sprintf(" could not fetch job run %d: %v", runID, err)
}

var b strings.Builder

// Locate the SSH server task to read its termination reason and per-task run output.
var sshTask *jobs.RunTask
for i := range run.Tasks {
if run.Tasks[i].TaskKey == sshServerTaskKey {
sshTask = &run.Tasks[i]
break
}
}

if sshTask != nil && sshTask.Status != nil && sshTask.Status.TerminationDetails != nil {
if msg := strings.TrimSpace(sshTask.Status.TerminationDetails.Message); msg != "" {
fmt.Fprintf(&b, " %s\n", msg)
}
}

// The notebook error/traceback carries the real cause (e.g. a Python exception).
outputRunID := runID
if sshTask != nil && sshTask.RunId != 0 {
outputRunID = sshTask.RunId
}
if output, err := client.Jobs.GetRunOutput(ctx, jobs.GetRunOutputRequest{RunId: outputRunID}); err == nil && output != nil {
if e := strings.TrimSpace(output.Error); e != "" {
fmt.Fprintf(&b, " %s\n", e)
}
if trace := strings.TrimSpace(output.ErrorTrace); trace != "" {
fmt.Fprintf(&b, "%s\n", truncateTail(trace, maxRunFailureTraceBytes))
}
}

if run.RunPageUrl != "" {
fmt.Fprintf(&b, " See the full job logs: %s", run.RunPageUrl)
}

if b.Len() == 0 {
return fmt.Sprintf(" job run %d failed; see run details in the workspace", runID)
}
return strings.TrimRight(b.String(), "\n")
}

// runFailureIfTerminated reports whether the bootstrap run has reached a terminal state (so the
// SSH server will never come up), returning a formatted failure description when it has.
func runFailureIfTerminated(ctx context.Context, client *databricks.WorkspaceClient, runID int64) (string, bool) {
if runID == 0 {
return "", false
}
run, err := client.Jobs.GetRun(ctx, jobs.GetRunRequest{RunId: runID})
if err != nil {
return "", false
}
for i := range run.Tasks {
if run.Tasks[i].TaskKey != sshServerTaskKey {
continue
}
if run.Tasks[i].Status != nil && run.Tasks[i].Status.State == jobs.RunLifecycleStateV2StateTerminated {
return describeRunFailure(ctx, client, runID), true
}
return "", false
}
return "", false
}

// truncateTail returns the last maxBytes of s, marking the cut when truncated.
func truncateTail(s string, maxBytes int) string {
if len(s) <= maxBytes {
return s
}
return " ...\n" + s[len(s)-maxBytes:]
}

func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceClient, version, secretScopeName string, opts ClientOptions) (string, int, string, error) {
sessionID := opts.SessionIdentifier()
// For dedicated clusters, use clusterID; for serverless, it will be read from metadata
Expand All @@ -712,7 +816,7 @@ func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceC
if errors.Is(err, errServerMetadata) {
cmdio.LogString(ctx, "Starting SSH server...")

err := submitSSHTunnelJob(ctx, client, version, secretScopeName, opts)
runID, err := submitSSHTunnelJob(ctx, client, version, secretScopeName, opts)
if err != nil {
return "", 0, "", fmt.Errorf("failed to submit and start ssh server job: %w", err)
}
Expand All @@ -729,10 +833,17 @@ func ensureSSHServerIsRunning(ctx context.Context, client *databricks.WorkspaceC
if err == nil {
cmdio.LogString(ctx, "Health check successful, starting ssh WebSocket connection...")
break
} else if retries < maxRetries-1 {
}
// The metadata never appears if the bootstrap job dies after reaching RUNNING.
// Surface the job's actual error instead of waiting out the full timeout with a
// generic "metadata.json doesn't exist" message.
if failure, terminated := runFailureIfTerminated(ctx, client, runID); terminated {
return "", 0, "", fmt.Errorf("ssh server bootstrap job failed:\n%s", failure)
}
if retries < maxRetries-1 {
time.Sleep(2 * time.Second)
} else {
return "", 0, "", fmt.Errorf("failed to start the ssh server: %w", err)
return "", 0, "", fmt.Errorf("failed to start the ssh server: %w\n%s", err, describeRunFailure(ctx, client, runID))
}
}
} else if err != nil {
Expand Down
Loading
Loading