Skip to content

Add CLI tunnel and auth commands#130

Draft
drewr wants to merge 12 commits intomainfrom
cli-tunnel-and-auth
Draft

Add CLI tunnel and auth commands#130
drewr wants to merge 12 commits intomainfrom
cli-tunnel-and-auth

Conversation

@drewr
Copy link
Copy Markdown
Contributor

@drewr drewr commented Mar 27, 2026

Summary

This PR ships the CLI client for Datum Connect tunneling — the headless equivalent of the desktop UI. It lets users authenticate, manage projects, and expose local services to public hostnames without launching the GUI.

Building

Rust tooling only (no Nix required):

cargo run -p datum-connect -- --help

Or with Nix:

nix run .#cli -- --help

Commands

auth

datum-connect auth login       # OAuth via browser; prompts to select a project after login
datum-connect auth logout
datum-connect auth status      # Shows authenticated user and active org/project
datum-connect auth list
datum-connect auth switch      # Logs out and re-authenticates; prompts to select a project

projects

datum-connect projects list    # Lists all orgs and projects; marks the active one with *
datum-connect projects switch  # Interactive prompt to change the active project

tunnel

datum-connect tunnel listen --endpoint 127.0.0.1:8080
datum-connect tunnel listen --endpoint 127.0.0.1:8080 --label my-tunnel
datum-connect tunnel listen --endpoint 127.0.0.1:8080 --project <project-id>
datum-connect tunnel list
datum-connect tunnel update --id <id> --label new-name
datum-connect tunnel delete --id <id>

tunnel listen runs in the foreground. It creates or reuses a tunnel for the given endpoint, starts the heartbeat agent so the gateway has routing info, enables the tunnel, and polls until it is accepted and programmed before printing the public hostname. Ctrl+C disables the tunnel and exits.

The --project flag overrides the active project for a single invocation without changing the stored selection.

Project selection

The active project is stored in config.yml (default: ~/.local/share/Datum/config.yml, overridable via $DATUM_CONNECT_REPO). It is set interactively after auth login or auth switch, or explicitly with projects switch.

Example session

$ cargo run -p datum-connect -- auth login
# browser opens for OAuth
Logged in as Jane Smith (jane@example.com)

Select a project:
  [1] Acme Corp / production
  [2] Acme Corp / staging
Enter number [1-2]: 2
Selected project: Acme Corp / staging

$ cargo run -p datum-connect -- tunnel listen --endpoint 127.0.0.1:3000
Created tunnel:
  id: httpp-abc123
  label: f3a9c2e1b047

Your endpoint ID: 30a9ddf5...
Setting up tunnel...
Tunnel ready after 8 sec: https://f3a9c2e1b047.tunnels.datum.net
Press Ctrl+C to stop...

Bug fixes (found during testing)

  • Tunnels created from CLI never route traffic: CLI was missing the HeartbeatAgent that continuously patches status.connectionDetails on the connector. Without it the gateway has no routing info. Fixed: tunnel listen now starts the heartbeat and registers the project before enabling the tunnel.
  • Re-running tunnel listen on an existing endpoint always prompted for update: Random label was generated before checking for an existing tunnel, so it always differed. Fixed: label generation moved into the create-new path; existing tunnels reuse their stored label unless --label is explicitly given.
  • Tunnel delete silently no-ops when connector is missing: delete_project returned early if no connector was found, skipping deletion of HTTPProxy/ConnectorAdvertisement/TrafficProtectionPolicy. Fixed: connector lookup is only needed for post-deletion cleanup and no longer gates resource deletion.
  • Auto-generated label used tunnel-<u16> format: Collided visually with resource ID format. Switched to 12 hex chars of random entropy (e.g. a3f9c2e1b047).

Test plan

  • cargo run -p datum-connect -- auth login completes OAuth and prompts for project selection
  • projects list shows all orgs/projects with active one marked
  • projects switch persists new selection to config.yml
  • tunnel listen --endpoint 127.0.0.1:<port> creates tunnel, prints hostname, disables on Ctrl+C
  • Re-running tunnel listen on the same endpoint reuses the existing tunnel without prompting
  • tunnel listen --project <id> uses the specified project
  • tunnel list shows tunnels in the active project
  • tunnel delete removes a tunnel cleanly

@drewr drewr marked this pull request as draft March 27, 2026 20:14
@zachsmith1
Copy link
Copy Markdown
Contributor

Do we want a separate cli for tunnels or do we want to bake in functionality into datumctl?

@drewr drewr force-pushed the cli-tunnel-and-auth branch from ea2df66 to 80ffdf7 Compare March 27, 2026 20:39
@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Mar 27, 2026

Yeah, it's why this is a draft. I needed the functionality and didn't want to commit one way or the other yet. I explored doing it in datumctl and it would involve either replicating the Iroh sidecar in go or making the project hybrid with a rust component.

This method uses all the same machinery as the GUI which felt like a better first pass.

drewr added 2 commits March 27, 2026 15:51
- Add 'tunnel' subcommand to datum-connect CLI with:
  - 'tunnel list': read-only listing of tunnels (no side effects)
  - 'tunnel listen': create/update and run tunnel in foreground
  - 'tunnel update': update tunnel label/endpoint
  - 'tunnel delete': delete a tunnel
- Add 'nix run .#connect' app to flake.nix
- Split find_connector_readonly for list operations
- Remove side effects from tunnel list (no patching Connector)
- Listen command:
  - Generates random label if not provided
  - Confirms before updating existing tunnel
  - Handles Ctrl+C to disable tunnel on exit
- Add 'auth' subcommand to CLI with:
  - 'auth status': Show current authentication and selected context
  - 'auth login': Log in via browser OAuth with account picker
  - 'auth logout': Log out and clear credentials
  - 'auth list': Show current authenticated user
  - 'auth switch': Log out current user and prompt for new login

Also add is_authenticated(), login(), logout() methods to DatumCloudClient.
@drewr drewr force-pushed the cli-tunnel-and-auth branch from 80ffdf7 to 01c3ab8 Compare March 27, 2026 20:51
@drewr drewr self-assigned this Mar 27, 2026
@zachsmith1
Copy link
Copy Markdown
Contributor

Ya the challenge is the core stuff we need is in rust so we'll need some magic to make the UX good

@scotwells
Copy link
Copy Markdown
Contributor

How does this interact with the GUI based application? Would auth be shared?

Since the GUI is locked to a specific project (because connectors are project-scoped resources), switching the authenticated user could break existing tunnels without the user knowing and it doesn't seem like we warn the user.

@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Mar 27, 2026

It's all shared. I'll show what it looks like when Rust is done compiling...

drewr and others added 5 commits March 27, 2026 16:37
delete_project returned early when find_connector returned None,
skipping deletion of HTTPProxy/ConnectorAdvertisement/TrafficProtectionPolicy.

Connector lookup is only needed for post-deletion cleanup (deciding
whether to delete the shared connector). Move it into an Option and
gate the cleanup block on Some, so resource deletion always proceeds.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Three interrelated bugs fixed in the tunnel listen command:

- Random label was generated before checking for an existing tunnel,
  so re-running listen on the same endpoint always triggered the update
  prompt. Moved label generation into the create-new path only; existing
  tunnels reuse their stored label unless --label is explicitly provided
  and differs.

- Default label format changed from tunnel-<u16> (collides with resource
  ID format) to 12 hex chars of random entropy (e.g. a3f9c2e1b047).
  Adds hex as a dependency.

- tunnel listen was missing the HeartbeatAgent that continuously patches
  status.connectionDetails on the connector (relay URL, addresses, public
  key). Without it the gateway has no routing info and tunnels never
  carry traffic. Now starts the heartbeat and registers the project before
  enabling the tunnel, then polls until accepted+programmed before
  printing the hostname.

Also simplifies tunnel delete output: connector cleanup is an internal
detail, so "Deleted tunnel <id>" replaces "(connector deleted: false)".

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
  - After auth login/switch, prompt user to select an org and project
    and persist the selection as the active context
  - Store the selected context in config.yml instead of a separate file
  - Add --project flag to the tunnel command to override the active
    project for a single invocation
  - Add projects list and projects switch commands for managing the
    active project outside of the auth flow
  - Fix tunnel listen to print id and label after creation
@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Mar 31, 2026

Here's a short demo of where I've gotten with this:

headless-tunnels-demo.mp4

@bmertens-datum
Copy link
Copy Markdown

@drewr Nice demo.

@zachsmith1
Copy link
Copy Markdown
Contributor

@drewr this is slick. lm planning on splitting off the app repo from the gateway repo and we should consider where we'd want this cli to live. last piece there would be a small enhancement around how we could inject this rust binary into datumctl (if we want to)

@richardhenwood
Copy link
Copy Markdown

I've had a moment to try this - and following your excellent demo video, and typing datum-connect -- tunnel listen --endpoint localhost:8080 I got a 'connector' appearing in the Datum cloud UI. This is a really powerful way to think about connections for me - so I'm very excited to play around :)

image

FYI: this is on my Fedora 43 workstation.

@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 9, 2026

Great feedback @richardhenwood, thanks!

@drewr drewr assigned gianarb and unassigned drewr Apr 9, 2026
@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 9, 2026

@zachsmith1 wrote:

where we'd want this cli to live

I think if we factor out the local process to a standalone rust utility like you're proposing it makes more sense for this to live in datumctl. I originally went that direction but didn't want to either rewrite the iroh integration in go or repackage this in an awkward way.

@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 9, 2026

I've had some instability with this and had both gpt-5.4 and sonnet-4.6 chewing on it:

Found it. The UpstreamProxy authorizes incoming iroh connections by checking self.state.get().proxies, but the CLI tunnel listen flow never calls listen.set_proxy() to register the tunnel in local state. The gateway connects over iroh, the auth handler finds no matching proxy, returns Forbidden, and the gateway sees a connection reset.

Fix incoming.

@gianarb
Copy link
Copy Markdown
Collaborator

gianarb commented Apr 9, 2026

There is a lot to unwrap in my opinion here, a lot around product so I am not sure I have enough context to help here.

Something is an old discussion we had here datum-cloud/enhancements#582 if you look for the ecosystem chapter:

Now my attention turns to "do we want to keep consistency in the ecosystem?". Do we want for example to get Datum Desktop to look at that file as well? So a switch context in the CLI will switch context in desktop?

In practice what I was trying to highlight here is the mood kubernetes and other cli tool develop when you do that everything you run starts from a unique source of truth (for kubernetes it is the ~/.kube/config file. If we can agree on something similar it will be a lot easier to bring other CTL or applications into a consistent state.

It will feel a lot easier to push for a plugin ecosystem like the one kubectl and others developed where binaries starting with kubectl- gets called from the main ctl. In this case we can release a binary datumctl-connect that will be callable like datumctl connect.

But if we can not agree on some common practices, like authentication the outcome for a user will be pretty poor, in this case I feel like we should just "give up" and release different binaries working their own way.

I am not saying that we should have in place the ability to switch and persist in between accounts/instances because I know we do not know yet datum-cloud/enhancements#653 (comment) but maybe since we do not know we can just take what we have today as common denominator until we figure out what's next.

So the way I envision the evolution of this PR is a binary that serves only the business logic to manage tunnels and connections and demands authentication to the same login used by the datumctl (or the datumctl changes to turn to the same used here and from desktop)

This is what I am trying to push to but as I said product wise I am not sure I have enough context to push into a direction vs another.

The gateway sends `CONNECT localhost:<port>` regardless of whether the
tunnel was registered with `localhost` or `127.0.0.1`, causing auth to
fail with Forbidden and the caller to see "upstream connect error or
disconnect/reset before headers."

Normalize `localhost`, `127.0.0.1`, and `::1` to a canonical form on
both sides of the host comparison in `tcp_proxy_exists`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 9, 2026

Gateway hostname normalization — needs investigation

While debugging an "upstream connect error or disconnect/reset before headers" report, I found the root cause in commit e2d868d: the gateway sends CONNECT localhost:<port> in the iroh HTTP CONNECT request regardless of what address was stored in the ConnectorAdvertisement (in this case 127.0.0.1). The strict string comparison in tcp_proxy_exists failed, returning Forbidden, which the gateway surfaces as a connection reset.

The client-side fix normalizes localhost, 127.0.0.1, and ::1 to a canonical form at comparison time, which handles the mismatch. But the gateway behavior is worth examining:

Observable behavior (from /tmp/datum-2026-04-09T19:45:27+00:00.log, line 8132):

handle request req=HttpRequest { version: HTTP/1.1, headers: {}, uri: localhost:3300, method: CONNECT }

The ConnectorAdvertisement had address: "127.0.0.1", but the gateway sent CONNECT localhost:3300.

Questions for the gateway team:

  • Is this intentional? Does the gateway always normalize 127.0.0.1localhost for loopback addresses?
  • Should the gateway instead use the ConnectorAdvertisement address verbatim in the CONNECT target?
  • If the gateway normalizes to localhost, should it normalize to 127.0.0.1 instead (since that's what users typically specify as the --endpoint)?

The client-side fix is defensive and correct either way, but if the gateway is doing unintended normalization, fixing it there would be cleaner and might surface other subtle issues.

@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 9, 2026

Follow-up on the gateway hostname question: the gateway sending CONNECT localhost:<port> is the datum-cloud/iroh-gateway service (deployed as ghcr.io/datum-cloud/iroh-gateway:main, daemonset defined in infra/apps/network-services-operator/downstream/). The shared library it uses is n0-computer/iroh-proxy-utils (DownstreamProxy).

The localhost normalization is happening in one of those two places. Worth checking iroh-gateway first since that's where the CONNECT target would be constructed from the ConnectorAdvertisement address field.

@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 10, 2026

Gateway iroh connection establishment delay (~10–18 min)

Confirmed a second, more impactful issue distinct from the localhost normalization fix.

Observed behavior: After tunnel listen reports "Tunnel ready", the gateway takes 10–18 minutes before it establishes an iroh QUIC path to the client endpoint. Any browser request during that window hits a gateway pod that hasn't connected yet, which closes the connection to Envoy before sending headers — producing the "upstream connect error or disconnect/reset before headers. reset reason: connection termination" error the user sees.

Evidence from logs:

  • 2026-04-10T14:50:05 — Tunnel ready (HTTPProxy accepted + programmed, ConnectorAdvertisement exists)
  • 2026-04-10T15:00:17 — First gateway iroh ping received (endpoint=9f7fe80240) — 10 min later
  • No router.accept events until the path is established; browser requests during the gap all fail

Once the gateway has the iroh path up, traffic flows correctly (confirmed working).

Root cause: The gateway appears to poll or process ConnectorAdvertisements with a long interval, delaying proactive iroh connection establishment. The client is reachable immediately via relay, but the gateway doesn't attempt to connect until it processes the advertisement.

Suggested fix (gateway-side): When a new or updated ConnectorAdvertisement is observed, proactively establish the iroh connection to the endpoint rather than waiting for the next poll cycle. This would bring the ready-to-serve time in line with the "Tunnel ready" message the user already sees.

@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 10, 2026

@zachsmith1 Something seems to have changed between Envoy and iroh-gateway. The original code in this PR worked fine. Agents haven't been able to find anything else in the client code to fix. Any ideas as to what it is?

@Frando @b5 Does the agent's diagnosis in the last comment track with reality, and if so, any upstream changes you feel like would cause it?

The gateway appears to poll or process ConnectorAdvertisements with a long interval, delaying proactive iroh connection establishment. The client is reachable immediately via relay, but the gateway doesn't attempt to connect until it processes the advertisement.

@zachsmith1
Copy link
Copy Markdown
Contributor

@drewr im not sure the agents comments are correct from an envoy/iroh-gateway perspective. I don't think we've touched anything on that path in some time. I'm seeing a new proxy take about 90 seconds to start up though. This is that karmada bug most likely or maybe when we enabled TPP it added some overhead. My tunnel works immediately though once its up.

Taking a step back, is the problem you're seeing that it takes 20 minutes for any tunnel traffic to work for you?

@zachsmith1
Copy link
Copy Markdown
Contributor

@drewr the ConnectorAdvertisement isn't the authority in the httpproxy/tunnel. Its actually based on the endpoint configured in the HttpProxy. That will have the host/port that envoy CONNECTs to iroh-gateway with. The desktop app today keeps both of those in sync

@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 10, 2026

is the problem you're seeing that it takes 20 minutes for any tunnel traffic to work for you?

Yes, after running tunnel listen I don't get a working tunnel until 15 or 20 minutes. However, it seems to be reliable after that.

I'm having an agent compare the client implementations. Stay tuned.

@drewr
Copy link
Copy Markdown
Contributor Author

drewr commented Apr 10, 2026

Note on tunnel warm-up delay

The CLI and UI share the same repo path (~/.local/share/Datum by default) and therefore the same listen_key — the same iroh node ID. This means:

  • When the UI has been running recently, it connects to n0des (via BUILD_N0DES_API_SECRET baked in at CI build time), pre-warming the relay path for that node ID.
  • If tunnel listen is run shortly after the UI, the iroh-gateway's cached path to that node ID is still warm, and traffic flows within ~1 minute.
  • On a cold start (no recent UI session), the CLI has no n0des connection, so the gateway has to rediscover the node via pkarr DNS. This takes ~18 minutes.

The CLI has never had BUILD_N0DES_API_SECRET baked in, so it has always relied on either the gateway's warm cache (from a prior UI session) or slow cold pkarr discovery. This PR adds CLI binary builds to bundle.yml and manual-release.yml so that distributed binaries get the secret baked in the same way the UI does. For local dev, set N0DES_API_SECRET in your environment — the runtime env var path already exists in n0des_api_secret_from_env().

@zachsmith1
Copy link
Copy Markdown
Contributor

@drewr we don't use the n0des api anymore. thats from the non-datum infra world

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants