Skip to content

Commit 9febe89

Browse files
committed
feat(python-sdk): support OIDC Bearer auth on SandboxClient
PR #1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR #935 / PR #1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the RFC 6749 error body included for diagnostics. Mirrors the Rust CLI's HTTP-policy posture from `crates/openshell-cli/src/oidc_auth.rs`: * `follow_redirects=False` so a 3xx during discovery can't steer us to an attacker-controlled token endpoint. * Discovery `issuer` is validated against the configured issuer; a discovery document claiming a different issuer is rejected, preventing the SDK from POSTing the refresh_token to a malicious endpoint. * `insecure: bool` flag plumbed through to httpx's `verify=` so self-signed-cert deployments work the same way they do in the Rust CLI. Built on `httpx` (chosen over `urllib` specifically for follow_redirects + verify control as kwargs). The OAuth2 refresh-token grant itself (RFC 6749 §6) is one form-encoded POST — handled inline rather than via a dedicated OAuth library; tried `authlib`'s `OAuth2Client` first but it auto-injects an Authorization header on every request, which breaks the unauthenticated discovery GET. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=True, insecure=False)` factory. Defaults to the refresher path with write-back enabled; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` is the default (changed from the first round of review): IdPs with refresh-token rotation (Keycloak with rotation, Entra in strict mode) invalidate the old refresh_token on each refresh, so an in-memory-only refresh would leave the on-disk bundle pointing at an invalidated value — any second process starting from disk would `invalid_grant`. With write-back enabled by default, the SDK keeps the shared cache consistent with the IdP. - `from_active_cluster` exposes `auto_refresh`, `write_back`, and `insecure` kwargs (defaults: True / True / False). The high-level `Sandbox` context manager surfaces the same three kwargs and forwards them through, so callers using the wrapper have parity with `SandboxClient` for OIDC-protected gateways. - `SandboxClient.close()` chains to a `_bearer_close` hook so the `_OidcRefresher`'s underlying `httpx.Client` is released deterministically instead of leaking sockets/FDs until GC runs `__del__`. Idempotent. - `_OidcRefresher._write_to_disk` uses `tempfile.mkstemp` (PID + random suffix) instead of a fixed `.oidc_token.json.tmp` path, so two writers racing on the same gateway directory don't trample each other's tmp content. Success path atomically replaces; failure path unlinks the orphan. OAuth2 refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)| OpenShell sits between the two; chose write-back-by-default because the rotation invariant matters more for our deployments than the "CLI is the only writer" assumption that fits google-auth. Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library — the refresh grant is a single POST. Tested: - 44 sandbox_test.py tests pass (5 pre-existing + 39 new across the bearer interceptor, fail-closed provider, refresher behavior, TlsConfig validation, from_active_cluster auth ladder, security-review regressions, Sandbox-wrapper kwarg forwarding, and lifecycle / concurrency probes). `mise run test:python` → 49 passed total across the python suite. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR #1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against three pre-review security findings: * **Discovery issuer validation** — a discovery document claiming a different `issuer` than the configured one is rejected with a clear `SandboxError` before any refresh POST can reach the attacker-controlled endpoint. * **Redirect during discovery** — `follow_redirects=False` on the underlying httpx client means a 3xx during discovery surfaces as a SandboxError rather than silently chasing the redirect. * **Cross-process rotation** — a two-process simulation shows process B starting from disk and successfully refreshing with the rotated refresh_token, because process A's write-back updated the shared cache. - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode (default), default-on write_back proven by test, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with the error body, the client_credentials / no-refresh_token error path, issuer- mismatch rejection, redirect-during-discovery rejection, insecure flag plumbing. - Lifecycle / concurrency regression tests added: `close()` invokes the `_bearer_close` hook (idempotent), the refresher's `httpx.Client` is marked closed after `SandboxClient.close()`, and 16 concurrent writers don't leave orphan tmp files behind while producing a valid final bundle. The `Sandbox` wrapper has direct forwarding tests proving `auto_refresh`, `write_back`, and `insecure` reach `from_active_cluster` (both explicit values and defaults). - End-to-end against a real OpenShift + Keycloak cluster from inside a pod: real OIDC discovery against `keycloak.keycloak.svc.cluster.local:8080`, refresh-token grant POST, atomic write-back of the rotated bundle at 0600, and a follow-up RPC reusing the freshly-rotated in-memory token — full round-trip in ~170ms. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
1 parent 6c7950d commit 9febe89

4 files changed

Lines changed: 1826 additions & 30 deletions

File tree

pyproject.toml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,25 @@ keywords = ["agent", "sandbox", "nvidia"]
4141
source = "https://github.com/NVIDIA/OpenShell"
4242
documentation = "https://docs.nvidia.com/openshell/latest/"
4343

44+
[project.optional-dependencies]
45+
# OIDC bearer-auth refresh path. Required only when:
46+
# - using SandboxClient.from_active_cluster() against a gateway with
47+
# metadata.auth_mode == "oidc" and auto_refresh=True (the default), OR
48+
# - explicitly constructing _OidcRefresher.
49+
# Plaintext / mTLS-only deployments don't need this extra.
50+
#
51+
# Install with: pip install "openshell[oidc]"
52+
oidc = [
53+
# HTTP client used by the refresh path. Chosen over urllib for
54+
# explicit redirect control (`follow_redirects=False`) and clean
55+
# SSL-verification toggle (`verify=`), both of which are
56+
# security-critical for OIDC discovery and the refresh-token POST.
57+
# The OAuth2 refresh-token grant itself (RFC 6749 §6) is a single
58+
# form-encoded POST — small enough to spell out directly without a
59+
# dedicated OAuth library.
60+
"httpx>=0.27",
61+
]
62+
4463
[dependency-groups]
4564
dev = [
4665
"pytest>=8.0",
@@ -53,6 +72,10 @@ dev = [
5372
"setuptools-scm>=8",
5473
"grpcio-tools>=1.60",
5574
"pyelftools>=0.30",
75+
# The OIDC refresh path (an optional `oidc` extra at runtime) is
76+
# exercised by the test suite via httpx.MockTransport, so httpx must
77+
# be present for tests even though it's not a default runtime dep.
78+
"httpx>=0.27",
5679
]
5780

5881
[tool.uv]

0 commit comments

Comments
 (0)