Skip to content

Commit fb03e38

Browse files
authored
feat(python-sdk): support OIDC Bearer auth on SandboxClient (#1621)
* feat(python-sdk): support OIDC Bearer auth on SandboxClient PR #1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR #935 / PR #1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the RFC 6749 error body included for diagnostics. Mirrors the Rust CLI's HTTP-policy posture from `crates/openshell-cli/src/oidc_auth.rs`: * `follow_redirects=False` so a 3xx during discovery can't steer us to an attacker-controlled token endpoint. * Discovery `issuer` is validated against the configured issuer; a discovery document claiming a different issuer is rejected, preventing the SDK from POSTing the refresh_token to a malicious endpoint. * `insecure: bool` flag plumbed through to httpx's `verify=` so self-signed-cert deployments work the same way they do in the Rust CLI. Built on `httpx` (chosen over `urllib` specifically for follow_redirects + verify control as kwargs). The OAuth2 refresh-token grant itself (RFC 6749 §6) is one form-encoded POST — handled inline rather than via a dedicated OAuth library; tried `authlib`'s `OAuth2Client` first but it auto-injects an Authorization header on every request, which breaks the unauthenticated discovery GET. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=True, insecure=False)` factory. Defaults to the refresher path with write-back enabled; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` is the default (changed from the first round of review): IdPs with refresh-token rotation (Keycloak with rotation, Entra in strict mode) invalidate the old refresh_token on each refresh, so an in-memory-only refresh would leave the on-disk bundle pointing at an invalidated value — any second process starting from disk would `invalid_grant`. With write-back enabled by default, the SDK keeps the shared cache consistent with the IdP. - `from_active_cluster` exposes `auto_refresh`, `write_back`, and `insecure` kwargs (defaults: True / True / False). The high-level `Sandbox` context manager surfaces the same three kwargs and forwards them through, so callers using the wrapper have parity with `SandboxClient` for OIDC-protected gateways. - `SandboxClient.close()` chains to a `_bearer_close` hook so the `_OidcRefresher`'s underlying `httpx.Client` is released deterministically instead of leaking sockets/FDs until GC runs `__del__`. Idempotent. - `_OidcRefresher._write_to_disk` uses `tempfile.mkstemp` (PID + random suffix) instead of a fixed `.oidc_token.json.tmp` path, so two writers racing on the same gateway directory don't trample each other's tmp content. Success path atomically replaces; failure path unlinks the orphan. OAuth2 refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)| OpenShell sits between the two; chose write-back-by-default because the rotation invariant matters more for our deployments than the "CLI is the only writer" assumption that fits google-auth. Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library — the refresh grant is a single POST. Tested: - 42 sandbox_test.py tests pass (5 pre-existing + 37 new across the bearer interceptor, fail-closed provider, refresher behavior, TlsConfig validation, from_active_cluster auth ladder, security-review regressions, Sandbox-wrapper kwarg forwarding, and lifecycle / concurrency probes). `mise run test:python` → 47 passed total across the python suite. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR #1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against three pre-review security findings: * **Discovery issuer validation** — a discovery document claiming a different `issuer` than the configured one is rejected with a clear `SandboxError` before any refresh POST can reach the attacker-controlled endpoint. * **Redirect during discovery** — `follow_redirects=False` on the underlying httpx client means a 3xx during discovery surfaces as a SandboxError rather than silently chasing the redirect. * **Cross-process rotation** — a two-process simulation shows process B starting from disk and successfully refreshing with the rotated refresh_token, because process A's write-back updated the shared cache. - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode (default), default-on write_back proven by test, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with the error body, the client_credentials / no-refresh_token error path, issuer- mismatch rejection, redirect-during-discovery rejection, insecure flag plumbing. - Lifecycle / concurrency regression tests added: `close()` invokes the `_bearer_close` hook (idempotent), the refresher's `httpx.Client` is marked closed after `SandboxClient.close()`, and 16 concurrent writers don't leave orphan tmp files behind while producing a valid final bundle. The `Sandbox` wrapper has direct forwarding tests proving `auto_refresh`, `write_back`, and `insecure` reach `from_active_cluster` (both explicit values and defaults). - End-to-end against a real OpenShift + Keycloak cluster from inside a pod: real OIDC discovery against `keycloak.keycloak.svc.cluster.local:8080`, refresh-token grant POST, atomic write-back of the rotated bundle at 0600, and a follow-up RPC reusing the freshly-rotated in-memory token — full round-trip in ~170ms. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(python-sdk): adopt newer on-disk OIDC bundle before refreshing _OidcRefresher.current_access_token() only adopted the on-disk oidc_token.json when its access token was still fresh; otherwise it refreshed using the in-memory bundle. With refresh-token rotation enabled (Keycloak with rotation, Entra strict mode), this let a process keep using an invalidated refresh_token: 1. Process A holds a stale in-memory bundle with refresh_token=r1. 2. Process B refreshes first and writes a rotated (r2) but now near-expiry bundle to disk. 3. Process A re-reads disk, sees the access token is not fresh, ignores the disk bundle, and POSTs the stale r1 — which the IdP has already invalidated, yielding invalid_grant. Fix: when the cached bundle is stale, adopt the on-disk bundle if it was refreshed more recently than ours, even when its access token is also stale. "More recently" is decided by expires_at — a refresh mints a new access token with a forward expiry alongside the rotated refresh_token, so the later expiry carries the newest refresh_token. Comparing by expiry (rather than unconditionally preferring disk) preserves the write_back=False case, where the in-memory bundle has already rotated past the on-disk copy and must not be clobbered. When the adopted bundle's issuer differs, the cached token endpoint is reset so the refresh re-discovers against the new issuer. Adds regression tests for the cross-process rotation race and the issuer-change re-discovery path. * fix(python-sdk): recover from invalid_grant on lost rotation race The expiry-based disk re-read narrows but does not fully close the cross-process refresh-token rotation race: two processes sharing a gateway directory can both enter their refresh window, both POST their copy of the refresh_token, and with rotation enabled the IdP invalidates the loser's token (invalid_grant). Neither google-auth nor botocore close this window without an OS file lock; a Python-only flock would not coordinate with the Rust CLI/TUI that also write oidc_token.json, so locking is not worth its cost here. Recover instead of prevent: distinguish an OAuth2 invalid_grant (the refresh_token was rejected) from transport/5xx failures via a private _InvalidGrantError, and on invalid_grant re-read oidc_token.json once. If a peer wrote a different refresh_token (it won the race), adopt and retry with it — returning early if it is already fresh — so the loser succeeds transparently instead of forcing a re-authenticate. If disk offers no new token, the rejection is genuine and surfaces the re-authenticate hint as before. The retry is single-shot; a second invalid_grant propagates. Adds tests for the peer-rotation recovery and the genuine-rejection (no-retry) paths. --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
1 parent d01d106 commit fb03e38

4 files changed

Lines changed: 2009 additions & 30 deletions

File tree

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ requires-python = ">=3.12"
1616
dependencies = [
1717
"cloudpickle>=3.0",
1818
"grpcio>=1.60",
19+
"httpx>=0.27",
1920
"protobuf>=4.25",
2021
]
2122
classifiers = [

0 commit comments

Comments
 (0)