Commit fb03e38
authored
feat(python-sdk): support OIDC Bearer auth on SandboxClient (#1621)
* feat(python-sdk): support OIDC Bearer auth on SandboxClient
PR #1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR #935 / PR #1404) were unreachable from
the SDK.
Adds:
- `bearer_token: str | Callable[[], str] | None` kwarg on
`SandboxClient`. Static strings or zero-arg callables (the latter
is invoked per RPC, so callers can drop in a refresh loop or
token-file watcher without reconstructing the client). Composes
with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
`grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
Appends `authorization: Bearer <token>` to outgoing metadata.
Implemented as an interceptor (not call credentials) so it works
on both plaintext (`disableTls=true` dev) and TLS channels without
`grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
`key_path`) are now optional with `cert_path` / `key_path`
required-together-or-not-at-all (enforced in `__post_init__`). This
unlocks three transport profiles from one dataclass:
* full mTLS (all three)
* CA-only trust (`ca_path` only)
* system roots (`TlsConfig()` — for OIDC gateways behind a
public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
`build_oidc_channel`:
* For any `https://` gateway, always build a secure channel.
Pick the strongest TLS profile available in `mtls/` (full
mTLS → CA-only → system roots). No more `insecure_channel`
fallback for HTTPS.
* Gate OIDC bearer attachment on
`metadata.json["auth_mode"] == "oidc"`. Matches
`crates/openshell-cli/src/main.rs:132` and the TUI; a stale
`oidc_token.json` next to a non-OIDC gateway no longer causes
the SDK to attach a bearer.
- `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh
modeled on `google.oauth2.credentials.Credentials` and
`botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every
RPC; when stale, re-reads disk first (the CLI may have rotated
the bundle), and only then exchanges the refresh_token against
the IdP's token endpoint discovered via OIDC discovery
(`/.well-known/openid-configuration`, cached after first call).
Concurrent RPCs share a single refresh via `threading.Lock` (no
IdP stampede). Honors refresh-token rotation. Surfaces IdP
failures as `SandboxError` with the RFC 6749 error body included
for diagnostics.
Mirrors the Rust CLI's HTTP-policy posture from
`crates/openshell-cli/src/oidc_auth.rs`:
* `follow_redirects=False` so a 3xx during discovery can't
steer us to an attacker-controlled token endpoint.
* Discovery `issuer` is validated against the configured
issuer; a discovery document claiming a different issuer is
rejected, preventing the SDK from POSTing the refresh_token
to a malicious endpoint.
* `insecure: bool` flag plumbed through to httpx's
`verify=` so self-signed-cert deployments work the same way
they do in the Rust CLI.
Built on `httpx` (chosen over `urllib` specifically for
follow_redirects + verify control as kwargs). The OAuth2
refresh-token grant itself (RFC 6749 §6) is one form-encoded
POST — handled inline rather than via a dedicated OAuth library;
tried `authlib`'s `OAuth2Client` first but it auto-injects an
Authorization header on every request, which breaks the
unauthenticated discovery GET.
- `_make_cluster_bearer_provider(..., auto_refresh=True,
write_back=True, insecure=False)` factory. Defaults to the
refresher path with write-back enabled; `auto_refresh=False`
falls back to the read-only fail-closed behavior for callers that
don't want the SDK to make outbound HTTP calls to the IdP.
`write_back=True` is the default (changed from the first round of
review): IdPs with refresh-token rotation (Keycloak with
rotation, Entra in strict mode) invalidate the old refresh_token
on each refresh, so an in-memory-only refresh would leave the
on-disk bundle pointing at an invalidated value — any second
process starting from disk would `invalid_grant`. With write-back
enabled by default, the SDK keeps the shared cache consistent
with the IdP.
- `from_active_cluster` exposes `auto_refresh`, `write_back`, and
`insecure` kwargs (defaults: True / True / False). The
high-level `Sandbox` context manager surfaces the same three
kwargs and forwards them through, so callers using the wrapper
have parity with `SandboxClient` for OIDC-protected gateways.
- `SandboxClient.close()` chains to a `_bearer_close` hook so the
`_OidcRefresher`'s underlying `httpx.Client` is released
deterministically instead of leaking sockets/FDs until GC runs
`__del__`. Idempotent.
- `_OidcRefresher._write_to_disk` uses `tempfile.mkstemp` (PID +
random suffix) instead of a fixed `.oidc_token.json.tmp` path,
so two writers racing on the same gateway directory don't
trample each other's tmp content. Success path atomically
replaces; failure path unlinks the orphan.
OAuth2 refresh policy and write-back semantics deliberately mirror
what the major Python SDKs do — see
github.com/googleapis/google-auth-library-python (`Credentials`)
and github.com/boto/botocore (`SSOTokenProvider`):
| Library | Native refresh | Writes back |
|-------------------------------|----------------|-------------|
| google-auth Credentials | yes | no |
| botocore SSOTokenProvider | yes | yes |
| openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)|
OpenShell sits between the two; chose write-back-by-default because
the rotation invariant matters more for our deployments than the
"CLI is the only writer" assumption that fits google-auth.
Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library —
the refresh grant is a single POST.
Tested:
- 42 sandbox_test.py tests pass (5 pre-existing + 37 new across
the bearer interceptor, fail-closed provider, refresher
behavior, TlsConfig validation, from_active_cluster auth ladder,
security-review regressions, Sandbox-wrapper kwarg forwarding,
and lifecycle / concurrency probes).
`mise run test:python` → 47 passed total across the python
suite.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift:
* unauthenticated `Health` bypass works
* admin + `openshell:all` reaches user-callable methods
* reader (`sandbox:read`) denied on `CreateSandbox` by scope
* admin + `openshell:all` denied on PR #1596 sandbox-only
methods at the router (the new gate is honored from the SDK)
* full provider CRUD lifecycle via the SDK
* callable token provider rotates per RPC as expected
- Regression-probed against three pre-review security findings:
* **Discovery issuer validation** — a discovery document
claiming a different `issuer` than the configured one is
rejected with a clear `SandboxError` before any refresh POST
can reach the attacker-controlled endpoint.
* **Redirect during discovery** — `follow_redirects=False` on
the underlying httpx client means a 3xx during discovery
surfaces as a SandboxError rather than silently chasing the
redirect.
* **Cross-process rotation** — a two-process simulation shows
process B starting from disk and successfully refreshing
with the rotated refresh_token, because process A's
write-back updated the shared cache.
- Refresher unit tests cover: cached-fresh fast path, disk-rotated
re-read before refresh, OAuth2 exchange against the discovered
token endpoint, refresh-token rotation, atomic write-back at
0600 mode (default), default-on write_back proven by test,
concurrent N-thread coordination (one refresh shared across 8
threads), IdP failure surfaced with the error body, the
client_credentials / no-refresh_token error path, issuer-
mismatch rejection, redirect-during-discovery rejection,
insecure flag plumbing.
- Lifecycle / concurrency regression tests added: `close()`
invokes the `_bearer_close` hook (idempotent), the refresher's
`httpx.Client` is marked closed after `SandboxClient.close()`,
and 16 concurrent writers don't leave orphan tmp files behind
while producing a valid final bundle. The `Sandbox` wrapper has
direct forwarding tests proving `auto_refresh`, `write_back`,
and `insecure` reach `from_active_cluster` (both explicit
values and defaults).
- End-to-end against a real OpenShift + Keycloak cluster from
inside a pod: real OIDC discovery against
`keycloak.keycloak.svc.cluster.local:8080`, refresh-token grant
POST, atomic write-back of the rotated bundle at 0600, and a
follow-up RPC reusing the freshly-rotated in-memory token —
full round-trip in ~170ms.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* fix(python-sdk): adopt newer on-disk OIDC bundle before refreshing
_OidcRefresher.current_access_token() only adopted the on-disk
oidc_token.json when its access token was still fresh; otherwise it
refreshed using the in-memory bundle. With refresh-token rotation
enabled (Keycloak with rotation, Entra strict mode), this let a process
keep using an invalidated refresh_token:
1. Process A holds a stale in-memory bundle with refresh_token=r1.
2. Process B refreshes first and writes a rotated (r2) but now
near-expiry bundle to disk.
3. Process A re-reads disk, sees the access token is not fresh, ignores
the disk bundle, and POSTs the stale r1 — which the IdP has already
invalidated, yielding invalid_grant.
Fix: when the cached bundle is stale, adopt the on-disk bundle if it was
refreshed more recently than ours, even when its access token is also
stale. "More recently" is decided by expires_at — a refresh mints a new
access token with a forward expiry alongside the rotated refresh_token,
so the later expiry carries the newest refresh_token. Comparing by
expiry (rather than unconditionally preferring disk) preserves the
write_back=False case, where the in-memory bundle has already rotated
past the on-disk copy and must not be clobbered. When the adopted
bundle's issuer differs, the cached token endpoint is reset so the
refresh re-discovers against the new issuer.
Adds regression tests for the cross-process rotation race and the
issuer-change re-discovery path.
* fix(python-sdk): recover from invalid_grant on lost rotation race
The expiry-based disk re-read narrows but does not fully close the
cross-process refresh-token rotation race: two processes sharing a
gateway directory can both enter their refresh window, both POST their
copy of the refresh_token, and with rotation enabled the IdP invalidates
the loser's token (invalid_grant). Neither google-auth nor botocore
close this window without an OS file lock; a Python-only flock would not
coordinate with the Rust CLI/TUI that also write oidc_token.json, so
locking is not worth its cost here.
Recover instead of prevent: distinguish an OAuth2 invalid_grant (the
refresh_token was rejected) from transport/5xx failures via a private
_InvalidGrantError, and on invalid_grant re-read oidc_token.json once. If
a peer wrote a different refresh_token (it won the race), adopt and retry
with it — returning early if it is already fresh — so the loser succeeds
transparently instead of forcing a re-authenticate. If disk offers no new
token, the rejection is genuine and surfaces the re-authenticate hint as
before. The retry is single-shot; a second invalid_grant propagates.
Adds tests for the peer-rotation recovery and the genuine-rejection
(no-retry) paths.
---------
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>1 parent d01d106 commit fb03e38
4 files changed
Lines changed: 2009 additions & 30 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
| |||
0 commit comments