feat: LiteLLM API model management + buyer sidecar reload#333
Closed
feat: LiteLLM API model management + buyer sidecar reload#333
Conversation
Two fixes validated with real Base Sepolia x402 payments between two DGX Spark nodes running Nemotron 120B inference. 1. **CA certificate bundle**: The x402-verifier runs in a distroless container with no CA store. TLS verification of the public facilitator (facilitator.x402.rs) fails with "x509: certificate signed by unknown authority". Fix: `obol sell pricing` now reads the host CA bundle and patches it into the `ca-certificates` ConfigMap mounted by the verifier. 2. **Missing Description field**: The facilitator rejects verify requests that lack a `description` field in PaymentRequirement with "invalid_format". Fix: populate Description from the route pattern when building the payment requirement. ## Validated testnet flow ### Alice (seller) ``` obolup.sh # bootstrap dependencies obol stack init && obol stack up obol model setup custom --name nemotron-120b \ --endpoint http://host.k3d.internal:8000/v1 \ --model "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" obol sell pricing --wallet 0xC0De...97E --chain base-sepolia obol sell http nemotron \ --wallet 0xC0De...97E --chain base-sepolia \ --per-request 0.001 --namespace llm \ --upstream litellm --port 4000 \ --health-path /health/readiness \ --register --register-name "Nemotron 120B on DGX Spark" obol tunnel restart ``` ### Bob (buyer) ``` # 1. Discover curl $TUNNEL/.well-known/agent-registration.json # → name: "Nemotron 120B on DGX Spark", x402Support: true # 2. Probe curl -X POST $TUNNEL/services/nemotron/v1/chat/completions # → 402: payTo=0xC0De...97E, amount=1000, network=base-sepolia # 3. Sign EIP-712 TransferWithAuthorization + pay python3 bob_buy.py # → 200: "The meaning of life is to discover and pursue purpose" ``` ### On-chain receipts (Base Sepolia) | Tx | Description | |----|-------------| | 0xd769953b...c231ec0 | x402 settlement: Bob→Alice 0.001 USDC via ERC-3009 | Balance change: Alice +0.001 USDC, Bob -0.001 USDC. Facilitator: https://facilitator.x402.rs (real public settlement).
Replace the third-party facilitator.x402.rs with the Obol-operated facilitator at x402.gcp.obol.tech. This gives us control over uptime, chain support, and monitoring (Grafana dashboards already deployed in obol-infrastructure). Introduces DefaultFacilitatorURL constant in internal/x402 and updates all references: CLI flag default, config loader, standalone inference gateway, and deployment store. Companion PR in obol-infrastructure adds Base Sepolia (84532) to the facilitator's chain config alongside Base Mainnet (8453).
Address #321 — LiteLLM reliability improvements: 1. Hot-add models via /model/new API instead of restarting the deployment. ConfigMap still patched for persistence. Restart only triggered when API keys change (Secret mount requires it). 2. Scale to 2 replicas with RollingUpdate (maxUnavailable: 0, maxSurge: 1) so a new pod is ready before any old pod terminates. 3. PodDisruptionBudget (minAvailable: 1) prevents both replicas from being down simultaneously during voluntary disruptions. 4. preStop hook (sleep 10) gives EndpointSlice time to deregister the terminating pod before SIGTERM — prevents in-flight request drops during rolling updates. 5. Reloader annotation on litellm-secrets — Stakater Reloader triggers rolling restart on API key rotation, no manual restart. 6. terminationGracePeriodSeconds: 60 — long inference requests (e.g. Nemotron 120B at 30s+) have time to complete.
…issing The prerequisite check blocked installation entirely when Node.js was not available, even though Docker could extract the openclaw binary from the published image. This prevented bootstrap on minimal servers (e.g. DGX Spark nodes with only Docker + Python). Changes: - Prerequisites: only fail if BOTH npm AND docker are missing - install_openclaw(): try npm first, fall back to Docker image extraction (docker create + docker cp) when npm unavailable
Introduces PurchaseRequest CRD and extends the serviceoffer-controller to reconcile buy-side purchases. This replaces direct ConfigMap writes from buy.py with a controller-based pattern matching the sell-side. ## New resources - **PurchaseRequest CRD** (`obol.org/v1alpha1`): declarative intent to buy inference from a remote x402-gated endpoint. Lives in the agent's namespace. ## Controller reconciliation (4 stages) 1. **Probed** — probe endpoint → 402, validate pricing matches spec 2. **AuthsSigned** — call remote-signer via cluster DNS to sign ERC-3009 TransferWithAuthorization vouchers 3. **Configured** — write buyer ConfigMaps in llm namespace with optimistic concurrency, restart LiteLLM 4. **Ready** — verify sidecar loaded auths via pod /status endpoint ## Security - Agent only creates PurchaseRequest CRs (own namespace, no cross-NS) - Controller has elevated RBAC for ConfigMaps in llm, pods/list - Remote-signer accessed via cluster DNS (no port-forward) - Finalizer handles cleanup on delete (remove upstream from config) ## RBAC - Added PurchaseRequest read/write to serviceoffer-controller ClusterRole - Added pods/get/list for sidecar status checks Addresses #329. Companion to the dual-stack integration test.
…rites Modifies buy.py cmd_buy to create a PurchaseRequest CR in the agent's own namespace instead of writing ConfigMaps cross-namespace. The serviceoffer-controller (PR #330) reconciles the CR: probes the endpoint, signs auths via remote-signer, writes buyer ConfigMaps in llm namespace, and verifies sidecar readiness. Changes: - buy.py: replace steps 5-6 (sign + write ConfigMaps) with _create_purchase_request() + _wait_for_purchase_ready() - Agent RBAC: add PurchaseRequest CRUD to openclaw-monetize-write ClusterRole (agent's own namespace only, no cross-NS access) - Keep steps 1-4 (probe, wallet, balance, count) for user feedback The agent SA can now create PurchaseRequests but never writes to ConfigMaps in the llm namespace. All ConfigMap operations are serialized through the controller with optimistic concurrency.
Three fixes discovered during dual-stack testnet validation: 1. **eRPC URL**: `obol sell register` used `http://localhost/rpc` which gets 404 from Traefik (wrong Host header). Changed to `http://obol.stack/rpc` which matches the HTTPRoute hostname. 2. **--private-key-file ignored**: When OpenClaw agent is deployed, sell register always preferred the remote-signer path and silently ignored --private-key-file. Now honours user intent: explicit key file flag takes priority over remote-signer auto-detection. 3. **Flow script**: add --allow-writes for Base Sepolia eRPC (needed for on-chain tx submission), restart eRPC after config change. Validated: `obol sell register --chain base-sepolia --private-key-file` mints ERC-8004 NFT (Agent ID 3826) on Base Sepolia via eRPC.
Update dual-stack test to verify PurchaseRequest CR exists after the agent runs buy.py. The agent prompt stays the same — buy.py's interface is unchanged, only the backend (CR instead of ConfigMap).
- Fix getSignerAddress to handle string array format from remote-signer - Fix flow-11: polling for pod readiness, LISTEN port check, anchored sed patterns, auto-fund remote-signer wallet - Auto-fund Bob's remote-signer with USDC from .env key (shortcut for #331) - resourceVersion handling for PurchaseRequest 409 Conflict Known issue: controller's signAuths sends typed-data in a format the remote-signer doesn't accept (empty signature). Needs investigation of the remote-signer's /api/v1/sign/<addr>/typed-data API format. Workaround: buy.py signs locally, controller only needs to copy auths to buyer ConfigMaps (architectural simplification planned).
…rets) Architectural simplification: instead of the controller reading a Secret cross-namespace (security risk), buy.py embeds the pre-signed auths directly in the PurchaseRequest spec.preSignedAuths field. Flow: 1. buy.py signs auths locally (remote-signer in same namespace) 2. buy.py creates PurchaseRequest CR with auths in spec 3. Controller reads auths from CR spec (same PurchaseRequest RBAC) 4. Controller writes to buyer ConfigMaps in llm namespace No cross-namespace Secret read. No general secrets RBAC. Controller only needs PurchaseRequest read + ConfigMap write in llm. Validated: test PurchaseRequest with embedded auth → Probed=True, AuthsSigned=True (loaded from spec), Configured=True (wrote to buyer ConfigMaps). Ready pending sidecar reload (ConfigMap propagation delay).
…implify agent response validation
The macOS CA bundle (~290KB) exceeds the 262KB annotation limit that kubectl apply requires. The previous implementation used kubectl patch --type=merge which hits the same limit. Switch to "kubectl create --dry-run=client -o yaml | kubectl replace" which bypasses the annotation entirely. Add PipeCommands helper to the kubectl package for this pattern. Tested: obol sell pricing now populates the ca-certificates ConfigMap automatically on both macOS (290KB /etc/ssl/cert.pem) and Linux (220KB /etc/ssl/certs/ca-certificates.crt).
The CA ConfigMap is mounted as a volume. Kubernetes may take 60-120s to propagate changes to running pods. The verifier needs TLS to work immediately for the facilitator connection, so trigger a rollout restart right after populating the CA bundle. Validated: fresh stack → obol sell pricing → CA auto-populated (339KB on macOS) → verifier restarted → zero TLS errors.
Replace fragile ConfigMap YAML read-modify-write cycles with HTTP API calls to our LiteLLM fork (ObolNetwork/litellm) for model management. Model management (internal/model/): - Add litellmAPIViaExec() — clean kubectl-exec wrapper that fans out API calls to all running litellm pods (replicas:2 consistency) - Add hotDeleteModel() — live model removal via /model/delete API - Refactor hotAddModels() — use per-pod fan-out instead of single deployment exec with inline wget command construction - Refactor RemoveModel() — hot-delete via API + ConfigMap patch for persistence. No more pod restart for model removal. - Refactor AddCustomEndpoint() — hot-add via API, falls back to restart only on failure Controller (internal/serviceoffercontroller/): - Implement removeLiteLLMModelEntry() — was no-op stub, now queries /model/info to resolve model_id then calls /model/delete - Wire into reconcileDeletingPurchase() for PurchaseRequest cleanup - Add triggerBuyerReload() — POST /admin/reload on sidecar pods for immediate config pickup (vs 5-second ticker wait) Buyer sidecar (internal/x402/buyer/): - Add POST /admin/reload endpoint — triggers immediate config/auth file re-read via buffered channel signal - Wire ReloadCh() into main ticker goroutine for dual select Infrastructure: - Switch LiteLLM image to Obol fork: ghcr.io/obolnetwork/litellm:sha-fe892e3 (config-only /model/new, /model/delete, /model/update without Postgres)
| path = f"/api/v1/namespaces/{ns}/secrets" | ||
| try: | ||
| _kube_json("POST", path, token, ssl_ctx, secret) | ||
| print(f" Stored {len(auths)} auths in Secret {ns}/{secret_name}") |
| existing = _kube_json("GET", f"{path}/{secret_name}", token, ssl_ctx) | ||
| secret["metadata"]["resourceVersion"] = existing["metadata"]["resourceVersion"] | ||
| _kube_json("PUT", f"{path}/{secret_name}", token, ssl_ctx, secret) | ||
| print(f" Updated Secret {ns}/{secret_name} with {len(auths)} auths") |
Includes fixes from ObolNetwork/litellm#2: - P1: stale in-memory config after save_config (sequential write data loss) - P2: inline ModelInfo imports moved to module-level - P3: PROXY_ADMIN role check in config-only code paths
Replace `sh -c` + fmt.Sprintf shell command construction with direct argument passing in litellmAPIViaExec() and hotDeleteModel(). JSON body or auth tokens containing single quotes would break the shell wrapper. Now each argument goes as a separate argv element to wget via kubectl exec, bypassing shell interpretation entirely. Also document this pattern in the obol-stack-dev skill gotchas section. Addresses CodeQL finding: "Potentially unsafe quoting" on model.go:292.
First multiplatform build: linux/amd64 + linux/arm64. Includes all previous fixes (P1 stale config, P2 imports, P3 admin auth).
Collaborator
Author
|
Superseded by the validated integration branch \ and the \ prerelease cut from it. The release-candidate branch now carries the tested sell → discover → buy → settle path, updated docs/skills, and the final x402/buy-side fixes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace fragile ConfigMap YAML read-modify-write cycles with HTTP API calls to our LiteLLM fork (
ObolNetwork/litellm) for model management. Eliminates pod restarts for model add/remove operations.litellmAPIViaExec()fans out API calls to all pods,hotDeleteModel()for live removal,RemoveModel()andAddCustomEndpoint()no longer restartremoveLiteLLMModelEntry()implemented (was no-op stub), wired into PurchaseRequest deletion cleanupPOST /admin/reloadendpoint for immediate config pickup (vs 5s ticker wait)ghcr.io/obolnetwork/litellm:sha-fe892e3Architecture
Before vs After
Data Flow: Model Lifecycle
Two Persistence Layers
Buy-Side Payment Flow (PurchaseRequest)
Buyer Sidecar API Surface
CLI Operations Matrix
Code Map
Test plan
TestRemoveLiteLLMModelEntry— mock/model/info→/model/deletewith correct IDTestRemoveLiteLLMModelEntryNoMatch— no delete when model absentTestRemoveLiteLLMModelEntryServerError— graceful on 500TestTriggerBuyerReload— no panic with no podsTestProxy_AdminReload— 200 + channel signalTestProxy_AdminReloadIdempotent— "already pending" on double-firego test ./...green (29 packages)