Skip to content

Replace monetize.py reconciliation loop with controller-runtime operator #296

@bussyjd

Description

@bussyjd

Summary

Replace the cron-driven monetize.py reconciliation loop with a proper Kubernetes controller using controller-runtime. Introduce PaymentRoute and RegistrationRequest child CRDs to eliminate the shared ConfigMap mutation race and isolate on-chain side effects. Keep the x402-verifier as a separate Deployment but switch its config source from file polling to a PaymentRoute informer.

Problem

The current reconciliation model has fundamental coupling and correctness issues:

1. Coupled to obol-agent runtime

The reconciler runs as a Python skill script inside the OpenClaw agent pod (monetize.py). If the agent crashes, restarts, or is undeployed, all ServiceOffer reconciliation stops. Payment routes, HTTPRoutes, and registrations stop converging. This was a deliberate design choice documented in monetisation-architecture-proposal.md#L54 — the cron-based approach was chosen over a Go operator for simplicity — but it has become the wrong tradeoff as the system grows.

2. ConfigMap mutation race

_add_pricing_route() (monetize.py#L699) reads the x402-pricing ConfigMap, appends a route entry via string manipulation, and writes the whole ConfigMap back. Two ServiceOffers reconciling simultaneously can overwrite each other's entries. _remove_pricing_route() (monetize.py#L1705) has the same problem in reverse.

3. Polling latency

The reconciler polls every 10-60 seconds. A new ServiceOffer CR sits idle until the next poll cycle. The ConfigMap watcher in the verifier (watcher.go#L16) adds another 60-120s kubelet sync delay. Total worst-case: ~180 seconds from CR creation to traffic flowing.

4. Imperative stage chain

The reconcile function (monetize.py#L1504) runs 6 stages sequentially. Each stage blocks on the previous. If Stage 3 fails, Stages 4-6 never run, even if they're independent. There's no self-healing — if an HTTPRoute is deleted externally, the reconciler won't recreate it until the ServiceOffer is modified.

5. No finalizer-based cleanup

Deletion cleanup is in the CLI path (monetize.py#L1690), not in the controller. If the CR is deleted directly via kubectl delete, external side effects (pricing routes, ERC-8004 registration) are orphaned.

6. Mixed concerns in verifier

The .well-known/agent-registration.json endpoint is served by the x402-verifier (verifier.go#L192). This is discovery metadata, not payment gating — it doesn't belong in the ForwardAuth service.

Proposed Architecture

Guiding Principles

  1. Derive and observe, don't pipeline. The reconciler computes desired child resources from spec and applies them with server-side apply. No stage ordering.
  2. Consolidate code, not runtime. One repo, one internal package set, optionally one image, but separate Deployments for controller and verifier.
  3. Separate control plane from data plane. The controller writes desired state. The verifier reads PaymentRoute and serves traffic. Different scaling axes, different RBAC, different failure domains.

Component Layout

┌─────────────────────────────────────────────────────────────────────────┐
│  obol-system namespace                                                  │
│                                                                         │
│  ┌──────────────────────────────┐    ┌───────────────────────────────┐  │
│  │  serviceoffer-controller     │    │  x402-verifier (unchanged ns) │  │
│  │  Deployment (1 replica)      │    │  Deployment (N replicas)      │  │
│  │                              │    │                               │  │
│  │  - Leader election           │    │  - No leader needed           │  │
│  │  - Broad RBAC (write CRDs,  │    │  - Read-only RBAC             │  │
│  │    HTTPRoutes, Middlewares)  │    │    (watch PaymentRoute)       │  │
│  │  - Reconciles ServiceOffer  │    │  - ForwardAuth on :8443       │  │
│  │  - Creates child resources: │    │  - Builds local route table   │  │
│  │    - PaymentRoute           │    │    from PaymentRoute informer │  │
│  │    - HTTPRoute              │    │  - Calls facilitator          │  │
│  │    - Middleware              │    │  - Exposes /metrics           │  │
│  │    - RegistrationRequest    │    │                               │  │
│  │  - Manages finalizers       │    │  Scales on: request QPS       │  │
│  │                              │    │  Failure: drops ForwardAuth   │  │
│  │  Scales on: CR count         │    │           (user-visible)      │  │
│  │  Failure: stops convergence  │    │                               │  │
│  │           (not user-visible) │    └───────────────────────────────┘  │
│  └──────────────────────────────┘                                       │
└─────────────────────────────────────────────────────────────────────────┘

Why Two Deployments

Concern Controller Verifier
Scaling axis Reconcile work (CR count) Request QPS
Replication Single leader All replicas active
RBAC Broad write (CRDs, HTTPRoutes, Middlewares) Read-only (PaymentRoute watch)
Failure impact Stops convergence of new offers Stops all paid requests
Restart cost Re-list + reconcile (seconds) Drops in-flight ForwardAuth (user-visible)

New CRDs

PaymentRoute (owned by ServiceOffer)

Replaces the shared x402-pricing ConfigMap. One CR per monetized route. The verifier watches these via informer instead of polling a file.

apiVersion: obol.org/v1alpha1
kind: PaymentRoute
metadata:
  name: myapi-payment
  namespace: x402
  ownerReferences:
    - apiVersion: obol.org/v1alpha1
      kind: ServiceOffer
      name: myapi
      uid: <service-offer-uid>
spec:
  pattern: "/services/myapi/*"
  price: "10000"                    # atomic USDC units
  payTo: "0x..."                    # seller wallet
  network: "eip155:84532"           # CAIP-2
  facilitatorURL: "https://..."
  priceModel: "per-request"         # per-request | per-mtok
  perMTok: "10000000"               # original per-mtok if applicable
  approxTokensPerRequest: 1000
  description: "My API service"
status:
  admitted: false                   # set by verifier when route is loaded
  lastAdmittedGeneration: 0

Why a CRD instead of ConfigMap:

  • Eliminates read-modify-write race (each ServiceOffer owns its own PaymentRoute)
  • Event-driven propagation (informer watch, sub-second vs 60-120s ConfigMap sync)
  • OwnerReferences enable automatic GC on ServiceOffer deletion
  • Status field lets the controller observe whether the verifier has loaded the route

RegistrationRequest (owned by ServiceOffer)

Isolates the ERC-8004 on-chain transaction from the main reconcile loop. The controller creates the request; a registrar Job or controller executes it.

apiVersion: obol.org/v1alpha1
kind: RegistrationRequest
metadata:
  name: myapi-registration
  namespace: openclaw-obol-agent
  ownerReferences:
    - apiVersion: obol.org/v1alpha1
      kind: ServiceOffer
      name: myapi
      uid: <service-offer-uid>
spec:
  name: "myapi"
  description: "My API service"
  endpoint: "https://tunnel.example.com/services/myapi"
  privateKeySecret:
    name: remote-signer-key
    key: keystore.json
  chain: "base-sepolia"
  registry: "0xEA0fE4FCF9E3017a24d9Db6e0e39B552c8648B9D"
status:
  phase: Pending | Submitted | Confirmed | Failed | OffChainOnly
  agentId: "42"
  txHash: "0x..."
  errorMessage: ""

Why a separate resource:

  • On-chain transactions are slow (seconds to minutes), expensive, and can fail for external reasons (no gas, RPC down)
  • The main reconcile loop should never block on a transaction
  • Retries and gas estimation are registration-specific concerns
  • OffChainOnly is a valid terminal state (not a failure), cleanly modeled in status

Reconciliation Flow

sequenceDiagram
    participant Op as Operator
    participant CLI as obol sell http
    participant K8s as Kubernetes API
    participant Ctrl as ServiceOffer Controller
    participant Verifier as x402-verifier
    participant Traefik
    participant Chain as Base L2

    Op->>CLI: obol sell http myapi --wallet 0x... --price 0.001
    CLI->>CLI: Validate upstream reachable, model ready (precondition)
    CLI->>K8s: Create ServiceOffer CR

    K8s-->>Ctrl: Informer event (Added)
    Ctrl->>Ctrl: Add finalizer, set conditions to Unknown
    Ctrl->>K8s: SSA Middleware (traefik.io ForwardAuth)
    Ctrl->>K8s: SSA PaymentRoute CR
    Ctrl->>K8s: SSA HTTPRoute (/services/myapi/*)

    K8s-->>Verifier: Informer event (PaymentRoute Added)
    Verifier->>Verifier: Build route table entry
    Verifier->>K8s: Patch PaymentRoute status.admitted=true

    K8s-->>Ctrl: Informer event (PaymentRoute updated)
    Ctrl->>Ctrl: Observe: PaymentRoute admitted, HTTPRoute accepted by Gateway
    Ctrl->>K8s: Create RegistrationRequest CR

    Note over Chain: Registrar Job/controller executes

    Chain-->>K8s: RegistrationRequest status: Confirmed, agentId=42
    K8s-->>Ctrl: Informer event (RegistrationRequest updated)
    Ctrl->>K8s: Set ServiceOffer status: Ready=True, observedGeneration=N

    Note over Traefik: /services/myapi/* → ForwardAuth → upstream

    Op->>K8s: Delete ServiceOffer CR
    K8s-->>Ctrl: Informer event (deletionTimestamp set)
    Ctrl->>Chain: Deactivate/tombstone ERC-8004 registration
    Ctrl->>K8s: Remove finalizer → GC cascades to PaymentRoute, HTTPRoute, Middleware

Controller Design

Generation-driven, not stage-driven. The reconciler always recomputes desired child resources from spec and applies them with server-side apply. No ordered stages.

func (r *ServiceOfferReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var offer obolv1alpha1.ServiceOffer
    if err := r.Get(ctx, req.NamespacedName, &offer); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Finalizer
    if !offer.DeletionTimestamp.IsZero() {
        return r.handleDeletion(ctx, &offer)
    }
    if !controllerutil.ContainsFinalizer(&offer, finalizerName) {
        controllerutil.AddFinalizer(&offer, finalizerName)
        return ctrl.Result{}, r.Update(ctx, &offer)
    }

    // Derive and apply desired child resources (all idempotent via SSA)
    middleware := r.desiredMiddleware(&offer)
    paymentRoute := r.desiredPaymentRoute(&offer)
    httpRoute := r.desiredHTTPRoute(&offer)

    for _, obj := range []client.Object{middleware, paymentRoute, httpRoute} {
        if err := r.Patch(ctx, obj, client.Apply, fieldOwner); err != nil {
            return ctrl.Result{}, err
        }
    }

    // Observe child resource status
    conditions := []metav1.Condition{
        r.computeUpstreamHealthy(ctx, &offer),
        r.computePaymentGateReady(ctx, &offer, paymentRoute),
        r.computeRoutePublished(ctx, &offer, httpRoute),
        r.computeRegistered(ctx, &offer),
    }

    // RegistrationRequest — only create when prerequisites are met
    if allTrue(conditions[:3]) && offer.Spec.Registration.Enabled {
        regReq := r.desiredRegistrationRequest(&offer)
        if err := r.Patch(ctx, regReq, client.Apply, fieldOwner); err != nil {
            return ctrl.Result{}, err
        }
    }

    // Update status
    offer.Status.ObservedGeneration = offer.Generation
    offer.Status.Conditions = conditions
    offer.Status.Phase = computePhase(conditions)
    return ctrl.Result{}, r.Status().Update(ctx, &offer)
}

Key properties:

  • observedGeneration distinguishes "hasn't seen this spec" from "tried and failed"
  • Ready = observedGeneration == generation AND all required conditions true
  • Child resources are owned → deletion cascades automatically (plus finalizer for external side effects)
  • No blocking work in reconcile — registration is a child resource observed asynchronously
  • HTTPRoute readiness comes from Gateway status conditions, not "I created the object"

Verifier Changes

The verifier (cmd/x402-verifier) stays as a separate binary and Deployment. Changes:

  1. Replace ConfigMap file watcher (watcher.go) with a PaymentRoute informer
  2. Build in-memory route table from PaymentRoute CRs instead of parsing YAML
  3. Write status.admitted on each PaymentRoute when loaded (feedback to controller)
  4. Remove .well-known handler — move to controller or dedicated httpd
// internal/x402/source/kube/informer.go
func NewPaymentRouteSource(client client.Client) *PaymentRouteSource {
    // Watches PaymentRoute CRs, builds local route table
    // sync.RWMutex protects reads (ForwardAuth) from writes (informer events)
}

The ForwardAuth handler itself is unchanged — it still matches request paths against routes and calls the facilitator. Only the config source changes.

Shared Packages

One binary family, two commands:

cmd/
  serviceoffer-controller/main.go   # controller-runtime manager
  x402-verifier/main.go             # ForwardAuth HTTP server (existing, modified)

internal/
  paymentroute/                     # PaymentRoute CRD types + deepcopy
    api/v1alpha1/types.go
    api/v1alpha1/zz_generated.deepcopy.go
  registrationrequest/              # RegistrationRequest CRD types
    api/v1alpha1/types.go
  controller/                       # Reconciler implementation
    serviceoffer_controller.go
    serviceoffer_controller_test.go
  x402/
    source/kube/                    # PaymentRoute informer (used by verifier)
      informer.go
    runtime/                        # ForwardAuth handler (existing, refactored)
      handler.go
    translate/                      # Route matching logic (existing)
      matcher.go

Migration Path

Phase 1: Controller + finalizers (no new CRDs)

  • Implement ServiceOfferController in Go with controller-runtime
  • Keep writing to x402-pricing ConfigMap (same as monetize.py does today)
  • Keep verifier unchanged (still reads ConfigMap)
  • Deploy controller as own Deployment
  • Keep monetize.py as fallback, gated behind a feature flag
  • Value: Deterministic reconciliation, independent of agent, idempotent, finalizer cleanup

Phase 2: PaymentRoute CRD

  • Define PaymentRoute CRD
  • Controller creates PaymentRoute CRs instead of mutating ConfigMap
  • Verifier switches from file watcher to PaymentRoute informer
  • Remove x402-pricing ConfigMap from the data path
  • Value: Eliminates ConfigMap race, sub-second propagation, correct deletion

Phase 3: RegistrationRequest CRD

  • Define RegistrationRequest CRD
  • Controller creates RegistrationRequest instead of calling ERC-8004 directly
  • Registrar Job/controller handles on-chain transaction
  • Value: Non-blocking registration, clean retry semantics, OffChainOnly as valid state

Phase 4: Cleanup

  • Remove monetize.py entirely
  • Remove .well-known handler from verifier
  • Remove x402-pricing ConfigMap template from infrastructure
  • Update CLAUDE.md and docs/specs/

What Gets Deleted

File Lines Reason
internal/embed/skills/sell/scripts/monetize.py ~1700 Replaced by Go controller
internal/x402/watcher.go 58 Replaced by PaymentRoute informer
x402-pricing ConfigMap template ~30 Replaced by PaymentRoute CRD
.well-known handler in verifier.go ~20 Moved to controller/httpd

ServiceOffer CRD Status Changes

Current:

status:
  phase: "Ready"  # single string

Proposed:

status:
  phase: "Ready"
  observedGeneration: 3
  conditions:
    - type: UpstreamHealthy
      status: "True"
      lastTransitionTime: "2026-03-29T10:00:00Z"
      reason: HealthCheckPassed
      message: "GET /health returned 200"
    - type: PaymentGateReady
      status: "True"
      lastTransitionTime: "2026-03-29T10:00:01Z"
      reason: PaymentRouteAdmitted
      message: "PaymentRoute myapi-payment admitted by verifier"
    - type: RoutePublished
      status: "True"
      lastTransitionTime: "2026-03-29T10:00:01Z"
      reason: HTTPRouteAccepted
      message: "HTTPRoute accepted by Gateway traefik-gateway"
    - type: Registered
      status: "True"
      lastTransitionTime: "2026-03-29T10:00:15Z"
      reason: OnChainConfirmed
      message: "ERC-8004 agentId=42, tx=0xabc..."

Acceptance Criteria

  1. obol sell http creates a ServiceOffer CR and the controller converges it to Ready without the obol-agent pod running
  2. Deleting a ServiceOffer via kubectl delete cleans up all child resources including pricing routes (finalizer)
  3. Two concurrent ServiceOffers never corrupt each other's pricing (no shared ConfigMap mutation)
  4. Route propagation from CR creation to ForwardAuth active is under 5 seconds (not 60-180s)
  5. Controller restart does not interrupt in-flight ForwardAuth requests on the verifier
  6. obol sell status shows per-condition status (not just a phase string)
  7. ERC-8004 registration failure does not block the service from being Ready for traffic (OffChainOnly is valid)
  8. All reconcile state transitions are testable in Go without a running cluster (envtest)

Test Plan

  • Unit tests: Reconcile function with fake client — test each condition computation, finalizer logic, SSA patch generation
  • envtest integration: Real API server, no kubelet — test full reconcile cycle, deletion cascade, concurrent ServiceOffers
  • E2E: obol sell http → verify PaymentRoute admitted → verify 402 response → verify deletion cleanup
  • Chaos: Kill controller pod during reconciliation → verify convergence on restart
  • Migration: Run monetize.py and controller side-by-side, verify identical outcomes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions