Skip to content

Latest commit

 

History

History
273 lines (212 loc) · 9.36 KB

File metadata and controls

273 lines (212 loc) · 9.36 KB

Disk Image CI/CD — Operator Guide

End-to-end disk image build pipeline: NodePlatform → Gitea Actions → OCI ingest → publication. Uses the platform's GitOps + CI worker infrastructure with Cosign signing for supply-chain integrity.

Architecture (one-paragraph summary)

A System::NodePlatform carries a build_script that produces a disk image (kernel + initramfs + composefs blob). The build runs on a self-hosted Gitea Actions runner (provisioned via provision_ci_worker) triggered by a webhook. After build, the runner pushes the artifact as an OCI blob (Cosign-signed via the platform's keyless identity), POSTs the webhook back to platform, which ingests via DiskImagePublicationProcessor. The resulting DiskImagePublication row links the OCI digest to the platform record + retention policy.

End-to-End Flow

sequenceDiagram
    actor Op as Operator
    participant Runner as Gitea Runner
    participant Plat as Platform
    participant Reg as OCI registry
    participant Ret as Retention service
    participant Agent as powernode-agent

    Op->>Runner: 1. trigger build<br/>(push tag OR dispatch_gitea_workflow)
    Runner->>Runner: 2. run build_script:<br/>apt-mirror, kernel,<br/>composefs blob, initramfs
    Runner->>Reg: oras push artifact<br/>cosign sign keyless
    Runner->>Plat: 3. POST webhook<br/>OCI digest + SBOM<br/>HMAC-signed
    Plat->>Plat: 4. DiskImageWebhook<br/>validates signature
    Plat->>Reg: 5. DiskImagePublicationProcessor<br/>fetch manifest + cosign verify
    Reg-->>Plat: verified manifest
    Plat->>Plat: 6. create DiskImagePublication<br/>update NodePlatform.disk_image_oci_ref
    Ret->>Plat: 7. prune images beyond retention_count
    Op->>Plat: 8. provision instance from Template
    Plat->>Agent: deploy
    Agent->>Reg: fetch OCI artifact at boot
    Reg-->>Agent: kernel + initramfs + composefs blob
    Agent-->>Op: instance booted from custom image
Loading

Six artifact families × two architectures

The initramfs builder publishes six artifact families per architecture, each suited to a different deployment context. The Disk Image Manager agent tracks publications per (NodePlatform, artifact_family, architecture) triple.

flowchart LR
    subgraph Build["Build pipeline"]
        BS[build.sh<br/>--arch &lt;arch&gt;]
    end

    subgraph Families["6 artifact families"]
        F1[kernel + initramfs.cpio.zst<br/>iPXE / direct kernel boot]
        F2[raw disk image .img<br/>USB / SD card / dd]
        F3[ISO 9660 .iso<br/>DVD / IPMI virtual media]
        F4[iPXE chainload .ipxe<br/>network boot entry]
        F5[qcow2 image<br/>libvirt / QEMU pre-baked]
        F6[OCI image<br/>bootc-compatible]
    end

    subgraph Arches["Per-arch publication"]
        A1[amd64]
        A2[arm64]
    end

    BS --> F1
    BS --> F2
    BS --> F3
    BS --> F4
    BS --> F5
    BS --> F6
    F1 --> A1
    F1 --> A2
    F2 --> A1
    F2 --> A2
    F3 --> A1
    F3 --> A2
    F4 --> A1
    F4 --> A2
    F5 --> A1
    F5 --> A2
    F6 --> A1
    F6 --> A2
Loading

Setup: Initial CI Worker + Webhook

Step 1: Bootstrap the CI worker for an account

platform.bootstrap_disk_image_ci({
  account_id: "<account>",
  // Provisions a Gitea Actions runner registered to the account's
  // disk-image build repo, with appropriate secrets (Cosign keys,
  // OCI registry credentials)
})

This creates:

  • A System::Task of type ci_worker_provision
  • A self-hosted Gitea Actions runner labeled disk-image-builder
  • Repository secrets for Cosign + OCI registry auth (rotated independently)

Step 2: Provision the build webhook

platform.provision_disk_image_webhook({
  node_platform_id: "<platform-id>"
})

Returns:

{
  "webhook_url": "https://platform.powernode.org/api/v1/system/webhooks/disk_image_built",
  "webhook_secret": "shared-secret-for-HMAC-signing"
}

Operator configures this webhook URL + secret in the build repo's CI workflow YAML so the runner can call back after a successful build.

Operator Workflow

Triggering a build

// Direct dispatch
platform.dispatch_gitea_workflow({
  account_id: "<account>",
  repo: "<account>/disk-images",
  workflow: "build-disk-image.yml",
  inputs: { platform_slug: "ubuntu-2404-base" }
})

// Or via git push to the configured branch

Monitoring a build

// List recent runs
platform.list_gitea_workflow_runs({
  account_id: "<account>",
  repo: "<account>/disk-images"
})

// Tail a specific job's logs
platform.get_gitea_job_logs({ run_id: "<run-id>", job_id: "<job-id>" })

Inspecting publications

# Via REST
curl /api/v1/system/disk_image_publications -H "Authorization: Bearer $JWT"

# Per-platform recent publications
curl "/api/v1/system/node_platforms/<id>/disk_image_publications" \
  -H "Authorization: Bearer $JWT"

Each publication carries:

  • oci_ref — fully-qualified registry path (e.g. registry.example.com/account/disk-images@sha256:...)
  • git_sha — source commit
  • built_at — timestamp
  • cosign_identity — who signed (Gitea Actions OIDC identity)
  • sbom_url — SBOM artifact URL
  • size_bytes, sha256 — artifact integrity

Promoting a publication

The latest publication is auto-promoted to current for its NodePlatform when ingest succeeds. To roll back:

// ⚠️ aspirational rollback shorthand — use system_set_default_disk_image_publication with the previous publication id to revert
platform.system_revert_disk_image({
  node_platform_id: "<id>",
  to_publication_id: "<earlier-publication-id>"
})

The next NodeInstance provisioned from a Template using this Platform will fetch the rolled-back image.

Retention Policy

NodePlatform.disk_image_retention_count (default: 3) controls how many publications are kept per platform. The DiskImageRetentionService (runs via Sidekiq cron) prunes older publications past the count, removing both the DB row + the OCI blob from the registry.

To change retention:

# Via API
curl -X PATCH /api/v1/system/node_platforms/<id> \
  -H "Authorization: Bearer $JWT" \
  -d '{"disk_image_retention_count": 5}'

Secret Rotation

Three secret types in this pipeline:

  1. Cosign keyless identity — Gitea Actions OIDC; rotates per-run automatically. No operator action.
  2. OCI registry credentials — used by Gitea runner to push artifacts. Stored as Gitea Actions secret. Rotate via:
    platform.set_gitea_action_secret({
      account_id: "<account>",
      repo: "<repo>",
      name: "OCI_REGISTRY_TOKEN",
      value: "<new-token>"
    })
  3. Webhook signing secret — HMAC-shared between platform + build script. Rotate via provision_disk_image_webhook (issues a new pair; operator updates the runner's env).

Troubleshooting

Build succeeds but publication doesn't appear

The webhook didn't reach the platform (firewall? wrong URL?) or HMAC signature mismatch. Check:

# Last webhook attempts
curl /api/v1/system/disk_image_webhooks/recent -H "Authorization: Bearer $JWT"

If signature mismatched, rotate the webhook secret.

Cosign verification fails

DiskImagePublicationProcessor rejects ingests where Cosign verify fails. Likely causes:

  • Build runner used a different Cosign identity than the platform's cosign_identity_regexp config on NodePlatform
  • OCI artifact was tampered post-signing

Inspect:

curl /api/v1/system/disk_image_publications/<id> -H "Authorization: Bearer $JWT"
# Look for publication_status="cosign_verify_failed" + publication_error

Runner stuck in pending

Gitea Actions runner provisioned but not online. Check:

platform.system_list_tasks({ task_type: "ci_worker_provision", status: "pending" })

Reprovision if necessary:

platform.bootstrap_disk_image_ci({ account_id: "<account>", force: true })

Source Files

Models:

  • extensions/system/server/app/models/system/disk_image_webhook.rb
  • extensions/system/server/app/models/system/disk_image_publication.rb

Services:

  • extensions/system/server/app/services/system/disk_image_publication_processor.rb — webhook → ingest
  • extensions/system/server/app/services/system/disk_image_oci_ingest_service.rb — OCI manifest fetch + Cosign verify
  • extensions/system/server/app/services/system/disk_image_direct_upload_ingest_service.rb — fallback for non-CI uploads
  • extensions/system/server/app/services/system/disk_image_retention_service.rb — prune past retention count

Controllers:

  • extensions/system/server/app/controllers/api/v1/system/disk_image_publications_controller.rb
  • extensions/system/server/app/controllers/api/v1/system/disk_image_webhooks_controller.rb
  • extensions/system/server/app/controllers/api/v1/system/webhooks/disk_image_built_controller.rb — receives Gitea webhook
  • extensions/system/server/app/controllers/api/v1/system/worker_api/disk_image_publications_controller.rb — runner-facing

MCP tools:

  • server/app/services/ai/tools/disk_image_operator_tool.rbbootstrap_disk_image_ci, provision_disk_image_webhook, provision_ci_worker
  • server/app/services/ai/tools/gitea_actions_tool.rb — secrets, workflow dispatch, run monitoring

Related Docs

  • extensions/system/initramfs/README.md — multi-arch boot artifact build details
  • docs/system/threat-model.md — supply-chain integrity rationale
  • extensions/system/docs/ARCHITECTURE.md — disk image pipeline subsystem