Skip to content

feat!(cozystack): wizard chain + 2-plugin consolidation (breaking)#11

Draft
lexfrei wants to merge 38 commits into
mainfrom
feat/cozystack-install-skill
Draft

feat!(cozystack): wizard chain + 2-plugin consolidation (breaking)#11
lexfrei wants to merge 38 commits into
mainfrom
feat/cozystack-install-skill

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented May 17, 2026

TL;DR

Major refactor: 5 separate plugins consolidated into 2 (cozystack with 10 skills + linstor with 1). 5 existing skills renamed; 6 new skills added; CI validator added; CLAUDE.md added; README rewritten. Breaking change for operators — old /plugin install <name>@cozystack-claude-plugins paths and skill IDs no longer work.

Repository refactor (BREAKING)

Before (main today):

skills/
  cozy-bump/skills/cozy-bump/SKILL.md
  cozy-deploy/skills/cozy-deploy/SKILL.md
  cozy-external-app/skills/cozy-external-app/SKILL.md
  cozystack-upgrade/skills/cozystack-upgrade/SKILL.md
  drbd-recovery/skills/drbd-recovery/SKILL.md

marketplace.json listed five plugins, each with one skill, each at its own install path.

After:

plugins/
  cozystack/                                  # one plugin, 10 skills
    .claude-plugin/plugin.json
    skills/
      wizard/                                 # NEW
      talos-bootstrap/                        # NEW
      talos-reset/                            # NEW
      ubuntu-bootstrap/                       # NEW
      cluster-install/                        # NEW
      debug/                                  # NEW
      cluster-upgrade/                        # RENAMED from cozystack-upgrade + restructured
      package-deploy/                         # RENAMED from cozy-deploy + restructured
      package-bump/                           # RENAMED from cozy-bump + restructured
      external-app-create/                    # RENAMED from cozy-external-app + restructured
  linstor/                                    # one plugin, 1 skill
    .claude-plugin/plugin.json
    skills/
      recover/                                # RENAMED from drbd-recovery + restructured

Skill rename + install-path migration

Old install + skill ID New install + skill ID
/plugin install cozy-deploy@cozystack-claude-plugins/cozy-deploy:cozy-deploy /plugin install cozystack@cozystack-claude-plugins/cozystack:package-deploy
/plugin install cozy-bump@cozystack-claude-plugins/cozy-bump:cozy-bump /cozystack:package-bump (same plugin)
/plugin install cozy-external-app@cozystack-claude-plugins/cozy-external-app:cozy-external-app /cozystack:external-app-create (same plugin)
/plugin install cozystack-upgrade@cozystack-claude-plugins/cozystack-upgrade:cozystack-upgrade /cozystack:cluster-upgrade (same plugin)
/plugin install drbd-recovery@cozystack-claude-plugins/drbd-recovery:drbd-recovery /plugin install linstor@cozystack-claude-plugins/linstor:recover

Existing operators should uninstall old plugins and install the new bundles once this lands.

Why consolidate

  • Single /cozystack:* namespace is the natural shape: every Cozystack-related skill ships from one plugin install, no per-skill /plugin install dance.
  • The new wizard chain (below) needs sibling skills to cross-reference each other via cozystack:<name>; cross-plugin references add fragility.
  • Renames carry intent: cozy-bumppackage-bump says what the skill does; cozystack-upgradecluster-upgrade is consistent with cluster-install; drbd-recoverylinstor:recover reflects what the tool actually operates on (LINSTOR-side, with DRBD as one of several layers).

New skills (6)

  • wizard — entry-point orchestrator. Free-form intent intake → parses hints → asks Talos / Ubuntu / Existing → builds a route → dispatches downstream skills via a cluster config directory the operator picks. Phase 4.5 runtime research surfaces known landmines for the specific combination from upstream docs + issue trackers.
  • talos-bootstrap — Talos node prep via talm. Maintenance-mode probe (Talos-1.12-aware), NAT-provider cert-SAN guardrail before first talm apply, multidoc machine-config with per-node VIP-link IPv4 stubs, talm apply + talosctl bootstrap + kubeconfig fetch. Phase 11.5 auto-upgrade to cozystack-tuned image when nodes booted from base Talos.
  • talos-reset — cloud-provider terminate+relaunch helper for OCI / AWS / GCP / Hetzner when nodes are unrecoverable from inside (cert-SAN trap before guardrail, broken machine-config, lost talosconfig). Preserves block volumes, secondary VNICs, NSG memberships. Sequential per-node to maintain etcd quorum.
  • ubuntu-bootstrap — wraps cozystack/ansible-cozystack for Ubuntu / Debian k3s bootstrap (OS prep, kernel modules, k3s install with cozystack-compatible flags).
  • cluster-install — Cozystack on a ready cluster. Node-readiness validation, ZFS pool provisioning via privileged DaemonSet on Talos (hostNetwork: true because CNI is not up yet), extractedprism for kube-apiserver HA, OCI-tag-normalized cozy-installer chart, Platform Package apply, inline tenants/root ingress patch and LINSTOR pool registration during the watch loop, Phase 8.6 default StorageClasses for v1.3.x, Phase 9.1 end-to-end reachability probe.
  • debug — investigate a stuck or broken install. Classifies operator error / config drift / upstream bug / not-yet-supported, applies fixes or workarounds, drafts upstream issues on approval (never opens issues silently).

Renamed-skill changes (non-trivial)

Beyond directory rename, the 5 carried-over skills got real edits:

  • cluster-upgrade (was cozystack-upgrade): 5 references files restructured with --context / --kube-context discipline; multiple known-failure entries added.
  • package-deploy (was cozy-deploy): rewritten to surface ghcr.io/<your-username> placeholder instead of personal registry; --kube-context discipline.
  • package-bump (was cozy-bump): path references updated; --context discipline; consistent printf / tr shell idioms.
  • external-app-create (was cozy-external-app): --context discipline.
  • linstor:recover (was drbd-recovery): every cluster-mutating kubectl call carries --context $CTX.

CI + contributor infrastructure

  • tools/check-refs.sh — 5-check validator: references-file existence, <plugin>:<skill> mention resolution, descriptions list every shipped skill, kubectl / helm cluster-mutating invocations carry --context / --kube-context (allow-list for read-only / local operations), no private cluster identifiers in plugin / public content.
  • .github/workflows/validate.yml — jq + check-refs.sh on push / PR.
  • CLAUDE.md — contributor guidance: repository layout, adding a new skill, adding a new plugin, cross-reference discipline, semver bump policy.
  • README.md — rewritten with the new skills table + chain diagram + third-party-dependency section.

Real-install hardening

The wizard chain went through multiple end-to-end install runs on a 3-node OCI Talos cluster. Each session surfaced bugs that no code review could catch; each finding became a guardrail in the appropriate skill:

  • cert-SAN trap on NAT'd providers (caught the same operator twice across consecutive sessions). Workstation reaches nodes via public IPs that the cloud fabric rewrites to internal IPs before they hit the interface. Talos issues machine-cert SANs from observed addresses (internal only); workstation TLS handshake fails with no escape. talos-bootstrap Phase 6.3 detects the NAT signature (reach_mode=public + external_ips_strategy=internal) and auto-adds public IPs to values.yaml.certSANs before the first apply.
  • OCI 1:1 NAT externalIPs mismatch. Cilium's externalIPs BPF matches on packet destination as seen by the host kernel; on NAT'd providers public IPs are never present on any interface. cluster-install Phase 4 refuses external strategy on intent_hints.platform ∈ {oci, gcp-with-nat, aws-with-eip} unless the operator opts in with --allow-external-on-nat-provider.
  • kubectl debug node --image=alpine:3 -- chroot /host zpool create does not work on Talos. Three independent reasons (musl vs glibc loader, no /bin/sh on host rootfs, PSA baseline blocks the debug pod). cluster-install Phase 5.5 uses a privileged Pod from ubuntu:24.04 in a cozy-storage-bootstrap namespace with pod-security.kubernetes.io/enforce=privileged.
  • LINSTOR registration race. HelmRelease Ready does not equal Deployment readyReplicas. STOP GATE 3 requires both no-HR-not-Ready AND storage-pool count matching storage-node count.
  • Tenants CRD landing after most HRs Ready. Tenant ingress patch is inline in the Phase 8 watch loop, event-driven on CR existence.
  • cozystack v1.3.x ships no StorageClasses. Phase 8.6 applies local (1-replica) + replicated (3-replica, default) for v1.3.x; skipped on v1.4+.
  • OCI registry tag has no v prefix. ${INSTALLER_VERSION#v} normalize before passing.
  • End-to-end reachability gate. Phase 9.1 curls https://dashboard.<host>/ from the workstation; on failure writes failed_at: external-reachability with curl + DNS + node-addresses detail.

Third-party dependency

cozystack:cluster-install default-installs extractedprism on generic variant clusters (k3s / kubeadm / RKE2) — per-node TCP load balancer that gives generic Linux Kubernetes the same localhost:7445 shape Talos has built-in via KubePrism. BSD-3-Clause; chart at oci://ghcr.io/lexfrei/charts/extractedprism. Maintained independently; reviewed and approved by the Cozystack platform team. Opt-out via --no-extractedprism --api-host=<ip>. Talos and hosted variants do not need extractedprism.

Validation

  • bash tools/check-refs.sh — all 5 checks green.
  • Multiple branch-review rounds with both human review and second-opinion model passes; final verdict LGTM.
  • Real-install validation on a 3-node OCI Talos cluster from scratch, end-to-end through HTTP 200 on the dashboard ingress.

Reviewer notes

  • This is a breaking refactor of the marketplace structure. Old plugin install paths and skill IDs no longer work; operators must uninstall + reinstall.
  • The cert-SAN guardrail in talos-bootstrap/SKILL.md Phase 6.3 is load-bearing for OCI / GCP-NAT / AWS-EIP operators and has caught the same trap twice in test runs without it.
  • tools/check-refs.sh runs in CI; adding new skills requires updating the descriptions in plugin.json + marketplace.json so the validator stays green.

lexfrei added 30 commits May 15, 2026 17:15
Guided install of Cozystack on an existing Kubernetes cluster
(kubeadm / k3s / RKE2 / managed). Discovers cluster facts,
validates node readiness via kubectl debug, recommends an
installer + platform variant, gathers values interactively,
installs the cozy-installer chart, applies the Platform Package,
waits until every HelmRelease is Ready, and prints a NOTES-style
access summary. On fatal failure, assembles a diagnostic bundle
and drafts an issue body for the appropriate cozystack/* repo
without auto-posting it.

The skill enforces three non-negotiable gates: cluster readiness
(version, CNI, cluster domain, conflicting workloads, CP-label
value vs KubeOVN expectation), values collected, and all expected
HelmReleases Ready. The KubeOVN gate handles the kubeadm vs
k3s/RKE2 difference in the value of
node-role.kubernetes.io/control-plane (empty vs literal 'true'),
which the platform variant pins as a strict match.

References cover the requirements matrix, exact node checks via
kubectl debug, variant picker, canonical Helm and Package values,
HelmRelease polling, known failures with recovery commands, and
upstream issue handoff templates.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Breaking change for marketplace consumers.

Before: five separate plugins (cozy-bump, cozy-deploy, cozy-external-app,
cozystack-install, cozystack-upgrade), each living under skills/<name>/.

After: a single bundle plugin 'cozystack' under plugins/cozystack/ that
ships five skills addressable as cozystack:bump, cozystack:deploy,
cozystack:external-app, cozystack:install, cozystack:upgrade. The
drbd-recovery skill stays a standalone plugin and moves from
skills/drbd-recovery/ to plugins/drbd-recovery/ for layout consistency.

Inside each SKILL.md:

- frontmatter name: now the short form (install, upgrade, deploy, bump,
  external-app, drbd-recovery) — the plugin name supplies the namespace.
- H1 and every user-facing self-reference and sibling-reference updated
  to the cozystack:<short> form.
- on-disk paths and mktemp suffixes keep the cozystack-<short> form (no
  colon) because some tools choke on ':' in paths.

marketplace.json carries two entries (cozystack, drbd-recovery) instead
of the previous six. README.md is rewritten to document the new
addressing scheme and repo layout.

No backward-compatibility shim — the marketplace is recent and the
breakage cost is low. Existing installations of the old plugin names
need to be reinstalled as 'cozystack' after pulling.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
The short forms (install / upgrade / deploy / bump / external-app) read
as synonyms once they live under a single namespace — bump and upgrade
in particular sound like the same operation. Add an object axis to each
name:

  install         → cluster-install         (operates on a k8s cluster)
  upgrade         → cluster-upgrade         (operates on a k8s cluster)
  deploy          → package-deploy          (operates on one monorepo package)
  bump            → package-bump            (operates on one monorepo package)
  external-app    → external-app-create     (creates a new external-apps package)

Invocation form:

  /cozystack:cluster-install
  /cozystack:cluster-upgrade
  /cozystack:package-deploy
  /cozystack:package-bump
  /cozystack:external-app-create

Updates frontmatter name, H1, every user-facing self/sibling reference,
and the aggregate plugin.json + marketplace.json + README.md to match.
On-disk paths and mktemp suffixes keep the cozystack-<short> form (no
colon) — colons in paths choke some tools.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Align with the cozystack:* naming axis introduced in the previous
commits. Old form (drbd-recovery as a one-skill plugin) read as an
inconsistent neighbour next to cozystack:cluster-* / package-*.

Mapping:

  plugin drbd-recovery / skill drbd-recovery → plugin linstor / skill recover
  invocation /drbd-recovery → /linstor:recover

The skill still covers both LINSTOR (orchestrator) and DRBD (kernel
module) layers — picking 'linstor' as the plugin namespace mirrors
the user-facing CLI most operators reach for first ('linstor r l ...').
Leaves room for sibling skills later (linstor:setup, linstor:bench)
without a second rename.

frontmatter: name: recover (short form, plugin name supplies the namespace).
H1 and marketplace / README copy updated to /linstor:recover.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ioning + Talos handoff

Two gaps surfaced while reviewing the skill against k3s and Talos
scenarios:

1. LINSTOR storage backend was asked about as Phase 4 question #10,
   after networking and publishing. Without a pool ready before the
   Platform Package goes in, the cozy-linstor HelmRelease never reaches
   Ready and Phase 8 stalls on a stuck state with no clear cause.
   Nothing in cozy-installer / piraeus-operator / ansible-cozystack
   creates the physical pool — that step has been entirely on the
   operator and undocumented in the install flow.

2. Talos node prep was implicitly conflated with cluster install. A
   vanilla Talos node has no drbd / zfs / openvswitch modules and no
   cozystack LVM filter — the install is guaranteed to fail at storage
   provisioning. Trying to bootstrap Talos from inside cluster-install
   bloats scope and tangles two very different workflows.

Changes:

- Phase 2 lookup adds storage discovery (per-node lsblk, pvs, vgs,
  lvs, zpool list, LVM-tools availability, Talos lsmod and lvm.conf
  filter check). Existing pools are surfaced as Phase 4 defaults.

- Phase 3 adds an early-exit refusal for Talos clusters missing
  cozystack-tuned extensions. The skill points at the future
  cozystack:talos-bootstrap skill and refuses without partial mutation.

- Phase 4 promotes storage to the second question (right after
  bundles), with full collection: backend (LVM thin / LVM thick / ZFS
  / skip-external), per-node disk picking, VG / thin-pool / LINSTOR
  pool names with sane defaults (vg-data / thinpool0 / data).

- New Phase 5.5 — per-node storage provisioning gate. For every node,
  the skill shows the exact pvcreate / vgcreate / lvcreate (or
  zpool create) commands, asks 'Provision / Skip / Cancel', and on
  approve runs them inside kubectl debug --profile=sysadmin with
  chroot /host, then verifies with pvs / vgs / lvs (or zpool status).
  Failure handling per node (Retry / Skip / Cancel), no auto-rollback.

- Phase 5 plan gate, Phase 10 NOTES, and the guardrails section reflect
  the new storage scope and the Talos handoff.

- New reference references/storage-backends.md covers backend
  trade-offs, exact create / verify / backout commands, default names,
  and the LinstorSatelliteConfiguration shapes the skill applies.
  node-checks.md adds storage discovery and Talos module / lvm.conf
  checks. known-failures.md adds the 'LINSTOR HR with no storage pool'
  failure mode. values-template.md adds the
  LinstorSatelliteConfiguration block with lvmThinPool / lvmPool /
  filePool / ZFS-via-CLI examples.

Follow-up out of scope for this commit: a new skill
cozystack:talos-bootstrap that does node-prep (talm / boot-to-talos /
image factory schematic). The current SKILL.md and node-checks.md
reference it by name; that handoff stays a TODO until the skill exists.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…alos-bootstrap

Split the install path into distinct skills and add a meta-skill
that picks and runs the right chain. Until now cozystack:cluster-install
carried the whole burden — it validated the cluster, refused on Talos
without tuned extensions, and pointed at ansible-cozystack for node OS
prep, but never closed the loop. This commit fills in the missing
steps and replaces the implicit handoff with explicit skills.

New skills:

- cozystack:wizard — entry point. Interviews the operator about the
  target environment (bare-metal Talos / bare-metal generic Linux /
  existing Kubernetes / managed Kubernetes), picks a chain, persists
  collected state to /tmp/cozystack-install-<ts>/state.yaml, then
  instructs Claude to invoke each downstream skill in order. After
  each downstream skill it re-reads state.yaml to verify
  status.<skill>.completed_at before continuing. Refuses for existing
  Cozystack clusters with a pointer to cozystack:cluster-upgrade.

- cozystack:k3s-bootstrap — full HA k3s install via SSH. Inventory
  interview (ssh user / key / nodes / roles / optional VIP), SSH
  preflight, per-node OS prep through ssh+sudo, first-CP install with
  --cluster-init for embedded etcd, additional CPs joining via
  --server for HA, agent workers, kubeconfig fetch with rewritten
  server URL (loopback -> CP1 IP or VIP) and optional kubecm merge.
  k3s flags match cozystack/docs/v1.3/install/kubernetes/generic.md
  (flannel off, kube-proxy off, traefik/servicelb/local-storage off,
  cluster-domain cozy.local).

- cozystack:linux-prep — OS prep on an already-running Kubernetes
  cluster via kubectl debug + chroot /host. Same set of fixes
  k3s-bootstrap applies pre-Kubernetes, but post-Kubernetes via
  ephemeral debug containers — no SSH credentials required.
  Detection matrix, four ordered batches (packages / modules /
  sysctl / services+blacklist+swap), per-batch approval, verification
  after each fix. Refuses on Talos nodes (those go through
  talos-bootstrap).

- cozystack:talos-bootstrap — v1 is a guided checklist for booting
  the cozystack-tuned Talos image (ghcr.io/cozystack/cozystack/talos:
  vX.Y.Z) on every node via boot-to-talos and applying machine-config
  with talm, then verifies kernel modules (drbd / zfs / openvswitch)
  and the LVM filter through kubectl debug. Full automation lands in
  v2 when SSH + talm orchestration is in scope.

Routes the wizard builds:

  bare-metal Talos              -> talos-bootstrap -> cluster-install
  bare-metal generic Linux      -> k3s-bootstrap  -> cluster-install
  existing k8s, prep needed     -> linux-prep     -> cluster-install
  existing k8s, ready / managed -> cluster-install

State schema documented in
plugins/cozystack/skills/wizard/references/state-schema.md.
Each skill reads state on start, writes status.<skill>.completed_at or
failed_at + error on exit. Wizard owns route construction and
inter-skill verification; skills do not invoke each other directly.

cluster-install SKILL.md gets two cross-references: the entry-point
note at the top mentions the wizard, and Phase 3 refusal points at
/cozystack:linux-prep for the kubectl-debug-based fix path.

plugin.json bumps to 1.1.0 and lists all nine skills.
README.md table reflects the new skill set plus a chain summary.

Verified locally: claude plugin validate passes, reinstall shows
Skills (9), all-on-token cost ~2.3k. No real-cluster prove-out yet
(deferred to operator testing); fixes by feedback go in follow-up
commits.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Surfaced by a real install run on a generic Linux cluster.

Two related fixes.

1. Move the tenants/root spec.ingress=true patch from Phase 9 (post-install verification) to a new Phase 7.5 between Platform Package apply and the HR wait loop. The cozy-dashboard chart ships gatekeeper (oauth2-proxy), which on startup does OIDC discovery against the public keycloak FQDN — not an in-cluster service. Without root ingress running, nothing listens on 443, gatekeeper CrashLoopBackOffs, the dashboard HR sits in Unknown until 10m timeout then InstallFailed and retries forever. cozy-fluxcd/flux-plunger has a hard dependency on cozy-dashboard/dashboard and stays False. The whole Phase 8 wait loop becomes unreachable. The Platform Package itself does not patch tenants/root.spec.ingress — that step is documented as manual in cozystack/docs/cozystack-installation.md:160. Phase 7.5 runs it pre-emptively so Phase 8 actually reaches Ready.

2. Add an explicit domain-ownership gate in Phase 4 question 7 (publishing.host). Cozystack uses cert-manager with the HTTP-01 solver by default, which requires the operator to own the domain, have wildcard DNS pointing at the chosen external IPs, and have port 80 reachable from the public internet. nip.io patterns are exempt because nip.io is publicly hosted DNS. Without the gate, an operator picks a domain they don't control, every cert request fails silently, dashboard is unreachable, and the failure mode looks identical to the chicken-and-egg above. The gate also runs a soft DNS pre-flight (dig +short) and surfaces the result.

Plan-gate output (Phase 5) and NOTES (Phase 10) now show domain ownership status, DNS pre-flight outcome, cert solver, and certificate-issuance progress.

Guardrails get two new entries: never accept a custom publishing.host without explicit ownership confirmation; never start Phase 8 before Phase 7.5 on system-bundle installs.

References:
- known-failures.md gets a 'Dashboard / Keycloak / flux-plunger stuck in OIDC chicken-and-egg' section with symptoms, cause, and the recovery procedure for installs that already stalled.

Phase 4 questions renumbered: external IPs moved from #8 to #6 so the publishing.host gate (now #7) can reference them; cert solver added as #8.

Companion docs PR in cozystack/website adds the LVM Thin section that the storage flow assumes (PR #540).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…kube-apiserver HA

Generic Kubernetes (k3s / kubeadm / RKE2) has no localhost:7445 KubePrism the way Talos does, so until this commit the skill defaulted cozystack.apiServerHost to CP1's internal IP — a single point of failure: when CP1 reboots or crashes, Cilium on every other node loses kube-apiserver and the cluster degrades. The cozystack-tuned Talos variant avoided this by pointing Cilium at the built-in localhost:7445; non-Talos had no symmetric story.

This commit makes extractedprism (a per-node TCP load balancer DaemonSet that listens on 127.0.0.1:7445 and proxies to healthy CP endpoints with TCP health checks; oci://ghcr.io/lexfrei/charts/extractedprism, BSD-3-Clause) the default for the generic variant. Talos is unchanged (KubePrism is already there); hosted variant is unchanged (managed provider handles HA).

Changes:

- New Phase 5.6 between storage provisioning and cozy-installer install. Generates the endpoint list from Phase 2's node lookup, helm-installs extractedprism in kube-system with hostNetwork + system-node-critical priority + catch-all toleration, verifies DaemonSet rollout and per-node pod presence.

- Phase 4 question 5 (apiServerHost) restructured. The skill picks per variant — talos: localhost:7445 (hardcoded by cozystack platform); generic + default: 127.0.0.1:7445 via extractedprism; generic + --no-extractedprism: operator supplies --api-host=<ip>; hosted: not set. The plan presentation in Phase 5 surfaces the chosen path and how to flip it.

- New flags: --no-extractedprism (opt-out) and --api-host=<ip> (required when opting out on multi-CP). Single-CP sandboxes still work without extractedprism — the proxy just has one upstream — but the default behaviour stays HA-friendly so multi-CP installs do not silently keep CP1 as a SPOF.

- Phase 5 plan-gate adds an apiServerHost source line and a new step in the action list ('install extractedprism DaemonSet' between storage and cozy-installer).

- Phase 10 NOTES adds an 'api server HA' block with the chosen host, source, and a verify command pointing at the DaemonSet.

- known-failures.md adds a 'Cilium loses API when CP1 dies' section with the temporary fix (re-point operator at a live CP) and the permanent fix (install extractedprism after the fact, re-render cozy-installer with apiServerHost=127.0.0.1:7445).

- values-template.md adds an extractedprism reference section and a per-variant apiServerHost resolution table.

Trade-offs documented inline so the operator can pick external-LB / VIP / kube-vip explicitly with --no-extractedprism instead.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…r + ZFS-only storage + cluster config dir

Three coupled changes driven by user feedback after the first real install attempt.

1. ZFS-only storage. The LVM and LVM-Thin paths were never part of cozystack's documented or validated install — the upstream docs at cozystack.io/docs/next/storage/disk-preparation/ describe ZFS only. Adding LVM in the skill (and in a pending cozystack/website PR, since closed) gave operators an unsupported route that cozystack itself does not stand behind. Dropping LVM removes a footgun and aligns the skill with the upstream contract. cluster-install Phase 4 storage question is reduced from backend-picker to per-node device + layout (single / mirror / raidz) on ZFS. references/storage-backends.md is now a ZFS-only document; values-template.md drops the LinstorSatelliteConfiguration lvmThinPool / lvmPool examples since the CRD has no zfsPool slot anyway — ZFS pools are registered at runtime via linstor storage-pool create zfs in a Phase 8 hook. known-failures.md replaces the LVM-thin failure section with a clarification that RHEL 10 family is unsupported (no OpenZFS RPMs yet) and that operators wanting LVM are on their own with the piraeus-operator CRD.

2. Three-route wizard. Earlier draft had five routes (Talos / bare-metal-generic / existing-k8s-dirty / existing-k8s-ready / managed) which added complexity without informing decisions the operator could not just make later in the chain. The new shape is three routes plus one refusal — Bare-metal Talos / Bare-metal Ubuntu / Existing Kubernetes (self-managed or managed are folded together; cluster-install picks the variant from cluster lookup). Existing Cozystack is a refusal that points at cluster-upgrade. linux-prep is deleted entirely — its work belongs inside ubuntu-bootstrap (the ansible playbook covers the same OS prep concerns more thoroughly).

3. ubuntu-bootstrap wraps cozystack/ansible-cozystack. The previous k3s-bootstrap was a hand-rolled SSH shell flow that re-implemented half of examples/ubuntu/prepare-ubuntu.yml and would drift the moment ansible-cozystack added new prep tasks. The new ubuntu-bootstrap renders inventory.yml into the cluster config directory and invokes prepare-sudo.yml (Ubuntu 26.04+ sudo-rs workaround), prepare-ubuntu.yml (packages including thin-provisioning-tools and lvm2, kernel modules including geneve / vhost_net / kvm_*, sysctl 11 keys including net.ipv6.conf.all.forwarding, services, multipath blacklist, LINBIT PPA + drbd-dkms for Secure Boot, ZFS + KubeVirt modules), and the k3s.orchestration collection — three discrete ansible-playbook invocations gated per step. cozystack_create_platform_package: false is wired in so the ansible role does not install Cozystack itself; that step stays in cluster-install. Renames k3s-bootstrap → ubuntu-bootstrap; renames k3s-install.md → ansible-playbook.md; drops node-prep.md (ansible covers it).

Cluster config directory. All skills in the chain now read and write artifacts under a single operator-picked directory (default $PWD or $PWD/<cluster-name>) instead of /tmp/cozystack-install-<ts>/. The wizard sets it up in Phase 1, writes the .gitignore with cozystack markers (secrets + state excluded), and points downstream skills at it via state.config_dir. Artifacts safe to git commit: inventory.yml (ubuntu), nodes/*.yaml (talos), cozystack-platform-package.yaml, extractedprism-values.yaml, .gitignore. Artifacts gitignored: .state.yaml, kubeconfig.yaml, talosconfig, secrets.yaml. The skill never runs git init / add / commit — git operations are operator-side. The point is the operator can rsync or git push the directory and reproduce the cluster shape.

plugin.json bumps to 1.2.0 and lists eight skills (linux-prep removed, k3s-bootstrap renamed to ubuntu-bootstrap). README.md skills table reflects the new layout. cluster-install argument-hint adds --config-dir; references update /tmp paths to <config-dir>/ uniformly.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
GitOps-friendly add-on to the cluster config directory pattern. With sops opt-in, every secret file the chain writes is encrypted in place with the operator's age public key, so the encrypted forms can be committed to the same repository the operator uses to track inventory.yml, cozystack-platform-package.yaml, and the talos nodes/. Without sops, the previous behaviour stays — secret files land in plain text and .gitignore keeps them out of commits.

wizard adds Phase 1.5 between config-dir selection and target picking. It checks for sops + age binaries, resolves the age public key by walking <config-dir>/.sops.yaml, $SOPS_AGE_KEY_FILE, and ~/.config/sops/age/keys.txt in order — asking the operator to confirm which key to use — and offers age-keygen with an explicit backup warning when no key exists. On opt-in the wizard writes <config-dir>/.sops.yaml with creation_rules covering kubeconfig.yaml, .state.yaml, inventory.yml, cozystack-platform-package.yaml, extractedprism-values.yaml, plus the talos nodes/*.yaml / talosconfig / secrets.yaml so talm's own .sops.yaml lookup finds matching rules. The cozystack .gitignore section is rewritten so secret-file lines drop out (encrypted-in-tree now); only *.tar.gz diagnostic bundles stay ignored.

ubuntu-bootstrap and cluster-install add a sops sanity check phase (refuse if state.sops.enabled is true but sops is missing — no silent plain-write fallback) and a maybe_encrypt / decrypt-edit-encrypt pattern wired into every secret-file write. inventory.yml is decrypted into a temp file for each ansible-playbook --inventory call; cozystack-platform-package.yaml is decrypted into a temp file for kubectl apply --filename. The encrypted on-disk forms stay canonical; temp files are removed immediately after the consuming tool exits.

Talos artefacts (talosconfig, secrets.yaml, nodes/*.yaml) are NOT double-encrypted by the skills. talm respects the shared .sops.yaml and emits encrypted output natively; the wizard's .sops.yaml block includes the right path_regex patterns so talm's lookup matches.

The skill never copies the age private key into the config directory and never invokes git. Operators back up their private key themselves (password manager, hardware token, offline copy); they commit and push the config directory on their own schedule.

wizard frontmatter accepts new flags --sops / --no-sops. state.sops {enabled, recipients, config_path} written by Phase 1.5. references/sops.md documents file-by-file decision matrix, age key resolution order, decrypt/edit workflow for operators, off↔on toggling, and what sops does NOT cover (private key safety, multi-recipient rotation, hardware-token integration — opt-out points for a future cozystack:sops helper skill).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Missed in ebd33b3. Bump to reflect the new sops feature so 'claude plugin details' shows operators which skill bundle they're on.

Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ream issue

Closes the loop on the chain: when something breaks, /cozystack:debug investigates symptoms via kubectl, checks the operator did the documented step (operator-not-dumb gate), classifies the failure into one of four buckets (operator error / config drift / upstream bug / not-yet-supported), applies a local fix or workaround when one exists, and on explicit approval drafts an upstream issue with the diagnostic bundle. Wizard auto-dispatches the skill on any failed_at in .state.yaml so chain failures get triaged before pausing — symmetric to linstor:recover but cluster-wide.

Hard constraints from the design discussion:
- No PRs. v1 only drafts issue bodies. Patches that the operator wants to send upstream are a manual follow-up.
- No silent filings. Every issue draft is shown for approval; if the operator has no GitHub account, the diagnostic bundle is still useful to hand off elsewhere.
- Doc-check is mandatory before classifying as upstream-bug. Saves duplicate / wrong-repo filings.
- Source location required. 'Something is wrong somewhere in cozystack' isn't enough — Phase 3 must name a file:line in a cozystack/* or upstream-upstream repo.

Wizard integration:
- Phase 5 dispatch loop auto-dispatches /cozystack:debug whenever a downstream skill writes status.<skill>.failed_at.
- After debug, the loop reads status.debug.action and decides — resolved/workaround → retry the failing skill; issue-drafted without workaround → pause and offer Skip/Cancel; no-action → offer Retry/Skip/Cancel.

References ship the working knowledge:
- classification.md — decision tree per bucket with signals.
- diagnostic-bundle.md — symptom-record schema + bundle script shared with cluster-install/issue-templates.md (single source so the script doesn't drift).
- source-search.md — grep recipes by symptom shape (chart fail, operator runtime, package CR rejection, piraeus/Kube-OVN forward routes).
- upstream-routing.md — per-repo routing table (cozystack/cozystack, cozystack/website, cozystack/ansible-cozystack, cozystack/talm, cozystack/boot-to-talos, plus upstream-upstream for piraeus, LINSTOR, Kube-OVN, Cilium, KubeVirt, cert-manager).
- issue-templates.md — per-repo issue body templates with the common Cozystack-version preamble + redaction rules.

plugin.json bumps to 1.4.0 with debug listed; README.md grows to 9 skills.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…oss all skills

Two coupled tweaks shaped by the way operators actually use the wizard in conversation, not in CI.

1. Wizard Phase 0 — free-form context

Before any structured question (config dir, sops, target), the wizard asks the operator to describe what they're installing and what's already in place in their own words. The skill parses the answer for hints (target environment, distribution, node count, GPU vs general workload, public-domain availability, prior failures) and pre-fills the structured questions in Phases 2 and 4 so the operator doesn't repeat themselves. Vague free-form is acknowledged and the structured questions fill the rest — no over-extraction. The recap is echoed back for confirmation before moving on.

state.intent_summary holds the free-form recap, state.intent_hints holds the parsed key/values. state-schema.md documents both, plus the new state.operator_language field.

Phase 2 (target picker) skips the four-option AskUserQuestion entirely when intent_hints.target is already set — operator only confirms.

2. Match the operator's natural language across all 10 skills

Every skill (wizard, talos-bootstrap, ubuntu-bootstrap, cluster-install, debug, cluster-upgrade, package-deploy, package-bump, external-app-create, linstor:recover) now declares 'match the operator's natural language' as a Core Principle. Detection is from conversation context (the language the operator wrote in for any prior turn) or from state.operator_language when a wizard chain is in progress — never via a separate 'which language?' question. The wizard's free-form Phase 0 answer is also the language anchor for the rest of the chain.

Canonical-form exceptions are spelled out per skill so things that have to stay English do:

- Code identifiers, commands, file paths in all skills.
- Ansible task names from upstream playbooks (ubuntu-bootstrap).
- Commit messages, PR body drafts, GitHub-public text (package-bump, debug issue templates, external-app-create generated YAML).
- linstor / drbdadm command outputs (linstor:recover).

plugin.json bumps to 1.5.0 and mentions both the free-form opener and the language-matching principle.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Real-run feedback surfaced two coupled defects.

1. talos-bootstrap had a stale v1-stub guardrail ('NEVER run boot-to-talos / talm apply / any node mutation in v1; v1 is a checklist + verification only') that survived the maintenance-mode-first rewrite. The skill knew how to drive talm but the guardrail told it not to — so it kept asking the operator to run the commands manually and ping back 'done'. Removing the guardrail and reorganising the workflow around 'default = nodes already in maintenance, skill runs talm' lets the skill execute its own commands. The boot-method picker stays for the explicit opt-in path when nodes aren't imaged yet — see Phase 3 needs-help question and Phase 4 probe.

2. Across the bundle, several phases gated on questions like 'I'll wait for you to say done, ok?' or 'Approve OS prep on all nodes / Approve per-node'. Those aren't approval gates; they're friction. When the operator already approved the consolidated plan and there's exactly one valid path forward, the skill should execute it.

The new Core Principle in every operator-facing skill: one valid path → just do it. Approval gates stay only for (a) multi-option choices the operator actually makes (storage layout, network values, publishing host, target picker), (b) destructive operations that have data implications regardless of plan approval (zpool create over data, talosctl reset on configured node, kubectl apply on prod-context, helm upgrade on a live cluster, dangerous DRBD operations), (c) the named STOP GATEs that present the plan before a long batch.

What changed concretely:

- talos-bootstrap Guardrails: 'NEVER run boot-to-talos / talm apply in v1' removed. Replaced with 'ALWAYS run talm init / talm apply / talosctl bootstrap / talosctl kubeconfig / verification automatically after Phase 5 plan approval'.
- talos-bootstrap Phase 7 (per-node review): 'Apply all / Apply per node' collapsed to 'Apply'. Per-node-approval path is gone; once the operator approved the config in Phase 7, every node gets talm apply back-to-back in Phase 8 without intermissions.
- cluster-install Core principles: explicit 'no are-you-ready-to-continue gates between phases with one valid outcome'. Phase 5 STOP GATE 2 plan approval is THE gate; Phase 6 onwards executes.
- ubuntu-bootstrap Core principles: same pattern. Once inventory is approved in Phase 4, ansible-playbook prepare-sudo / prepare-ubuntu / k3s.orchestration.site run back-to-back.
- debug Core principles: when classification + proposed action are unambiguous (operator error with documented fix, config drift with clear restore), apply without an extra ok-to-apply question. Issue / PR drafts still require explicit yes per filing.
- cluster-upgrade Core principles: named STOP GATEs are the only gates; in-flight phases execute.
- linstor:recover Core principle 0a: safe single-path recovery operations run without friction. The dangerous-operations list at principle 8 still asks (linstor node lost, deleting the last replica, drbdadm down on InUse/Primary, --discard-my-data with one diskful copy, drbdadm create-md --force).
- wizard Core principles: same shape for the orchestrator itself.
- plugin.json description for talos-bootstrap rewritten: no longer claims 'v1 guided checklist', states what the skill actually does (maintenance-mode probe → talm init + apply → etcd bootstrap → kubeconfig fetch → verification; opt-in boot-method picker).
- README.md skill row for talos-bootstrap updated to match.

plugin.json bumps to 1.6.0.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Operators reported feeling dragged through 40 questions across multiple phases. The pattern was unnecessary — most of the answers can be inferred from cluster lookups, wizard Phase 0 intent_hints, or sensible defaults. The questions the operator does need to answer should land on one screen, not ten.

New Core principle across every operator-facing skill: front-load the interview. Every question the skill might ask in any later phase is asked once up front in a single intake phase. Read-only lookups run first; intent_hints from wizard Phase 0 pre-fills wherever possible; the consolidated summary lands as ONE screen with every slot filled (defaults marked) and quick-edit affordances. Operator either approves the lot or names the slot they want to edit. Later phases consume the collected answers without re-prompting, except for destructive STOP GATEs that have to ask by their nature (zpool destroy of an existing pool the operator chose to wipe, talosctl reset on a configured node, etc.).

Concretely:

- cluster-install Phase 4 reframed from 10 sequential AskUserQuestion calls to a single consolidated summary that covers bundles, per-node storage devices and layout, networking (CIDRs, gateway, apiServerHost / port, KubeOVN MASTER_NODES), publishing (external IPs, exposure mode, host with ownership gate auto-passing for nip.io, apiServerEndpoint, exposedServices, cert solver), and operations toggles (storage provisioning, extractedprism, tenant ingress patch). What was Phase 5.5 per-node approve and Phase 5.6 extractedprism opt-in / Phase 7.5 tenant patch confirmation now live in this single screen.

- talos-bootstrap Phase 5 picks up boot method only for nodes the probe found unready, and collects everything else needed for the rest of the flow in the same summary — install disk per node, optional VIP, cluster name, kubeconfig merge target, custom installer override. talm init / talm apply / bootstrap / kubeconfig / verify run end-to-end after Approve.

- ubuntu-bootstrap Phase 4 inventory render now includes the operational knobs (k3s_version, cozystack_flush_iptables, cozystack_enable_zfs, cozystack_enable_drbd_dkms, cluster name, kubeconfig merge target). After Approve, the three ansible-playbook invocations run back-to-back without per-step gates.

- debug Phase 4 collapses to a single screen with symptom, classification, root cause, proposed fix / workaround, and the issue-filing decision (yes / no / show-draft-first). Phases 1-3 (symptom gathering, doc check, classification) are read-only and run before the screen.

- cluster-upgrade keeps its risk-summary approval as the single gate.

- linstor:recover reads the full diagnostic graph first, then presents a recovery plan with ordered operations classified safe / dangerous; dangerous-operation approvals are batched into one screen.

plugin.json bumps to 1.7.0. Each skill's Core principles section documents the new pattern explicitly so the skill can't drift back into question-dribbling.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ill summaries

Real-run feedback: talos-bootstrap final NOTES ended with 'Возвращаю управление wizard'у. Состояние записано: status.talos-bootstrap.completed_at = ..., cluster.* заполнены. Skill завершила работу — wizard может dispatch'нуть следующую в цепочке.' That phrasing was Claude's improvisation based on the chain context, not text from SKILL.md, but the SKILL.md didn't explicitly forbid it either — and so it leaked.

New Core principle on every downstream skill: layer-pure operator output. The skill never says 'returning control to wizard', 'the wizard will dispatch next', or any other orchestration commentary in the operator-facing summary. Whoever invoked the skill (a human running /cozystack:<skill> directly, or the wizard's dispatch loop on auto-dispatch) figures out what's next on their own. The wizard reads .state.yaml and decides; a human reads the printed 'next:' hint at the bottom of the NOTES.

Internal SKILL.md references to cozystack:wizard stay — they're documentation for Claude / future maintainers, not for the operator. The principle restricts only operator-facing text.

Applied to: talos-bootstrap, ubuntu-bootstrap, cluster-install, debug, cluster-upgrade, linstor:recover. plugin.json bumps to 1.7.1 (text-only fix).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
After a successful test run on 3 OCI Talos nodes the operator filed a
retrospective with eight concrete improvements. This commit lands all of them.

Wizard changes:

- /tmp guardrail moved from NEVER to AVOID. Phase 1 now offers a 4th option
  'Scratch dir under $TMPDIR (will be lost on reboot — test runs only)' for
  legitimate scratch / sandbox runs. New --allow-ephemeral flag for CI. The
  Layer-pure principle is unchanged — git-able dirs are still the default for
  real installs.

- Phase 4 intake for bare-metal-talos route expanded from 'node IPs +
  talosctl present' to the full set of things talos-bootstrap actually needs
  to start: node list, reach mode (public/internal/vip), CP endpoint (VIP /
  single-CP IP / external LB), VIP details (address, link, subnet, MTU) if
  applicable, boot method per node when needs-OS-install. Operator answers
  once in wizard Phase 4 instead of being re-asked by talos-bootstrap.

- State schema gains platform (metal/nocloud/aws/oci/azure/gcp) and
  cloud_hint (free-form for downstream routing) so OCI custom image / AWS
  AMI / metal flows pick the right boot path. Plus reach_mode, cp_endpoint,
  and a structured vip {address, link, subnet, mtu} block.

talos-bootstrap changes:

- installer.yaml lookup parameterised. --installer-profile-url for direct
  URL override, --cozystack-repo for non-default checkout, URL fallback to
  raw.githubusercontent.com when no local clone is present. Previously
  hardcoded path made the skill brittle on forks and bare workstations.

- Phase 11 verification switched from kubectl debug node + chroot /host to
  talosctl read /proc/modules + talosctl get extensions + talosctl read
  /etc/lvm/lvm.conf. The kubectl path required a CNI that doesn't exist
  yet at this point in the chain (cluster-install installs Cilium later);
  the verification was effectively broken. apid is always reachable via
  the talos API port regardless of CNI state.

- Phase 11 also adds Talos version drift detection: talm apply does NOT
  reinstall, so if the operator pinned image: ...:v1.12.7 in values.yaml
  but the nodes booted from nocloud-amd64.raw.xz of v1.13.0, the running
  Talos stays on v1.13.0 and effective stack drifts (DRBD 9.3.1 vs 9.2.16,
  etc.). The skill compares running version to pinned and surfaces the
  drift with the talosctl upgrade command to reconcile. Warning, not
  refusal — operator can choose to live with it once they know.

References / manual-steps.md additions:

- Working nodes/<name>.yaml example for cozystack v1.12+: multidoc
  HostnameConfig + LinkConfig instead of legacy machine.network.interfaces[]
  / machine.network.hostname. Explicit warning about the two error messages
  operators hit when mixing schemas ('multi-doc renderer cannot translate
  legacy ...' and 'static hostname is already set in v1alpha1 config').

- talm template flags note: -e/-n explicit on every call. talm apply
  parses the modeline at the top of nodes/<name>.yaml but talm template -i
  does not. 'failed to determine endpoints' is solved by -e/-n, not by
  --offline (the retrospective specifically called out the --offline
  shortcut as a wrong reflex).

- TALOSCONFIG env note: doesn't persist between Bash tool invocations
  (each is a fresh shell), so always pass --talosconfig explicitly.

plugin.json bumps to 1.8.0.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…e, CI validator

Addresses the nine blockers and four non-blocking items from /branch-review.

Cross-reference drift (six blockers — same root cause):
- marketplace.json plugin description listed 5 skills; actual is 9.
- CLAUDE.md described old flat skills/ layout removed in the bundle refactor.
- cluster-install/references/issue-templates.md routing table called the storage skill 'drbd-recovery' (renamed to linstor:recover).
- package-bump/SKILL.md cross-ref pointed at plugins/cozystack/skills/deploy (correct: package-deploy).
- cluster-install/SKILL.md References table described storage-backends.md as 'LVM thin / LVM thick / ZFS'; the file is ZFS-only per the cozystack platform contract.
- talos-bootstrap drift check read $CONFIG_DIR/values.yaml which doesn't exist in the talos route — talm init writes nodes/<name>.yaml, secrets.yaml, talosconfig, not values.yaml. The drift check would have silently never fired. Now reads nodes/$CP1_NAME.yaml machine.install.image with uniformity verification across all node configs.

Discipline regressions:
- cluster-upgrade had 7+ bare kubectl/helm invocations despite the global --context rule. Added Phase 0 that pins $CTX once and an explicit Guardrail forbidding bare commands. Every kubectl and helm in the skill now passes --context $CTX / --kube-context $CTX.
- package-deploy --registry example showed ghcr.io/lexfrei (maintainer namespace) in a cozystack-branded skill. Replaced with placeholder ghcr.io/<your-username>.
- debug/references/upstream-routing.md called extractedprism's relationship 'author has commit rights through CCP context' — phrase doesn't parse for outside readers. Replaced with a neutral pointer ('independent BSD-3 project; file there for proxy-specific bugs') plus a new README section 'Third-party dependencies' that documents the dependency policy explicitly.

Non-blocking cleanups:
- Sample prompts in wizard / talos-bootstrap / debug had Russian text that confuses English-only contributors reading the source. Translated to English with a top-of-skill meta-note: 'every operator-facing prompt below is written in English; the LLM matches operator language at runtime per Core Principle'.
- .gitignore inconsistency: wizard sops opt-in flow claimed talosconfig + secrets.yaml stay gitignored even with sops on, contradicting the sops.md decision matrix which has talm encrypting them. Aligned — sops on, every secret file is encrypted-in-tree and commit-friendly; only *.tar.gz stays ignored. Talos NOTES summary updated to match.

CI validator (preventing recurrence):
- tools/check-refs.sh walks the plugin tree and validates: (1) every references/<file>.md mentioned in a SKILL.md exists; (2) every /<plugin>:<skill> mention resolves to a real directory under plugins/<plugin>/skills/<skill>/; (3) every plugin description in marketplace.json and the corresponding plugin.json names every skill present on disk. Six of nine blockers above are surface forms of the same 'string in one file doesn't match reality in another' bug — the validator catches all three patterns mechanically.
- .github/workflows/validate.yml runs jq on every JSON manifest and tools/check-refs.sh on every push and PR.

plugin.json bumps to 1.9.0.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Validator additions:
- check-refs.sh Check 4: enforce --context / --kube-context on kubectl
  and helm cluster-mutating commands; allow-list read-only / local ops
- check-refs.sh Check 5: deny private cluster names in plugin tree

Documentation fixes flushed by the new checks (61 violations across
cluster-upgrade, cluster-install, debug, external-app-create,
package-bump, package-deploy, talos-bootstrap, linstor/recover):
- add --context / --kube-context to every cluster-mutating invocation
- replace private cluster identifiers with generic placeholders

Operator-facing cleanups:
- talos-bootstrap Phase 5: drop duplicated intro paragraph
- talos-bootstrap Phase 12 NOTES: dedupe artifact list
- cluster-install Phase 10 NOTES: expand 'helm uninstall ... -n' to
  'helm --kube-context <CTX> uninstall ... --namespace ...'

plugin.json: bump 1.9.0 -> 1.10.0

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
cluster-install Phase 5.6: render $CONFIG_DIR/extractedprism-values.yaml
on disk and pass it via --values, instead of inlining --set endpoints=...
The wizard already advertises this artifact (cluster-config layout, sops
opt-in prompt, .gitignore listing, final summary; state-schema.md;
sops.md creation_rules path_regex) — the downstream skill must actually
produce it for the artifact contract to hold.

check-refs.sh: replace GNU-only \\b word boundaries with portable
character-class equivalents. \\b is silently treated as literal by BSD
grep on macOS, which made Check 4 (kubectl/helm --context discipline)
and Check 5 (private cluster name denylist) no-op locally for any
contributor without GNU grep first in PATH. The validator is documented
as a pre-commit local gate, so a silent local degradation is the worst
failure mode.

plugin.json: 1.10.0 -> 1.10.1.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
extractedprism chart contract: endpoints is a comma-separated string
scalar (values.schema.json type=string). A YAML list shape is rejected
by helm at schema validation. cluster-install Phase 5.6 now renders
endpoints as a single string, matching the chart's published values.

Verified against pulled chart 0.2.0:
  helm template ... --values <(echo 'endpoints: ["x:1"]')
    => Error: at '/endpoints': got array, want string
  helm template ... --values <(echo 'endpoints: "x:1,y:1"')
    => render OK

references/values-template.md: switch documented invocation to --values
matching SKILL.md; call out the string-scalar contract explicitly so
future contributors do not regress.

references/known-failures.md: rewrite the recovery-path snippet to use
--values too. The previous --set form is in fact also broken — helm
--set parser treats ':' and ',' as syntax, so '--set endpoints=a:1,b:1'
fails to parse regardless of quoting. The values-file form sidesteps
this entirely.

Layer-pure refusal text:

- cluster-install SKILL.md Guardrails: drop the operator-facing pointer
  to re-run /cozystack:wizard from the sops-missing refusal; tell the
  operator what to install or which flag to pass instead.
- ubuntu-bootstrap SKILL.md Phase 2.5 and Guardrails: same treatment.

The downstream skill must not name the orchestrator in operator-facing
output. Internal SKILL.md prose and 'See also' references are fine.

plugin.json: 1.10.1 -> 1.10.2.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
The previous Guardrail wording pointed operators at
references/requirements.md for SOPS recovery options, but that file
has no SOPS content — the only SOPS recovery doc lives in
wizard/references/sops.md. Pointing at a file that doesn't carry the
content is the same class of doc-drift the cross-ref validator was
built to catch (and the validator only verifies file existence, not
section presence).

Inline the recovery options directly, mirroring the ubuntu-bootstrap
treatment in commit 19a7ba9. Operators hitting the same problem in
either skill now get the same wording.

plugin.json: 1.10.2 -> 1.10.3.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
The ZFS-only design has no declarative LinstorSatelliteConfiguration
CR — Phase 5.5 documents this correctly (CRD has no zfsPool slot;
registration goes through the LINSTOR API at Phase 8). Two stale call
sites contradicted that:

- Guardrail bullet claimed Phase 4 writes a 'LinstorSatelliteConfiguration
  block' to the platform package — there is no such block. Rewrite to
  describe the actual storage-state shape: cozystack.storage.nodes[] in
  .state.yaml, replayed by the Phase 8 post-Ready hook.

- References section advertised 'LinstorSatelliteConfiguration shapes'
  in values-template.md — that file documents the ZFS pool registration
  hook, not a CR shape. Repoint accordingly.

plugin.json: 1.10.3 -> 1.10.4.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Real install run on a 3-node Talos / OCI / VIP cluster surfaced three
chain-level gaps that no per-skill review could catch:

  1. Front-load promise was not honoured across the chain boundary.
     Wizard collected the bootstrap minimum, then cluster-install asked
     the operator publishing.host, externalIPs strategy, bundles,
     cert solver, storage layout, exposed services after the cluster
     already existed. Many of these are policy decisions known before
     bootstrap; re-asking them post-runtime is friction.

  2. completed_at / failed_at contract was implicit. cluster-install
     never documented the write. On the real run the operator patched
     status.cluster-install by hand after dispatch because the wizard's
     dispatch loop could not find a state transition.

  3. ScheduleWakeup fallback fired on top of downstream task-notification.
     Wizard re-invoked itself, spent a turn re-checking state that did
     not change, and reached an ambiguous dispatch decision.

Changes:

- wizard SKILL.md Phase 4: rewrite as 'full intake — everything
  policy-decidable up front'. Bootstrap-stage slots stay route-keyed.
  New cozystack-stage slots collected up front: bundles, storage layout
  preference, networking CIDRs, publishing.host + host_kind + cert
  solver + exposed services, external_ips strategy with explicit
  internal/external/explicit option (and the OCI-NAT failure mode
  spelled out), extractedprism opt-out. Render single consolidated
  review screen with Approve all / Edit <slot> / Cancel.

- wizard SKILL.md Phase 5 dispatch loop: forbid ScheduleWakeup fallback;
  enforce that downstream skill returning without completed_at /
  failed_at is treated as broken contract, synthesised failed_at, and
  routed through debug.

- wizard Phase 4 bootstrap slots gain per-node VIP-link static address
  collection — OCI maintenance-mode interfaces commonly lack IPv4 on
  VLAN secondaries, talm template then renders LinkConfig without
  addresses and the cluster splits.

- state-schema.md: new top-level cozystack_intake section documenting
  every policy slot the wizard now collects, with the
  external_ips.strategy=internal default justification.
  status responsibilities table updated; wizard now owns cozystack_intake.*.

- cluster-install SKILL.md Phase 4: rewritten as read-cozystack_intake-
  first, discovery-driven fill, single Approve/Edit gate. Direct
  invocation (no wizard) still works — the skill falls back to inline
  prompts when cozystack_intake is absent.

- cluster-install SKILL.md adds Phase 9.5 — mandatory status write,
  one of completed_at / failed_at, with shape spelled out. Phase 10
  NOTES follows after.

- wizard SKILL.md Guardrails enforce 'neither completed_at nor failed_at
  is a broken contract' and route through debug.

plugin.json: 1.10.4 -> 1.11.0 (minor — chain contract change, new
state slot).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Five issues surfaced on a 3-node OCI Talos install that no review
caught because they were about Talos / cloud-provider reality vs the
documented happy path.

Phase 4 maintenance-mode probe:
  replace 'talosctl get machineconfig --insecure' with
  'talosctl version --insecure --short'. Talos v1.13+ returns
  PermissionDenied to insecure per-resource gets even in maintenance
  mode, which the previous probe misread as 'already configured' —
  a false negative that silently skipped Phase 5/6.

Phase 6.5 (new) — stub creation:
  'talm init' writes templates/, secrets.yaml, talosconfig — but not
  nodes/<name>.yaml. The skill now creates the stub files with hostname
  + schema modeline before 'talm template --in-place' fills them.

Phase 6.7 (new) — VIP-link IPv4 guardrail (HA only):
  On cloud-provider VLAN secondaries (canonical case: OCI ens5 in
  maintenance mode), the VIP-carrying link often has no IPv4 — talm
  template then renders LinkConfig with a vip: block and no
  addresses: block. After talm apply, only the VIP-holder is reachable
  in the VIP subnet; other CPs cannot dial the VIP for etcd join, and
  the cluster sits in 'member Preparing' for 10+ min before timing out.
  Skill now detects the empty addresses: block, refuses Phase 7, and
  either auto-patches from wizard intent_hints.vip.per_node or surfaces
  the exact fix.

Phase 8 — quoting + reboot semantics:
  Add explicit guidance to always double-quote shell variables holding
  IPs / node lists ('--nodes "$IPS"' not '--nodes $IPS') — unquoted
  lists with multiple IPs fail with 'unknown command <second-ip>'.
  Clarify the 'Applied configuration without a reboot' message: talm
  applied live; the on-disk image is replaced lazily on next upgrade,
  not on this apply.

Phase 12 — talm.key backup discipline:
  Surface talm.key as a recovery key in the artifacts listing and add
  a separate reminder line before NOTES. talm.key is not sops-encrypted;
  losing it loses the cluster's PKI without a path to regenerate.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
… cert diag, e2e gate

Four runtime issues from the OCI install run, all in cluster-install:

#104 — External IPs choice silently picked the wrong set on NAT'd
providers. cluster-install Phase 4 LB/externalIPs now reads
cozystack_intake.external_ips (filled by wizard Phase 4) and validates
against live Node.status.addresses. On intent_hints.platform ∈
{oci, aws-with-eip, gcp-with-nat} the skill refuses the 'external'
strategy and forces 'internal' with an explicit reason, because public
IPs on those platforms are 1:1-NATed before the packet reaches the
node — Cilium externalIPs BPF would never match. Operator can
--allow-external-on-nat-provider to override.

#105 — Phase 7.5 root Tenant patch was a fixed pre-Phase-8 wait, but
the tenants.apps.cozystack.io CRD can land after most HRs are Ready
(observed on a real run). Move the patch into the Phase 8 watch loop:
event-driven — as soon as the CR appears, patch it and continue
monitoring. Removes the race entirely. Update SKILL.md cross-refs,
Guardrails, and references/known-failures.md to match.

#106 — 'wait ~5 min for first issuance' was a misleading cert-manager
mitigation that hid real failures. Healthy ACME HTTP-01 resolves in
under 30 s; >2 min stuck is always a definite cause. Phase 9 now
prints a per-state diagnostic table (kubectl get challenges, orders,
cert-manager log filter) and a mapping of state+reason -> likely
cause -> fix. nip.io vs custom-fqdn diverge here: nip.io's
instant-DNS makes any 2-min pending challenge anomalous.

#107 — End-to-end reachability gate added as Phase 9.4 (between
Phase 9 verify and Phase 9.5 status write). 'All HRs Ready' is a
cluster-side signal; says nothing about whether the dashboard is
reachable from outside the cluster. The skill now curls
https://dashboard.<host>/ from the workstation, classifies the result
(200/302/401 pass; exit 7/28, 530, persistent 502/503 fail), and on
failure writes failed_at='external-reachability' with curl + DNS +
node-addresses detail. The cluster is not rolled back; debug picks
up from there. --skip-external-reachability downgrades to warning
for non-routable sandbox installs.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…linstor hook + StorageClasses

Three storage-path gaps surfaced on the OCI Talos install:

#101 — Phase 5.5 used 'kubectl debug node --image=alpine:3 --
chroot /host zpool create'. None of the parts work on Talos:
- alpine ships musl; the cozystack-tuned image's zpool is glibc-linked
  and the loader (/lib64/ld-linux-x86-64.so.2) lives only in the
  ext-zfs-service namespace, not on the host rootfs visible via chroot
- 'chroot /host /bin/sh' fails — Talos has no /bin/sh
- PSA baseline on default / kube-system rejects the sysadmin debug pod

Split Phase 5.5 by distribution. Talos path now uses a one-shot
privileged Pod from ubuntu:24.04 in a cozy-storage-bootstrap namespace
labelled PSA=privileged: apt-get install zfsutils-linux, sgdisk
manual partitioning (no udev inside the pod, so 'zpool create /dev/sdb'
cannot wait for partition discovery), then 'zpool create -f
${DEVICE}1'. Ubuntu / k3s / kubeadm path keeps the original
kubectl-debug chroot approach. references/storage-backends.md gets
the verbatim Pod manifest and the three-reasons-why-chroot-fails block.

Add a Phase 5.5 step 7: pre-existing-data check before zpool create
(pvs, dmsetup, wipefs probe). 'talosctl reset' does not wipe user
disks by default, and previous LINSTOR LVM-thin pools on /dev/sdb
silently break 'zpool create' with EBUSY or DEGRADED pools.

#102 — Earlier SKILL.md and values-template.md described a 'Phase 8
hook that runs linstor storage-pool create zfs per node after
controller Ready'. The hook was never implemented; the watch loop
only monitored HR Ready=True. Real installs left LINSTOR with only
DfltDisklessStorPool and operators registered pools by hand.

New Phase 8.5 — Register LINSTOR storage pools — is the actual
implementation. After all HRs Ready, iterate
state.cozystack.storage.nodes[] and exec
'linstor storage-pool create zfs <node> <linstor-pool> <zpool>'
inside the linstor-controller pod. Idempotent (skip when
storage-pool list already shows the entry). On per-node failure,
write failed_at='linstor-storage-pool' and abort — partial
registrations are kept for operator inspection.

references/storage-backends.md rewrites the 'register the pool'
section to point at the actual implementation in SKILL.md Phase 8.5.

#103 — cozystack v1.3.x does not create StorageClasses
automatically; cluster reaches 'all HRs Ready' with zero SCs and
every stateful tenant workload sits in
'pod has unbound immediate PersistentVolumeClaims'. Skip on
v1.4.0+ (the tenants CRD then exposes spec.storageClasses).

New Phase 8.6 — Default StorageClasses — writes
'local' (placementCount=1) and 'replicated' (placementCount=3,
isDefaultClass=true) to storageclasses-default.yaml under
config-dir and applies them. placementCount on 'replicated'
auto-derives from cozystack.storage.nodes[] count when fewer
than 3 storage nodes are present.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
New references/provider-pitfalls.md collects the things that weren't
deducible from 'all HRs Ready' or from generic k8s docs and were each
a multi-hour debug episode in a real install run:

- OCI 1:1 NAT — externalIPs must be internal, mechanism explained
- GCP NAT'd external IPs — same shape, with exception for direct-EIP VMs
- AWS Elastic IP / NLB proxy-protocol pitfalls
- Talos system-extension binaries live only inside ext-* namespaces;
  why kubectl debug + chroot /host cannot run zpool / drbd userspace
- Pod Security Admission baseline blocks kubectl debug node since
  k8s 1.25; need a dedicated privileged-labelled namespace
- talosctl reset leaves user disks intact; previous-install LINSTOR
  LVM-thin pool state on /dev/sdb silently breaks zpool create
- cozystack v1.3.x does not create StorageClasses automatically
- cozystack v1.3.3 isp-full bundle does not include Keycloak
- api.<host> ingress speaks TCP passthrough — 401 + self-signed cert
  is by design (apiserver terminates TLS itself)
- HelmRelease count varies during install (no fixed expected total)
- linstor CLI requires mTLS client cert; always exec inside the
  controller pod
- HelmRelease cascade warnings ('secret not found', 'rolebinding not
  found') in first 5–10 min are transient race conditions

Each pitfall has Symptom / Mechanism / Fix sections so the skill can
cross-reference a specific entry from Phase 4 publishing or Phase 5.5
storage and the operator gets the why, not just the what.

plugin.json: 1.11.0 -> 1.12.0 (minor — new chain semantics + new
phases 8.5 / 8.6 / 9.4 / 9.5 in cluster-install).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Four blockers + three medium findings from the post-batch-B review.

Blocker — stale Phase 7.5 references in debug skill:
  debug SKILL.md classification example and references/classification.md
  table row both named 'Phase 7.5'. The cluster-install refactor moved
  that work inline into the Phase 8 watch loop; debug now produced
  evidence-narratives pointing at a phase number that no longer exists.
  Re-frame: the failure mode is now 'Phase 8 watch loop never observed
  the tenants/root CR appearing'.

Blocker — Talos privileged-pod wait condition was wrong:
  references/storage-backends.md used --for=condition=Ready on a
  short-running one-shot pod with restartPolicy: Never. A one-shot pod
  transitions to phase=Succeeded with condition=Ready=False, so the
  wait hung the full 300 s timeout on every successful run.
  Switch to --for=jsonpath='{.status.phase}'=Succeeded; on failure
  describe + log + exit 1 so the operator gets the real diagnosis.

Blocker — Phase 8.5 ordering would deadlock real installs:
  Earlier text gated LINSTOR storage-pool registration on 'all HRs
  Ready'. paas / monitoring HRs that request PVCs wait for the
  storage pool to exist, so an all-HRs-Ready gate would block on HRs
  that block on registration that blocks on all-HRs-Ready.
  Fold registration into the Phase 8 watch loop, gated on
  linstor-controller Deployment having >= 1 Ready replica (same shape
  as the root-Tenant ingress patch). Drop the standalone Phase 8.5.
  Update all cross-references (SKILL.md, storage-backends.md,
  provider-pitfalls.md).

Blocker — duplicate /dev/zfs hostPath mount masks itself:
  Mounting hostPath:/dev/zfs at /dev/zfs inside a pod that already
  mounts hostPath:/dev at /dev is nested-mount territory; one of the
  two will mask the other. /dev/zfs is reached through the /dev mount
  (devtmpfs is shared). Drop the second volume + volumeMount.

Medium — curl exit-code semantics in Phase 9.4 reachability probe:
  Phase 9.4 said 'curl exit 7 is almost always externalIPs misconfig
  on a NAT'd provider'. curl exit 7 (CURLE_COULDNT_CONNECT) actually
  covers both ECONNREFUSED (RST — fits the NAT story) and EHOSTUNREACH
  / ENETUNREACH (no route — workstation can't reach the cluster at
  all). Different root causes. Add 'ip route get <ip>' as the
  disambiguator. Also add curl exit 6 (DNS resolve failed) — distinct
  failure mode, common on custom-fqdn before DNS is configured. Note
  the %{exitcode} curl 7.75+ requirement.

Medium — Phase 8 LINSTOR loop iteration safety:
  'for entry in $(...)' word-splits on whitespace; an embedded space
  in a JSON value would silently turn one object into multiple bogus
  entries. Switch to 'jq -c .[] | while IFS= read -r entry' both in
  SKILL.md Phase 8 and references/storage-backends.md.

Medium — talosctl jsonpath empty-output silent pass:
  talosctl get extensions ... | grep -E 'drbd|zfs|openvswitch' silently
  passes when the stream is empty (grep exits with no match). Replace
  with a per-extension grep -q with explicit floor — each expected
  extension must produce at least one match, otherwise exit 1.

plugin.json: 1.12.0 -> 1.12.1.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Three structural contradictions in cluster-install/SKILL.md and one
comment lie in talos-bootstrap/SKILL.md.

Blocker A — STOP GATE 2 referenced as 'Phase 5' but lived inside
Phase 4 (Phase 5 didn't exist as a heading). Three call sites
(line 27, 219, 267) pointed operators at a non-existent phase.
Promote 'STOP GATE 2 — Plan presentation' to '## Phase 5 — Plan
presentation (STOP GATE 2)'. Phase 4 ends at the intake summary;
Phase 5 owns the plan and the gate.

Blocker B — two sibling H2 headings titled '## Phase 8'. The second
one was rationale + the inline LINSTOR registration block; it isn't
a separate phase, it's a continuation. Demote to '### LINSTOR
storage-pool registration — inline' under the first Phase 8 H2.

Blocker C — Phase numbering skipped 9.1 / 9.2 / 9.3 and jumped
straight to 9.4 / 9.5, leaving the fingerprint of removed-but-not-
renumbered phases. Renumber 9.4 -> 9.1 (reachability probe) and
9.5 -> 9.2 (status write).

Medium — talos-bootstrap Phase 11 extensions check comment claimed
'grep -c floor catches that' while the code uses per-extension
'grep -qE'. Rewrite the comment to describe what the code actually
does.

plugin.json: 1.12.1 -> 1.12.2.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added 8 commits May 16, 2026 03:33
… Phase 7 contribution offer

The skill set now ships a seed library of operator-experience case
studies that capture things not deducible from generic Cozystack /
Kubernetes docs: provider-specific quirks, hardware edge cases, race
conditions only visible on slow clusters, real failure modes operators
debugged for hours and the actual sequence of commands that fixed them.

Layout under plugins/cozystack/skills/wizard/references/case-studies/:

  oci/ (3)             — externalIPs 1:1 NAT, VLAN-VIP IPv4 missing,
                         VNIC-attachment for VIP placement
  hetzner/ (2)         — RobotLB + vSwitch instead of MetalLB,
                         GRUB serial-console blocks Talos boot
  bare-metal/ (1)      — Ubuntu Secure Boot rejects DRBD module
  talos/ (7)           — kubectl debug + chroot fails for zpool,
                         PSA blocks debug pods, talosctl reset leaves
                         user disks, boot-to-talos hardcoded version
                         mismatch, lldpd extension blocks boot,
                         talosctl version mismatch, single-disk
                         UserVolume + --insecure flag window
  storage/ (6)         — LINSTOR controller CrashLoop + CRD-DB recovery,
                         dedicated network resets, multi-DC zone
                         selectors, encryption passphrase re-enter
                         after restart, stale superblock blocks pool
                         detection, TLS rotation staged restart
  networking/ (3)      — kube-ovn DB cleanup DaemonSet, virtual-router
                         port_security annotation, KubeSpan +
                         kube-ovn MTU 1222
  virtualization/ (5)  — Proxmox migration via cdi-uploadproxy,
                         Windows VirtIO bus switch + MTU 1400,
                         MikroTik CHR SATA bus, GPU passthrough
                         resource-name derivation
  cozystack-v1.3/ (2)  — StorageClasses not auto-created,
                         Keycloak absent from isp-full bundle
  cluster-install/ (1) — tenants CRD race with HRs Ready

Sources: a real OCI Talos install run that surfaced 8 issues, plus 22
candidates lifted from cozystack/website docs/v1.3 and blog posts. All
IPs redacted to RFC 5737 or placeholders; hostnames to example.com;
no customer / account / project identifiers preserved.

case-studies/README.md spells out the schema, the body shape
(Symptom / Mechanism / What was tried / What fixed it / Operator
notes), and the redaction rules. New cases follow the schema.

Wizard Phase 7 — contribution offer:

  After Phase 6 final summary, when the run hit something interesting
  (resolved failed_at, triggered guardrail, new platform tag without
  prior coverage, debug drafted upstream issue), the wizard offers
  ONCE to draft a case study. Skipped silently when nothing
  interesting happened.

  Auto-redaction runs BEFORE the operator sees the draft (IPv4/IPv6
  literals, hostnames, OCID/ARN/GCP-project, SSH key paths, secrets,
  emails). The redaction report is printed alongside the draft so
  the operator can audit what was replaced.

  Operator is then explicitly asked to read the draft for things the
  regex can't catch — cloud account IDs, organization names embedded
  in label values, internal hostnames in stack traces, real names
  in operator-notes. Four options: Looks clean / Edit first /
  Save only / Discard. Discard records the decline and the wizard
  never asks again on subsequent --resume runs.

  Looks clean prints the exact 'gh pr create' command for the
  operator to run themselves. The wizard never opens PRs.

Cross-references added from:
  - talos-bootstrap Phase 6.7 VIP-link guardrail → oci/vlan-vip-no-ipv4-in-maintenance
  - cluster-install Phase 4 externalIPs strategy → oci/externalips-1to1-nat
  - cluster-install References section → full case-studies catalog by axis

README.md gains a 'Knowledge base' section above 'Third-party
dependencies' explaining what the seed library covers and how
operators contribute back.

plugin.json: 1.12.2 -> 1.13.0 (minor — new wizard phase + new
top-level reference subtree under wizard/).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…e base + Phase 7 contribution offer"

This reverts commit f805c4e.
…cited

Replaces the reverted bundled case-studies knowledge base with a
runtime research pass before dispatch.

Rationale:
A bundled knowledge base of 30 pre-written case studies inside the
plugin had two problems no review caught:
- Data ages (Cozystack ships monthly; Talos ships monthly; providers
  evolve). Pre-written cases pin to a moment in time and rot.
- Skills should ship instructions, not data. A growing community-
  contributed knowledge base mixes review cadences with skill code
  and conflates 'what to do' with 'what others did'.

The right shape: the wizard knows HOW to look for known landmines
for the operator's specific combination at install time, against
current docs and current upstream trackers.

Phase 4.5 — Active research:
- Runs read-only, time-boxed (~2 min), after Phase 4 intake collects
  the full combination (target, platform, version, variant, bundles).
- Sources in trust order: local clones of cozystack/website +
  cozystack/cozystack + cozystack/talm (if present at
  ~/git/github.com/cozystack/), upstream issue trackers via gh,
  web search as last resort.
- Skeptical-verification rules are non-negotiable: every claim must
  cite a traceable source (URL, file path, issue number); 'I recall
  that' / 'I'd expect' / 'might also' are forbidden; recency check
  for blog posts vs current versions; reproduce-vs-speculation flag.
- Per-axis checklist of things to look for (oci, hetzner,
  aws-with-eip, gcp-with-nat, bare-metal + talos / bundle-specific
  axes) — starting points, not an exhaustive matrix.
- Output is a ranked HIGH / MEDIUM / LOW landmines list with the
  source citation per finding; operator gate: Acknowledged / Pause
  / Edit Phase 4 values.
- Empty result is a quiet one-liner, not a gate.
- 24h cache keyed on the combination; --rerun-research forces fresh.

plugin.json: 1.12.2 -> 1.13.0 (minor — new phase in wizard,
new state slot research_cache).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ltidoc handling

Real OCI install run on Talos 1.12 (cozystack v1.3.3) surfaced four
high-severity bugs in talos-bootstrap. The cert-SAN trap caught the
same operator twice in a row — second occurrence makes it a skill
bug, not a one-off.

#1 cert-SAN trap (NEW Phase 6.3 guardrail):

  On NAT-fronted providers (OCI 1:1 NAT, GCP Cloud NAT, AWS EIP) the
  workstation reaches each node by a public IP that the cloud fabric
  rewrites to the internal IP before the packet hits the interface.
  Talos sees only the internal IP and generates the API-server cert
  with internal-only certSANs. The workstation, dialing the public
  IP, gets a cert with no matching SAN; TLS handshake fails;
  talosctl bootstrap cannot proceed; there is no insecure escape
  (skipping verify on the bootstrap subcommand doesn't exist, and
  talosctl reset itself needs valid TLS).

  Recovery from this state requires re-imaging the node. The trap
  caught the same operator twice across consecutive sessions.

  Phase 6.3 — NAT-provider cert-SAN guardrail — runs BEFORE the
  first talm apply, when the signature (reach_mode=public AND
  external_ips_strategy=internal AND public_ip set) is detected.
  Auto-populates values.yaml machine.certSANs with:
    127.0.0.1, localhost, every inventory.nodes[].public_ip and
    internal_ip, vip.shared_address, vip.per_node[], api.<host>.
  Surfaces the result and the reason. Skipped silently when no
  NAT signature.

#4 Talos 1.12 maintenance probe:

  Earlier rewrite (post-dev17.1) switched the probe to
  'talosctl version --insecure --short' to dodge v1.13's
  PermissionDenied on get machineconfig. But Talos 1.12 (cozystack
  v1.3.3's pinned version) returns 'API is not implemented in
  maintenance mode' from the version endpoint — false negative.

  New probe is 'talosctl get disks --insecure' which works across
  Talos 1.12 / 1.13 / 1.14 in maintenance. Use 'get machineconfig
  --insecure' result to distinguish 'maintenance' from 'already
  configured' after the disks probe confirms the node speaks the
  Talos API at all.

#2 talm template --in-place strips body overlay:

  talm README states explicitly that 'talm template -f node.yaml
  [--in-place]' produces output byte-identical to running the
  template alone — it does NOT merge body overlay (HostnameConfig,
  per-node LinkConfig with static IPv4 on the VIP link, etc.).
  On Talos 1.12+ multidoc the body overlay is critical for per-node
  network config; --in-place erases it.

  Phase 6.5 no longer runs --in-place. Instead it writes
  nodes/<name>.yaml with the body overlay directly: modeline +
  HostnameConfig doc + LinkConfig doc (with addresses when the
  node carries a per-node VIP static) + Layer2VIPConfig doc
  (once on the first CP). 'talm apply -f node.yaml' merges this
  body overlay on top of the template render correctly.

#3 VIP-link guardrail multidoc-aware:

  The old guardrail check used 'yq .machine.network.interfaces[] |
  select(.vip != null) | .addresses' which only matches the
  legacy single-doc Talos <= 1.11 format. On 1.12+ multidoc the
  VIP lives in a separate Layer2VIPConfig doc and link static
  addresses live in LinkConfig doc — the old yq path always
  returned empty (false trigger) or missed real misconfigurations
  (false negative).

  Phase 6.7 now branches on state.cluster.talos_version: multidoc
  check looks at 'select(.kind == "LinkConfig" and .metadata.name
  == strenv(VIP_LINK)) | .spec.addresses'; legacy check stays for
  Talos <= 1.11.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…sion normalize + hostNetwork + LINSTOR race

#5 talos-bootstrap Phase 11.5 — auto-upgrade to cozystack-tuned image:

  When Phase 11 verification detects missing extensions or running
  image != pinned cozystack-tuned image, the root cause is almost
  always 'operator booted from base Talos image instead of the
  cozystack-tuned artefact'. talm apply does not reinstall — the
  pinned machine.install.image only takes effect on the next
  talosctl upgrade.

  Phase 11.5 surfaces the diagnosis, asks the operator (STOP GATE),
  then runs 'talosctl upgrade --image $PINNED --preserve' per node
  sequentially (etcd quorum maintained). After all nodes upgrade,
  re-runs Phase 11 verification once. Without this step, the
  downstream cluster-install Phase 3 refuses the cluster.

#6 cluster-install Phase 6 — OCI registry version normalize:

  cozystack/cozystack git tags are 'vX.Y.Z' but the OCI registry
  publishes the cozy-installer chart at tag 'X.Y.Z' (no v prefix).
  Passing 'v1.3.3' to --version produces 'not found'. Normalize
  in-place: INSTALLER_VERSION_OCI="${INSTALLER_VERSION#v}".
  Phase 5 plan presentation shows the normalized tag explicitly.

#7 cluster-install Phase 5.5 — hostNetwork: true on storage pod:

  Phase 5.5 runs BEFORE Phase 6 which installs cozy-installer
  (and Cilium). Without CNI a pod on the default pod-network
  cannot get an IP and stays ContainerCreating forever. The bootstrap
  pod doesn't open any listening ports, so hostNetwork: true is the
  right shape — sidesteps CNI dependency entirely. Test run got
  lucky (image pulled before Cilium was Required) but contract
  requires explicit hostNetwork. references/storage-backends.md
  already has it; updated SKILL.md prose to explain why.

#8 cluster-install Phase 8 — LINSTOR race fix:

  Earlier design folded LINSTOR storage-pool registration into the
  watch loop gated on linstor-controller Deployment Ready, to avoid
  the post-watch-deadlock with paas/monitoring HRs that depend on
  PVCs. But: a HelmRelease can report Ready=True (from Flux's
  install-action-succeeded lifecycle) BEFORE the underlying
  Deployment has any ready replicas. The watch loop's outer exit
  condition was 'no HR is non-Ready', which fired before the
  inline registration block had a ready linstor-controller to
  exec against. Pools stayed unregistered; operator had to register
  manually after cluster-install declared success.

  STOP GATE 3 now requires BOTH no-HR-not-Ready AND
  storage-pool count matches storage-node count (queried via
  'linstor storage-pool list' filtered to provider_kind=ZFS,
  unique node names). The watch loop keeps polling — including the
  inline registration — until both conditions hold, without
  re-introducing the post-watch deadlock.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…ch storage + bash compat + state schema

Batch D — wizard:

#9 Networking CIDRs read from cozystack source values.yaml:
  Hardcoded 10.42.0.0/16 + 10.43.0.0/16 were k3s defaults, not
  cozystack defaults. cozystack v1.3.3 actually uses 10.244.0.0/16
  + 10.96.0.0/16. Resolution order: local clone of cozystack/cozystack
  at ~/git/.../packages/core/platform/values.yaml, then older layouts,
  then HTTP fallback. cozystack_intake.network now carries
  defaults_source path so the operator can trace which file informed
  the choice.

#10 cozystack_intake.installer_variant + platform_variant kept SEPARATE:
  Older drafts joined into 'isp-full-generic' but the cozy-installer
  chart wants installer_variant for --set cozystackOperator.variant
  AND platform_variant for the values overlay. The joined form
  doesn't map to either CLI flag. cozystack_intake schema split
  explicit: installer_variant ∈ {generic, talos, hosted},
  platform_variant ∈ {isp-full, isp-paas, isp-hosted, minimal}.

#11 Drop separate confirmation rounds in Phase 0/2:
  Operator complained about 5-6 'looks right? yes/no' rounds across
  the wizard. Phase 0 echo now flows directly into Phase 1's question
  (single response moves the chain). Phase 2 no longer asks 'right?'
  when intent_hints.target was already extracted; Phase 4 consolidated
  intake is the single confirmation point.

Batch E — cross-skill UX:

#13 cluster-install Phase 5.5 batch-all-identical:
  When all storage-scope nodes share layout + device shape +
  distribution, surface one STOP GATE that covers all N nodes
  ('Provision all 3 identical nodes') instead of three identical
  per-node prompts. Detect identical-batch shape from
  cozystack_intake.storage_pref + discovered devices. For non-uniform
  configurations (mixed single/mirror, different paths), fall back
  to per-node STOP GATE — those are real choices.

#12 Shell compatibility for emitted helper scripts:
  Documented in references/helmrelease-monitoring.md. Shebang
  #!/usr/bin/env bash; avoid declare -A / mapfile / ${var,,} / **
  globstar; fall back to parallel arrays + linear lookup, or
  awk/jq/yq over a tempfile. Bash-4-only constructs need explicit
  BASH_VERSINFO guard. A real session hit declare -A failing on
  macOS bash 3.2 and the watch loop never ran.

#14 state.schema.json — machine-readable JSON Schema:
  Lives at wizard/references/state.schema.json. Validates every
  top-level section (intent_hints, sops, inventory, cluster,
  cozystack_intake, cozystack, research_cache, status) with enum
  constraints on variant/distribution/layout/strategy fields.
  status.<skill> oneOf-constrained to exactly one of
  completed_at / failed_at / (dispatched-only mid-flight).
  Operators can drive yaml-language-server via a header comment
  in .state.yaml; future check-refs.sh extension can validate.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…covery

When Talos nodes reach an unrecoverable state from inside the cluster
(cert-SAN trap caught before the guardrail, broken machine-config,
lost talosconfig, accidental wipe), 'talosctl reset' itself needs a
valid TLS handshake — there's no in-cluster escape. Recovery requires
provider-layer intervention: terminate the instance, preserve disks,
relaunch from the cozystack-tuned image, re-attach.

cozystack:talos-reset orchestrates this without losing block volumes,
secondary VNICs, or NSG memberships. Provider-specific CLI sequences
(oci, aws, gcloud, hcloud) live in references/provider-cli.md.

Phase plan:

  1 — Read state, scope (--nodes filter or all-of-failure-signature)
  2 — Provider CLI auth check (refuses on missing/unauthenticated)
  3 — Snapshot per-node cloud state to <config-dir>/talos-reset/<node>/
      (instance + volume-attachments + vnic-attachments + nsg memberships)
  4 — Plan presentation (STOP GATE 1): per-node terminate/relaunch/reattach
      commands with preserved+lost breakdown
  5 — Terminate per node, sequential (etcd quorum maintained on multi-CP)
  6 — Relaunch from cozystack-tuned image + reattach preserved resources
  7 — Verify maintenance mode + preserved zpool intact
  8 — Update state.yaml (new public IPs, reset status.talos-bootstrap)
  9 — NOTES + handoff to /cozystack:talos-bootstrap

Provider-specific gotchas documented:
  - OCI: PARAVIRTUALIZED launch mode mandatory for QCOW2 custom images;
    secondary VNIC OCIDs change (preserve VLAN membership, not VNIC ID).
  - AWS: detach data volumes BEFORE terminate (DeleteOnTermination
    behaviour); EIPs disassociate, re-associate post-relaunch.
  - GCP: --keep-disks=data preserves non-boot disks automatically;
    Cloud-NAT'd VMs handled by Phase 4.5 NAT-signature research.
  - Hetzner: Cloud servers only (Robot not supported by hcloud);
    vSwitch + VLAN preserved at Network attachment level.

Guardrails:
  - Never auto-execute provider CLI; print + wait for Proceed
  - Never terminate >1 node simultaneously on multi-CP (etcd quorum)
  - Never assume new public IP == old (capture from new instance VNIC)
  - Never touch in-cluster state (Cozystack-side is cozystack:debug)

Registered in marketplace.json + plugin.json + README + CLAUDE.md.
plugin.json: 1.13.0 -> 1.14.0 (minor — new skill).

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
…sion

Review of the dev17.2-feedback batch caught four contradictions
introduced by my fix attempts. All four come from making changes
without verifying against the upstream source.

#1 platform_variant enum was invented, not sourced from upstream:
  cozystack/cozystack/packages/core/platform/ ships exactly
  three overlays: values-isp-full.yaml, values-isp-full-generic.yaml,
  values-isp-hosted.yaml, plus the base values.yaml (default).
  The schema enum was ['isp-full', 'isp-paas', 'isp-hosted', 'minimal']
  — omits real value isp-full-generic and invents two values with
  no upstream backing.

  state.schema.json + state-schema.md now use the real upstream
  enum: ['default', 'isp-full', 'isp-full-generic', 'isp-hosted'].
  The state-schema.md comment that called 'isp-full-generic' a
  'v0 mistake' was itself the mistake; rewritten to document the
  installer_variant ↔ platform_variant pairing as it exists in
  cozystack source (talos↔isp-full, generic↔isp-full-generic,
  hosted↔isp-hosted).

#2 CIDR fix #9 was only half-applied:
  wizard/SKILL.md correctly states cozystack defaults are
  10.244.0.0/16 / 10.96.0.0/16, but cluster-install/SKILL.md:197-199
  consolidated summary still showed 10.42 / 10.43, and
  references/values-template.md kept those as 'k3s default'
  recommendations.

  Updated cluster-install Phase 4 summary to show cozystack defaults.
  Rewrote values-template.md 'CIDR cheat sheet by distribution'
  to 'CIDR defaults from cozystack source' — the k3s/RKE2 distro
  defaults are irrelevant on cozystack because Kube-OVN overlays
  cozystack's defaults regardless. Old table was actively misleading.

#3 Layer confusion in wizard SKILL.md Phase 4.5 trigger:
  Line 420 read 'When cozystack_intake.platform_variant contains
  system bundle' — platform_variant is a scalar enum, bundles is
  the array of bundles. Fixed to 'When cozystack_intake.bundles
  includes system' — the exact thing fix #10 was meant to keep
  separate.

#4 cluster-install Phase 4 summary 'default for variant: isp-full-generic'
  was parseable as the old joined form. Clarified to
  'default for platform_variant: isp-full-generic on installer_variant:
  generic' — explicitly names both axes.

plugin.json: 1.14.0 -> 1.14.1.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 828e0b67-66ec-468c-a28d-a9c04ef3a739

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/cozystack-install-skill

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request restructures the repository by consolidating individual skills into two primary plugin bundles, cozystack and linstor, and introduces a wizard orchestrator to streamline the installation process. It adds extensive documentation and reference guides for cluster installation, upgrades, debugging, and node bootstrapping for Talos and Ubuntu environments. A new validation script, check-refs.sh, is also included to maintain cross-reference integrity. Review feedback identifies a bug in the diagnostic bundle's tar command and points out that the skill count for the cozystack plugin needs to be updated for accuracy in the documentation.

kubectl --context $CTX --namespace "$ns" describe hr "$name" > "$DUMP/hr-${ns}-${name}.txt"
done

tar -czf "${DUMP}.tar.gz" --directory /tmp "$(basename "$DUMP")"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tar command for creating the diagnostic bundle has an incorrect --directory argument. It's set to /tmp, but the directory to be archived (diagnostics-${TS}) is created under $CONFIG_DIR. This will cause the command to fail if $CONFIG_DIR is not /tmp.

Suggested change
tar -czf "${DUMP}.tar.gz" --directory /tmp "$(basename "$DUMP")"
tar -czf "${DUMP}.tar.gz" --directory "$CONFIG_DIR" "$(basename "$DUMP")"

Comment thread README.md
### cozystack

| Plugin | Description |
Platform skills bundle. One install gives you nine skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for the cozystack plugin states that it provides nine skills, but it actually includes ten. This should be updated for accuracy.

Suggested change
Platform skills bundle. One install gives you nine skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain.
Platform skills bundle. One install gives you ten skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain.

Comment thread README.md

```text
plugins/
cozystack/ # platform bundle (9 skills)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The repository layout description for cozystack incorrectly states it has 9 skills. This should be updated to 10 to match the actual number of skills in the bundle.

Suggested change
cozystack/ # platform bundle (9 skills)
cozystack/ # platform bundle (10 skills)

@lexfrei lexfrei self-assigned this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant