feat!(cozystack): wizard chain + 2-plugin consolidation (breaking) by lexfrei · Pull Request #11 · cozystack/ccp

lexfrei · 2026-05-17T01:36:00Z

TL;DR

Major refactor: 5 separate plugins consolidated into 2 (cozystack with 10 skills + linstor with 1). 5 existing skills renamed; 6 new skills added; CI validator added; CLAUDE.md added; README rewritten. Breaking change for operators — old /plugin install <name>@cozystack-claude-plugins paths and skill IDs no longer work.

Repository refactor (BREAKING)

Before (main today):

skills/
  cozy-bump/skills/cozy-bump/SKILL.md
  cozy-deploy/skills/cozy-deploy/SKILL.md
  cozy-external-app/skills/cozy-external-app/SKILL.md
  cozystack-upgrade/skills/cozystack-upgrade/SKILL.md
  drbd-recovery/skills/drbd-recovery/SKILL.md

marketplace.json listed five plugins, each with one skill, each at its own install path.

After:

plugins/
  cozystack/                                  # one plugin, 10 skills
    .claude-plugin/plugin.json
    skills/
      wizard/                                 # NEW
      talos-bootstrap/                        # NEW
      talos-reset/                            # NEW
      ubuntu-bootstrap/                       # NEW
      cluster-install/                        # NEW
      debug/                                  # NEW
      cluster-upgrade/                        # RENAMED from cozystack-upgrade + restructured
      package-deploy/                         # RENAMED from cozy-deploy + restructured
      package-bump/                           # RENAMED from cozy-bump + restructured
      external-app-create/                    # RENAMED from cozy-external-app + restructured
  linstor/                                    # one plugin, 1 skill
    .claude-plugin/plugin.json
    skills/
      recover/                                # RENAMED from drbd-recovery + restructured

Skill rename + install-path migration

Old install + skill ID	New install + skill ID
`/plugin install cozy-deploy@cozystack-claude-plugins` → `/cozy-deploy:cozy-deploy`	`/plugin install cozystack@cozystack-claude-plugins` → `/cozystack:package-deploy`
`/plugin install cozy-bump@cozystack-claude-plugins` → `/cozy-bump:cozy-bump`	`/cozystack:package-bump` (same plugin)
`/plugin install cozy-external-app@cozystack-claude-plugins` → `/cozy-external-app:cozy-external-app`	`/cozystack:external-app-create` (same plugin)
`/plugin install cozystack-upgrade@cozystack-claude-plugins` → `/cozystack-upgrade:cozystack-upgrade`	`/cozystack:cluster-upgrade` (same plugin)
`/plugin install drbd-recovery@cozystack-claude-plugins` → `/drbd-recovery:drbd-recovery`	`/plugin install linstor@cozystack-claude-plugins` → `/linstor:recover`

Existing operators should uninstall old plugins and install the new bundles once this lands.

Why consolidate

Single /cozystack:* namespace is the natural shape: every Cozystack-related skill ships from one plugin install, no per-skill /plugin install dance.
The new wizard chain (below) needs sibling skills to cross-reference each other via cozystack:<name>; cross-plugin references add fragility.
Renames carry intent: cozy-bump → package-bump says what the skill does; cozystack-upgrade → cluster-upgrade is consistent with cluster-install; drbd-recovery → linstor:recover reflects what the tool actually operates on (LINSTOR-side, with DRBD as one of several layers).

New skills (6)

wizard — entry-point orchestrator. Free-form intent intake → parses hints → asks Talos / Ubuntu / Existing → builds a route → dispatches downstream skills via a cluster config directory the operator picks. Phase 4.5 runtime research surfaces known landmines for the specific combination from upstream docs + issue trackers.
talos-bootstrap — Talos node prep via talm. Maintenance-mode probe (Talos-1.12-aware), NAT-provider cert-SAN guardrail before first talm apply, multidoc machine-config with per-node VIP-link IPv4 stubs, talm apply + talosctl bootstrap + kubeconfig fetch. Phase 11.5 auto-upgrade to cozystack-tuned image when nodes booted from base Talos.
talos-reset — cloud-provider terminate+relaunch helper for OCI / AWS / GCP / Hetzner when nodes are unrecoverable from inside (cert-SAN trap before guardrail, broken machine-config, lost talosconfig). Preserves block volumes, secondary VNICs, NSG memberships. Sequential per-node to maintain etcd quorum.
ubuntu-bootstrap — wraps cozystack/ansible-cozystack for Ubuntu / Debian k3s bootstrap (OS prep, kernel modules, k3s install with cozystack-compatible flags).
cluster-install — Cozystack on a ready cluster. Node-readiness validation, ZFS pool provisioning via privileged DaemonSet on Talos (hostNetwork: true because CNI is not up yet), extractedprism for kube-apiserver HA, OCI-tag-normalized cozy-installer chart, Platform Package apply, inline tenants/root ingress patch and LINSTOR pool registration during the watch loop, Phase 8.6 default StorageClasses for v1.3.x, Phase 9.1 end-to-end reachability probe.
debug — investigate a stuck or broken install. Classifies operator error / config drift / upstream bug / not-yet-supported, applies fixes or workarounds, drafts upstream issues on approval (never opens issues silently).

Renamed-skill changes (non-trivial)

Beyond directory rename, the 5 carried-over skills got real edits:

cluster-upgrade (was cozystack-upgrade): 5 references files restructured with --context / --kube-context discipline; multiple known-failure entries added.
package-deploy (was cozy-deploy): rewritten to surface ghcr.io/<your-username> placeholder instead of personal registry; --kube-context discipline.
package-bump (was cozy-bump): path references updated; --context discipline; consistent printf / tr shell idioms.
external-app-create (was cozy-external-app): --context discipline.
linstor:recover (was drbd-recovery): every cluster-mutating kubectl call carries --context $CTX.

CI + contributor infrastructure

tools/check-refs.sh — 5-check validator: references-file existence, <plugin>:<skill> mention resolution, descriptions list every shipped skill, kubectl / helm cluster-mutating invocations carry --context / --kube-context (allow-list for read-only / local operations), no private cluster identifiers in plugin / public content.
.github/workflows/validate.yml — jq + check-refs.sh on push / PR.
CLAUDE.md — contributor guidance: repository layout, adding a new skill, adding a new plugin, cross-reference discipline, semver bump policy.
README.md — rewritten with the new skills table + chain diagram + third-party-dependency section.

Real-install hardening

The wizard chain went through multiple end-to-end install runs on a 3-node OCI Talos cluster. Each session surfaced bugs that no code review could catch; each finding became a guardrail in the appropriate skill:

cert-SAN trap on NAT'd providers (caught the same operator twice across consecutive sessions). Workstation reaches nodes via public IPs that the cloud fabric rewrites to internal IPs before they hit the interface. Talos issues machine-cert SANs from observed addresses (internal only); workstation TLS handshake fails with no escape. talos-bootstrap Phase 6.3 detects the NAT signature (reach_mode=public + external_ips_strategy=internal) and auto-adds public IPs to values.yaml.certSANs before the first apply.
OCI 1:1 NAT externalIPs mismatch. Cilium's externalIPs BPF matches on packet destination as seen by the host kernel; on NAT'd providers public IPs are never present on any interface. cluster-install Phase 4 refuses external strategy on intent_hints.platform ∈ {oci, gcp-with-nat, aws-with-eip} unless the operator opts in with --allow-external-on-nat-provider.
kubectl debug node --image=alpine:3 -- chroot /host zpool create does not work on Talos. Three independent reasons (musl vs glibc loader, no /bin/sh on host rootfs, PSA baseline blocks the debug pod). cluster-install Phase 5.5 uses a privileged Pod from ubuntu:24.04 in a cozy-storage-bootstrap namespace with pod-security.kubernetes.io/enforce=privileged.
LINSTOR registration race. HelmRelease Ready does not equal Deployment readyReplicas. STOP GATE 3 requires both no-HR-not-Ready AND storage-pool count matching storage-node count.
Tenants CRD landing after most HRs Ready. Tenant ingress patch is inline in the Phase 8 watch loop, event-driven on CR existence.
cozystack v1.3.x ships no StorageClasses. Phase 8.6 applies local (1-replica) + replicated (3-replica, default) for v1.3.x; skipped on v1.4+.
OCI registry tag has no v prefix. ${INSTALLER_VERSION#v} normalize before passing.
End-to-end reachability gate. Phase 9.1 curls https://dashboard.<host>/ from the workstation; on failure writes failed_at: external-reachability with curl + DNS + node-addresses detail.

Third-party dependency

cozystack:cluster-install default-installs extractedprism on generic variant clusters (k3s / kubeadm / RKE2) — per-node TCP load balancer that gives generic Linux Kubernetes the same localhost:7445 shape Talos has built-in via KubePrism. BSD-3-Clause; chart at oci://ghcr.io/lexfrei/charts/extractedprism. Maintained independently; reviewed and approved by the Cozystack platform team. Opt-out via --no-extractedprism --api-host=<ip>. Talos and hosted variants do not need extractedprism.

Validation

bash tools/check-refs.sh — all 5 checks green.
Multiple branch-review rounds with both human review and second-opinion model passes; final verdict LGTM.
Real-install validation on a 3-node OCI Talos cluster from scratch, end-to-end through HTTP 200 on the dashboard ingress.

Reviewer notes

This is a breaking refactor of the marketplace structure. Old plugin install paths and skill IDs no longer work; operators must uninstall + reinstall.
The cert-SAN guardrail in talos-bootstrap/SKILL.md Phase 6.3 is load-bearing for OCI / GCP-NAT / AWS-EIP operators and has caught the same trap twice in test runs without it.
tools/check-refs.sh runs in CI; adding new skills requires updating the descriptions in plugin.json + marketplace.json so the validator stays green.

Guided install of Cozystack on an existing Kubernetes cluster (kubeadm / k3s / RKE2 / managed). Discovers cluster facts, validates node readiness via kubectl debug, recommends an installer + platform variant, gathers values interactively, installs the cozy-installer chart, applies the Platform Package, waits until every HelmRelease is Ready, and prints a NOTES-style access summary. On fatal failure, assembles a diagnostic bundle and drafts an issue body for the appropriate cozystack/* repo without auto-posting it. The skill enforces three non-negotiable gates: cluster readiness (version, CNI, cluster domain, conflicting workloads, CP-label value vs KubeOVN expectation), values collected, and all expected HelmReleases Ready. The KubeOVN gate handles the kubeadm vs k3s/RKE2 difference in the value of node-role.kubernetes.io/control-plane (empty vs literal 'true'), which the platform variant pins as a strict match. References cover the requirements matrix, exact node checks via kubectl debug, variant picker, canonical Helm and Package values, HelmRelease polling, known failures with recovery commands, and upstream issue handoff templates. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Breaking change for marketplace consumers. Before: five separate plugins (cozy-bump, cozy-deploy, cozy-external-app, cozystack-install, cozystack-upgrade), each living under skills/<name>/. After: a single bundle plugin 'cozystack' under plugins/cozystack/ that ships five skills addressable as cozystack:bump, cozystack:deploy, cozystack:external-app, cozystack:install, cozystack:upgrade. The drbd-recovery skill stays a standalone plugin and moves from skills/drbd-recovery/ to plugins/drbd-recovery/ for layout consistency. Inside each SKILL.md: - frontmatter name: now the short form (install, upgrade, deploy, bump, external-app, drbd-recovery) — the plugin name supplies the namespace. - H1 and every user-facing self-reference and sibling-reference updated to the cozystack:<short> form. - on-disk paths and mktemp suffixes keep the cozystack-<short> form (no colon) because some tools choke on ':' in paths. marketplace.json carries two entries (cozystack, drbd-recovery) instead of the previous six. README.md is rewritten to document the new addressing scheme and repo layout. No backward-compatibility shim — the marketplace is recent and the breakage cost is low. Existing installations of the old plugin names need to be reinstalled as 'cozystack' after pulling. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

The short forms (install / upgrade / deploy / bump / external-app) read as synonyms once they live under a single namespace — bump and upgrade in particular sound like the same operation. Add an object axis to each name: install → cluster-install (operates on a k8s cluster) upgrade → cluster-upgrade (operates on a k8s cluster) deploy → package-deploy (operates on one monorepo package) bump → package-bump (operates on one monorepo package) external-app → external-app-create (creates a new external-apps package) Invocation form: /cozystack:cluster-install /cozystack:cluster-upgrade /cozystack:package-deploy /cozystack:package-bump /cozystack:external-app-create Updates frontmatter name, H1, every user-facing self/sibling reference, and the aggregate plugin.json + marketplace.json + README.md to match. On-disk paths and mktemp suffixes keep the cozystack-<short> form (no colon) — colons in paths choke some tools. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Align with the cozystack:* naming axis introduced in the previous commits. Old form (drbd-recovery as a one-skill plugin) read as an inconsistent neighbour next to cozystack:cluster-* / package-*. Mapping: plugin drbd-recovery / skill drbd-recovery → plugin linstor / skill recover invocation /drbd-recovery → /linstor:recover The skill still covers both LINSTOR (orchestrator) and DRBD (kernel module) layers — picking 'linstor' as the plugin namespace mirrors the user-facing CLI most operators reach for first ('linstor r l ...'). Leaves room for sibling skills later (linstor:setup, linstor:bench) without a second rename. frontmatter: name: recover (short form, plugin name supplies the namespace). H1 and marketplace / README copy updated to /linstor:recover. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ioning + Talos handoff Two gaps surfaced while reviewing the skill against k3s and Talos scenarios: 1. LINSTOR storage backend was asked about as Phase 4 question #10, after networking and publishing. Without a pool ready before the Platform Package goes in, the cozy-linstor HelmRelease never reaches Ready and Phase 8 stalls on a stuck state with no clear cause. Nothing in cozy-installer / piraeus-operator / ansible-cozystack creates the physical pool — that step has been entirely on the operator and undocumented in the install flow. 2. Talos node prep was implicitly conflated with cluster install. A vanilla Talos node has no drbd / zfs / openvswitch modules and no cozystack LVM filter — the install is guaranteed to fail at storage provisioning. Trying to bootstrap Talos from inside cluster-install bloats scope and tangles two very different workflows. Changes: - Phase 2 lookup adds storage discovery (per-node lsblk, pvs, vgs, lvs, zpool list, LVM-tools availability, Talos lsmod and lvm.conf filter check). Existing pools are surfaced as Phase 4 defaults. - Phase 3 adds an early-exit refusal for Talos clusters missing cozystack-tuned extensions. The skill points at the future cozystack:talos-bootstrap skill and refuses without partial mutation. - Phase 4 promotes storage to the second question (right after bundles), with full collection: backend (LVM thin / LVM thick / ZFS / skip-external), per-node disk picking, VG / thin-pool / LINSTOR pool names with sane defaults (vg-data / thinpool0 / data). - New Phase 5.5 — per-node storage provisioning gate. For every node, the skill shows the exact pvcreate / vgcreate / lvcreate (or zpool create) commands, asks 'Provision / Skip / Cancel', and on approve runs them inside kubectl debug --profile=sysadmin with chroot /host, then verifies with pvs / vgs / lvs (or zpool status). Failure handling per node (Retry / Skip / Cancel), no auto-rollback. - Phase 5 plan gate, Phase 10 NOTES, and the guardrails section reflect the new storage scope and the Talos handoff. - New reference references/storage-backends.md covers backend trade-offs, exact create / verify / backout commands, default names, and the LinstorSatelliteConfiguration shapes the skill applies. node-checks.md adds storage discovery and Talos module / lvm.conf checks. known-failures.md adds the 'LINSTOR HR with no storage pool' failure mode. values-template.md adds the LinstorSatelliteConfiguration block with lvmThinPool / lvmPool / filePool / ZFS-via-CLI examples. Follow-up out of scope for this commit: a new skill cozystack:talos-bootstrap that does node-prep (talm / boot-to-talos / image factory schematic). The current SKILL.md and node-checks.md reference it by name; that handoff stays a TODO until the skill exists. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…alos-bootstrap Split the install path into distinct skills and add a meta-skill that picks and runs the right chain. Until now cozystack:cluster-install carried the whole burden — it validated the cluster, refused on Talos without tuned extensions, and pointed at ansible-cozystack for node OS prep, but never closed the loop. This commit fills in the missing steps and replaces the implicit handoff with explicit skills. New skills: - cozystack:wizard — entry point. Interviews the operator about the target environment (bare-metal Talos / bare-metal generic Linux / existing Kubernetes / managed Kubernetes), picks a chain, persists collected state to /tmp/cozystack-install-<ts>/state.yaml, then instructs Claude to invoke each downstream skill in order. After each downstream skill it re-reads state.yaml to verify status.<skill>.completed_at before continuing. Refuses for existing Cozystack clusters with a pointer to cozystack:cluster-upgrade. - cozystack:k3s-bootstrap — full HA k3s install via SSH. Inventory interview (ssh user / key / nodes / roles / optional VIP), SSH preflight, per-node OS prep through ssh+sudo, first-CP install with --cluster-init for embedded etcd, additional CPs joining via --server for HA, agent workers, kubeconfig fetch with rewritten server URL (loopback -> CP1 IP or VIP) and optional kubecm merge. k3s flags match cozystack/docs/v1.3/install/kubernetes/generic.md (flannel off, kube-proxy off, traefik/servicelb/local-storage off, cluster-domain cozy.local). - cozystack:linux-prep — OS prep on an already-running Kubernetes cluster via kubectl debug + chroot /host. Same set of fixes k3s-bootstrap applies pre-Kubernetes, but post-Kubernetes via ephemeral debug containers — no SSH credentials required. Detection matrix, four ordered batches (packages / modules / sysctl / services+blacklist+swap), per-batch approval, verification after each fix. Refuses on Talos nodes (those go through talos-bootstrap). - cozystack:talos-bootstrap — v1 is a guided checklist for booting the cozystack-tuned Talos image (ghcr.io/cozystack/cozystack/talos: vX.Y.Z) on every node via boot-to-talos and applying machine-config with talm, then verifies kernel modules (drbd / zfs / openvswitch) and the LVM filter through kubectl debug. Full automation lands in v2 when SSH + talm orchestration is in scope. Routes the wizard builds: bare-metal Talos -> talos-bootstrap -> cluster-install bare-metal generic Linux -> k3s-bootstrap -> cluster-install existing k8s, prep needed -> linux-prep -> cluster-install existing k8s, ready / managed -> cluster-install State schema documented in plugins/cozystack/skills/wizard/references/state-schema.md. Each skill reads state on start, writes status.<skill>.completed_at or failed_at + error on exit. Wizard owns route construction and inter-skill verification; skills do not invoke each other directly. cluster-install SKILL.md gets two cross-references: the entry-point note at the top mentions the wizard, and Phase 3 refusal points at /cozystack:linux-prep for the kubectl-debug-based fix path. plugin.json bumps to 1.1.0 and lists all nine skills. README.md table reflects the new skill set plus a chain summary. Verified locally: claude plugin validate passes, reinstall shows Skills (9), all-on-token cost ~2.3k. No real-cluster prove-out yet (deferred to operator testing); fixes by feedback go in follow-up commits. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Surfaced by a real install run on a generic Linux cluster. Two related fixes. 1. Move the tenants/root spec.ingress=true patch from Phase 9 (post-install verification) to a new Phase 7.5 between Platform Package apply and the HR wait loop. The cozy-dashboard chart ships gatekeeper (oauth2-proxy), which on startup does OIDC discovery against the public keycloak FQDN — not an in-cluster service. Without root ingress running, nothing listens on 443, gatekeeper CrashLoopBackOffs, the dashboard HR sits in Unknown until 10m timeout then InstallFailed and retries forever. cozy-fluxcd/flux-plunger has a hard dependency on cozy-dashboard/dashboard and stays False. The whole Phase 8 wait loop becomes unreachable. The Platform Package itself does not patch tenants/root.spec.ingress — that step is documented as manual in cozystack/docs/cozystack-installation.md:160. Phase 7.5 runs it pre-emptively so Phase 8 actually reaches Ready. 2. Add an explicit domain-ownership gate in Phase 4 question 7 (publishing.host). Cozystack uses cert-manager with the HTTP-01 solver by default, which requires the operator to own the domain, have wildcard DNS pointing at the chosen external IPs, and have port 80 reachable from the public internet. nip.io patterns are exempt because nip.io is publicly hosted DNS. Without the gate, an operator picks a domain they don't control, every cert request fails silently, dashboard is unreachable, and the failure mode looks identical to the chicken-and-egg above. The gate also runs a soft DNS pre-flight (dig +short) and surfaces the result. Plan-gate output (Phase 5) and NOTES (Phase 10) now show domain ownership status, DNS pre-flight outcome, cert solver, and certificate-issuance progress. Guardrails get two new entries: never accept a custom publishing.host without explicit ownership confirmation; never start Phase 8 before Phase 7.5 on system-bundle installs. References: - known-failures.md gets a 'Dashboard / Keycloak / flux-plunger stuck in OIDC chicken-and-egg' section with symptoms, cause, and the recovery procedure for installs that already stalled. Phase 4 questions renumbered: external IPs moved from #8 to #6 so the publishing.host gate (now #7) can reference them; cert solver added as #8. Companion docs PR in cozystack/website adds the LVM Thin section that the storage flow assumes (PR #540). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…kube-apiserver HA Generic Kubernetes (k3s / kubeadm / RKE2) has no localhost:7445 KubePrism the way Talos does, so until this commit the skill defaulted cozystack.apiServerHost to CP1's internal IP — a single point of failure: when CP1 reboots or crashes, Cilium on every other node loses kube-apiserver and the cluster degrades. The cozystack-tuned Talos variant avoided this by pointing Cilium at the built-in localhost:7445; non-Talos had no symmetric story. This commit makes extractedprism (a per-node TCP load balancer DaemonSet that listens on 127.0.0.1:7445 and proxies to healthy CP endpoints with TCP health checks; oci://ghcr.io/lexfrei/charts/extractedprism, BSD-3-Clause) the default for the generic variant. Talos is unchanged (KubePrism is already there); hosted variant is unchanged (managed provider handles HA). Changes: - New Phase 5.6 between storage provisioning and cozy-installer install. Generates the endpoint list from Phase 2's node lookup, helm-installs extractedprism in kube-system with hostNetwork + system-node-critical priority + catch-all toleration, verifies DaemonSet rollout and per-node pod presence. - Phase 4 question 5 (apiServerHost) restructured. The skill picks per variant — talos: localhost:7445 (hardcoded by cozystack platform); generic + default: 127.0.0.1:7445 via extractedprism; generic + --no-extractedprism: operator supplies --api-host=<ip>; hosted: not set. The plan presentation in Phase 5 surfaces the chosen path and how to flip it. - New flags: --no-extractedprism (opt-out) and --api-host=<ip> (required when opting out on multi-CP). Single-CP sandboxes still work without extractedprism — the proxy just has one upstream — but the default behaviour stays HA-friendly so multi-CP installs do not silently keep CP1 as a SPOF. - Phase 5 plan-gate adds an apiServerHost source line and a new step in the action list ('install extractedprism DaemonSet' between storage and cozy-installer). - Phase 10 NOTES adds an 'api server HA' block with the chosen host, source, and a verify command pointing at the DaemonSet. - known-failures.md adds a 'Cilium loses API when CP1 dies' section with the temporary fix (re-point operator at a live CP) and the permanent fix (install extractedprism after the fact, re-render cozy-installer with apiServerHost=127.0.0.1:7445). - values-template.md adds an extractedprism reference section and a per-variant apiServerHost resolution table. Trade-offs documented inline so the operator can pick external-LB / VIP / kube-vip explicitly with --no-extractedprism instead. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…r + ZFS-only storage + cluster config dir Three coupled changes driven by user feedback after the first real install attempt. 1. ZFS-only storage. The LVM and LVM-Thin paths were never part of cozystack's documented or validated install — the upstream docs at cozystack.io/docs/next/storage/disk-preparation/ describe ZFS only. Adding LVM in the skill (and in a pending cozystack/website PR, since closed) gave operators an unsupported route that cozystack itself does not stand behind. Dropping LVM removes a footgun and aligns the skill with the upstream contract. cluster-install Phase 4 storage question is reduced from backend-picker to per-node device + layout (single / mirror / raidz) on ZFS. references/storage-backends.md is now a ZFS-only document; values-template.md drops the LinstorSatelliteConfiguration lvmThinPool / lvmPool examples since the CRD has no zfsPool slot anyway — ZFS pools are registered at runtime via linstor storage-pool create zfs in a Phase 8 hook. known-failures.md replaces the LVM-thin failure section with a clarification that RHEL 10 family is unsupported (no OpenZFS RPMs yet) and that operators wanting LVM are on their own with the piraeus-operator CRD. 2. Three-route wizard. Earlier draft had five routes (Talos / bare-metal-generic / existing-k8s-dirty / existing-k8s-ready / managed) which added complexity without informing decisions the operator could not just make later in the chain. The new shape is three routes plus one refusal — Bare-metal Talos / Bare-metal Ubuntu / Existing Kubernetes (self-managed or managed are folded together; cluster-install picks the variant from cluster lookup). Existing Cozystack is a refusal that points at cluster-upgrade. linux-prep is deleted entirely — its work belongs inside ubuntu-bootstrap (the ansible playbook covers the same OS prep concerns more thoroughly). 3. ubuntu-bootstrap wraps cozystack/ansible-cozystack. The previous k3s-bootstrap was a hand-rolled SSH shell flow that re-implemented half of examples/ubuntu/prepare-ubuntu.yml and would drift the moment ansible-cozystack added new prep tasks. The new ubuntu-bootstrap renders inventory.yml into the cluster config directory and invokes prepare-sudo.yml (Ubuntu 26.04+ sudo-rs workaround), prepare-ubuntu.yml (packages including thin-provisioning-tools and lvm2, kernel modules including geneve / vhost_net / kvm_*, sysctl 11 keys including net.ipv6.conf.all.forwarding, services, multipath blacklist, LINBIT PPA + drbd-dkms for Secure Boot, ZFS + KubeVirt modules), and the k3s.orchestration collection — three discrete ansible-playbook invocations gated per step. cozystack_create_platform_package: false is wired in so the ansible role does not install Cozystack itself; that step stays in cluster-install. Renames k3s-bootstrap → ubuntu-bootstrap; renames k3s-install.md → ansible-playbook.md; drops node-prep.md (ansible covers it). Cluster config directory. All skills in the chain now read and write artifacts under a single operator-picked directory (default $PWD or $PWD/<cluster-name>) instead of /tmp/cozystack-install-<ts>/. The wizard sets it up in Phase 1, writes the .gitignore with cozystack markers (secrets + state excluded), and points downstream skills at it via state.config_dir. Artifacts safe to git commit: inventory.yml (ubuntu), nodes/*.yaml (talos), cozystack-platform-package.yaml, extractedprism-values.yaml, .gitignore. Artifacts gitignored: .state.yaml, kubeconfig.yaml, talosconfig, secrets.yaml. The skill never runs git init / add / commit — git operations are operator-side. The point is the operator can rsync or git push the directory and reproduce the cluster shape. plugin.json bumps to 1.2.0 and lists eight skills (linux-prep removed, k3s-bootstrap renamed to ubuntu-bootstrap). README.md skills table reflects the new layout. cluster-install argument-hint adds --config-dir; references update /tmp paths to <config-dir>/ uniformly. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

GitOps-friendly add-on to the cluster config directory pattern. With sops opt-in, every secret file the chain writes is encrypted in place with the operator's age public key, so the encrypted forms can be committed to the same repository the operator uses to track inventory.yml, cozystack-platform-package.yaml, and the talos nodes/. Without sops, the previous behaviour stays — secret files land in plain text and .gitignore keeps them out of commits. wizard adds Phase 1.5 between config-dir selection and target picking. It checks for sops + age binaries, resolves the age public key by walking <config-dir>/.sops.yaml, $SOPS_AGE_KEY_FILE, and ~/.config/sops/age/keys.txt in order — asking the operator to confirm which key to use — and offers age-keygen with an explicit backup warning when no key exists. On opt-in the wizard writes <config-dir>/.sops.yaml with creation_rules covering kubeconfig.yaml, .state.yaml, inventory.yml, cozystack-platform-package.yaml, extractedprism-values.yaml, plus the talos nodes/*.yaml / talosconfig / secrets.yaml so talm's own .sops.yaml lookup finds matching rules. The cozystack .gitignore section is rewritten so secret-file lines drop out (encrypted-in-tree now); only *.tar.gz diagnostic bundles stay ignored. ubuntu-bootstrap and cluster-install add a sops sanity check phase (refuse if state.sops.enabled is true but sops is missing — no silent plain-write fallback) and a maybe_encrypt / decrypt-edit-encrypt pattern wired into every secret-file write. inventory.yml is decrypted into a temp file for each ansible-playbook --inventory call; cozystack-platform-package.yaml is decrypted into a temp file for kubectl apply --filename. The encrypted on-disk forms stay canonical; temp files are removed immediately after the consuming tool exits. Talos artefacts (talosconfig, secrets.yaml, nodes/*.yaml) are NOT double-encrypted by the skills. talm respects the shared .sops.yaml and emits encrypted output natively; the wizard's .sops.yaml block includes the right path_regex patterns so talm's lookup matches. The skill never copies the age private key into the config directory and never invokes git. Operators back up their private key themselves (password manager, hardware token, offline copy); they commit and push the config directory on their own schedule. wizard frontmatter accepts new flags --sops / --no-sops. state.sops {enabled, recipients, config_path} written by Phase 1.5. references/sops.md documents file-by-file decision matrix, age key resolution order, decrypt/edit workflow for operators, off↔on toggling, and what sops does NOT cover (private key safety, multi-recipient rotation, hardware-token integration — opt-out points for a future cozystack:sops helper skill). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Missed in ebd33b3. Bump to reflect the new sops feature so 'claude plugin details' shows operators which skill bundle they're on. Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ream issue Closes the loop on the chain: when something breaks, /cozystack:debug investigates symptoms via kubectl, checks the operator did the documented step (operator-not-dumb gate), classifies the failure into one of four buckets (operator error / config drift / upstream bug / not-yet-supported), applies a local fix or workaround when one exists, and on explicit approval drafts an upstream issue with the diagnostic bundle. Wizard auto-dispatches the skill on any failed_at in .state.yaml so chain failures get triaged before pausing — symmetric to linstor:recover but cluster-wide. Hard constraints from the design discussion: - No PRs. v1 only drafts issue bodies. Patches that the operator wants to send upstream are a manual follow-up. - No silent filings. Every issue draft is shown for approval; if the operator has no GitHub account, the diagnostic bundle is still useful to hand off elsewhere. - Doc-check is mandatory before classifying as upstream-bug. Saves duplicate / wrong-repo filings. - Source location required. 'Something is wrong somewhere in cozystack' isn't enough — Phase 3 must name a file:line in a cozystack/* or upstream-upstream repo. Wizard integration: - Phase 5 dispatch loop auto-dispatches /cozystack:debug whenever a downstream skill writes status.<skill>.failed_at. - After debug, the loop reads status.debug.action and decides — resolved/workaround → retry the failing skill; issue-drafted without workaround → pause and offer Skip/Cancel; no-action → offer Retry/Skip/Cancel. References ship the working knowledge: - classification.md — decision tree per bucket with signals. - diagnostic-bundle.md — symptom-record schema + bundle script shared with cluster-install/issue-templates.md (single source so the script doesn't drift). - source-search.md — grep recipes by symptom shape (chart fail, operator runtime, package CR rejection, piraeus/Kube-OVN forward routes). - upstream-routing.md — per-repo routing table (cozystack/cozystack, cozystack/website, cozystack/ansible-cozystack, cozystack/talm, cozystack/boot-to-talos, plus upstream-upstream for piraeus, LINSTOR, Kube-OVN, Cilium, KubeVirt, cert-manager). - issue-templates.md — per-repo issue body templates with the common Cozystack-version preamble + redaction rules. plugin.json bumps to 1.4.0 with debug listed; README.md grows to 9 skills. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…oss all skills Two coupled tweaks shaped by the way operators actually use the wizard in conversation, not in CI. 1. Wizard Phase 0 — free-form context Before any structured question (config dir, sops, target), the wizard asks the operator to describe what they're installing and what's already in place in their own words. The skill parses the answer for hints (target environment, distribution, node count, GPU vs general workload, public-domain availability, prior failures) and pre-fills the structured questions in Phases 2 and 4 so the operator doesn't repeat themselves. Vague free-form is acknowledged and the structured questions fill the rest — no over-extraction. The recap is echoed back for confirmation before moving on. state.intent_summary holds the free-form recap, state.intent_hints holds the parsed key/values. state-schema.md documents both, plus the new state.operator_language field. Phase 2 (target picker) skips the four-option AskUserQuestion entirely when intent_hints.target is already set — operator only confirms. 2. Match the operator's natural language across all 10 skills Every skill (wizard, talos-bootstrap, ubuntu-bootstrap, cluster-install, debug, cluster-upgrade, package-deploy, package-bump, external-app-create, linstor:recover) now declares 'match the operator's natural language' as a Core Principle. Detection is from conversation context (the language the operator wrote in for any prior turn) or from state.operator_language when a wizard chain is in progress — never via a separate 'which language?' question. The wizard's free-form Phase 0 answer is also the language anchor for the rest of the chain. Canonical-form exceptions are spelled out per skill so things that have to stay English do: - Code identifiers, commands, file paths in all skills. - Ansible task names from upstream playbooks (ubuntu-bootstrap). - Commit messages, PR body drafts, GitHub-public text (package-bump, debug issue templates, external-app-create generated YAML). - linstor / drbdadm command outputs (linstor:recover). plugin.json bumps to 1.5.0 and mentions both the free-form opener and the language-matching principle. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Real-run feedback surfaced two coupled defects. 1. talos-bootstrap had a stale v1-stub guardrail ('NEVER run boot-to-talos / talm apply / any node mutation in v1; v1 is a checklist + verification only') that survived the maintenance-mode-first rewrite. The skill knew how to drive talm but the guardrail told it not to — so it kept asking the operator to run the commands manually and ping back 'done'. Removing the guardrail and reorganising the workflow around 'default = nodes already in maintenance, skill runs talm' lets the skill execute its own commands. The boot-method picker stays for the explicit opt-in path when nodes aren't imaged yet — see Phase 3 needs-help question and Phase 4 probe. 2. Across the bundle, several phases gated on questions like 'I'll wait for you to say done, ok?' or 'Approve OS prep on all nodes / Approve per-node'. Those aren't approval gates; they're friction. When the operator already approved the consolidated plan and there's exactly one valid path forward, the skill should execute it. The new Core Principle in every operator-facing skill: one valid path → just do it. Approval gates stay only for (a) multi-option choices the operator actually makes (storage layout, network values, publishing host, target picker), (b) destructive operations that have data implications regardless of plan approval (zpool create over data, talosctl reset on configured node, kubectl apply on prod-context, helm upgrade on a live cluster, dangerous DRBD operations), (c) the named STOP GATEs that present the plan before a long batch. What changed concretely: - talos-bootstrap Guardrails: 'NEVER run boot-to-talos / talm apply in v1' removed. Replaced with 'ALWAYS run talm init / talm apply / talosctl bootstrap / talosctl kubeconfig / verification automatically after Phase 5 plan approval'. - talos-bootstrap Phase 7 (per-node review): 'Apply all / Apply per node' collapsed to 'Apply'. Per-node-approval path is gone; once the operator approved the config in Phase 7, every node gets talm apply back-to-back in Phase 8 without intermissions. - cluster-install Core principles: explicit 'no are-you-ready-to-continue gates between phases with one valid outcome'. Phase 5 STOP GATE 2 plan approval is THE gate; Phase 6 onwards executes. - ubuntu-bootstrap Core principles: same pattern. Once inventory is approved in Phase 4, ansible-playbook prepare-sudo / prepare-ubuntu / k3s.orchestration.site run back-to-back. - debug Core principles: when classification + proposed action are unambiguous (operator error with documented fix, config drift with clear restore), apply without an extra ok-to-apply question. Issue / PR drafts still require explicit yes per filing. - cluster-upgrade Core principles: named STOP GATEs are the only gates; in-flight phases execute. - linstor:recover Core principle 0a: safe single-path recovery operations run without friction. The dangerous-operations list at principle 8 still asks (linstor node lost, deleting the last replica, drbdadm down on InUse/Primary, --discard-my-data with one diskful copy, drbdadm create-md --force). - wizard Core principles: same shape for the orchestrator itself. - plugin.json description for talos-bootstrap rewritten: no longer claims 'v1 guided checklist', states what the skill actually does (maintenance-mode probe → talm init + apply → etcd bootstrap → kubeconfig fetch → verification; opt-in boot-method picker). - README.md skill row for talos-bootstrap updated to match. plugin.json bumps to 1.6.0. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Operators reported feeling dragged through 40 questions across multiple phases. The pattern was unnecessary — most of the answers can be inferred from cluster lookups, wizard Phase 0 intent_hints, or sensible defaults. The questions the operator does need to answer should land on one screen, not ten. New Core principle across every operator-facing skill: front-load the interview. Every question the skill might ask in any later phase is asked once up front in a single intake phase. Read-only lookups run first; intent_hints from wizard Phase 0 pre-fills wherever possible; the consolidated summary lands as ONE screen with every slot filled (defaults marked) and quick-edit affordances. Operator either approves the lot or names the slot they want to edit. Later phases consume the collected answers without re-prompting, except for destructive STOP GATEs that have to ask by their nature (zpool destroy of an existing pool the operator chose to wipe, talosctl reset on a configured node, etc.). Concretely: - cluster-install Phase 4 reframed from 10 sequential AskUserQuestion calls to a single consolidated summary that covers bundles, per-node storage devices and layout, networking (CIDRs, gateway, apiServerHost / port, KubeOVN MASTER_NODES), publishing (external IPs, exposure mode, host with ownership gate auto-passing for nip.io, apiServerEndpoint, exposedServices, cert solver), and operations toggles (storage provisioning, extractedprism, tenant ingress patch). What was Phase 5.5 per-node approve and Phase 5.6 extractedprism opt-in / Phase 7.5 tenant patch confirmation now live in this single screen. - talos-bootstrap Phase 5 picks up boot method only for nodes the probe found unready, and collects everything else needed for the rest of the flow in the same summary — install disk per node, optional VIP, cluster name, kubeconfig merge target, custom installer override. talm init / talm apply / bootstrap / kubeconfig / verify run end-to-end after Approve. - ubuntu-bootstrap Phase 4 inventory render now includes the operational knobs (k3s_version, cozystack_flush_iptables, cozystack_enable_zfs, cozystack_enable_drbd_dkms, cluster name, kubeconfig merge target). After Approve, the three ansible-playbook invocations run back-to-back without per-step gates. - debug Phase 4 collapses to a single screen with symptom, classification, root cause, proposed fix / workaround, and the issue-filing decision (yes / no / show-draft-first). Phases 1-3 (symptom gathering, doc check, classification) are read-only and run before the screen. - cluster-upgrade keeps its risk-summary approval as the single gate. - linstor:recover reads the full diagnostic graph first, then presents a recovery plan with ordered operations classified safe / dangerous; dangerous-operation approvals are batched into one screen. plugin.json bumps to 1.7.0. Each skill's Core principles section documents the new pattern explicitly so the skill can't drift back into question-dribbling. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ill summaries Real-run feedback: talos-bootstrap final NOTES ended with 'Возвращаю управление wizard'у. Состояние записано: status.talos-bootstrap.completed_at = ..., cluster.* заполнены. Skill завершила работу — wizard может dispatch'нуть следующую в цепочке.' That phrasing was Claude's improvisation based on the chain context, not text from SKILL.md, but the SKILL.md didn't explicitly forbid it either — and so it leaked. New Core principle on every downstream skill: layer-pure operator output. The skill never says 'returning control to wizard', 'the wizard will dispatch next', or any other orchestration commentary in the operator-facing summary. Whoever invoked the skill (a human running /cozystack:<skill> directly, or the wizard's dispatch loop on auto-dispatch) figures out what's next on their own. The wizard reads .state.yaml and decides; a human reads the printed 'next:' hint at the bottom of the NOTES. Internal SKILL.md references to cozystack:wizard stay — they're documentation for Claude / future maintainers, not for the operator. The principle restricts only operator-facing text. Applied to: talos-bootstrap, ubuntu-bootstrap, cluster-install, debug, cluster-upgrade, linstor:recover. plugin.json bumps to 1.7.1 (text-only fix). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

After a successful test run on 3 OCI Talos nodes the operator filed a retrospective with eight concrete improvements. This commit lands all of them. Wizard changes: - /tmp guardrail moved from NEVER to AVOID. Phase 1 now offers a 4th option 'Scratch dir under $TMPDIR (will be lost on reboot — test runs only)' for legitimate scratch / sandbox runs. New --allow-ephemeral flag for CI. The Layer-pure principle is unchanged — git-able dirs are still the default for real installs. - Phase 4 intake for bare-metal-talos route expanded from 'node IPs + talosctl present' to the full set of things talos-bootstrap actually needs to start: node list, reach mode (public/internal/vip), CP endpoint (VIP / single-CP IP / external LB), VIP details (address, link, subnet, MTU) if applicable, boot method per node when needs-OS-install. Operator answers once in wizard Phase 4 instead of being re-asked by talos-bootstrap. - State schema gains platform (metal/nocloud/aws/oci/azure/gcp) and cloud_hint (free-form for downstream routing) so OCI custom image / AWS AMI / metal flows pick the right boot path. Plus reach_mode, cp_endpoint, and a structured vip {address, link, subnet, mtu} block. talos-bootstrap changes: - installer.yaml lookup parameterised. --installer-profile-url for direct URL override, --cozystack-repo for non-default checkout, URL fallback to raw.githubusercontent.com when no local clone is present. Previously hardcoded path made the skill brittle on forks and bare workstations. - Phase 11 verification switched from kubectl debug node + chroot /host to talosctl read /proc/modules + talosctl get extensions + talosctl read /etc/lvm/lvm.conf. The kubectl path required a CNI that doesn't exist yet at this point in the chain (cluster-install installs Cilium later); the verification was effectively broken. apid is always reachable via the talos API port regardless of CNI state. - Phase 11 also adds Talos version drift detection: talm apply does NOT reinstall, so if the operator pinned image: ...:v1.12.7 in values.yaml but the nodes booted from nocloud-amd64.raw.xz of v1.13.0, the running Talos stays on v1.13.0 and effective stack drifts (DRBD 9.3.1 vs 9.2.16, etc.). The skill compares running version to pinned and surfaces the drift with the talosctl upgrade command to reconcile. Warning, not refusal — operator can choose to live with it once they know. References / manual-steps.md additions: - Working nodes/<name>.yaml example for cozystack v1.12+: multidoc HostnameConfig + LinkConfig instead of legacy machine.network.interfaces[] / machine.network.hostname. Explicit warning about the two error messages operators hit when mixing schemas ('multi-doc renderer cannot translate legacy ...' and 'static hostname is already set in v1alpha1 config'). - talm template flags note: -e/-n explicit on every call. talm apply parses the modeline at the top of nodes/<name>.yaml but talm template -i does not. 'failed to determine endpoints' is solved by -e/-n, not by --offline (the retrospective specifically called out the --offline shortcut as a wrong reflex). - TALOSCONFIG env note: doesn't persist between Bash tool invocations (each is a fresh shell), so always pass --talosconfig explicitly. plugin.json bumps to 1.8.0. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…e, CI validator Addresses the nine blockers and four non-blocking items from /branch-review. Cross-reference drift (six blockers — same root cause): - marketplace.json plugin description listed 5 skills; actual is 9. - CLAUDE.md described old flat skills/ layout removed in the bundle refactor. - cluster-install/references/issue-templates.md routing table called the storage skill 'drbd-recovery' (renamed to linstor:recover). - package-bump/SKILL.md cross-ref pointed at plugins/cozystack/skills/deploy (correct: package-deploy). - cluster-install/SKILL.md References table described storage-backends.md as 'LVM thin / LVM thick / ZFS'; the file is ZFS-only per the cozystack platform contract. - talos-bootstrap drift check read $CONFIG_DIR/values.yaml which doesn't exist in the talos route — talm init writes nodes/<name>.yaml, secrets.yaml, talosconfig, not values.yaml. The drift check would have silently never fired. Now reads nodes/$CP1_NAME.yaml machine.install.image with uniformity verification across all node configs. Discipline regressions: - cluster-upgrade had 7+ bare kubectl/helm invocations despite the global --context rule. Added Phase 0 that pins $CTX once and an explicit Guardrail forbidding bare commands. Every kubectl and helm in the skill now passes --context $CTX / --kube-context $CTX. - package-deploy --registry example showed ghcr.io/lexfrei (maintainer namespace) in a cozystack-branded skill. Replaced with placeholder ghcr.io/<your-username>. - debug/references/upstream-routing.md called extractedprism's relationship 'author has commit rights through CCP context' — phrase doesn't parse for outside readers. Replaced with a neutral pointer ('independent BSD-3 project; file there for proxy-specific bugs') plus a new README section 'Third-party dependencies' that documents the dependency policy explicitly. Non-blocking cleanups: - Sample prompts in wizard / talos-bootstrap / debug had Russian text that confuses English-only contributors reading the source. Translated to English with a top-of-skill meta-note: 'every operator-facing prompt below is written in English; the LLM matches operator language at runtime per Core Principle'. - .gitignore inconsistency: wizard sops opt-in flow claimed talosconfig + secrets.yaml stay gitignored even with sops on, contradicting the sops.md decision matrix which has talm encrypting them. Aligned — sops on, every secret file is encrypted-in-tree and commit-friendly; only *.tar.gz stays ignored. Talos NOTES summary updated to match. CI validator (preventing recurrence): - tools/check-refs.sh walks the plugin tree and validates: (1) every references/<file>.md mentioned in a SKILL.md exists; (2) every /<plugin>:<skill> mention resolves to a real directory under plugins/<plugin>/skills/<skill>/; (3) every plugin description in marketplace.json and the corresponding plugin.json names every skill present on disk. Six of nine blockers above are surface forms of the same 'string in one file doesn't match reality in another' bug — the validator catches all three patterns mechanically. - .github/workflows/validate.yml runs jq on every JSON manifest and tools/check-refs.sh on every push and PR. plugin.json bumps to 1.9.0. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Validator additions: - check-refs.sh Check 4: enforce --context / --kube-context on kubectl and helm cluster-mutating commands; allow-list read-only / local ops - check-refs.sh Check 5: deny private cluster names in plugin tree Documentation fixes flushed by the new checks (61 violations across cluster-upgrade, cluster-install, debug, external-app-create, package-bump, package-deploy, talos-bootstrap, linstor/recover): - add --context / --kube-context to every cluster-mutating invocation - replace private cluster identifiers with generic placeholders Operator-facing cleanups: - talos-bootstrap Phase 5: drop duplicated intro paragraph - talos-bootstrap Phase 12 NOTES: dedupe artifact list - cluster-install Phase 10 NOTES: expand 'helm uninstall ... -n' to 'helm --kube-context <CTX> uninstall ... --namespace ...' plugin.json: bump 1.9.0 -> 1.10.0 Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

cluster-install Phase 5.6: render $CONFIG_DIR/extractedprism-values.yaml on disk and pass it via --values, instead of inlining --set endpoints=... The wizard already advertises this artifact (cluster-config layout, sops opt-in prompt, .gitignore listing, final summary; state-schema.md; sops.md creation_rules path_regex) — the downstream skill must actually produce it for the artifact contract to hold. check-refs.sh: replace GNU-only \\b word boundaries with portable character-class equivalents. \\b is silently treated as literal by BSD grep on macOS, which made Check 4 (kubectl/helm --context discipline) and Check 5 (private cluster name denylist) no-op locally for any contributor without GNU grep first in PATH. The validator is documented as a pre-commit local gate, so a silent local degradation is the worst failure mode. plugin.json: 1.10.0 -> 1.10.1. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

extractedprism chart contract: endpoints is a comma-separated string scalar (values.schema.json type=string). A YAML list shape is rejected by helm at schema validation. cluster-install Phase 5.6 now renders endpoints as a single string, matching the chart's published values. Verified against pulled chart 0.2.0: helm template ... --values <(echo 'endpoints: ["x:1"]') => Error: at '/endpoints': got array, want string helm template ... --values <(echo 'endpoints: "x:1,y:1"') => render OK references/values-template.md: switch documented invocation to --values matching SKILL.md; call out the string-scalar contract explicitly so future contributors do not regress. references/known-failures.md: rewrite the recovery-path snippet to use --values too. The previous --set form is in fact also broken — helm --set parser treats ':' and ',' as syntax, so '--set endpoints=a:1,b:1' fails to parse regardless of quoting. The values-file form sidesteps this entirely. Layer-pure refusal text: - cluster-install SKILL.md Guardrails: drop the operator-facing pointer to re-run /cozystack:wizard from the sops-missing refusal; tell the operator what to install or which flag to pass instead. - ubuntu-bootstrap SKILL.md Phase 2.5 and Guardrails: same treatment. The downstream skill must not name the orchestrator in operator-facing output. Internal SKILL.md prose and 'See also' references are fine. plugin.json: 1.10.1 -> 1.10.2. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

The previous Guardrail wording pointed operators at references/requirements.md for SOPS recovery options, but that file has no SOPS content — the only SOPS recovery doc lives in wizard/references/sops.md. Pointing at a file that doesn't carry the content is the same class of doc-drift the cross-ref validator was built to catch (and the validator only verifies file existence, not section presence). Inline the recovery options directly, mirroring the ubuntu-bootstrap treatment in commit 19a7ba9. Operators hitting the same problem in either skill now get the same wording. plugin.json: 1.10.2 -> 1.10.3. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

The ZFS-only design has no declarative LinstorSatelliteConfiguration CR — Phase 5.5 documents this correctly (CRD has no zfsPool slot; registration goes through the LINSTOR API at Phase 8). Two stale call sites contradicted that: - Guardrail bullet claimed Phase 4 writes a 'LinstorSatelliteConfiguration block' to the platform package — there is no such block. Rewrite to describe the actual storage-state shape: cozystack.storage.nodes[] in .state.yaml, replayed by the Phase 8 post-Ready hook. - References section advertised 'LinstorSatelliteConfiguration shapes' in values-template.md — that file documents the ZFS pool registration hook, not a CR shape. Repoint accordingly. plugin.json: 1.10.3 -> 1.10.4. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Real install run on a 3-node Talos / OCI / VIP cluster surfaced three chain-level gaps that no per-skill review could catch: 1. Front-load promise was not honoured across the chain boundary. Wizard collected the bootstrap minimum, then cluster-install asked the operator publishing.host, externalIPs strategy, bundles, cert solver, storage layout, exposed services after the cluster already existed. Many of these are policy decisions known before bootstrap; re-asking them post-runtime is friction. 2. completed_at / failed_at contract was implicit. cluster-install never documented the write. On the real run the operator patched status.cluster-install by hand after dispatch because the wizard's dispatch loop could not find a state transition. 3. ScheduleWakeup fallback fired on top of downstream task-notification. Wizard re-invoked itself, spent a turn re-checking state that did not change, and reached an ambiguous dispatch decision. Changes: - wizard SKILL.md Phase 4: rewrite as 'full intake — everything policy-decidable up front'. Bootstrap-stage slots stay route-keyed. New cozystack-stage slots collected up front: bundles, storage layout preference, networking CIDRs, publishing.host + host_kind + cert solver + exposed services, external_ips strategy with explicit internal/external/explicit option (and the OCI-NAT failure mode spelled out), extractedprism opt-out. Render single consolidated review screen with Approve all / Edit <slot> / Cancel. - wizard SKILL.md Phase 5 dispatch loop: forbid ScheduleWakeup fallback; enforce that downstream skill returning without completed_at / failed_at is treated as broken contract, synthesised failed_at, and routed through debug. - wizard Phase 4 bootstrap slots gain per-node VIP-link static address collection — OCI maintenance-mode interfaces commonly lack IPv4 on VLAN secondaries, talm template then renders LinkConfig without addresses and the cluster splits. - state-schema.md: new top-level cozystack_intake section documenting every policy slot the wizard now collects, with the external_ips.strategy=internal default justification. status responsibilities table updated; wizard now owns cozystack_intake.*. - cluster-install SKILL.md Phase 4: rewritten as read-cozystack_intake- first, discovery-driven fill, single Approve/Edit gate. Direct invocation (no wizard) still works — the skill falls back to inline prompts when cozystack_intake is absent. - cluster-install SKILL.md adds Phase 9.5 — mandatory status write, one of completed_at / failed_at, with shape spelled out. Phase 10 NOTES follows after. - wizard SKILL.md Guardrails enforce 'neither completed_at nor failed_at is a broken contract' and route through debug. plugin.json: 1.10.4 -> 1.11.0 (minor — chain contract change, new state slot). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Five issues surfaced on a 3-node OCI Talos install that no review caught because they were about Talos / cloud-provider reality vs the documented happy path. Phase 4 maintenance-mode probe: replace 'talosctl get machineconfig --insecure' with 'talosctl version --insecure --short'. Talos v1.13+ returns PermissionDenied to insecure per-resource gets even in maintenance mode, which the previous probe misread as 'already configured' — a false negative that silently skipped Phase 5/6. Phase 6.5 (new) — stub creation: 'talm init' writes templates/, secrets.yaml, talosconfig — but not nodes/<name>.yaml. The skill now creates the stub files with hostname + schema modeline before 'talm template --in-place' fills them. Phase 6.7 (new) — VIP-link IPv4 guardrail (HA only): On cloud-provider VLAN secondaries (canonical case: OCI ens5 in maintenance mode), the VIP-carrying link often has no IPv4 — talm template then renders LinkConfig with a vip: block and no addresses: block. After talm apply, only the VIP-holder is reachable in the VIP subnet; other CPs cannot dial the VIP for etcd join, and the cluster sits in 'member Preparing' for 10+ min before timing out. Skill now detects the empty addresses: block, refuses Phase 7, and either auto-patches from wizard intent_hints.vip.per_node or surfaces the exact fix. Phase 8 — quoting + reboot semantics: Add explicit guidance to always double-quote shell variables holding IPs / node lists ('--nodes "$IPS"' not '--nodes $IPS') — unquoted lists with multiple IPs fail with 'unknown command <second-ip>'. Clarify the 'Applied configuration without a reboot' message: talm applied live; the on-disk image is replaced lazily on next upgrade, not on this apply. Phase 12 — talm.key backup discipline: Surface talm.key as a recovery key in the artifacts listing and add a separate reminder line before NOTES. talm.key is not sops-encrypted; losing it loses the cluster's PKI without a path to regenerate. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

… cert diag, e2e gate Four runtime issues from the OCI install run, all in cluster-install: #104 — External IPs choice silently picked the wrong set on NAT'd providers. cluster-install Phase 4 LB/externalIPs now reads cozystack_intake.external_ips (filled by wizard Phase 4) and validates against live Node.status.addresses. On intent_hints.platform ∈ {oci, aws-with-eip, gcp-with-nat} the skill refuses the 'external' strategy and forces 'internal' with an explicit reason, because public IPs on those platforms are 1:1-NATed before the packet reaches the node — Cilium externalIPs BPF would never match. Operator can --allow-external-on-nat-provider to override. #105 — Phase 7.5 root Tenant patch was a fixed pre-Phase-8 wait, but the tenants.apps.cozystack.io CRD can land after most HRs are Ready (observed on a real run). Move the patch into the Phase 8 watch loop: event-driven — as soon as the CR appears, patch it and continue monitoring. Removes the race entirely. Update SKILL.md cross-refs, Guardrails, and references/known-failures.md to match. #106 — 'wait ~5 min for first issuance' was a misleading cert-manager mitigation that hid real failures. Healthy ACME HTTP-01 resolves in under 30 s; >2 min stuck is always a definite cause. Phase 9 now prints a per-state diagnostic table (kubectl get challenges, orders, cert-manager log filter) and a mapping of state+reason -> likely cause -> fix. nip.io vs custom-fqdn diverge here: nip.io's instant-DNS makes any 2-min pending challenge anomalous. #107 — End-to-end reachability gate added as Phase 9.4 (between Phase 9 verify and Phase 9.5 status write). 'All HRs Ready' is a cluster-side signal; says nothing about whether the dashboard is reachable from outside the cluster. The skill now curls https://dashboard.<host>/ from the workstation, classifies the result (200/302/401 pass; exit 7/28, 530, persistent 502/503 fail), and on failure writes failed_at='external-reachability' with curl + DNS + node-addresses detail. The cluster is not rolled back; debug picks up from there. --skip-external-reachability downgrades to warning for non-routable sandbox installs. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…linstor hook + StorageClasses Three storage-path gaps surfaced on the OCI Talos install: #101 — Phase 5.5 used 'kubectl debug node --image=alpine:3 -- chroot /host zpool create'. None of the parts work on Talos: - alpine ships musl; the cozystack-tuned image's zpool is glibc-linked and the loader (/lib64/ld-linux-x86-64.so.2) lives only in the ext-zfs-service namespace, not on the host rootfs visible via chroot - 'chroot /host /bin/sh' fails — Talos has no /bin/sh - PSA baseline on default / kube-system rejects the sysadmin debug pod Split Phase 5.5 by distribution. Talos path now uses a one-shot privileged Pod from ubuntu:24.04 in a cozy-storage-bootstrap namespace labelled PSA=privileged: apt-get install zfsutils-linux, sgdisk manual partitioning (no udev inside the pod, so 'zpool create /dev/sdb' cannot wait for partition discovery), then 'zpool create -f ${DEVICE}1'. Ubuntu / k3s / kubeadm path keeps the original kubectl-debug chroot approach. references/storage-backends.md gets the verbatim Pod manifest and the three-reasons-why-chroot-fails block. Add a Phase 5.5 step 7: pre-existing-data check before zpool create (pvs, dmsetup, wipefs probe). 'talosctl reset' does not wipe user disks by default, and previous LINSTOR LVM-thin pools on /dev/sdb silently break 'zpool create' with EBUSY or DEGRADED pools. #102 — Earlier SKILL.md and values-template.md described a 'Phase 8 hook that runs linstor storage-pool create zfs per node after controller Ready'. The hook was never implemented; the watch loop only monitored HR Ready=True. Real installs left LINSTOR with only DfltDisklessStorPool and operators registered pools by hand. New Phase 8.5 — Register LINSTOR storage pools — is the actual implementation. After all HRs Ready, iterate state.cozystack.storage.nodes[] and exec 'linstor storage-pool create zfs <node> <linstor-pool> <zpool>' inside the linstor-controller pod. Idempotent (skip when storage-pool list already shows the entry). On per-node failure, write failed_at='linstor-storage-pool' and abort — partial registrations are kept for operator inspection. references/storage-backends.md rewrites the 'register the pool' section to point at the actual implementation in SKILL.md Phase 8.5. #103 — cozystack v1.3.x does not create StorageClasses automatically; cluster reaches 'all HRs Ready' with zero SCs and every stateful tenant workload sits in 'pod has unbound immediate PersistentVolumeClaims'. Skip on v1.4.0+ (the tenants CRD then exposes spec.storageClasses). New Phase 8.6 — Default StorageClasses — writes 'local' (placementCount=1) and 'replicated' (placementCount=3, isDefaultClass=true) to storageclasses-default.yaml under config-dir and applies them. placementCount on 'replicated' auto-derives from cozystack.storage.nodes[] count when fewer than 3 storage nodes are present. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

New references/provider-pitfalls.md collects the things that weren't deducible from 'all HRs Ready' or from generic k8s docs and were each a multi-hour debug episode in a real install run: - OCI 1:1 NAT — externalIPs must be internal, mechanism explained - GCP NAT'd external IPs — same shape, with exception for direct-EIP VMs - AWS Elastic IP / NLB proxy-protocol pitfalls - Talos system-extension binaries live only inside ext-* namespaces; why kubectl debug + chroot /host cannot run zpool / drbd userspace - Pod Security Admission baseline blocks kubectl debug node since k8s 1.25; need a dedicated privileged-labelled namespace - talosctl reset leaves user disks intact; previous-install LINSTOR LVM-thin pool state on /dev/sdb silently breaks zpool create - cozystack v1.3.x does not create StorageClasses automatically - cozystack v1.3.3 isp-full bundle does not include Keycloak - api.<host> ingress speaks TCP passthrough — 401 + self-signed cert is by design (apiserver terminates TLS itself) - HelmRelease count varies during install (no fixed expected total) - linstor CLI requires mTLS client cert; always exec inside the controller pod - HelmRelease cascade warnings ('secret not found', 'rolebinding not found') in first 5–10 min are transient race conditions Each pitfall has Symptom / Mechanism / Fix sections so the skill can cross-reference a specific entry from Phase 4 publishing or Phase 5.5 storage and the operator gets the why, not just the what. plugin.json: 1.11.0 -> 1.12.0 (minor — new chain semantics + new phases 8.5 / 8.6 / 9.4 / 9.5 in cluster-install). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Four blockers + three medium findings from the post-batch-B review. Blocker — stale Phase 7.5 references in debug skill: debug SKILL.md classification example and references/classification.md table row both named 'Phase 7.5'. The cluster-install refactor moved that work inline into the Phase 8 watch loop; debug now produced evidence-narratives pointing at a phase number that no longer exists. Re-frame: the failure mode is now 'Phase 8 watch loop never observed the tenants/root CR appearing'. Blocker — Talos privileged-pod wait condition was wrong: references/storage-backends.md used --for=condition=Ready on a short-running one-shot pod with restartPolicy: Never. A one-shot pod transitions to phase=Succeeded with condition=Ready=False, so the wait hung the full 300 s timeout on every successful run. Switch to --for=jsonpath='{.status.phase}'=Succeeded; on failure describe + log + exit 1 so the operator gets the real diagnosis. Blocker — Phase 8.5 ordering would deadlock real installs: Earlier text gated LINSTOR storage-pool registration on 'all HRs Ready'. paas / monitoring HRs that request PVCs wait for the storage pool to exist, so an all-HRs-Ready gate would block on HRs that block on registration that blocks on all-HRs-Ready. Fold registration into the Phase 8 watch loop, gated on linstor-controller Deployment having >= 1 Ready replica (same shape as the root-Tenant ingress patch). Drop the standalone Phase 8.5. Update all cross-references (SKILL.md, storage-backends.md, provider-pitfalls.md). Blocker — duplicate /dev/zfs hostPath mount masks itself: Mounting hostPath:/dev/zfs at /dev/zfs inside a pod that already mounts hostPath:/dev at /dev is nested-mount territory; one of the two will mask the other. /dev/zfs is reached through the /dev mount (devtmpfs is shared). Drop the second volume + volumeMount. Medium — curl exit-code semantics in Phase 9.4 reachability probe: Phase 9.4 said 'curl exit 7 is almost always externalIPs misconfig on a NAT'd provider'. curl exit 7 (CURLE_COULDNT_CONNECT) actually covers both ECONNREFUSED (RST — fits the NAT story) and EHOSTUNREACH / ENETUNREACH (no route — workstation can't reach the cluster at all). Different root causes. Add 'ip route get <ip>' as the disambiguator. Also add curl exit 6 (DNS resolve failed) — distinct failure mode, common on custom-fqdn before DNS is configured. Note the %{exitcode} curl 7.75+ requirement. Medium — Phase 8 LINSTOR loop iteration safety: 'for entry in $(...)' word-splits on whitespace; an embedded space in a JSON value would silently turn one object into multiple bogus entries. Switch to 'jq -c .[] | while IFS= read -r entry' both in SKILL.md Phase 8 and references/storage-backends.md. Medium — talosctl jsonpath empty-output silent pass: talosctl get extensions ... | grep -E 'drbd|zfs|openvswitch' silently passes when the stream is empty (grep exits with no match). Replace with a per-extension grep -q with explicit floor — each expected extension must produce at least one match, otherwise exit 1. plugin.json: 1.12.0 -> 1.12.1. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Three structural contradictions in cluster-install/SKILL.md and one comment lie in talos-bootstrap/SKILL.md. Blocker A — STOP GATE 2 referenced as 'Phase 5' but lived inside Phase 4 (Phase 5 didn't exist as a heading). Three call sites (line 27, 219, 267) pointed operators at a non-existent phase. Promote 'STOP GATE 2 — Plan presentation' to '## Phase 5 — Plan presentation (STOP GATE 2)'. Phase 4 ends at the intake summary; Phase 5 owns the plan and the gate. Blocker B — two sibling H2 headings titled '## Phase 8'. The second one was rationale + the inline LINSTOR registration block; it isn't a separate phase, it's a continuation. Demote to '### LINSTOR storage-pool registration — inline' under the first Phase 8 H2. Blocker C — Phase numbering skipped 9.1 / 9.2 / 9.3 and jumped straight to 9.4 / 9.5, leaving the fingerprint of removed-but-not- renumbered phases. Renumber 9.4 -> 9.1 (reachability probe) and 9.5 -> 9.2 (status write). Medium — talos-bootstrap Phase 11 extensions check comment claimed 'grep -c floor catches that' while the code uses per-extension 'grep -qE'. Rewrite the comment to describe what the code actually does. plugin.json: 1.12.1 -> 1.12.2. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

… Phase 7 contribution offer The skill set now ships a seed library of operator-experience case studies that capture things not deducible from generic Cozystack / Kubernetes docs: provider-specific quirks, hardware edge cases, race conditions only visible on slow clusters, real failure modes operators debugged for hours and the actual sequence of commands that fixed them. Layout under plugins/cozystack/skills/wizard/references/case-studies/: oci/ (3) — externalIPs 1:1 NAT, VLAN-VIP IPv4 missing, VNIC-attachment for VIP placement hetzner/ (2) — RobotLB + vSwitch instead of MetalLB, GRUB serial-console blocks Talos boot bare-metal/ (1) — Ubuntu Secure Boot rejects DRBD module talos/ (7) — kubectl debug + chroot fails for zpool, PSA blocks debug pods, talosctl reset leaves user disks, boot-to-talos hardcoded version mismatch, lldpd extension blocks boot, talosctl version mismatch, single-disk UserVolume + --insecure flag window storage/ (6) — LINSTOR controller CrashLoop + CRD-DB recovery, dedicated network resets, multi-DC zone selectors, encryption passphrase re-enter after restart, stale superblock blocks pool detection, TLS rotation staged restart networking/ (3) — kube-ovn DB cleanup DaemonSet, virtual-router port_security annotation, KubeSpan + kube-ovn MTU 1222 virtualization/ (5) — Proxmox migration via cdi-uploadproxy, Windows VirtIO bus switch + MTU 1400, MikroTik CHR SATA bus, GPU passthrough resource-name derivation cozystack-v1.3/ (2) — StorageClasses not auto-created, Keycloak absent from isp-full bundle cluster-install/ (1) — tenants CRD race with HRs Ready Sources: a real OCI Talos install run that surfaced 8 issues, plus 22 candidates lifted from cozystack/website docs/v1.3 and blog posts. All IPs redacted to RFC 5737 or placeholders; hostnames to example.com; no customer / account / project identifiers preserved. case-studies/README.md spells out the schema, the body shape (Symptom / Mechanism / What was tried / What fixed it / Operator notes), and the redaction rules. New cases follow the schema. Wizard Phase 7 — contribution offer: After Phase 6 final summary, when the run hit something interesting (resolved failed_at, triggered guardrail, new platform tag without prior coverage, debug drafted upstream issue), the wizard offers ONCE to draft a case study. Skipped silently when nothing interesting happened. Auto-redaction runs BEFORE the operator sees the draft (IPv4/IPv6 literals, hostnames, OCID/ARN/GCP-project, SSH key paths, secrets, emails). The redaction report is printed alongside the draft so the operator can audit what was replaced. Operator is then explicitly asked to read the draft for things the regex can't catch — cloud account IDs, organization names embedded in label values, internal hostnames in stack traces, real names in operator-notes. Four options: Looks clean / Edit first / Save only / Discard. Discard records the decline and the wizard never asks again on subsequent --resume runs. Looks clean prints the exact 'gh pr create' command for the operator to run themselves. The wizard never opens PRs. Cross-references added from: - talos-bootstrap Phase 6.7 VIP-link guardrail → oci/vlan-vip-no-ipv4-in-maintenance - cluster-install Phase 4 externalIPs strategy → oci/externalips-1to1-nat - cluster-install References section → full case-studies catalog by axis README.md gains a 'Knowledge base' section above 'Third-party dependencies' explaining what the seed library covers and how operators contribute back. plugin.json: 1.12.2 -> 1.13.0 (minor — new wizard phase + new top-level reference subtree under wizard/). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…e base + Phase 7 contribution offer" This reverts commit f805c4e.

…cited Replaces the reverted bundled case-studies knowledge base with a runtime research pass before dispatch. Rationale: A bundled knowledge base of 30 pre-written case studies inside the plugin had two problems no review caught: - Data ages (Cozystack ships monthly; Talos ships monthly; providers evolve). Pre-written cases pin to a moment in time and rot. - Skills should ship instructions, not data. A growing community- contributed knowledge base mixes review cadences with skill code and conflates 'what to do' with 'what others did'. The right shape: the wizard knows HOW to look for known landmines for the operator's specific combination at install time, against current docs and current upstream trackers. Phase 4.5 — Active research: - Runs read-only, time-boxed (~2 min), after Phase 4 intake collects the full combination (target, platform, version, variant, bundles). - Sources in trust order: local clones of cozystack/website + cozystack/cozystack + cozystack/talm (if present at ~/git/github.com/cozystack/), upstream issue trackers via gh, web search as last resort. - Skeptical-verification rules are non-negotiable: every claim must cite a traceable source (URL, file path, issue number); 'I recall that' / 'I'd expect' / 'might also' are forbidden; recency check for blog posts vs current versions; reproduce-vs-speculation flag. - Per-axis checklist of things to look for (oci, hetzner, aws-with-eip, gcp-with-nat, bare-metal + talos / bundle-specific axes) — starting points, not an exhaustive matrix. - Output is a ranked HIGH / MEDIUM / LOW landmines list with the source citation per finding; operator gate: Acknowledged / Pause / Edit Phase 4 values. - Empty result is a quiet one-liner, not a gate. - 24h cache keyed on the combination; --rerun-research forces fresh. plugin.json: 1.12.2 -> 1.13.0 (minor — new phase in wizard, new state slot research_cache). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ltidoc handling Real OCI install run on Talos 1.12 (cozystack v1.3.3) surfaced four high-severity bugs in talos-bootstrap. The cert-SAN trap caught the same operator twice in a row — second occurrence makes it a skill bug, not a one-off. #1 cert-SAN trap (NEW Phase 6.3 guardrail): On NAT-fronted providers (OCI 1:1 NAT, GCP Cloud NAT, AWS EIP) the workstation reaches each node by a public IP that the cloud fabric rewrites to the internal IP before the packet hits the interface. Talos sees only the internal IP and generates the API-server cert with internal-only certSANs. The workstation, dialing the public IP, gets a cert with no matching SAN; TLS handshake fails; talosctl bootstrap cannot proceed; there is no insecure escape (skipping verify on the bootstrap subcommand doesn't exist, and talosctl reset itself needs valid TLS). Recovery from this state requires re-imaging the node. The trap caught the same operator twice across consecutive sessions. Phase 6.3 — NAT-provider cert-SAN guardrail — runs BEFORE the first talm apply, when the signature (reach_mode=public AND external_ips_strategy=internal AND public_ip set) is detected. Auto-populates values.yaml machine.certSANs with: 127.0.0.1, localhost, every inventory.nodes[].public_ip and internal_ip, vip.shared_address, vip.per_node[], api.<host>. Surfaces the result and the reason. Skipped silently when no NAT signature. #4 Talos 1.12 maintenance probe: Earlier rewrite (post-dev17.1) switched the probe to 'talosctl version --insecure --short' to dodge v1.13's PermissionDenied on get machineconfig. But Talos 1.12 (cozystack v1.3.3's pinned version) returns 'API is not implemented in maintenance mode' from the version endpoint — false negative. New probe is 'talosctl get disks --insecure' which works across Talos 1.12 / 1.13 / 1.14 in maintenance. Use 'get machineconfig --insecure' result to distinguish 'maintenance' from 'already configured' after the disks probe confirms the node speaks the Talos API at all. #2 talm template --in-place strips body overlay: talm README states explicitly that 'talm template -f node.yaml [--in-place]' produces output byte-identical to running the template alone — it does NOT merge body overlay (HostnameConfig, per-node LinkConfig with static IPv4 on the VIP link, etc.). On Talos 1.12+ multidoc the body overlay is critical for per-node network config; --in-place erases it. Phase 6.5 no longer runs --in-place. Instead it writes nodes/<name>.yaml with the body overlay directly: modeline + HostnameConfig doc + LinkConfig doc (with addresses when the node carries a per-node VIP static) + Layer2VIPConfig doc (once on the first CP). 'talm apply -f node.yaml' merges this body overlay on top of the template render correctly. #3 VIP-link guardrail multidoc-aware: The old guardrail check used 'yq .machine.network.interfaces[] | select(.vip != null) | .addresses' which only matches the legacy single-doc Talos <= 1.11 format. On 1.12+ multidoc the VIP lives in a separate Layer2VIPConfig doc and link static addresses live in LinkConfig doc — the old yq path always returned empty (false trigger) or missed real misconfigurations (false negative). Phase 6.7 now branches on state.cluster.talos_version: multidoc check looks at 'select(.kind == "LinkConfig" and .metadata.name == strenv(VIP_LINK)) | .spec.addresses'; legacy check stays for Talos <= 1.11. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…sion normalize + hostNetwork + LINSTOR race #5 talos-bootstrap Phase 11.5 — auto-upgrade to cozystack-tuned image: When Phase 11 verification detects missing extensions or running image != pinned cozystack-tuned image, the root cause is almost always 'operator booted from base Talos image instead of the cozystack-tuned artefact'. talm apply does not reinstall — the pinned machine.install.image only takes effect on the next talosctl upgrade. Phase 11.5 surfaces the diagnosis, asks the operator (STOP GATE), then runs 'talosctl upgrade --image $PINNED --preserve' per node sequentially (etcd quorum maintained). After all nodes upgrade, re-runs Phase 11 verification once. Without this step, the downstream cluster-install Phase 3 refuses the cluster. #6 cluster-install Phase 6 — OCI registry version normalize: cozystack/cozystack git tags are 'vX.Y.Z' but the OCI registry publishes the cozy-installer chart at tag 'X.Y.Z' (no v prefix). Passing 'v1.3.3' to --version produces 'not found'. Normalize in-place: INSTALLER_VERSION_OCI="${INSTALLER_VERSION#v}". Phase 5 plan presentation shows the normalized tag explicitly. #7 cluster-install Phase 5.5 — hostNetwork: true on storage pod: Phase 5.5 runs BEFORE Phase 6 which installs cozy-installer (and Cilium). Without CNI a pod on the default pod-network cannot get an IP and stays ContainerCreating forever. The bootstrap pod doesn't open any listening ports, so hostNetwork: true is the right shape — sidesteps CNI dependency entirely. Test run got lucky (image pulled before Cilium was Required) but contract requires explicit hostNetwork. references/storage-backends.md already has it; updated SKILL.md prose to explain why. #8 cluster-install Phase 8 — LINSTOR race fix: Earlier design folded LINSTOR storage-pool registration into the watch loop gated on linstor-controller Deployment Ready, to avoid the post-watch-deadlock with paas/monitoring HRs that depend on PVCs. But: a HelmRelease can report Ready=True (from Flux's install-action-succeeded lifecycle) BEFORE the underlying Deployment has any ready replicas. The watch loop's outer exit condition was 'no HR is non-Ready', which fired before the inline registration block had a ready linstor-controller to exec against. Pools stayed unregistered; operator had to register manually after cluster-install declared success. STOP GATE 3 now requires BOTH no-HR-not-Ready AND storage-pool count matches storage-node count (queried via 'linstor storage-pool list' filtered to provider_kind=ZFS, unique node names). The watch loop keeps polling — including the inline registration — until both conditions hold, without re-introducing the post-watch deadlock. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…ch storage + bash compat + state schema Batch D — wizard: #9 Networking CIDRs read from cozystack source values.yaml: Hardcoded 10.42.0.0/16 + 10.43.0.0/16 were k3s defaults, not cozystack defaults. cozystack v1.3.3 actually uses 10.244.0.0/16 + 10.96.0.0/16. Resolution order: local clone of cozystack/cozystack at ~/git/.../packages/core/platform/values.yaml, then older layouts, then HTTP fallback. cozystack_intake.network now carries defaults_source path so the operator can trace which file informed the choice. #10 cozystack_intake.installer_variant + platform_variant kept SEPARATE: Older drafts joined into 'isp-full-generic' but the cozy-installer chart wants installer_variant for --set cozystackOperator.variant AND platform_variant for the values overlay. The joined form doesn't map to either CLI flag. cozystack_intake schema split explicit: installer_variant ∈ {generic, talos, hosted}, platform_variant ∈ {isp-full, isp-paas, isp-hosted, minimal}. #11 Drop separate confirmation rounds in Phase 0/2: Operator complained about 5-6 'looks right? yes/no' rounds across the wizard. Phase 0 echo now flows directly into Phase 1's question (single response moves the chain). Phase 2 no longer asks 'right?' when intent_hints.target was already extracted; Phase 4 consolidated intake is the single confirmation point. Batch E — cross-skill UX: #13 cluster-install Phase 5.5 batch-all-identical: When all storage-scope nodes share layout + device shape + distribution, surface one STOP GATE that covers all N nodes ('Provision all 3 identical nodes') instead of three identical per-node prompts. Detect identical-batch shape from cozystack_intake.storage_pref + discovered devices. For non-uniform configurations (mixed single/mirror, different paths), fall back to per-node STOP GATE — those are real choices. #12 Shell compatibility for emitted helper scripts: Documented in references/helmrelease-monitoring.md. Shebang #!/usr/bin/env bash; avoid declare -A / mapfile / ${var,,} / ** globstar; fall back to parallel arrays + linear lookup, or awk/jq/yq over a tempfile. Bash-4-only constructs need explicit BASH_VERSINFO guard. A real session hit declare -A failing on macOS bash 3.2 and the watch loop never ran. #14 state.schema.json — machine-readable JSON Schema: Lives at wizard/references/state.schema.json. Validates every top-level section (intent_hints, sops, inventory, cluster, cozystack_intake, cozystack, research_cache, status) with enum constraints on variant/distribution/layout/strategy fields. status.<skill> oneOf-constrained to exactly one of completed_at / failed_at / (dispatched-only mid-flight). Operators can drive yaml-language-server via a header comment in .state.yaml; future check-refs.sh extension can validate. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…covery When Talos nodes reach an unrecoverable state from inside the cluster (cert-SAN trap caught before the guardrail, broken machine-config, lost talosconfig, accidental wipe), 'talosctl reset' itself needs a valid TLS handshake — there's no in-cluster escape. Recovery requires provider-layer intervention: terminate the instance, preserve disks, relaunch from the cozystack-tuned image, re-attach. cozystack:talos-reset orchestrates this without losing block volumes, secondary VNICs, or NSG memberships. Provider-specific CLI sequences (oci, aws, gcloud, hcloud) live in references/provider-cli.md. Phase plan: 1 — Read state, scope (--nodes filter or all-of-failure-signature) 2 — Provider CLI auth check (refuses on missing/unauthenticated) 3 — Snapshot per-node cloud state to <config-dir>/talos-reset/<node>/ (instance + volume-attachments + vnic-attachments + nsg memberships) 4 — Plan presentation (STOP GATE 1): per-node terminate/relaunch/reattach commands with preserved+lost breakdown 5 — Terminate per node, sequential (etcd quorum maintained on multi-CP) 6 — Relaunch from cozystack-tuned image + reattach preserved resources 7 — Verify maintenance mode + preserved zpool intact 8 — Update state.yaml (new public IPs, reset status.talos-bootstrap) 9 — NOTES + handoff to /cozystack:talos-bootstrap Provider-specific gotchas documented: - OCI: PARAVIRTUALIZED launch mode mandatory for QCOW2 custom images; secondary VNIC OCIDs change (preserve VLAN membership, not VNIC ID). - AWS: detach data volumes BEFORE terminate (DeleteOnTermination behaviour); EIPs disassociate, re-associate post-relaunch. - GCP: --keep-disks=data preserves non-boot disks automatically; Cloud-NAT'd VMs handled by Phase 4.5 NAT-signature research. - Hetzner: Cloud servers only (Robot not supported by hcloud); vSwitch + VLAN preserved at Network attachment level. Guardrails: - Never auto-execute provider CLI; print + wait for Proceed - Never terminate >1 node simultaneously on multi-CP (etcd quorum) - Never assume new public IP == old (capture from new instance VNIC) - Never touch in-cluster state (Cozystack-side is cozystack:debug) Registered in marketplace.json + plugin.json + README + CLAUDE.md. plugin.json: 1.13.0 -> 1.14.0 (minor — new skill). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

…sion Review of the dev17.2-feedback batch caught four contradictions introduced by my fix attempts. All four come from making changes without verifying against the upstream source. #1 platform_variant enum was invented, not sourced from upstream: cozystack/cozystack/packages/core/platform/ ships exactly three overlays: values-isp-full.yaml, values-isp-full-generic.yaml, values-isp-hosted.yaml, plus the base values.yaml (default). The schema enum was ['isp-full', 'isp-paas', 'isp-hosted', 'minimal'] — omits real value isp-full-generic and invents two values with no upstream backing. state.schema.json + state-schema.md now use the real upstream enum: ['default', 'isp-full', 'isp-full-generic', 'isp-hosted']. The state-schema.md comment that called 'isp-full-generic' a 'v0 mistake' was itself the mistake; rewritten to document the installer_variant ↔ platform_variant pairing as it exists in cozystack source (talos↔isp-full, generic↔isp-full-generic, hosted↔isp-hosted). #2 CIDR fix #9 was only half-applied: wizard/SKILL.md correctly states cozystack defaults are 10.244.0.0/16 / 10.96.0.0/16, but cluster-install/SKILL.md:197-199 consolidated summary still showed 10.42 / 10.43, and references/values-template.md kept those as 'k3s default' recommendations. Updated cluster-install Phase 4 summary to show cozystack defaults. Rewrote values-template.md 'CIDR cheat sheet by distribution' to 'CIDR defaults from cozystack source' — the k3s/RKE2 distro defaults are irrelevant on cozystack because Kube-OVN overlays cozystack's defaults regardless. Old table was actively misleading. #3 Layer confusion in wizard SKILL.md Phase 4.5 trigger: Line 420 read 'When cozystack_intake.platform_variant contains system bundle' — platform_variant is a scalar enum, bundles is the array of bundles. Fixed to 'When cozystack_intake.bundles includes system' — the exact thing fix #10 was meant to keep separate. #4 cluster-install Phase 4 summary 'default for variant: isp-full-generic' was parseable as the old joined form. Clarified to 'default for platform_variant: isp-full-generic on installer_variant: generic' — explicitly names both axes. plugin.json: 1.14.0 -> 1.14.1. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

coderabbitai · 2026-05-17T01:36:07Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 828e0b67-66ec-468c-a28d-a9c04ef3a739

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/cozystack-install-skill

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request restructures the repository by consolidating individual skills into two primary plugin bundles, cozystack and linstor, and introduces a wizard orchestrator to streamline the installation process. It adds extensive documentation and reference guides for cluster installation, upgrades, debugging, and node bootstrapping for Talos and Ubuntu environments. A new validation script, check-refs.sh, is also included to maintain cross-reference integrity. Review feedback identifies a bug in the diagnostic bundle's tar command and points out that the skill count for the cozystack plugin needs to be updated for accuracy in the documentation.

gemini-code-assist · 2026-05-17T01:40:42Z

+  kubectl --context $CTX --namespace "$ns" describe hr "$name" > "$DUMP/hr-${ns}-${name}.txt"
+done
+
+tar -czf "${DUMP}.tar.gz" --directory /tmp "$(basename "$DUMP")"


The tar command for creating the diagnostic bundle has an incorrect --directory argument. It's set to /tmp, but the directory to be archived (diagnostics-${TS}) is created under $CONFIG_DIR. This will cause the command to fail if $CONFIG_DIR is not /tmp.

Suggested change

tar -czf "${DUMP}.tar.gz" --directory /tmp "$(basename "$DUMP")"

tar -czf "${DUMP}.tar.gz" --directory "$CONFIG_DIR" "$(basename "$DUMP")"

gemini-code-assist · 2026-05-17T01:40:42Z

+### cozystack

-| Plugin | Description |
+Platform skills bundle. One install gives you nine skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain.


The description for the cozystack plugin states that it provides nine skills, but it actually includes ten. This should be updated for accuracy.

Suggested change

Platform skills bundle. One install gives you nine skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain.

Platform skills bundle. One install gives you ten skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain.

gemini-code-assist · 2026-05-17T01:40:42Z

+
+```text
+plugins/
+  cozystack/                          # platform bundle (9 skills)


The repository layout description for cozystack incorrectly states it has 9 skills. This should be updated to 10 to match the actual number of skills in the bundle.

Suggested change

cozystack/ # platform bundle (9 skills)

cozystack/ # platform bundle (10 skills)

lexfrei added 30 commits May 15, 2026 17:15

chore(cozystack): bump plugin version 1.2.0 → 1.3.0 (sops opt-in)

653c4cc

Missed in ebd33b3. Bump to reflect the new sops feature so 'claude plugin details' shows operators which skill bundle they're on. Signed-off-by: Aleksei Sviridkin <f@lex.la>

lexfrei added 8 commits May 16, 2026 03:33

Revert "feat(wizard): seed operator-contributed case-studies knowledg…

9852abd

…e base + Phase 7 contribution offer" This reverts commit f805c4e.

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

lexfrei self-assigned this May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!(cozystack): wizard chain + 2-plugin consolidation (breaking)#11

feat!(cozystack): wizard chain + 2-plugin consolidation (breaking)#11
lexfrei wants to merge 38 commits into
mainfrom
feat/cozystack-install-skill

lexfrei commented May 17, 2026

Uh oh!

coderabbitai Bot commented May 17, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	tar -czf "${DUMP}.tar.gz" --directory /tmp "$(basename "$DUMP")"
	tar -czf "${DUMP}.tar.gz" --directory "$CONFIG_DIR" "$(basename "$DUMP")"

	Platform skills bundle. One install gives you nine skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain.
	Platform skills bundle. One install gives you ten skills, invoked as `/cozystack:<name>`. Start with `/cozystack:wizard` — it asks Talos / Ubuntu / Existing and picks the chain.

	cozystack/ # platform bundle (9 skills)
	cozystack/ # platform bundle (10 skills)

Conversation

lexfrei commented May 17, 2026

TL;DR

Repository refactor (BREAKING)

Skill rename + install-path migration

Why consolidate

New skills (6)

Renamed-skill changes (non-trivial)

CI + contributor infrastructure

Real-install hardening

Third-party dependency

Validation

Reviewer notes

Uh oh!

coderabbitai Bot commented May 17, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant