Skip to content

docs: add concise ACP HCP etcd design (no ManagedEtcd CRD)#1

Open
jiazhiguang wants to merge 17 commits into
release-0.2-alaudafrom
dev/release-0.2-alauda
Open

docs: add concise ACP HCP etcd design (no ManagedEtcd CRD)#1
jiazhiguang wants to merge 17 commits into
release-0.2-alaudafrom
dev/release-0.2-alauda

Conversation

@jiazhiguang

Copy link
Copy Markdown

Add docs/design/hcp-etcd-design.md, a trimmed, human-readable design for managing etcd under ACP HCP. It carries over the goals of the original acp-hcp-managed-etcd-crd-design.md but corrects them against the current codebase and drops the high-level ManagedEtcd CRD.

Key decisions:

  • No new high-level CRD. A codebase audit showed the low-level EtcdCluster controller already drives etcd membership directly (MemberList/Add/Remove/ PromoteLearner via internal/etcdutils), so the capabilities the original design assigned to ManagedEtcd (status aggregation, scheduling protection, single-member recovery) are generic etcd concerns. They belong on EtcdCluster and are potentially upstreamable, not in a separate CRD.

  • Two tracks instead of a wrapper CRD:

    • Track 1 (generic, upstreamable): enrich EtcdCluster. Populate the currently-empty EtcdClusterStatus (members/leader/health/conditions) from the data already gathered in-process; extend podTemplate with nodeSelector/tolerations/affinity/topologySpreadConstraints; add a PDB and a -client Service; add HyperShift-style single-member recovery (Job-based, gated on quorum + gracePeriod).
    • Track 2 (ACP-specific, optional): publish the Kamaji DataStore. Since track 1 makes the client endpoint and -client-tls secret stable and predictable, "who creates the DataStore" is a replaceable integration point with three options: (A) manual/declarative as a zero-code fallback, (B) a watcher in the Kamaji control-plane provider (preferred — keeps the Kamaji dependency out of etcd-operator), (C) an opt-in reconciler inside the operator. None of them add Kamaji fields to EtcdCluster.spec.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)
  • main
  • master
  • ^\d.x$

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9cb7d629-03bb-4caf-a577-fe98e5d074bd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev/release-0.2-alauda

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch 11 times, most recently from 541b58e to ec4b51d Compare June 5, 2026 14:59
Add docs/design/hcp-etcd-design.md, a concise design for highly-available
managed etcd under ACP HCP. Goal: keep the hosted control plane's etcd
available while management nodes are upgraded/drained.

Structure: benchmark against OpenShift HCP (HyperShift), inventory our gap,
then close it.

- OCP/HyperShift baseline: hosted etcd runs as a 3-member StatefulSet in the
  management cluster and stays quorum-safe across rolling node upgrades via a
  PodDisruptionBudget (one member evicted at a time), pod anti-affinity +
  topology spread, ordered StatefulSet rollout with readiness gating,
  member-level self-healing, and observable status.

- Gap vs OCP: the low-level EtcdCluster controller already drives etcd
  membership directly (MemberList/Add/Remove/PromoteLearner via
  internal/etcdutils), but the HA layers are mostly missing — EtcdClusterStatus
  is an empty struct, podTemplate only carries metadata labels/annotations
  (no affinity/topology), there is no PDB, no recovery workflow, and only a
  headless Service (no client Service).

- Close the gap in two tracks, no new high-level CRD:
  - Track 1 (generic HA, upstreamable): enrich EtcdCluster — populate status
    from the data already gathered in-process; extend podTemplate with
    scheduling fields; add a PDB and a <name>-client Service; add
    HyperShift-style single-member recovery (Job-based, gated on quorum +
    gracePeriod).
  - Track 2 (ACP-specific, optional): publish the Kamaji DataStore. Track 1
    makes the client endpoint and <name>-client-tls secret stable, so "who
    creates the DataStore" is a replaceable integration point: (A) manual as a
    zero-code fallback, (B) a watcher in the Kamaji control-plane provider
    (preferred — keeps the Kamaji dependency out of etcd-operator), (C) an
    opt-in reconciler inside the operator. None add Kamaji fields to
    EtcdCluster.spec.

No high-level CRD is introduced: the missing pieces are generic etcd HA
features that belong on EtcdCluster itself; only DataStore publishing is
ACP-specific and does not warrant a separate CRD.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch 2 times, most recently from 0672077 to 473f603 Compare June 5, 2026 16:26
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch 5 times, most recently from d165ffa to 9338aa5 Compare June 9, 2026 10:42
Design for running etcd as the datastore of ACP HCP hosted control planes,
kept highly available across management-node rolling upgrades and supporting
etcd version upgrades. Benchmarked against OpenShift HCP (HyperShift).

Contents:
- Benchmark vs OCP and gap analysis; EtcdCluster CRD fields and the
  StorageClass contract (per-member RWO local PV).
- Deploy & upgrade overview: deploy flow (management nodes + TopoLVM →
  operator → EtcdCluster → optional Kamaji DataStore); two upgrade paths —
  node rolling (serialized by PDB) and etcd version (serialized by the readyz
  probe), with the two guarantees: PDB keeps the other members available
  before a member is disrupted, and readyz only reports a started, voting
  member (leader/follower) as Ready.
- Management-plane changes: dedicated CAPI MachineDeployment (label
  cpaas.io/hcp-management-node), MachineConfigPool planning of
  IP/hostname/disk, TopoLVM local storage (TopolvmCluster CR + sc-topolvm-vdc),
  Baremetal Provider reusing the old node's IP/hostname/disk on roll, and the
  nodeDrainTimeout=0 invariant (never drain while a PDB is unsatisfied).
- etcd-operator changes: status, :9980 readyz/healthz probe (serializable
  health plus an added not-learner check), scheduling fields, PDB
  (maxUnavailable:1 + AlwaysAllow), version upgrade (downgrade rejected;
  minor-by-minor enforced by the upgrade flow), conditional reset-member
  initContainer, client Service, single-member recovery.
- DataStore publishing (optional); create & recovery workflows; observability
  & operations (where to read live status, stuck-not-crashed failure behavior,
  when manual intervention is needed, troubleshooting playbook); future work.

Single-member recovery only handles genuine data loss/corruption; on ACP the
PV/disk is reused on node roll, so members rejoin with their data
automatically without recovery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch from 9338aa5 to a6c3402 Compare June 9, 2026 10:49
zgjia and others added 4 commits June 10, 2026 07:06
Reorganize the doc into 背景 → Goal/Non-Goal → 核心总结 → OCP 升级对标 →
展开章节, and add §13 备份与恢复 (manual etcd snapshot + Velero backup of
the hosted control-plane namespace, restore flow), benchmarked against OCP.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e file

Revert hcp-etcd-design.md to the original (a6c3402) and add the
restructured + backup/restore version as hcp-etcd-design-v2.md, so both
revisions coexist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Background ends after the gap list with a one-line wrap-up
- Remove the "不新增高层 CRD" item from Goal/Non-Goal
- Rename 核心总结 → 总结
- Reword the quorum trade-off sentence in plain language

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Condense explanatory wording across every section; no design content,
tables, diagrams, code blocks, or cross-references removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch from 02b2168 to e84e2f3 Compare June 10, 2026 07:58
Keep the design discussion (what to back up with which tool, key
trade-offs, restore order); move the concrete runbook out as a separate
deliverable. Retitle to 方案 and update cross-references accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch from 6c9e2ab to 9439c3a Compare June 10, 2026 08:20
zgjia and others added 5 commits June 10, 2026 08:24
… handling

PDB-constrained drain is not unique to etcd (any PDB-backed service
constrains drain); reframe from "difference" to "same mechanism, the
burden is on etcd's own PDB + probes".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reframe §4 as four parallel dimensions (deploy / node upgrade / etcd
version upgrade / etcd HA), add the HCP node-topology isolation tiers
(Shared Everything / Shared Nothing / Dedicated Request Serving) table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a four-dimension summary table (deploy topology / node upgrade /
etcd version upgrade / etcd HA), keep the detailed HA-mechanism table.
Correct §4 point 1: shared mgmt node pool across hosted clusters is
Shared Everything (first tier supported), not Shared Nothing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… etcd-operator

- Add a divider after §5: chapters before it are background/benchmarking,
  after it is the formal design.
- Move EtcdCluster CRD into §8.1 (first under etcd-operator), fold the full
  example there as a collapsed <details>.
- Rename 高可用改造 → 生产可用改造 (§7/§8/§9).
- Renumber all sections and cross-references accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch from 45f5795 to 2251f69 Compare June 10, 2026 09:05
- §7.1: list-format key points + zone-label (topology.kubernetes.io/zone) requirement
- §3 point 3: add spec.deletion.nodeDrainTimeout=0/unset prerequisite
- §12: trim trade-offs to a single restore-order line
- §13 + §2 Non-Goal: drop 永久换机 / local PV 迁移
- §4: remove inaccurate "隔离强度递增"
- §5: clarify 降级硬校验 → etcd 版本 skew 校验(拦降级、限制跨 minor)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch from 24c8616 to a08c88e Compare June 10, 2026 09:26
基于 Kamaji + Cluster API 的 ACP HCP 托管控制面 etcd 生产可用设计,对标 OCP HCP(HyperShift)。

- 分析:背景、Goal/Non-Goal、总结、OCP 对标(部署拓扑 / 节点升级 / 版本升级 / etcd HA)、差距盘点。
- 方案:总体部署与升级流程;生产可用改造(管控面节点池 + TopoLVM 复用盘 + PDB;etcd-operator 含 CRD、status、readyz 探针、调度字段、PDB、版本升级、reset-member、client Service、单成员自愈;DataStore 发布);工作流程;可观测与运维;备份恢复方案(etcd snapshot + Velero)。

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch from 5a68ef1 to 3cd2875 Compare June 10, 2026 09:32
Summarize how Hosted Control Planes (HyperShift) handle DR: manual
runbook vs scheduled backup, etcd snapshot and OADP/Velero mechanics,
upgrade-time backup behavior, backup storage location, fleet-scale
backup, and whether OADP can back up and restore the etcd PV.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch 7 times, most recently from 8bdb04a to 20f9c4c Compare June 11, 2026 07:54
…tidy §8/§10

§2 Goal: 保证 HCP managed 节点升级时 etcd 可用性 / 支持 etcd 版本升级 /
etcd 单成员故障自动恢复 / 明确容灾方案(§12). Drops the generic
"扩缩容→高可用" and Kamaji/DataStore bullets and the "本期" qualifiers
on the Goal/Non-Goal headers.

§2 Non-Goal: backup/restore automation; the concrete runbook for
backup-restore and quorum-loss recovery (manual, documented
separately); TLS cert rotation; auto defrag. Quorum-loss recovery is
a manual runbook, not deferred automation.

Management-cluster placement and StatefulSet rendering (§3, §7.1, §8):
- HCP control plane (incl. etcd) must run on management-cluster WORKER
  nodes, never master/control-plane — on a master the OVN
  (ovn-kubernetes) pods can never be evicted, so node drain never
  completes and blocks machine replacement / upgrades.
- etcd pods get a high-priority PriorityClass (e.g.
  system-cluster-critical) so they aren't preempted / node-pressure
  evicted; operator sets a default, overridable via podTemplate.
- One etcd per HCP cluster (one EtcdCluster). The etcd StatefulSet
  uses podManagementPolicy: Parallel — since learners are
  intentionally NotReady (§8.3), the default OrderedReady would stall
  the StatefulSet on a NotReady learner. Membership is serialized by
  the operator (one learner at a time) and readyz, not by pod ordering.

§12 is now tiered disaster recovery rather than just etcd backup:
- Single-member failure (quorum intact) self-heals via §10.2 — no
  snapshot, no downtime. Quorum loss / control-plane resource loss
  restores from backup, hosted apiserver unavailable meanwhile.
- Two tools: Velero backs up the control-plane namespace resources
  (EtcdCluster CR, TLS Secret, DataStore) AND PV volumes (incl.
  etcd's PVC/PV), batching many hosted clusters via includedNamespaces
  (mirrors OADP); etcd snapshot gives a single-cluster consistent
  point-in-time image.
- Restore order "resources first, then data"; backups land in
  S3-compatible object storage.

§8.2 status conditions: SingleMemberRecoveryActive is the live
"recovery in progress" signal; status.recovery keeps only history
(lastResult, lastRecoveredMember), dropping the duplicate active
field. SingleMemberDegraded dropped as derivable from members[].healthy
+ QuorumAvailable. §10.2/§11.3 reference SingleMemberRecoveryActive /
recovery.lastResult accordingly.

§8.1 example / §8.4: drop the hostname topologySpread that duplicated
the hostname podAntiAffinity; anti-affinity alone enforces one member
per node. Zone spread (ScheduleAnyway) is the opt-in for failure-domain
distribution.

§10.2: NOSPACE is no longer a single-member rebuild trigger — it's a
cluster-wide quota/fragmentation alarm whose fix is compact + defrag +
disarm (needs the §13 auto-defrag, not in scope this iteration), so it
only alarms. Auto-rebuild handles CORRUPT / member-missing /
db-load-failure only.

Cross-references (§1, part intro, §11.3) say 容灾 instead of 备份恢复.
§14 adds the OCP HCP DR research doc plus the manual snapshot and OADP
runbooks. §10.2 health-check Job ETCD_POD_SELECTOR uses app=<name>,
matching operator pod labels (utils.go:163) and the §8.1 example.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jiazhiguang jiazhiguang force-pushed the dev/release-0.2-alauda branch from 20f9c4c to f4e3bef Compare June 11, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant