docs: add concise ACP HCP etcd design (no ManagedEtcd CRD)#1
Open
jiazhiguang wants to merge 17 commits into
Open
docs: add concise ACP HCP etcd design (no ManagedEtcd CRD)#1jiazhiguang wants to merge 17 commits into
jiazhiguang wants to merge 17 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (3)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
541b58e to
ec4b51d
Compare
Add docs/design/hcp-etcd-design.md, a concise design for highly-available
managed etcd under ACP HCP. Goal: keep the hosted control plane's etcd
available while management nodes are upgraded/drained.
Structure: benchmark against OpenShift HCP (HyperShift), inventory our gap,
then close it.
- OCP/HyperShift baseline: hosted etcd runs as a 3-member StatefulSet in the
management cluster and stays quorum-safe across rolling node upgrades via a
PodDisruptionBudget (one member evicted at a time), pod anti-affinity +
topology spread, ordered StatefulSet rollout with readiness gating,
member-level self-healing, and observable status.
- Gap vs OCP: the low-level EtcdCluster controller already drives etcd
membership directly (MemberList/Add/Remove/PromoteLearner via
internal/etcdutils), but the HA layers are mostly missing — EtcdClusterStatus
is an empty struct, podTemplate only carries metadata labels/annotations
(no affinity/topology), there is no PDB, no recovery workflow, and only a
headless Service (no client Service).
- Close the gap in two tracks, no new high-level CRD:
- Track 1 (generic HA, upstreamable): enrich EtcdCluster — populate status
from the data already gathered in-process; extend podTemplate with
scheduling fields; add a PDB and a <name>-client Service; add
HyperShift-style single-member recovery (Job-based, gated on quorum +
gracePeriod).
- Track 2 (ACP-specific, optional): publish the Kamaji DataStore. Track 1
makes the client endpoint and <name>-client-tls secret stable, so "who
creates the DataStore" is a replaceable integration point: (A) manual as a
zero-code fallback, (B) a watcher in the Kamaji control-plane provider
(preferred — keeps the Kamaji dependency out of etcd-operator), (C) an
opt-in reconciler inside the operator. None add Kamaji fields to
EtcdCluster.spec.
No high-level CRD is introduced: the missing pieces are generic etcd HA
features that belong on EtcdCluster itself; only DataStore publishing is
ACP-specific and does not warrant a separate CRD.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0672077 to
473f603
Compare
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d165ffa to
9338aa5
Compare
Design for running etcd as the datastore of ACP HCP hosted control planes, kept highly available across management-node rolling upgrades and supporting etcd version upgrades. Benchmarked against OpenShift HCP (HyperShift). Contents: - Benchmark vs OCP and gap analysis; EtcdCluster CRD fields and the StorageClass contract (per-member RWO local PV). - Deploy & upgrade overview: deploy flow (management nodes + TopoLVM → operator → EtcdCluster → optional Kamaji DataStore); two upgrade paths — node rolling (serialized by PDB) and etcd version (serialized by the readyz probe), with the two guarantees: PDB keeps the other members available before a member is disrupted, and readyz only reports a started, voting member (leader/follower) as Ready. - Management-plane changes: dedicated CAPI MachineDeployment (label cpaas.io/hcp-management-node), MachineConfigPool planning of IP/hostname/disk, TopoLVM local storage (TopolvmCluster CR + sc-topolvm-vdc), Baremetal Provider reusing the old node's IP/hostname/disk on roll, and the nodeDrainTimeout=0 invariant (never drain while a PDB is unsatisfied). - etcd-operator changes: status, :9980 readyz/healthz probe (serializable health plus an added not-learner check), scheduling fields, PDB (maxUnavailable:1 + AlwaysAllow), version upgrade (downgrade rejected; minor-by-minor enforced by the upgrade flow), conditional reset-member initContainer, client Service, single-member recovery. - DataStore publishing (optional); create & recovery workflows; observability & operations (where to read live status, stuck-not-crashed failure behavior, when manual intervention is needed, troubleshooting playbook); future work. Single-member recovery only handles genuine data loss/corruption; on ACP the PV/disk is reused on node roll, so members rejoin with their data automatically without recovery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9338aa5 to
a6c3402
Compare
Reorganize the doc into 背景 → Goal/Non-Goal → 核心总结 → OCP 升级对标 → 展开章节, and add §13 备份与恢复 (manual etcd snapshot + Velero backup of the hosted control-plane namespace, restore flow), benchmarked against OCP. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e file Revert hcp-etcd-design.md to the original (a6c3402) and add the restructured + backup/restore version as hcp-etcd-design-v2.md, so both revisions coexist. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Background ends after the gap list with a one-line wrap-up - Remove the "不新增高层 CRD" item from Goal/Non-Goal - Rename 核心总结 → 总结 - Reword the quorum trade-off sentence in plain language Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Condense explanatory wording across every section; no design content, tables, diagrams, code blocks, or cross-references removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
02b2168 to
e84e2f3
Compare
Keep the design discussion (what to back up with which tool, key trade-offs, restore order); move the concrete runbook out as a separate deliverable. Retitle to 方案 and update cross-references accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6c9e2ab to
9439c3a
Compare
… handling PDB-constrained drain is not unique to etcd (any PDB-backed service constrains drain); reframe from "difference" to "same mechanism, the burden is on etcd's own PDB + probes". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reframe §4 as four parallel dimensions (deploy / node upgrade / etcd version upgrade / etcd HA), add the HCP node-topology isolation tiers (Shared Everything / Shared Nothing / Dedicated Request Serving) table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a four-dimension summary table (deploy topology / node upgrade / etcd version upgrade / etcd HA), keep the detailed HA-mechanism table. Correct §4 point 1: shared mgmt node pool across hosted clusters is Shared Everything (first tier supported), not Shared Nothing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… etcd-operator - Add a divider after §5: chapters before it are background/benchmarking, after it is the formal design. - Move EtcdCluster CRD into §8.1 (first under etcd-operator), fold the full example there as a collapsed <details>. - Rename 高可用改造 → 生产可用改造 (§7/§8/§9). - Renumber all sections and cross-references accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
45f5795 to
2251f69
Compare
- §7.1: list-format key points + zone-label (topology.kubernetes.io/zone) requirement - §3 point 3: add spec.deletion.nodeDrainTimeout=0/unset prerequisite - §12: trim trade-offs to a single restore-order line - §13 + §2 Non-Goal: drop 永久换机 / local PV 迁移 - §4: remove inaccurate "隔离强度递增" - §5: clarify 降级硬校验 → etcd 版本 skew 校验(拦降级、限制跨 minor) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
24c8616 to
a08c88e
Compare
基于 Kamaji + Cluster API 的 ACP HCP 托管控制面 etcd 生产可用设计,对标 OCP HCP(HyperShift)。 - 分析:背景、Goal/Non-Goal、总结、OCP 对标(部署拓扑 / 节点升级 / 版本升级 / etcd HA)、差距盘点。 - 方案:总体部署与升级流程;生产可用改造(管控面节点池 + TopoLVM 复用盘 + PDB;etcd-operator 含 CRD、status、readyz 探针、调度字段、PDB、版本升级、reset-member、client Service、单成员自愈;DataStore 发布);工作流程;可观测与运维;备份恢复方案(etcd snapshot + Velero)。 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5a68ef1 to
3cd2875
Compare
Summarize how Hosted Control Planes (HyperShift) handle DR: manual runbook vs scheduled backup, etcd snapshot and OADP/Velero mechanics, upgrade-time backup behavior, backup storage location, fleet-scale backup, and whether OADP can back up and restore the etcd PV. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8bdb04a to
20f9c4c
Compare
…tidy §8/§10 §2 Goal: 保证 HCP managed 节点升级时 etcd 可用性 / 支持 etcd 版本升级 / etcd 单成员故障自动恢复 / 明确容灾方案(§12). Drops the generic "扩缩容→高可用" and Kamaji/DataStore bullets and the "本期" qualifiers on the Goal/Non-Goal headers. §2 Non-Goal: backup/restore automation; the concrete runbook for backup-restore and quorum-loss recovery (manual, documented separately); TLS cert rotation; auto defrag. Quorum-loss recovery is a manual runbook, not deferred automation. Management-cluster placement and StatefulSet rendering (§3, §7.1, §8): - HCP control plane (incl. etcd) must run on management-cluster WORKER nodes, never master/control-plane — on a master the OVN (ovn-kubernetes) pods can never be evicted, so node drain never completes and blocks machine replacement / upgrades. - etcd pods get a high-priority PriorityClass (e.g. system-cluster-critical) so they aren't preempted / node-pressure evicted; operator sets a default, overridable via podTemplate. - One etcd per HCP cluster (one EtcdCluster). The etcd StatefulSet uses podManagementPolicy: Parallel — since learners are intentionally NotReady (§8.3), the default OrderedReady would stall the StatefulSet on a NotReady learner. Membership is serialized by the operator (one learner at a time) and readyz, not by pod ordering. §12 is now tiered disaster recovery rather than just etcd backup: - Single-member failure (quorum intact) self-heals via §10.2 — no snapshot, no downtime. Quorum loss / control-plane resource loss restores from backup, hosted apiserver unavailable meanwhile. - Two tools: Velero backs up the control-plane namespace resources (EtcdCluster CR, TLS Secret, DataStore) AND PV volumes (incl. etcd's PVC/PV), batching many hosted clusters via includedNamespaces (mirrors OADP); etcd snapshot gives a single-cluster consistent point-in-time image. - Restore order "resources first, then data"; backups land in S3-compatible object storage. §8.2 status conditions: SingleMemberRecoveryActive is the live "recovery in progress" signal; status.recovery keeps only history (lastResult, lastRecoveredMember), dropping the duplicate active field. SingleMemberDegraded dropped as derivable from members[].healthy + QuorumAvailable. §10.2/§11.3 reference SingleMemberRecoveryActive / recovery.lastResult accordingly. §8.1 example / §8.4: drop the hostname topologySpread that duplicated the hostname podAntiAffinity; anti-affinity alone enforces one member per node. Zone spread (ScheduleAnyway) is the opt-in for failure-domain distribution. §10.2: NOSPACE is no longer a single-member rebuild trigger — it's a cluster-wide quota/fragmentation alarm whose fix is compact + defrag + disarm (needs the §13 auto-defrag, not in scope this iteration), so it only alarms. Auto-rebuild handles CORRUPT / member-missing / db-load-failure only. Cross-references (§1, part intro, §11.3) say 容灾 instead of 备份恢复. §14 adds the OCP HCP DR research doc plus the manual snapshot and OADP runbooks. §10.2 health-check Job ETCD_POD_SELECTOR uses app=<name>, matching operator pod labels (utils.go:163) and the §8.1 example. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
20f9c4c to
f4e3bef
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add docs/design/hcp-etcd-design.md, a trimmed, human-readable design for managing etcd under ACP HCP. It carries over the goals of the original acp-hcp-managed-etcd-crd-design.md but corrects them against the current codebase and drops the high-level ManagedEtcd CRD.
Key decisions:
No new high-level CRD. A codebase audit showed the low-level EtcdCluster controller already drives etcd membership directly (MemberList/Add/Remove/ PromoteLearner via internal/etcdutils), so the capabilities the original design assigned to ManagedEtcd (status aggregation, scheduling protection, single-member recovery) are generic etcd concerns. They belong on EtcdCluster and are potentially upstreamable, not in a separate CRD.
Two tracks instead of a wrapper CRD: