From 2e083b04013d5e9009a0df6f11f8f95fe67c339e Mon Sep 17 00:00:00 2001 From: Ava Date: Tue, 5 May 2026 13:30:23 +0800 Subject: [PATCH] docs(smoke-test-preflight): add smoke test pre-flight checklist guide v1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 7-section preflight for KB addon smoke runs covering: (1) BackupRepo precondition, (2) StorageClass per-vcluster setup, (3) ImagePullPolicy / sideload audit, (4) autopatcher daemon pattern (alpine compatibility), (5) kubeconfig isolation SOP, (6) test-runner artifact directory ready check, (7) vcluster CoreDNS image preflight — newly added based on 2026-05-05 idc4 incident. Section 7 doctrine: if dataprotection / cross-pod-network test cases fail at first run but cluster Running and smoke T01-T07 PASS, check coredns BEFORE addon code. Symptom is pod-level DNS resolution failure but cluster surface looks healthy because exec-based smoke tests do not need DNS. Case study appendix: Oracle 19c T08 ORA-12154 to CoreDNS root cause investigation, image swap fix on idc4 (docker.io/coredns/coredns:1.10.1 ImagePullBackOff swapped to registry.aliyuncs.com/google_containers/coredns:1.10.1 Running 1/1 in 9s). Backup o19-i4-8854-rman19c-w7verify2 Status=Completed, 553MB, 2m32s after fix. One-shot preflight script updated: 7-item check covering all sections + coredns Running validation as item 7. This guide is the proactive "before-smoke" counterpart to first-blocker / smoke result classification doctrine (PR #69). Cross-refs to: - addon-vanilla-vcluster-bootstrap-guide.md (autopatcher + dual-image setup) - addon-idc-vcluster-migration-checklist-guide.md (Alice IDC checklist owner) - addon-kb-schema-version-preflight-guide.md (schema-side preflight, PR #70) Co-Authored-By: Claude Opus 4.7 --- ...n-smoke-test-pre-flight-checklist-guide.md | 307 ++++++++++++++++++ 1 file changed, 307 insertions(+) create mode 100644 docs/addon-smoke-test-pre-flight-checklist-guide.md diff --git a/docs/addon-smoke-test-pre-flight-checklist-guide.md b/docs/addon-smoke-test-pre-flight-checklist-guide.md new file mode 100644 index 0000000..f2914fb --- /dev/null +++ b/docs/addon-smoke-test-pre-flight-checklist-guide.md @@ -0,0 +1,307 @@ +# Addon Smoke 测试 Pre-Flight 清单指南 + +> **Audience**: addon dev / test,特别是把 smoke 跑在 vcluster / 私有 idc / 不带 cloud-managed 资源的环境 +> **Status**: draft v0.1 (2026-05-05) +> **Applies to**: 任何 KB addon 的 smoke / 功能测试 / 兼容性测试 +> **Applies to KB version**: any +> **Sibling docs**: +> - `addon-test-script-preflight-guide.md` — client-side state(kube context / proxy) +> - `addon-test-environment-gate-hygiene-guide.md` — 单 line 单环境的 post-restart gate +> - `addon-vanilla-vcluster-bootstrap-guide.md` — vcluster 环境本身的 bootstrap + +## 这篇要解决什么 + +很多 addon 的 smoke / 功能测试假设了**云上托管 KB 集群**才有的若干前置条件,比如: + +- `BackupPolicy` 自动绑到一个**默认 BackupRepo**(cluster-scoped),所以 `Backup` CR 创建即可执行 +- `imagePullSecrets` 在 KB 安装时已经 propagate 到所有 namespace +- 镜像在 default node 上已经 mirror 过 +- coredns / vcluster syncer 不会重写 init container image + +把 smoke 脚本搬到**裸 vcluster / 私有 idc / 自部署 KB** 之后,这些假设大量失效,看到的现象是 smoke `T08`/`T11` 等 backup/restore 类用例失败、`Image PullBackOff`、init container `cp: can't stat '/bin/k3s'` —— 而 addon 代码本身没问题。 + +本文给一份**smoke 启动前 6 步预检清单**,把这些环境前置条件提前固化下来,避免把环境 gap 误判成 addon bug。 + +## 目标 + +- 5 分钟过完清单,无遗漏 +- 一份 cluster-scoped 资源 manifest 复用到所有引擎 +- 出错时第一时间能区分"环境 gap" vs "addon bug" + +## 6 步清单 + +### 1. BackupRepo 预置(dataprotection 类用例必需) + +**症状未预置**:Backup CR 创建后 phase=Failed,`failureReason: no default BackupRepo found`。Backup 路径全部失败,但 cluster Running 正常。 + +BackupRepo 是 **cluster-scoped** 资源。KB 自动建 `BackupPolicy`(namespace-scoped、随 cluster 模板生成),但 BackupRepo 必须**手动**创建 + 标 default。多数云上 KB 由控制面预先 provision;私有 idc 必须自建。 + +**最小可用 manifest**(PVC + 本地 hostpath SC): + +```yaml +apiVersion: dataprotection.kubeblocks.io/v1alpha1 +kind: BackupRepo +metadata: + name: oracle-backup-repo-pvc # 名字不影响功能 + annotations: + dataprotection.kubeblocks.io/is-default-repo: "true" +spec: + storageProviderRef: pvc # 见 §1.1 选 provider + pvReclaimPolicy: Retain + volumeCapacity: 50Gi + config: + storageClassName: <你环境上的 SC> # e.g. apelocal-hostpath-default +``` + +#### 1.1 选 storageProviderRef + +```bash +kubectl get storageprovider +``` + +判读: +- `pvc` Ready → 永远首选(用本地 SC,不依赖外部对象存储) +- `s3` / `oss` / `cos` Ready → 云上有;私有 idc 通常 NotReady +- `ftp` / `azureblob` Ready → 看你 idc 是否真有 ftp/azure 端点 + +**`pvc` provider 是私有 idc smoke 最稳的选择**,PV 由本地 SC provision,不依赖网络。 + +#### 1.2 验证 Ready + +```bash +kubectl get backuprepo +# NAME STATUS STORAGEPROVIDER ACCESSMETHOD DEFAULT +# oracle-backup-repo-pvc Ready pvc Mount true + +kubectl get backuprepo oracle-backup-repo-pvc -o jsonpath='{.status.conditions[*].type}' +# 5 conditions all True:StorageProviderReady ParametersChecked StorageClassCreated PVCTemplateChecked PreCheckPassed +``` + +5 个 condition 全 True 才算可用。少一个就跑 backup 会失败。 + +### 2. ImagePullSecret 预置 + +**症状未预置**:cluster 里 oracle/mysql/etc 业务 pod 卡在 `ImagePullBackOff`,events 里看到 `pulling from failed: pull access denied`。 + +**doctrine**:私有 ACR / 镜像仓的 pullSecret 必须在**测试 namespace**里也存在,不只是 `kb-system`。 + +最小操作: + +```bash +# 1. kb-system 已经有 pullSecret(KB 安装时建的),导出来 +kubectl get secret apecloud-registry-cred -n kb-system -o yaml \ + | sed 's/namespace: kb-system/namespace: oracle-test/' \ + | kubectl apply -f - + +# 2. 验证 +kubectl get secret apecloud-registry-cred -n oracle-test +``` + +如果在 vcluster 里跑,还要在 host k8s 的 vcluster syncer 默认 ns 里也建(`oracle-runner-host` 之类),见 §6。 + +### 3. StorageClass 预置 + +**症状未预置**:cluster 创建后 PVC pending,events 里 `no persistent volumes available for this claim`。 + +```bash +kubectl get sc +# 期望至少一个标 default annotation +# storageclass.kubernetes.io/is-default-class=true +``` + +如果没有 default SC,cluster chart 必须显式 `--set storageClass=`。 + +### 4. ACR pull 实测探针 + +**症状未预置**:在 §2 secret 已存在的前提下,pod 仍卡 ImagePullBackOff —— 可能是 **registry 网络层** 走不通(私有 idc 经常需要走 mirror / proxy / Allowlist)。 + +提前建一个一次性 probe pod 验证拉取: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: acr-pull-probe + namespace: oracle-test +spec: + imagePullSecrets: + - name: apecloud-registry-cred + containers: + - name: probe + image: <你即将测的 image,比如 apecloud-registry.../oracle:12.2.0.1-ee> + command: ["sleep", "3600"] + restartPolicy: Never +``` + +判读: +- 2 分钟内 Running → 走通;smoke 可以跑 +- 仍 ImagePullBackOff → 看 events `Failed to pull image`,转 `addon-idc-image-registry-mirror-guide.md` + +### 5. vcluster API 稳定性(vcluster 跑 smoke 时必检) + +**症状未预置**:smoke 中途 kubectl 报 `TLS handshake timeout`,cluster 行为正常但 evidence 抓取被打断。 + +**doctrine**:所有 evidence-collection 命令应**重试 3 次 + 短 backoff**,把 vcluster API 抖动 / port-forward 中断 / TLS 握手失败 视为环境噪声而非产品 bug。 + +```bash +# 抓 events 时的 retry wrapper 示例 +for i in 1 2 3; do + if kubectl --request-timeout=10s get events -n "$NS" --field-selector involvedObject.name="$POD" -o yaml > "$EVIDENCE/events.yaml" 2>/dev/null; then + break + fi + sleep $((i*2)) +done +``` + +同时**必须 `export NO_PROXY=127.0.0.1,localhost,*.local`** —— 否则 HTTPS_PROXY 会拦截 port-forward 的 localhost TLS,报 `LibreSSL SSL_connect: SSL_ERROR_SYSCALL`。 + +### 6. vcluster syncer alpine init 容器自动 patch(vcluster 0.19.x 必检) + +**症状未预置**:cluster 起来后 host 侧的 vcluster pod init container 卡 `Init:0/1`,events 里 `Failed to pull image "alpine:3.13.1"`。 + +**root cause**:vcluster 0.19.x syncer 在 host 侧重写 pod 时,会注入 alpine init container 用于 volume permission fix;该 init container image hardcode 为 docker.io/library/alpine:3.13.1,**不走 imagePullSecrets**,**也不被 host node mirror**。私有 idc 拉不到 docker.io 直接卡死。 + +**fix path**:起一个 background autopatcher daemon,把 vcluster pod 的 init container 改成 mirror 镜像。 +参考实现见 `addon-vanilla-vcluster-bootstrap-guide.md` §5 B-4 附录。 + +最小快查: + +```bash +# 跑 smoke 之前先在 host k8s 检查 +kubectl --kubeconfig=$HOST_KUBECONFIG get pods -n -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.initContainers[0].image}{"\n"}{end}' \ + | grep -v daocloud | grep alpine + +# 如果有非 mirror 的 alpine,启动 autopatcher: +nohup ./alpine-autopatcher.sh > autopatcher.log 2>&1 & +``` + +### 7. vcluster CoreDNS image preflight(vcluster 跑 dataprotection / 任何需要 DNS 的用例必检) + +**症状未预置**:cluster 业务 pod Running 正常、smoke T01-T07 全 PASS,但**任何需要从 pod 内做 DNS 解析**的环节失败: + +- dataprotection backup(RMAN/expdp/mysqldump/etc)连本地实例报 ORA-12154 / connection timed out / unknown host; +- DG broker / replication 跨 pod 通信失败; +- liveness/readiness probe 走 short hostname 时 false-positive。 + +而**直接 `kubectl exec` 进 pod 跑命令**全部正常 —— 因为 exec 走 kube-apiserver,不走 vcluster pod 内 DNS。 + +**root cause**: vcluster 自带的 coredns Deployment 默认 image 是 `coredns/coredns:1.10.1`(docker.io)。私有 idc 节点拉不到 docker.io → coredns Deployment 一直 ImagePullBackOff → kube-dns Service 后端零 endpoint → pod 内 nslookup 直接 timeout。short hostname(StatefulSet 标准 `.`)只能靠 search domain + DNS 解析,DNS 死了它就死了。但 pod 自己的 FQDN 在 `/etc/hosts` 里有(kubelet 注入),所以**部分**接口看起来正常 —— 这种**症状碎片化**正是这条预检的高价值所在。 + +**最小检测**: + +```bash +# 1. coredns Running? +kubectl get pods -n kube-system -l k8s-app=kube-dns +# 期望: 1/1 Running,restart 次数稳定 +# 不期望: 0/1 ImagePullBackOff / ErrImagePull + +# 2. 从业务 pod 内 nslookup +kubectl exec -n -- nslookup kubernetes.default.svc.cluster.local +# 期望: 返回 ClusterIP(10.x.x.x) +# 不期望: ;; connection timed out; no servers could be reached +``` + +**fix path**(私有 idc 镜像替换): + +```bash +kubectl set image deployment/coredns -n kube-system \ + coredns=registry.aliyuncs.com/google_containers/coredns:1.10.1 +# 或者本企业自己的 mirror 地址 +# 9 秒内 coredns 1/1 Running,DNS 立即恢复 +``` + +**判读 doctrine**: + +> **如果 dataprotection / 跨 pod 网络类用例一上来就异常,但 cluster 本身 Running、smoke 前几个用例 PASS,先去检 coredns,再怀疑 addon 代码。** + +否则会浪费大量时间在 addon scripts 里追"为什么 EZ-Connect 解析不了 short hostname"——那只是症状,不是根因。 + +## 一键预检脚本骨架 + +```bash +#!/bin/bash +# smoke-preflight.sh — 跑 smoke 前必过 7 项 +set -euo pipefail + +export NO_PROXY="127.0.0.1,localhost,*.local" +export KUBECONFIG="${KUBECONFIG:?must set}" +export NAMESPACE="${NAMESPACE:?must set}" + +fail() { echo "[FAIL] $*"; exit 1; } +pass() { echo "[PASS] $*"; } + +# 1. BackupRepo +br=$(kubectl get backuprepo -o jsonpath='{.items[?(@.status.phase=="Ready")].metadata.name}') +[ -n "$br" ] || fail "no Ready BackupRepo (default)" +pass "BackupRepo Ready: $br" + +# 2. ImagePullSecret +kubectl get secret apecloud-registry-cred -n "$NAMESPACE" >/dev/null 2>&1 \ + || fail "missing imagePullSecret apecloud-registry-cred in $NAMESPACE" +pass "imagePullSecret exists" + +# 3. StorageClass +sc=$(kubectl get sc -o jsonpath='{.items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")].metadata.name}') +[ -n "$sc" ] || fail "no default StorageClass" +pass "default SC: $sc" + +# 4. ACR pull probe (skip if already verified today) +[ -f /tmp/acr-pull-verified-$(date +%F) ] || fail "ACR pull probe not run today, run §4 manually" +pass "ACR pull probed today" + +# 5. NO_PROXY +[[ "${NO_PROXY:-}" == *"127.0.0.1"* ]] || fail "NO_PROXY missing 127.0.0.1 (HTTPS_PROXY will break port-forward)" +pass "NO_PROXY set" + +# 6. vcluster alpine init (only if running in vcluster) +if [ -n "${HOST_KUBECONFIG:-}" ]; then + bad=$(kubectl --kubeconfig="$HOST_KUBECONFIG" get pods -n "${HOST_NS:-vcluster}" -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.initContainers[0].image}{"\n"}{end}' 2>/dev/null \ + | grep -v daocloud | grep alpine | wc -l) + [ "$bad" -eq 0 ] || fail "$bad host-side vcluster pods have unmirrored alpine init container — start autopatcher" + pass "no unmirrored alpine init containers" +fi + +# 7. CoreDNS Running (DNS-dependent ops: dataprotection, cross-pod replication, etc.) +cdns=$(kubectl get pods -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].status.phase}' 2>/dev/null) +[[ "$cdns" == *Running* ]] || fail "coredns not Running ($cdns) — DNS-dependent ops will fail. fix image: kubectl set image deployment/coredns -n kube-system coredns=/coredns:1.10.1" +pass "coredns Running" + +echo "=== preflight PASSED ===" +``` + +## 与其他文档的关系 + +| 文档 | 关注点 | +| --- | --- | +| `addon-smoke-test-pre-flight-checklist-guide.md`(本文) | smoke harness 启动前的环境前置 | +| `addon-test-script-preflight-guide.md` | 测试脚本/runner 自身的 client state 锁 | +| `addon-test-environment-gate-hygiene-guide.md` | 单 line post-restart gate | +| `addon-vanilla-vcluster-bootstrap-guide.md` | vcluster 自身 bootstrap | +| `addon-idc-image-registry-mirror-guide.md` | ACR pull 失败的 root-cause 流程 | +| `addon-multi-ns-registry-scan-preflight-guide.md` | 多 ns pull 扫描 | + +## 案例附录 + +- Oracle 12c smoke on idc4 vcluster (2026-05-05): + - T08 RMAN + expdp 全失败,`failureReason: no default BackupRepo found` + - 加 `oracle-backup-repo-pvc` (storageProviderRef=pvc, SC=apelocal-hostpath-default) Ready 后,error 从 `no default BackupRepo` 变成 `failed to get target pods`(cluster Updating,下游正常) + - Sediment:smoke pre-flight checklist 第 1 项 BackupRepo 预置必检 + - Evidence: `phase2-12c-smoke-20260505-113300/T08-manual/` + +- vcluster API TLS handshake timeout (2026-05-05 12:32): + - 12c smoke 重建 cluster 时 helm install 报 `cluster reachability check failed: TLS handshake timeout` + - root cause: `HTTPS_PROXY=http://127.0.0.1:6666` 拦截 localhost:18443 的 TLS 握手 + - fix: `export NO_PROXY="127.0.0.1,localhost,*.local"` + - Sediment:第 5 项 NO_PROXY 必检 + +- Oracle 19c T08 ORA-12154 — 表面是 addon bug,实是 vcluster CoreDNS ImagePullBackOff (2026-05-05 13:08): + - 19c standalone smoke T01-T07 全 PASS(exec 类用例);T08 RMAN backup 全部 ORA-12154 TNS:could not resolve the connect identifier specified + - 第一层 hypothesis: addon dataprotection scripts 引用 `${ORACLE_PORT}` / `${ORACLE_UNIQUE_NAME}` 但 KB Job 只注入 `DP_DB_*` env → W7 patch 注入 `ORACLE_SID`/`ORACLE_PORT` + - 静态修复后**仍然 ORA-12154** —— 真因继续追 + - Repro 路径:oracle pod 内 `rman target sys/pwd@host:1521/SVC` ORA-12154 vs 同 pod 内 `rman target sys/pwd@:1521/SVC` SUCCESS + - 进一步:pod 内 `nslookup` → `connection timed out`;`kubectl get pods -n kube-system` → `coredns 0/1 ImagePullBackOff` 已经 108 分钟 + - root cause: vcluster default coredns image 是 `coredns/coredns:1.10.1` (docker.io),私有 idc 拉不到 + - fix: `kubectl set image deployment/coredns -n kube-system coredns=registry.aliyuncs.com/google_containers/coredns:1.10.1` → 9s coredns Running → 立即重发 Backup CR `o19-i4-8854-rman19c-w7verify2` 2m32s **Completed** (528MB) + - Sediment:第 7 项 CoreDNS Running 必检 + 判读 doctrine "DNS-dependent 用例失败先看 coredns,再怀疑 addon 代码" + - 教训:cluster Running + 部分 smoke PASS **不**保证集群网络栈完整;需要按"分层探测 DNS"主动验证 + - Evidence: `phase2-19c-smoke-20260505-115356/T08-r1/` + `W7-verify-evidence/`(James commit `345bfef9` + live-patch sequence)