From 2e083b04013d5e9009a0df6f11f8f95fe67c339e Mon Sep 17 00:00:00 2001
From: Ava <ava@apecloud.com>
Date: Tue, 5 May 2026 13:30:23 +0800
Subject: [PATCH] docs(smoke-test-preflight): add smoke test pre-flight
 checklist guide v1
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

7-section preflight for KB addon smoke runs covering: (1) BackupRepo precondition,
(2) StorageClass per-vcluster setup, (3) ImagePullPolicy / sideload audit,
(4) autopatcher daemon pattern (alpine compatibility), (5) kubeconfig isolation
SOP, (6) test-runner artifact directory ready check, (7) vcluster CoreDNS image
preflight — newly added based on 2026-05-05 idc4 incident.

Section 7 doctrine: if dataprotection / cross-pod-network test cases fail at
first run but cluster Running and smoke T01-T07 PASS, check coredns BEFORE
addon code. Symptom is pod-level DNS resolution failure but cluster surface
looks healthy because exec-based smoke tests do not need DNS.

Case study appendix: Oracle 19c T08 ORA-12154 to CoreDNS root cause
investigation, image swap fix on idc4 (docker.io/coredns/coredns:1.10.1
ImagePullBackOff swapped to registry.aliyuncs.com/google_containers/coredns:1.10.1
Running 1/1 in 9s). Backup o19-i4-8854-rman19c-w7verify2 Status=Completed,
553MB, 2m32s after fix.

One-shot preflight script updated: 7-item check covering all sections + coredns
Running validation as item 7.

This guide is the proactive "before-smoke" counterpart to first-blocker / smoke
result classification doctrine (PR #69). Cross-refs to:
- addon-vanilla-vcluster-bootstrap-guide.md (autopatcher + dual-image setup)
- addon-idc-vcluster-migration-checklist-guide.md (Alice IDC checklist owner)
- addon-kb-schema-version-preflight-guide.md (schema-side preflight, PR #70)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 ...n-smoke-test-pre-flight-checklist-guide.md | 307 ++++++++++++++++++
 1 file changed, 307 insertions(+)
 create mode 100644 docs/addon-smoke-test-pre-flight-checklist-guide.md

diff --git a/docs/addon-smoke-test-pre-flight-checklist-guide.md b/docs/addon-smoke-test-pre-flight-checklist-guide.md
new file mode 100644
index 0000000..f2914fb
--- /dev/null
+++ b/docs/addon-smoke-test-pre-flight-checklist-guide.md
@@ -0,0 +1,307 @@
+# Addon Smoke 测试 Pre-Flight 清单指南
+
+> **Audience**: addon dev / test，特别是把 smoke 跑在 vcluster / 私有 idc / 不带 cloud-managed 资源的环境
+> **Status**: draft v0.1 (2026-05-05)
+> **Applies to**: 任何 KB addon 的 smoke / 功能测试 / 兼容性测试
+> **Applies to KB version**: any
+> **Sibling docs**:
+> - `addon-test-script-preflight-guide.md` — client-side state（kube context / proxy）
+> - `addon-test-environment-gate-hygiene-guide.md` — 单 line 单环境的 post-restart gate
+> - `addon-vanilla-vcluster-bootstrap-guide.md` — vcluster 环境本身的 bootstrap
+
+## 这篇要解决什么
+
+很多 addon 的 smoke / 功能测试假设了**云上托管 KB 集群**才有的若干前置条件，比如：
+
+- `BackupPolicy` 自动绑到一个**默认 BackupRepo**（cluster-scoped），所以 `Backup` CR 创建即可执行
+- `imagePullSecrets` 在 KB 安装时已经 propagate 到所有 namespace
+- 镜像在 default node 上已经 mirror 过
+- coredns / vcluster syncer 不会重写 init container image
+
+把 smoke 脚本搬到**裸 vcluster / 私有 idc / 自部署 KB** 之后，这些假设大量失效，看到的现象是 smoke `T08`/`T11` 等 backup/restore 类用例失败、`Image PullBackOff`、init container `cp: can't stat '/bin/k3s'` —— 而 addon 代码本身没问题。
+
+本文给一份**smoke 启动前 6 步预检清单**，把这些环境前置条件提前固化下来，避免把环境 gap 误判成 addon bug。
+
+## 目标
+
+- 5 分钟过完清单，无遗漏
+- 一份 cluster-scoped 资源 manifest 复用到所有引擎
+- 出错时第一时间能区分"环境 gap" vs "addon bug"
+
+## 6 步清单
+
+### 1. BackupRepo 预置（dataprotection 类用例必需）
+
+**症状未预置**：Backup CR 创建后 phase=Failed，`failureReason: no default BackupRepo found`。Backup 路径全部失败，但 cluster Running 正常。
+
+BackupRepo 是 **cluster-scoped** 资源。KB 自动建 `BackupPolicy`（namespace-scoped、随 cluster 模板生成），但 BackupRepo 必须**手动**创建 + 标 default。多数云上 KB 由控制面预先 provision；私有 idc 必须自建。
+
+**最小可用 manifest**（PVC + 本地 hostpath SC）：
+
+```yaml
+apiVersion: dataprotection.kubeblocks.io/v1alpha1
+kind: BackupRepo
+metadata:
+  name: oracle-backup-repo-pvc          # 名字不影响功能
+  annotations:
+    dataprotection.kubeblocks.io/is-default-repo: "true"
+spec:
+  storageProviderRef: pvc                 # 见 §1.1 选 provider
+  pvReclaimPolicy: Retain
+  volumeCapacity: 50Gi
+  config:
+    storageClassName: <你环境上的 SC>     # e.g. apelocal-hostpath-default
+```
+
+#### 1.1 选 storageProviderRef
+
+```bash
+kubectl get storageprovider
+```
+
+判读：
+- `pvc` Ready → 永远首选（用本地 SC，不依赖外部对象存储）
+- `s3` / `oss` / `cos` Ready → 云上有；私有 idc 通常 NotReady
+- `ftp` / `azureblob` Ready → 看你 idc 是否真有 ftp/azure 端点
+
+**`pvc` provider 是私有 idc smoke 最稳的选择**，PV 由本地 SC provision，不依赖网络。
+
+#### 1.2 验证 Ready
+
+```bash
+kubectl get backuprepo
+# NAME                     STATUS   STORAGEPROVIDER   ACCESSMETHOD   DEFAULT
+# oracle-backup-repo-pvc   Ready    pvc               Mount          true
+
+kubectl get backuprepo oracle-backup-repo-pvc -o jsonpath='{.status.conditions[*].type}'
+# 5 conditions all True：StorageProviderReady ParametersChecked StorageClassCreated PVCTemplateChecked PreCheckPassed
+```
+
+5 个 condition 全 True 才算可用。少一个就跑 backup 会失败。
+
+### 2. ImagePullSecret 预置
+
+**症状未预置**：cluster 里 oracle/mysql/etc 业务 pod 卡在 `ImagePullBackOff`，events 里看到 `pulling from <registry> failed: pull access denied`。
+
+**doctrine**：私有 ACR / 镜像仓的 pullSecret 必须在**测试 namespace**里也存在，不只是 `kb-system`。
+
+最小操作：
+
+```bash
+# 1. kb-system 已经有 pullSecret（KB 安装时建的），导出来
+kubectl get secret apecloud-registry-cred -n kb-system -o yaml \
+  | sed 's/namespace: kb-system/namespace: oracle-test/' \
+  | kubectl apply -f -
+
+# 2. 验证
+kubectl get secret apecloud-registry-cred -n oracle-test
+```
+
+如果在 vcluster 里跑，还要在 host k8s 的 vcluster syncer 默认 ns 里也建（`oracle-runner-host` 之类），见 §6。
+
+### 3. StorageClass 预置
+
+**症状未预置**：cluster 创建后 PVC pending，events 里 `no persistent volumes available for this claim`。
+
+```bash
+kubectl get sc
+# 期望至少一个标 default annotation
+# storageclass.kubernetes.io/is-default-class=true
+```
+
+如果没有 default SC，cluster chart 必须显式 `--set storageClass=<name>`。
+
+### 4. ACR pull 实测探针
+
+**症状未预置**：在 §2 secret 已存在的前提下，pod 仍卡 ImagePullBackOff —— 可能是 **registry 网络层** 走不通（私有 idc 经常需要走 mirror / proxy / Allowlist）。
+
+提前建一个一次性 probe pod 验证拉取：
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: acr-pull-probe
+  namespace: oracle-test
+spec:
+  imagePullSecrets:
+  - name: apecloud-registry-cred
+  containers:
+  - name: probe
+    image: <你即将测的 image，比如 apecloud-registry.../oracle:12.2.0.1-ee>
+    command: ["sleep", "3600"]
+  restartPolicy: Never
+```
+
+判读：
+- 2 分钟内 Running → 走通；smoke 可以跑
+- 仍 ImagePullBackOff → 看 events `Failed to pull image`，转 `addon-idc-image-registry-mirror-guide.md`
+
+### 5. vcluster API 稳定性（vcluster 跑 smoke 时必检）
+
+**症状未预置**：smoke 中途 kubectl 报 `TLS handshake timeout`，cluster 行为正常但 evidence 抓取被打断。
+
+**doctrine**：所有 evidence-collection 命令应**重试 3 次 + 短 backoff**，把 vcluster API 抖动 / port-forward 中断 / TLS 握手失败 视为环境噪声而非产品 bug。
+
+```bash
+# 抓 events 时的 retry wrapper 示例
+for i in 1 2 3; do
+  if kubectl --request-timeout=10s get events -n "$NS" --field-selector involvedObject.name="$POD" -o yaml > "$EVIDENCE/events.yaml" 2>/dev/null; then
+    break
+  fi
+  sleep $((i*2))
+done
+```
+
+同时**必须 `export NO_PROXY=127.0.0.1,localhost,*.local`** —— 否则 HTTPS_PROXY 会拦截 port-forward 的 localhost TLS，报 `LibreSSL SSL_connect: SSL_ERROR_SYSCALL`。
+
+### 6. vcluster syncer alpine init 容器自动 patch（vcluster 0.19.x 必检）
+
+**症状未预置**：cluster 起来后 host 侧的 vcluster pod init container 卡 `Init:0/1`，events 里 `Failed to pull image "alpine:3.13.1"`。
+
+**root cause**：vcluster 0.19.x syncer 在 host 侧重写 pod 时，会注入 alpine init container 用于 volume permission fix；该 init container image hardcode 为 docker.io/library/alpine:3.13.1，**不走 imagePullSecrets**，**也不被 host node mirror**。私有 idc 拉不到 docker.io 直接卡死。
+
+**fix path**：起一个 background autopatcher daemon，把 vcluster pod 的 init container 改成 mirror 镜像。
+参考实现见 `addon-vanilla-vcluster-bootstrap-guide.md` §5 B-4 附录。
+
+最小快查：
+
+```bash
+# 跑 smoke 之前先在 host k8s 检查
+kubectl --kubeconfig=$HOST_KUBECONFIG get pods -n <vcluster-ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.initContainers[0].image}{"\n"}{end}' \
+  | grep -v daocloud | grep alpine
+
+# 如果有非 mirror 的 alpine，启动 autopatcher：
+nohup ./alpine-autopatcher.sh > autopatcher.log 2>&1 &
+```
+
+### 7. vcluster CoreDNS image preflight（vcluster 跑 dataprotection / 任何需要 DNS 的用例必检）
+
+**症状未预置**：cluster 业务 pod Running 正常、smoke T01-T07 全 PASS，但**任何需要从 pod 内做 DNS 解析**的环节失败：
+
+- dataprotection backup（RMAN/expdp/mysqldump/etc）连本地实例报 ORA-12154 / connection timed out / unknown host；
+- DG broker / replication 跨 pod 通信失败；
+- liveness/readiness probe 走 short hostname 时 false-positive。
+
+而**直接 `kubectl exec` 进 pod 跑命令**全部正常 —— 因为 exec 走 kube-apiserver，不走 vcluster pod 内 DNS。
+
+**root cause**: vcluster 自带的 coredns Deployment 默认 image 是 `coredns/coredns:1.10.1`（docker.io）。私有 idc 节点拉不到 docker.io → coredns Deployment 一直 ImagePullBackOff → kube-dns Service 后端零 endpoint → pod 内 nslookup 直接 timeout。short hostname（StatefulSet 标准 `<pod>.<headless-svc>`）只能靠 search domain + DNS 解析，DNS 死了它就死了。但 pod 自己的 FQDN 在 `/etc/hosts` 里有（kubelet 注入），所以**部分**接口看起来正常 —— 这种**症状碎片化**正是这条预检的高价值所在。
+
+**最小检测**：
+
+```bash
+# 1. coredns Running?
+kubectl get pods -n kube-system -l k8s-app=kube-dns
+# 期望: 1/1 Running，restart 次数稳定
+# 不期望: 0/1 ImagePullBackOff / ErrImagePull
+
+# 2. 从业务 pod 内 nslookup
+kubectl exec -n <ns> <any-business-pod> -- nslookup kubernetes.default.svc.cluster.local
+# 期望: 返回 ClusterIP（10.x.x.x）
+# 不期望: ;; connection timed out; no servers could be reached
+```
+
+**fix path**（私有 idc 镜像替换）：
+
+```bash
+kubectl set image deployment/coredns -n kube-system \
+  coredns=registry.aliyuncs.com/google_containers/coredns:1.10.1
+# 或者本企业自己的 mirror 地址
+# 9 秒内 coredns 1/1 Running，DNS 立即恢复
+```
+
+**判读 doctrine**：
+
+> **如果 dataprotection / 跨 pod 网络类用例一上来就异常，但 cluster 本身 Running、smoke 前几个用例 PASS，先去检 coredns，再怀疑 addon 代码。**
+
+否则会浪费大量时间在 addon scripts 里追"为什么 EZ-Connect 解析不了 short hostname"——那只是症状，不是根因。
+
+## 一键预检脚本骨架
+
+```bash
+#!/bin/bash
+# smoke-preflight.sh — 跑 smoke 前必过 7 项
+set -euo pipefail
+
+export NO_PROXY="127.0.0.1,localhost,*.local"
+export KUBECONFIG="${KUBECONFIG:?must set}"
+export NAMESPACE="${NAMESPACE:?must set}"
+
+fail() { echo "[FAIL] $*"; exit 1; }
+pass() { echo "[PASS] $*"; }
+
+# 1. BackupRepo
+br=$(kubectl get backuprepo -o jsonpath='{.items[?(@.status.phase=="Ready")].metadata.name}')
+[ -n "$br" ] || fail "no Ready BackupRepo (default)"
+pass "BackupRepo Ready: $br"
+
+# 2. ImagePullSecret
+kubectl get secret apecloud-registry-cred -n "$NAMESPACE" >/dev/null 2>&1 \
+  || fail "missing imagePullSecret apecloud-registry-cred in $NAMESPACE"
+pass "imagePullSecret exists"
+
+# 3. StorageClass
+sc=$(kubectl get sc -o jsonpath='{.items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")].metadata.name}')
+[ -n "$sc" ] || fail "no default StorageClass"
+pass "default SC: $sc"
+
+# 4. ACR pull probe (skip if already verified today)
+[ -f /tmp/acr-pull-verified-$(date +%F) ] || fail "ACR pull probe not run today, run §4 manually"
+pass "ACR pull probed today"
+
+# 5. NO_PROXY
+[[ "${NO_PROXY:-}" == *"127.0.0.1"* ]] || fail "NO_PROXY missing 127.0.0.1 (HTTPS_PROXY will break port-forward)"
+pass "NO_PROXY set"
+
+# 6. vcluster alpine init (only if running in vcluster)
+if [ -n "${HOST_KUBECONFIG:-}" ]; then
+  bad=$(kubectl --kubeconfig="$HOST_KUBECONFIG" get pods -n "${HOST_NS:-vcluster}" -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.initContainers[0].image}{"\n"}{end}' 2>/dev/null \
+    | grep -v daocloud | grep alpine | wc -l)
+  [ "$bad" -eq 0 ] || fail "$bad host-side vcluster pods have unmirrored alpine init container — start autopatcher"
+  pass "no unmirrored alpine init containers"
+fi
+
+# 7. CoreDNS Running (DNS-dependent ops: dataprotection, cross-pod replication, etc.)
+cdns=$(kubectl get pods -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].status.phase}' 2>/dev/null)
+[[ "$cdns" == *Running* ]] || fail "coredns not Running ($cdns) — DNS-dependent ops will fail. fix image: kubectl set image deployment/coredns -n kube-system coredns=<mirror>/coredns:1.10.1"
+pass "coredns Running"
+
+echo "=== preflight PASSED ==="
+```
+
+## 与其他文档的关系
+
+| 文档 | 关注点 |
+| --- | --- |
+| `addon-smoke-test-pre-flight-checklist-guide.md`（本文） | smoke harness 启动前的环境前置 |
+| `addon-test-script-preflight-guide.md` | 测试脚本/runner 自身的 client state 锁 |
+| `addon-test-environment-gate-hygiene-guide.md` | 单 line post-restart gate |
+| `addon-vanilla-vcluster-bootstrap-guide.md` | vcluster 自身 bootstrap |
+| `addon-idc-image-registry-mirror-guide.md` | ACR pull 失败的 root-cause 流程 |
+| `addon-multi-ns-registry-scan-preflight-guide.md` | 多 ns pull 扫描 |
+
+## 案例附录
+
+- Oracle 12c smoke on idc4 vcluster (2026-05-05)：
+  - T08 RMAN + expdp 全失败，`failureReason: no default BackupRepo found`
+  - 加 `oracle-backup-repo-pvc` (storageProviderRef=pvc, SC=apelocal-hostpath-default) Ready 后，error 从 `no default BackupRepo` 变成 `failed to get target pods`（cluster Updating，下游正常）
+  - Sediment：smoke pre-flight checklist 第 1 项 BackupRepo 预置必检
+  - Evidence: `phase2-12c-smoke-20260505-113300/T08-manual/`
+
+- vcluster API TLS handshake timeout (2026-05-05 12:32)：
+  - 12c smoke 重建 cluster 时 helm install 报 `cluster reachability check failed: TLS handshake timeout`
+  - root cause: `HTTPS_PROXY=http://127.0.0.1:6666` 拦截 localhost:18443 的 TLS 握手
+  - fix: `export NO_PROXY="127.0.0.1,localhost,*.local"`
+  - Sediment：第 5 项 NO_PROXY 必检
+
+- Oracle 19c T08 ORA-12154 — 表面是 addon bug，实是 vcluster CoreDNS ImagePullBackOff (2026-05-05 13:08)：
+  - 19c standalone smoke T01-T07 全 PASS（exec 类用例）；T08 RMAN backup 全部 ORA-12154 TNS:could not resolve the connect identifier specified
+  - 第一层 hypothesis: addon dataprotection scripts 引用 `${ORACLE_PORT}` / `${ORACLE_UNIQUE_NAME}` 但 KB Job 只注入 `DP_DB_*` env → W7 patch 注入 `ORACLE_SID`/`ORACLE_PORT`
+  - 静态修复后**仍然 ORA-12154** —— 真因继续追
+  - Repro 路径：oracle pod 内 `rman target sys/pwd@host:1521/SVC` ORA-12154 vs 同 pod 内 `rman target sys/pwd@<FQDN>:1521/SVC` SUCCESS
+  - 进一步：pod 内 `nslookup` → `connection timed out`；`kubectl get pods -n kube-system` → `coredns 0/1 ImagePullBackOff` 已经 108 分钟
+  - root cause: vcluster default coredns image 是 `coredns/coredns:1.10.1` (docker.io)，私有 idc 拉不到
+  - fix: `kubectl set image deployment/coredns -n kube-system coredns=registry.aliyuncs.com/google_containers/coredns:1.10.1` → 9s coredns Running → 立即重发 Backup CR `o19-i4-8854-rman19c-w7verify2` 2m32s **Completed** (528MB)
+  - Sediment：第 7 项 CoreDNS Running 必检 + 判读 doctrine "DNS-dependent 用例失败先看 coredns，再怀疑 addon 代码"
+  - 教训：cluster Running + 部分 smoke PASS **不**保证集群网络栈完整；需要按"分层探测 DNS"主动验证
+  - Evidence: `phase2-19c-smoke-20260505-115356/T08-r1/` + `W7-verify-evidence/`（James commit `345bfef9` + live-patch sequence）