Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
307 changes: 307 additions & 0 deletions docs/addon-smoke-test-pre-flight-checklist-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
# Addon Smoke 测试 Pre-Flight 清单指南

> **Audience**: addon dev / test,特别是把 smoke 跑在 vcluster / 私有 idc / 不带 cloud-managed 资源的环境
> **Status**: draft v0.1 (2026-05-05)
> **Applies to**: 任何 KB addon 的 smoke / 功能测试 / 兼容性测试
> **Applies to KB version**: any
> **Sibling docs**:
> - `addon-test-script-preflight-guide.md` — client-side state(kube context / proxy)
> - `addon-test-environment-gate-hygiene-guide.md` — 单 line 单环境的 post-restart gate
> - `addon-vanilla-vcluster-bootstrap-guide.md` — vcluster 环境本身的 bootstrap

## 这篇要解决什么

很多 addon 的 smoke / 功能测试假设了**云上托管 KB 集群**才有的若干前置条件,比如:

- `BackupPolicy` 自动绑到一个**默认 BackupRepo**(cluster-scoped),所以 `Backup` CR 创建即可执行
- `imagePullSecrets` 在 KB 安装时已经 propagate 到所有 namespace
- 镜像在 default node 上已经 mirror 过
- coredns / vcluster syncer 不会重写 init container image

把 smoke 脚本搬到**裸 vcluster / 私有 idc / 自部署 KB** 之后,这些假设大量失效,看到的现象是 smoke `T08`/`T11` 等 backup/restore 类用例失败、`Image PullBackOff`、init container `cp: can't stat '/bin/k3s'` —— 而 addon 代码本身没问题。

本文给一份**smoke 启动前 6 步预检清单**,把这些环境前置条件提前固化下来,避免把环境 gap 误判成 addon bug。

## 目标

- 5 分钟过完清单,无遗漏
- 一份 cluster-scoped 资源 manifest 复用到所有引擎
- 出错时第一时间能区分"环境 gap" vs "addon bug"

## 6 步清单

### 1. BackupRepo 预置(dataprotection 类用例必需)

**症状未预置**:Backup CR 创建后 phase=Failed,`failureReason: no default BackupRepo found`。Backup 路径全部失败,但 cluster Running 正常。

BackupRepo 是 **cluster-scoped** 资源。KB 自动建 `BackupPolicy`(namespace-scoped、随 cluster 模板生成),但 BackupRepo 必须**手动**创建 + 标 default。多数云上 KB 由控制面预先 provision;私有 idc 必须自建。

**最小可用 manifest**(PVC + 本地 hostpath SC):

```yaml
apiVersion: dataprotection.kubeblocks.io/v1alpha1
kind: BackupRepo
metadata:
name: oracle-backup-repo-pvc # 名字不影响功能
annotations:
dataprotection.kubeblocks.io/is-default-repo: "true"
spec:
storageProviderRef: pvc # 见 §1.1 选 provider
pvReclaimPolicy: Retain
volumeCapacity: 50Gi
config:
storageClassName: <你环境上的 SC> # e.g. apelocal-hostpath-default
```

#### 1.1 选 storageProviderRef

```bash
kubectl get storageprovider
```

判读:
- `pvc` Ready → 永远首选(用本地 SC,不依赖外部对象存储)
- `s3` / `oss` / `cos` Ready → 云上有;私有 idc 通常 NotReady
- `ftp` / `azureblob` Ready → 看你 idc 是否真有 ftp/azure 端点

**`pvc` provider 是私有 idc smoke 最稳的选择**,PV 由本地 SC provision,不依赖网络。

#### 1.2 验证 Ready

```bash
kubectl get backuprepo
# NAME STATUS STORAGEPROVIDER ACCESSMETHOD DEFAULT
# oracle-backup-repo-pvc Ready pvc Mount true

kubectl get backuprepo oracle-backup-repo-pvc -o jsonpath='{.status.conditions[*].type}'
# 5 conditions all True:StorageProviderReady ParametersChecked StorageClassCreated PVCTemplateChecked PreCheckPassed
```

5 个 condition 全 True 才算可用。少一个就跑 backup 会失败。

### 2. ImagePullSecret 预置

**症状未预置**:cluster 里 oracle/mysql/etc 业务 pod 卡在 `ImagePullBackOff`,events 里看到 `pulling from <registry> failed: pull access denied`。

**doctrine**:私有 ACR / 镜像仓的 pullSecret 必须在**测试 namespace**里也存在,不只是 `kb-system`。

最小操作:

```bash
# 1. kb-system 已经有 pullSecret(KB 安装时建的),导出来
kubectl get secret apecloud-registry-cred -n kb-system -o yaml \
| sed 's/namespace: kb-system/namespace: oracle-test/' \
| kubectl apply -f -

# 2. 验证
kubectl get secret apecloud-registry-cred -n oracle-test
```

如果在 vcluster 里跑,还要在 host k8s 的 vcluster syncer 默认 ns 里也建(`oracle-runner-host` 之类),见 §6。

### 3. StorageClass 预置

**症状未预置**:cluster 创建后 PVC pending,events 里 `no persistent volumes available for this claim`。

```bash
kubectl get sc
# 期望至少一个标 default annotation
# storageclass.kubernetes.io/is-default-class=true
```

如果没有 default SC,cluster chart 必须显式 `--set storageClass=<name>`。

### 4. ACR pull 实测探针

**症状未预置**:在 §2 secret 已存在的前提下,pod 仍卡 ImagePullBackOff —— 可能是 **registry 网络层** 走不通(私有 idc 经常需要走 mirror / proxy / Allowlist)。

提前建一个一次性 probe pod 验证拉取:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: acr-pull-probe
namespace: oracle-test
spec:
imagePullSecrets:
- name: apecloud-registry-cred
containers:
- name: probe
image: <你即将测的 image,比如 apecloud-registry.../oracle:12.2.0.1-ee>
command: ["sleep", "3600"]
restartPolicy: Never
```

判读:
- 2 分钟内 Running → 走通;smoke 可以跑
- 仍 ImagePullBackOff → 看 events `Failed to pull image`,转 `addon-idc-image-registry-mirror-guide.md`

### 5. vcluster API 稳定性(vcluster 跑 smoke 时必检)

**症状未预置**:smoke 中途 kubectl 报 `TLS handshake timeout`,cluster 行为正常但 evidence 抓取被打断。

**doctrine**:所有 evidence-collection 命令应**重试 3 次 + 短 backoff**,把 vcluster API 抖动 / port-forward 中断 / TLS 握手失败 视为环境噪声而非产品 bug。

```bash
# 抓 events 时的 retry wrapper 示例
for i in 1 2 3; do
if kubectl --request-timeout=10s get events -n "$NS" --field-selector involvedObject.name="$POD" -o yaml > "$EVIDENCE/events.yaml" 2>/dev/null; then
break
fi
sleep $((i*2))
done
```

同时**必须 `export NO_PROXY=127.0.0.1,localhost,*.local`** —— 否则 HTTPS_PROXY 会拦截 port-forward 的 localhost TLS,报 `LibreSSL SSL_connect: SSL_ERROR_SYSCALL`。

### 6. vcluster syncer alpine init 容器自动 patch(vcluster 0.19.x 必检)

**症状未预置**:cluster 起来后 host 侧的 vcluster pod init container 卡 `Init:0/1`,events 里 `Failed to pull image "alpine:3.13.1"`。

**root cause**:vcluster 0.19.x syncer 在 host 侧重写 pod 时,会注入 alpine init container 用于 volume permission fix;该 init container image hardcode 为 docker.io/library/alpine:3.13.1,**不走 imagePullSecrets**,**也不被 host node mirror**。私有 idc 拉不到 docker.io 直接卡死。

**fix path**:起一个 background autopatcher daemon,把 vcluster pod 的 init container 改成 mirror 镜像。
参考实现见 `addon-vanilla-vcluster-bootstrap-guide.md` §5 B-4 附录。

最小快查:

```bash
# 跑 smoke 之前先在 host k8s 检查
kubectl --kubeconfig=$HOST_KUBECONFIG get pods -n <vcluster-ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.initContainers[0].image}{"\n"}{end}' \
| grep -v daocloud | grep alpine

# 如果有非 mirror 的 alpine,启动 autopatcher:
nohup ./alpine-autopatcher.sh > autopatcher.log 2>&1 &
```

### 7. vcluster CoreDNS image preflight(vcluster 跑 dataprotection / 任何需要 DNS 的用例必检)

**症状未预置**:cluster 业务 pod Running 正常、smoke T01-T07 全 PASS,但**任何需要从 pod 内做 DNS 解析**的环节失败:

- dataprotection backup(RMAN/expdp/mysqldump/etc)连本地实例报 ORA-12154 / connection timed out / unknown host;
- DG broker / replication 跨 pod 通信失败;
- liveness/readiness probe 走 short hostname 时 false-positive。

而**直接 `kubectl exec` 进 pod 跑命令**全部正常 —— 因为 exec 走 kube-apiserver,不走 vcluster pod 内 DNS。

**root cause**: vcluster 自带的 coredns Deployment 默认 image 是 `coredns/coredns:1.10.1`(docker.io)。私有 idc 节点拉不到 docker.io → coredns Deployment 一直 ImagePullBackOff → kube-dns Service 后端零 endpoint → pod 内 nslookup 直接 timeout。short hostname(StatefulSet 标准 `<pod>.<headless-svc>`)只能靠 search domain + DNS 解析,DNS 死了它就死了。但 pod 自己的 FQDN 在 `/etc/hosts` 里有(kubelet 注入),所以**部分**接口看起来正常 —— 这种**症状碎片化**正是这条预检的高价值所在。

**最小检测**:

```bash
# 1. coredns Running?
kubectl get pods -n kube-system -l k8s-app=kube-dns
# 期望: 1/1 Running,restart 次数稳定
# 不期望: 0/1 ImagePullBackOff / ErrImagePull

# 2. 从业务 pod 内 nslookup
kubectl exec -n <ns> <any-business-pod> -- nslookup kubernetes.default.svc.cluster.local
# 期望: 返回 ClusterIP(10.x.x.x)
# 不期望: ;; connection timed out; no servers could be reached
```

**fix path**(私有 idc 镜像替换):

```bash
kubectl set image deployment/coredns -n kube-system \
coredns=registry.aliyuncs.com/google_containers/coredns:1.10.1
# 或者本企业自己的 mirror 地址
# 9 秒内 coredns 1/1 Running,DNS 立即恢复
```

**判读 doctrine**:

> **如果 dataprotection / 跨 pod 网络类用例一上来就异常,但 cluster 本身 Running、smoke 前几个用例 PASS,先去检 coredns,再怀疑 addon 代码。**

否则会浪费大量时间在 addon scripts 里追"为什么 EZ-Connect 解析不了 short hostname"——那只是症状,不是根因。

## 一键预检脚本骨架

```bash
#!/bin/bash
# smoke-preflight.sh — 跑 smoke 前必过 7 项
set -euo pipefail

export NO_PROXY="127.0.0.1,localhost,*.local"
export KUBECONFIG="${KUBECONFIG:?must set}"
export NAMESPACE="${NAMESPACE:?must set}"

fail() { echo "[FAIL] $*"; exit 1; }
pass() { echo "[PASS] $*"; }

# 1. BackupRepo
br=$(kubectl get backuprepo -o jsonpath='{.items[?(@.status.phase=="Ready")].metadata.name}')
[ -n "$br" ] || fail "no Ready BackupRepo (default)"
pass "BackupRepo Ready: $br"

# 2. ImagePullSecret
kubectl get secret apecloud-registry-cred -n "$NAMESPACE" >/dev/null 2>&1 \
|| fail "missing imagePullSecret apecloud-registry-cred in $NAMESPACE"
pass "imagePullSecret exists"

# 3. StorageClass
sc=$(kubectl get sc -o jsonpath='{.items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")].metadata.name}')
[ -n "$sc" ] || fail "no default StorageClass"
pass "default SC: $sc"

# 4. ACR pull probe (skip if already verified today)
[ -f /tmp/acr-pull-verified-$(date +%F) ] || fail "ACR pull probe not run today, run §4 manually"
pass "ACR pull probed today"

# 5. NO_PROXY
[[ "${NO_PROXY:-}" == *"127.0.0.1"* ]] || fail "NO_PROXY missing 127.0.0.1 (HTTPS_PROXY will break port-forward)"
pass "NO_PROXY set"

# 6. vcluster alpine init (only if running in vcluster)
if [ -n "${HOST_KUBECONFIG:-}" ]; then
bad=$(kubectl --kubeconfig="$HOST_KUBECONFIG" get pods -n "${HOST_NS:-vcluster}" -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.initContainers[0].image}{"\n"}{end}' 2>/dev/null \
| grep -v daocloud | grep alpine | wc -l)
[ "$bad" -eq 0 ] || fail "$bad host-side vcluster pods have unmirrored alpine init container — start autopatcher"
pass "no unmirrored alpine init containers"
fi

# 7. CoreDNS Running (DNS-dependent ops: dataprotection, cross-pod replication, etc.)
cdns=$(kubectl get pods -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].status.phase}' 2>/dev/null)
[[ "$cdns" == *Running* ]] || fail "coredns not Running ($cdns) — DNS-dependent ops will fail. fix image: kubectl set image deployment/coredns -n kube-system coredns=<mirror>/coredns:1.10.1"
pass "coredns Running"

echo "=== preflight PASSED ==="
```

## 与其他文档的关系

| 文档 | 关注点 |
| --- | --- |
| `addon-smoke-test-pre-flight-checklist-guide.md`(本文) | smoke harness 启动前的环境前置 |
| `addon-test-script-preflight-guide.md` | 测试脚本/runner 自身的 client state 锁 |
| `addon-test-environment-gate-hygiene-guide.md` | 单 line post-restart gate |
| `addon-vanilla-vcluster-bootstrap-guide.md` | vcluster 自身 bootstrap |
| `addon-idc-image-registry-mirror-guide.md` | ACR pull 失败的 root-cause 流程 |
| `addon-multi-ns-registry-scan-preflight-guide.md` | 多 ns pull 扫描 |

## 案例附录

- Oracle 12c smoke on idc4 vcluster (2026-05-05):
- T08 RMAN + expdp 全失败,`failureReason: no default BackupRepo found`
- 加 `oracle-backup-repo-pvc` (storageProviderRef=pvc, SC=apelocal-hostpath-default) Ready 后,error 从 `no default BackupRepo` 变成 `failed to get target pods`(cluster Updating,下游正常)
- Sediment:smoke pre-flight checklist 第 1 项 BackupRepo 预置必检
- Evidence: `phase2-12c-smoke-20260505-113300/T08-manual/`

- vcluster API TLS handshake timeout (2026-05-05 12:32):
- 12c smoke 重建 cluster 时 helm install 报 `cluster reachability check failed: TLS handshake timeout`
- root cause: `HTTPS_PROXY=http://127.0.0.1:6666` 拦截 localhost:18443 的 TLS 握手
- fix: `export NO_PROXY="127.0.0.1,localhost,*.local"`
- Sediment:第 5 项 NO_PROXY 必检

- Oracle 19c T08 ORA-12154 — 表面是 addon bug,实是 vcluster CoreDNS ImagePullBackOff (2026-05-05 13:08):
- 19c standalone smoke T01-T07 全 PASS(exec 类用例);T08 RMAN backup 全部 ORA-12154 TNS:could not resolve the connect identifier specified
- 第一层 hypothesis: addon dataprotection scripts 引用 `${ORACLE_PORT}` / `${ORACLE_UNIQUE_NAME}` 但 KB Job 只注入 `DP_DB_*` env → W7 patch 注入 `ORACLE_SID`/`ORACLE_PORT`
- 静态修复后**仍然 ORA-12154** —— 真因继续追
- Repro 路径:oracle pod 内 `rman target sys/pwd@host:1521/SVC` ORA-12154 vs 同 pod 内 `rman target sys/pwd@<FQDN>:1521/SVC` SUCCESS
- 进一步:pod 内 `nslookup` → `connection timed out`;`kubectl get pods -n kube-system` → `coredns 0/1 ImagePullBackOff` 已经 108 分钟
- root cause: vcluster default coredns image 是 `coredns/coredns:1.10.1` (docker.io),私有 idc 拉不到
- fix: `kubectl set image deployment/coredns -n kube-system coredns=registry.aliyuncs.com/google_containers/coredns:1.10.1` → 9s coredns Running → 立即重发 Backup CR `o19-i4-8854-rman19c-w7verify2` 2m32s **Completed** (528MB)
- Sediment:第 7 项 CoreDNS Running 必检 + 判读 doctrine "DNS-dependent 用例失败先看 coredns,再怀疑 addon 代码"
- 教训:cluster Running + 部分 smoke PASS **不**保证集群网络栈完整;需要按"分层探测 DNS"主动验证
- Evidence: `phase2-19c-smoke-20260505-115356/T08-r1/` + `W7-verify-evidence/`(James commit `345bfef9` + live-patch sequence)