From e6d93028f162b1ed9a2369adaaa3301eca3e8ee4 Mon Sep 17 00:00:00 2001 From: nayutah Date: Wed, 6 May 2026 13:15:44 +0800 Subject: [PATCH] docs: add lifecycle script exit code reliability guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds addon-lifecycle-script-exit-code-reliability-guide.md — a cross-engine reference for writing and reviewing KubeBlocks lifecycle action scripts. Covers: - The exit-0 silent-failure root cause and why KubeBlocks cannot detect it - Three mandatory verification paths: happy path, forced failure, runtime readback - kbagent 60s hard clamp and retryPolicy budget calculation - Pre-merge 7-item checklist Appendix A: Oracle W8b sqlplus WHENEVER SQLERROR / CDB open-mode wait case (contributed by Sophia). Appendix B: redis-cli exit-0 pitfalls and output-inspection pattern. Adds one entry to SKILL-INDEX.md Section 1 (设计/开发新 addon). --- docs/SKILL-INDEX.md | 1 + ...ycle-script-exit-code-reliability-guide.md | 321 ++++++++++++++++++ 2 files changed, 322 insertions(+) create mode 100644 docs/addon-lifecycle-script-exit-code-reliability-guide.md diff --git a/docs/SKILL-INDEX.md b/docs/SKILL-INDEX.md index a84cdfd..f6f7334 100644 --- a/docs/SKILL-INDEX.md +++ b/docs/SKILL-INDEX.md @@ -27,6 +27,7 @@ - [`addon-clusterdef-topology-componentdef-regex-guide.md`](addon-clusterdef-topology-componentdef-regex-guide.md) — 新增 `serviceVersion` / `ComponentDefinition` family 时,同步更新 `ClusterDefinition.spec.topologies[].components[].compDef` 正则;避免 Cluster `PreCheckFailed` 后无 pod 创建,且 `ClusterDefinition` 自身仍显示 Available 的误判 - [`addon-cmpd-image-override-jsonpath-guide.md`](addon-cmpd-image-override-jsonpath-guide.md) — CMPD 与 ComponentVersion 的两层镜像解析规则、各 container slot 正确的 `kubectl -o jsonpath` 表达式、Oracle 12c/19c/23ai 镜像位置矩阵、T01 sentinel 断言写法、`spec.releases` vs `spec.versions` 陷阱 - [`addon-pvc-rebind-via-workload-intent-guide.md`](addon-pvc-rebind-via-workload-intent-guide.md) — 当一条 OpsRequest 需要把同名 PVC 从一块 PV 改绑到另一块(rebuild / restore-into-place / PV migration),用 Workload CR annotation 把意图交给 Workload 控制器(唯一写者),避免 OpsRequest 控制器、Workload 控制器、动态 provisioner 三方抢同名 PVC 所有权造成 `PersistentVolume "" not found` 或绑错 PV +- [`addon-lifecycle-script-exit-code-reliability-guide.md`](addon-lifecycle-script-exit-code-reliability-guide.md) — lifecycle action 脚本 exit code 可靠性:exit 0 ≠ 业务成功的根因、三路径验证口径(happy path / forced failure / runtime readback)、kbagent 60s clamp 与 retryPolicy 预算、合并前 7 条 checklist;附录含 Oracle W8b sqlplus 静默失败案例 + Redis redis-cli exit 0 陷阱 ### 2. 写新 smoke / chaos 测试 diff --git a/docs/addon-lifecycle-script-exit-code-reliability-guide.md b/docs/addon-lifecycle-script-exit-code-reliability-guide.md new file mode 100644 index 0000000..85e2fba --- /dev/null +++ b/docs/addon-lifecycle-script-exit-code-reliability-guide.md @@ -0,0 +1,321 @@ +# Addon Lifecycle Script Exit Code Reliability Guide + +> **Audience**: addon dev / TL / test engineer,为任何引擎编写或验收 lifecycle action 脚本 +> **Status**: stable +> **Applies to**: 任何 KubeBlocks Addon(accountProvision / memberJoin / switchover / preStop / postStart 等 lifecycle actions) +> **Applies to KB version**: any +> **Affected by version skew**: 不受 KB 版本影响 — exit code 可靠性原则跨 KB 版本一致;kbagent 60s clamp 适用于所有使用 lifecycle action 的 KB 版本 + +本文面向需要编写或验收 KubeBlocks Addon lifecycle action 脚本的工程师。核心论题:外部客户端命令(数据库 CLI、HTTP client、shell 工具)频繁在操作失败时仍然返回 exit code 0,导致 KubeBlocks 误判为成功。引擎相关命令和案例只放在附录中,正文保持 engine-neutral。 + +## 先用白话理解这篇文档 + +### 这篇文档解决什么问题 + +最容易踩的不是"操作失败了",而是"看起来成功了,但操作根本没发生"。 + +典型场景:`accountProvision` 脚本调用数据库 CLI 创建用户,CLI 因为数据库未完全就绪而失败,但 CLI 返回 exit 0。kbagent 记录"Action Executed Successfully",KubeBlocks 继续下一步。几秒后,依赖该用户的备份操作失败 —— 用户从未被创建过,但没有任何地方记录了这个失败。 + +这是一个**两层静默**问题: +1. CLI 吞掉了错误(exit 0) +2. kbagent 没有机制验证副作用,只信任 exit code + +### 读完你能做什么决策 + +- **写 lifecycle 脚本时**:知道哪些客户端命令不可信赖 exit code,知道如何加 output inspection +- **测试 lifecycle 脚本时**:知道必须测 forced failure path,不能只测 happy path +- **配置 kbagent 超时时**:知道 60s clamp 的存在,知道如何用 retryPolicy 扩展预算 +- **做合并前 review 时**:用 Section 4 的 checklist 逐条核对 + +--- + +## 1. The Core Problem: Exit 0 ≠ Business Action Succeeded + +External client commands (database CLIs, HTTP clients, shell utilities) frequently return exit code 0 even when the underlying business operation failed. This is a widespread issue — not specific to any engine — and has caused silent lifecycle failures across multiple addons. + +**Why this matters for KubeBlocks:** +- KubeBlocks determines lifecycle action success entirely by the script's exit code. +- If a script exits 0 after a failed operation, KubeBlocks records the action as "Executed" and moves on. +- There is no automatic runtime readback: the operator trusts the exit code completely. +- Consequences: users that should have been created were never created; joins that should have happened never happened; switchover targets that should have been set were never set — all silently. + +**Pattern of failure:** +``` +script runs DB client → DB client exits 0 despite error → KubeBlocks marks action Succeeded +→ cluster in partially-initialized state → downstream features fail with opaque errors +``` + +--- + +## 2. Verification Criteria: Three Paths You Must Test + +Every lifecycle script must be tested against all three paths, not just the happy path. + +### 2.1 Happy Path + +Run the action in a clean environment and verify: +- Script exits 0 +- The intended side effect actually occurred (verified by readback, not by absence of error) + +``` +Criterion: exit_code == 0 AND readback_confirms_effect == true +``` + +### 2.2 Forced Failure Path + +Construct a scenario where the operation must fail, then verify: +- Script exits **non-zero** +- The error is surfaced, not silently swallowed + +This is the path most teams skip. Without it, you cannot know whether your error-handling code works at all. + +**How to construct forced failure scenarios by action type:** + +| Action | How to Force Failure | +|---|---| +| `accountProvision` | Use invalid credentials, an unreachable DB endpoint, or insufficient permissions — NOT a duplicate account (idempotent implementations treat that as a no-op, not a failure) | +| `memberJoin` | Point the join target at an unreachable or invalid primary address | +| `switchover` | Trigger from a non-primary node; expect the script to detect and reject | +| `preStop` / `postStart` | Remove required connection params or point at a non-existent resource | + +``` +Criterion: exit_code != 0 (the script must propagate the failure) +``` + +### 2.3 Runtime Readback + +After a successful action (exit 0), independently verify the side effect occurred: +- Query the engine directly (not through the script's own logic) +- Use a separate tool or session +- Compare expected state vs actual state + +``` +Criterion: independent_query_confirms_effect == true +``` + +**Runtime readback is not optional.** Exit code alone is insufficient evidence of success. Exit code + readback together are the minimum acceptable standard. + +--- + +## 3. KubeBlocks kbagent Timeout and retryPolicy Implications + +### 3.1 The 60-Second Hard Clamp + +KubeBlocks upstream enforces `maxActionCallTimeout = 60s`. Any `timeoutSeconds` value in your chart that exceeds 60s is silently truncated to 60s. + +**Consequence:** If your lifecycle script can silently hang (waiting for a connection that never comes, or looping on an operation that keeps returning the wrong exit code), it will be killed at 60s — and KubeBlocks will record the action as timed out, which may be treated as failure or retried depending on your `retryPolicy`. + +**Rule:** Lifecycle scripts must fail fast and explicitly. A script that swallows errors and exits 0 after 55 seconds of silent retrying is worse than one that fails immediately with a clear error message. + +### 3.2 retryPolicy as the Extended Budget + +If your action genuinely needs more than 60s total (e.g., account provisioning on a cold database), use `retryPolicy`: + +```yaml +retryPolicy: + maxRetries: 30 + retryInterval: 10s + # 30 retries × 10s = 300s total window, within each retry's 60s clamp +``` + +**Key distinction:** `retryPolicy` retries the script on non-zero exit. It does not help if the script exits 0 incorrectly. Correct exit codes are a prerequisite for `retryPolicy` to work at all. + +### 3.3 Observability Requirement + +A script must be observable within the 60s window. This means: +- Print progress to stdout/stderr at meaningful checkpoints +- Exit non-zero with a descriptive message on any failure branch +- Do not use bare `|| true` or `|| exit 0` patterns that discard failures + +--- + +## 4. Validation Checklist (Self-Review Before Merging) + +Before merging any lifecycle script, confirm all of the following: + +- [ ] **Happy path tested**: Script exits 0, readback confirms the operation took effect +- [ ] **Forced failure path tested**: A constructed failure scenario produces non-zero exit +- [ ] **No bare error suppression**: No `|| true`, `|| exit 0`, `2>/dev/null` that discards actionable errors +- [ ] **Client exit codes verified**: For each external CLI call, confirmed what exit codes it returns on error (do not assume non-zero on failure) +- [ ] **Runtime readback included**: An independent check of the side effect is present in the test suite (not just exit code inspection) +- [ ] **Timeout-aware**: Script fails explicitly within 60s on error paths; no silent hanging +- [ ] **retryPolicy configured**: If the action needs >60s total, retryPolicy budget is calculated: `maxRetries × retryInterval` + +--- + +## Appendix A: Oracle W8b — accountProvision Silent Failure + +> *Contributed by Sophia (Oracle Addon test engineer)* + +### Root Cause + +Oracle's `sqlplus` exits 0 by default even when SQL statements fail. Without the `WHENEVER SQLERROR EXIT SQL.SQLCODE` directive, a `CREATE USER` failure (e.g., due to database not open, invalid credentials, or permission errors) causes `sqlplus` to complete the heredoc and exit 0. Combined with `set -euo pipefail` in the surrounding shell script, the silent success propagates — kbagent records "Action Executed Successfully" while the user was never created. + +Secondary issue: the `accountProvision` lifecycle action fires while Oracle is still initializing (`OPEN_MODE` not yet `READ WRITE`). Without an explicit open-mode wait, every SQL in the block fails with `ORA-01109: database not open`, but the script still exits 0. + +### Observed Symptom + +- kbagent lifecycle log: `Action Executed` for `accountProvision` +- The common user `c##kbdataprotection` never created in the CDB +- Backup operations fail immediately after cluster creation: `ORA-01017: invalid username/password` +- `kubectl exec ... sqlplus` as `c##kbdataprotection` returns `ORA-01017` — user does not exist + +This was discovered in W8b (2026-05-05) after observing that backup smoke tests always failed on fresh clusters, even though `accountProvision` was reported as successful by kbagent across multiple retry cycles. + +### Fix + +**1. Add `WHENEVER SQLERROR EXIT SQL.SQLCODE`** as the first line inside the sqlplus heredoc: + +```sql +WHENEVER SQLERROR EXIT SQL.SQLCODE +``` + +This causes sqlplus to propagate any SQL error as a non-zero exit code. With `set -euo pipefail` in the shell script, the non-zero exit is caught, kbagent records the failure, and retries are triggered. + +**2. Wait for CDB to reach OPEN status** before issuing DDL: + +```bash +for attempt in $(seq 1 10); do + open_mode=$(sqlplus -S / as sysdba <<'SQLEOF' 2>/dev/null | tr -d ' \r\n' +SET FEEDBACK OFF HEADING OFF VERIFY OFF PAGES 0 LINESIZE 1000 +select open_mode from v$database; +exit; +SQLEOF + ) + if [[ "$open_mode" == "READWRITE" ]]; then + break + fi + echo "Waiting for CDB OPEN_MODE=READWRITE (attempt $attempt, current: $open_mode)" >&2 + sleep 5 +done + +if [[ "$open_mode" != "READWRITE" ]]; then + echo "CDB did not reach READWRITE in time, aborting accountProvision" >&2 + exit 1 +fi +``` + +**3. Idempotent user creation** — wrap `CREATE USER` in a PL/SQL `DECLARE` block with `PRAGMA EXCEPTION_INIT` so re-runs don't fail if the user already exists. All other exceptions propagate normally (and exit non-zero via `WHENEVER SQLERROR EXIT`): + +```sql +WHENEVER SQLERROR EXIT SQL.SQLCODE +DECLARE + user_exists EXCEPTION; + PRAGMA EXCEPTION_INIT(user_exists, -1920); +BEGIN + EXECUTE IMMEDIATE q'[CREATE USER c##kbdataprotection IDENTIFIED BY "..." CONTAINER=ALL]'; +EXCEPTION + WHEN user_exists THEN NULL; -- ORA-01920: user already exists → idempotent no-op +END; +/ +``` + +### Forced Failure Verification + +Rather than staging a complex "DB not open" scenario, verify `WHENEVER SQLERROR EXIT` propagation directly using an intentional SQL error (`SELECT 1/0`). This is simpler, stable, and definitively proves exit code propagation. + +**With fix (expected: non-zero exit — ORA-01476):** + +```bash +kubectl exec -i -n $NAMESPACE $POD -c oracle -- bash -lc " +sqlplus -S / as sysdba <<'SQL' +WHENEVER SQLERROR EXIT SQL.SQLCODE +SELECT 1/0 FROM dual; +EXIT; +SQL" +echo "Exit code: $?" +# Expected: exit 1476 (ORA-01476: divisor is equal to zero) +``` + +**Without fix (expected: exit 0 — old silent behavior):** + +```bash +kubectl exec -i -n $NAMESPACE $POD -c oracle -- bash -lc " +sqlplus -S / as sysdba <<'SQL' +SELECT 1/0 FROM dual; +EXIT; +SQL" +echo "Exit code: $?" +# Old behavior: exit 0 — SQL error silently swallowed +``` + +Running both together constitutes a complete verification of sqlplus error propagation. Note: this confirms that `WHENEVER SQLERROR EXIT` is wired correctly — account runtime readback (verifying the user actually exists) is still required as a separate step. + +### Lesson + +- Never trust `sqlplus` (or any database CLI) to return non-zero on SQL errors without explicit configuration. +- The kbagent retry cycle masks silent failures: if a script exits 0 silently, all retries also "succeed", and the problem is only discovered downstream when a dependent operation fails. +- Always run a forced-failure path test for each lifecycle action before merge. + +--- + +## Appendix B: Redis — redis-cli Exit Code Pitfalls + +**Context:** Redis lifecycle scripts (accountProvision, memberJoin, switchover) use `redis-cli`. The `redis-cli` command does not always return non-zero on failure. + +**Known cases where redis-cli exits 0 on failure:** +- `redis-cli SET` on a read-only replica: exits 0, prints `(error) READONLY` +- `redis-cli CONFIG SET` with an invalid parameter: exits 0, prints `(error) ERR` +- `redis-cli AUTH` / `-a` with wrong password: exits 0, prints `(error) WRONGPASS` — **not a reliable forced-failure trigger; must use output inspection instead** + +**Required pattern — explicit output inspection:** + +```bash +# Wrong: trusts exit code only +redis-cli CONFIG SET maxmemory 512mb +echo "Config applied" # may print even on failure + +# Correct: inspect output +RESULT=$(redis-cli CONFIG SET maxmemory 512mb) +if echo "$RESULT" | grep -q "^(error)"; then + echo "CONFIG SET failed: $RESULT" >&2 + exit 1 +fi +echo "Config applied: $RESULT" +``` + +**Forced failure test:** + +The only reliable option across all Redis versions is an unreachable endpoint: + +```bash +# Unreachable endpoint: always exits non-zero (connection refused) +redis-cli -h 127.0.0.1 -p 9 PING +# Expected: Could not connect to Redis at 127.0.0.1:9: Connection refused +# Exit code: non-zero (confirmed across Redis 6.x, 7.x) +``` + +Do not use wrong-auth as a forced-failure trigger. As noted above, `redis-cli -a wrong PING` exits 0 even on auth failure — it is an output-inspection case, not an exit-code case. + +**Note on `ACL SETUSER` syntax:** When scripting Redis ACL commands in shell, always quote the password argument to prevent shell redirection. Write `'>password'` (single-quoted) or `\>password`, not `>password`: + +```bash +# Wrong: >testpass is parsed as stdout redirection by the shell +redis-cli ACL SETUSER alice on >testpass ~* +@all + +# Correct: single-quote the password argument +redis-cli ACL SETUSER alice on '>testpass' ~* +@all +``` + +**Runtime readback for accountProvision:** + +```bash +# After provisioning, verify user exists and can authenticate +redis-cli --no-auth-warning --user "$USERNAME" -a "$PASSWORD" PING +# Expected: PONG (confirms auth succeeds) + +redis-cli ACL GETUSER "$USERNAME" +# Expected: non-empty response showing permissions, flags, etc. +# Empty or "nil" response means the user was never created +``` + +**Switchover — detecting non-primary execution:** + +```bash +# Before executing switchover, verify this pod is actually primary +ROLE=$(redis-cli ROLE | head -1) +if [[ "$ROLE" != "master" ]]; then + echo "This pod is not primary (role=$ROLE), cannot initiate switchover" >&2 + exit 1 +fi +```