From e6d93028f162b1ed9a2369adaaa3301eca3e8ee4 Mon Sep 17 00:00:00 2001
From: nayutah <dullboy@apecloud.com>
Date: Wed, 6 May 2026 13:15:44 +0800
Subject: [PATCH] docs: add lifecycle script exit code reliability guide
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds addon-lifecycle-script-exit-code-reliability-guide.md — a
cross-engine reference for writing and reviewing KubeBlocks lifecycle
action scripts. Covers:

- The exit-0 silent-failure root cause and why KubeBlocks cannot detect it
- Three mandatory verification paths: happy path, forced failure, runtime readback
- kbagent 60s hard clamp and retryPolicy budget calculation
- Pre-merge 7-item checklist

Appendix A: Oracle W8b sqlplus WHENEVER SQLERROR / CDB open-mode wait
case (contributed by Sophia).
Appendix B: redis-cli exit-0 pitfalls and output-inspection pattern.

Adds one entry to SKILL-INDEX.md Section 1 (设计/开发新 addon).
---
 docs/SKILL-INDEX.md                           |   1 +
 ...ycle-script-exit-code-reliability-guide.md | 321 ++++++++++++++++++
 2 files changed, 322 insertions(+)
 create mode 100644 docs/addon-lifecycle-script-exit-code-reliability-guide.md

diff --git a/docs/SKILL-INDEX.md b/docs/SKILL-INDEX.md
index a84cdfd..f6f7334 100644
--- a/docs/SKILL-INDEX.md
+++ b/docs/SKILL-INDEX.md
@@ -27,6 +27,7 @@
 - [`addon-clusterdef-topology-componentdef-regex-guide.md`](addon-clusterdef-topology-componentdef-regex-guide.md) — 新增 `serviceVersion` / `ComponentDefinition` family 时，同步更新 `ClusterDefinition.spec.topologies[].components[].compDef` 正则；避免 Cluster `PreCheckFailed` 后无 pod 创建，且 `ClusterDefinition` 自身仍显示 Available 的误判
 - [`addon-cmpd-image-override-jsonpath-guide.md`](addon-cmpd-image-override-jsonpath-guide.md) — CMPD 与 ComponentVersion 的两层镜像解析规则、各 container slot 正确的 `kubectl -o jsonpath` 表达式、Oracle 12c/19c/23ai 镜像位置矩阵、T01 sentinel 断言写法、`spec.releases` vs `spec.versions` 陷阱
 - [`addon-pvc-rebind-via-workload-intent-guide.md`](addon-pvc-rebind-via-workload-intent-guide.md) — 当一条 OpsRequest 需要把同名 PVC 从一块 PV 改绑到另一块（rebuild / restore-into-place / PV migration），用 Workload CR annotation 把意图交给 Workload 控制器（唯一写者），避免 OpsRequest 控制器、Workload 控制器、动态 provisioner 三方抢同名 PVC 所有权造成 `PersistentVolume "" not found` 或绑错 PV
+- [`addon-lifecycle-script-exit-code-reliability-guide.md`](addon-lifecycle-script-exit-code-reliability-guide.md) — lifecycle action 脚本 exit code 可靠性：exit 0 ≠ 业务成功的根因、三路径验证口径（happy path / forced failure / runtime readback）、kbagent 60s clamp 与 retryPolicy 预算、合并前 7 条 checklist；附录含 Oracle W8b sqlplus 静默失败案例 + Redis redis-cli exit 0 陷阱
 
 ### 2. 写新 smoke / chaos 测试
 
diff --git a/docs/addon-lifecycle-script-exit-code-reliability-guide.md b/docs/addon-lifecycle-script-exit-code-reliability-guide.md
new file mode 100644
index 0000000..85e2fba
--- /dev/null
+++ b/docs/addon-lifecycle-script-exit-code-reliability-guide.md
@@ -0,0 +1,321 @@
+# Addon Lifecycle Script Exit Code Reliability Guide
+
+> **Audience**: addon dev / TL / test engineer，为任何引擎编写或验收 lifecycle action 脚本
+> **Status**: stable
+> **Applies to**: 任何 KubeBlocks Addon（accountProvision / memberJoin / switchover / preStop / postStart 等 lifecycle actions）
+> **Applies to KB version**: any
+> **Affected by version skew**: 不受 KB 版本影响 — exit code 可靠性原则跨 KB 版本一致；kbagent 60s clamp 适用于所有使用 lifecycle action 的 KB 版本
+
+本文面向需要编写或验收 KubeBlocks Addon lifecycle action 脚本的工程师。核心论题：外部客户端命令（数据库 CLI、HTTP client、shell 工具）频繁在操作失败时仍然返回 exit code 0，导致 KubeBlocks 误判为成功。引擎相关命令和案例只放在附录中，正文保持 engine-neutral。
+
+## 先用白话理解这篇文档
+
+### 这篇文档解决什么问题
+
+最容易踩的不是"操作失败了"，而是"看起来成功了，但操作根本没发生"。
+
+典型场景：`accountProvision` 脚本调用数据库 CLI 创建用户，CLI 因为数据库未完全就绪而失败，但 CLI 返回 exit 0。kbagent 记录"Action Executed Successfully"，KubeBlocks 继续下一步。几秒后，依赖该用户的备份操作失败 —— 用户从未被创建过，但没有任何地方记录了这个失败。
+
+这是一个**两层静默**问题：
+1. CLI 吞掉了错误（exit 0）
+2. kbagent 没有机制验证副作用，只信任 exit code
+
+### 读完你能做什么决策
+
+- **写 lifecycle 脚本时**：知道哪些客户端命令不可信赖 exit code，知道如何加 output inspection
+- **测试 lifecycle 脚本时**：知道必须测 forced failure path，不能只测 happy path
+- **配置 kbagent 超时时**：知道 60s clamp 的存在，知道如何用 retryPolicy 扩展预算
+- **做合并前 review 时**：用 Section 4 的 checklist 逐条核对
+
+---
+
+## 1. The Core Problem: Exit 0 ≠ Business Action Succeeded
+
+External client commands (database CLIs, HTTP clients, shell utilities) frequently return exit code 0 even when the underlying business operation failed. This is a widespread issue — not specific to any engine — and has caused silent lifecycle failures across multiple addons.
+
+**Why this matters for KubeBlocks:**
+- KubeBlocks determines lifecycle action success entirely by the script's exit code.
+- If a script exits 0 after a failed operation, KubeBlocks records the action as "Executed" and moves on.
+- There is no automatic runtime readback: the operator trusts the exit code completely.
+- Consequences: users that should have been created were never created; joins that should have happened never happened; switchover targets that should have been set were never set — all silently.
+
+**Pattern of failure:**
+```
+script runs DB client → DB client exits 0 despite error → KubeBlocks marks action Succeeded
+→ cluster in partially-initialized state → downstream features fail with opaque errors
+```
+
+---
+
+## 2. Verification Criteria: Three Paths You Must Test
+
+Every lifecycle script must be tested against all three paths, not just the happy path.
+
+### 2.1 Happy Path
+
+Run the action in a clean environment and verify:
+- Script exits 0
+- The intended side effect actually occurred (verified by readback, not by absence of error)
+
+```
+Criterion: exit_code == 0  AND  readback_confirms_effect == true
+```
+
+### 2.2 Forced Failure Path
+
+Construct a scenario where the operation must fail, then verify:
+- Script exits **non-zero**
+- The error is surfaced, not silently swallowed
+
+This is the path most teams skip. Without it, you cannot know whether your error-handling code works at all.
+
+**How to construct forced failure scenarios by action type:**
+
+| Action | How to Force Failure |
+|---|---|
+| `accountProvision` | Use invalid credentials, an unreachable DB endpoint, or insufficient permissions — NOT a duplicate account (idempotent implementations treat that as a no-op, not a failure) |
+| `memberJoin` | Point the join target at an unreachable or invalid primary address |
+| `switchover` | Trigger from a non-primary node; expect the script to detect and reject |
+| `preStop` / `postStart` | Remove required connection params or point at a non-existent resource |
+
+```
+Criterion: exit_code != 0  (the script must propagate the failure)
+```
+
+### 2.3 Runtime Readback
+
+After a successful action (exit 0), independently verify the side effect occurred:
+- Query the engine directly (not through the script's own logic)
+- Use a separate tool or session
+- Compare expected state vs actual state
+
+```
+Criterion: independent_query_confirms_effect == true
+```
+
+**Runtime readback is not optional.** Exit code alone is insufficient evidence of success. Exit code + readback together are the minimum acceptable standard.
+
+---
+
+## 3. KubeBlocks kbagent Timeout and retryPolicy Implications
+
+### 3.1 The 60-Second Hard Clamp
+
+KubeBlocks upstream enforces `maxActionCallTimeout = 60s`. Any `timeoutSeconds` value in your chart that exceeds 60s is silently truncated to 60s.
+
+**Consequence:** If your lifecycle script can silently hang (waiting for a connection that never comes, or looping on an operation that keeps returning the wrong exit code), it will be killed at 60s — and KubeBlocks will record the action as timed out, which may be treated as failure or retried depending on your `retryPolicy`.
+
+**Rule:** Lifecycle scripts must fail fast and explicitly. A script that swallows errors and exits 0 after 55 seconds of silent retrying is worse than one that fails immediately with a clear error message.
+
+### 3.2 retryPolicy as the Extended Budget
+
+If your action genuinely needs more than 60s total (e.g., account provisioning on a cold database), use `retryPolicy`:
+
+```yaml
+retryPolicy:
+  maxRetries: 30
+  retryInterval: 10s
+  # 30 retries × 10s = 300s total window, within each retry's 60s clamp
+```
+
+**Key distinction:** `retryPolicy` retries the script on non-zero exit. It does not help if the script exits 0 incorrectly. Correct exit codes are a prerequisite for `retryPolicy` to work at all.
+
+### 3.3 Observability Requirement
+
+A script must be observable within the 60s window. This means:
+- Print progress to stdout/stderr at meaningful checkpoints
+- Exit non-zero with a descriptive message on any failure branch
+- Do not use bare `|| true` or `|| exit 0` patterns that discard failures
+
+---
+
+## 4. Validation Checklist (Self-Review Before Merging)
+
+Before merging any lifecycle script, confirm all of the following:
+
+- [ ] **Happy path tested**: Script exits 0, readback confirms the operation took effect
+- [ ] **Forced failure path tested**: A constructed failure scenario produces non-zero exit
+- [ ] **No bare error suppression**: No `|| true`, `|| exit 0`, `2>/dev/null` that discards actionable errors
+- [ ] **Client exit codes verified**: For each external CLI call, confirmed what exit codes it returns on error (do not assume non-zero on failure)
+- [ ] **Runtime readback included**: An independent check of the side effect is present in the test suite (not just exit code inspection)
+- [ ] **Timeout-aware**: Script fails explicitly within 60s on error paths; no silent hanging
+- [ ] **retryPolicy configured**: If the action needs >60s total, retryPolicy budget is calculated: `maxRetries × retryInterval`
+
+---
+
+## Appendix A: Oracle W8b — accountProvision Silent Failure
+
+> *Contributed by Sophia (Oracle Addon test engineer)*
+
+### Root Cause
+
+Oracle's `sqlplus` exits 0 by default even when SQL statements fail. Without the `WHENEVER SQLERROR EXIT SQL.SQLCODE` directive, a `CREATE USER` failure (e.g., due to database not open, invalid credentials, or permission errors) causes `sqlplus` to complete the heredoc and exit 0. Combined with `set -euo pipefail` in the surrounding shell script, the silent success propagates — kbagent records "Action Executed Successfully" while the user was never created.
+
+Secondary issue: the `accountProvision` lifecycle action fires while Oracle is still initializing (`OPEN_MODE` not yet `READ WRITE`). Without an explicit open-mode wait, every SQL in the block fails with `ORA-01109: database not open`, but the script still exits 0.
+
+### Observed Symptom
+
+- kbagent lifecycle log: `Action Executed` for `accountProvision`
+- The common user `c##kbdataprotection` never created in the CDB
+- Backup operations fail immediately after cluster creation: `ORA-01017: invalid username/password`
+- `kubectl exec ... sqlplus` as `c##kbdataprotection` returns `ORA-01017` — user does not exist
+
+This was discovered in W8b (2026-05-05) after observing that backup smoke tests always failed on fresh clusters, even though `accountProvision` was reported as successful by kbagent across multiple retry cycles.
+
+### Fix
+
+**1. Add `WHENEVER SQLERROR EXIT SQL.SQLCODE`** as the first line inside the sqlplus heredoc:
+
+```sql
+WHENEVER SQLERROR EXIT SQL.SQLCODE
+```
+
+This causes sqlplus to propagate any SQL error as a non-zero exit code. With `set -euo pipefail` in the shell script, the non-zero exit is caught, kbagent records the failure, and retries are triggered.
+
+**2. Wait for CDB to reach OPEN status** before issuing DDL:
+
+```bash
+for attempt in $(seq 1 10); do
+  open_mode=$(sqlplus -S / as sysdba <<'SQLEOF' 2>/dev/null | tr -d ' \r\n'
+SET FEEDBACK OFF HEADING OFF VERIFY OFF PAGES 0 LINESIZE 1000
+select open_mode from v$database;
+exit;
+SQLEOF
+  )
+  if [[ "$open_mode" == "READWRITE" ]]; then
+    break
+  fi
+  echo "Waiting for CDB OPEN_MODE=READWRITE (attempt $attempt, current: $open_mode)" >&2
+  sleep 5
+done
+
+if [[ "$open_mode" != "READWRITE" ]]; then
+  echo "CDB did not reach READWRITE in time, aborting accountProvision" >&2
+  exit 1
+fi
+```
+
+**3. Idempotent user creation** — wrap `CREATE USER` in a PL/SQL `DECLARE` block with `PRAGMA EXCEPTION_INIT` so re-runs don't fail if the user already exists. All other exceptions propagate normally (and exit non-zero via `WHENEVER SQLERROR EXIT`):
+
+```sql
+WHENEVER SQLERROR EXIT SQL.SQLCODE
+DECLARE
+  user_exists EXCEPTION;
+  PRAGMA EXCEPTION_INIT(user_exists, -1920);
+BEGIN
+  EXECUTE IMMEDIATE q'[CREATE USER c##kbdataprotection IDENTIFIED BY "..." CONTAINER=ALL]';
+EXCEPTION
+  WHEN user_exists THEN NULL;  -- ORA-01920: user already exists → idempotent no-op
+END;
+/
+```
+
+### Forced Failure Verification
+
+Rather than staging a complex "DB not open" scenario, verify `WHENEVER SQLERROR EXIT` propagation directly using an intentional SQL error (`SELECT 1/0`). This is simpler, stable, and definitively proves exit code propagation.
+
+**With fix (expected: non-zero exit — ORA-01476):**
+
+```bash
+kubectl exec -i -n $NAMESPACE $POD -c oracle -- bash -lc "
+sqlplus -S / as sysdba <<'SQL'
+WHENEVER SQLERROR EXIT SQL.SQLCODE
+SELECT 1/0 FROM dual;
+EXIT;
+SQL"
+echo "Exit code: $?"
+# Expected: exit 1476 (ORA-01476: divisor is equal to zero)
+```
+
+**Without fix (expected: exit 0 — old silent behavior):**
+
+```bash
+kubectl exec -i -n $NAMESPACE $POD -c oracle -- bash -lc "
+sqlplus -S / as sysdba <<'SQL'
+SELECT 1/0 FROM dual;
+EXIT;
+SQL"
+echo "Exit code: $?"
+# Old behavior: exit 0 — SQL error silently swallowed
+```
+
+Running both together constitutes a complete verification of sqlplus error propagation. Note: this confirms that `WHENEVER SQLERROR EXIT` is wired correctly — account runtime readback (verifying the user actually exists) is still required as a separate step.
+
+### Lesson
+
+- Never trust `sqlplus` (or any database CLI) to return non-zero on SQL errors without explicit configuration.
+- The kbagent retry cycle masks silent failures: if a script exits 0 silently, all retries also "succeed", and the problem is only discovered downstream when a dependent operation fails.
+- Always run a forced-failure path test for each lifecycle action before merge.
+
+---
+
+## Appendix B: Redis — redis-cli Exit Code Pitfalls
+
+**Context:** Redis lifecycle scripts (accountProvision, memberJoin, switchover) use `redis-cli`. The `redis-cli` command does not always return non-zero on failure.
+
+**Known cases where redis-cli exits 0 on failure:**
+- `redis-cli SET` on a read-only replica: exits 0, prints `(error) READONLY`
+- `redis-cli CONFIG SET` with an invalid parameter: exits 0, prints `(error) ERR`
+- `redis-cli AUTH` / `-a` with wrong password: exits 0, prints `(error) WRONGPASS` — **not a reliable forced-failure trigger; must use output inspection instead**
+
+**Required pattern — explicit output inspection:**
+
+```bash
+# Wrong: trusts exit code only
+redis-cli CONFIG SET maxmemory 512mb
+echo "Config applied"  # may print even on failure
+
+# Correct: inspect output
+RESULT=$(redis-cli CONFIG SET maxmemory 512mb)
+if echo "$RESULT" | grep -q "^(error)"; then
+  echo "CONFIG SET failed: $RESULT" >&2
+  exit 1
+fi
+echo "Config applied: $RESULT"
+```
+
+**Forced failure test:**
+
+The only reliable option across all Redis versions is an unreachable endpoint:
+
+```bash
+# Unreachable endpoint: always exits non-zero (connection refused)
+redis-cli -h 127.0.0.1 -p 9 PING
+# Expected: Could not connect to Redis at 127.0.0.1:9: Connection refused
+# Exit code: non-zero (confirmed across Redis 6.x, 7.x)
+```
+
+Do not use wrong-auth as a forced-failure trigger. As noted above, `redis-cli -a wrong PING` exits 0 even on auth failure — it is an output-inspection case, not an exit-code case.
+
+**Note on `ACL SETUSER` syntax:** When scripting Redis ACL commands in shell, always quote the password argument to prevent shell redirection. Write `'>password'` (single-quoted) or `\>password`, not `>password`:
+
+```bash
+# Wrong: >testpass is parsed as stdout redirection by the shell
+redis-cli ACL SETUSER alice on >testpass ~* +@all
+
+# Correct: single-quote the password argument
+redis-cli ACL SETUSER alice on '>testpass' ~* +@all
+```
+
+**Runtime readback for accountProvision:**
+
+```bash
+# After provisioning, verify user exists and can authenticate
+redis-cli --no-auth-warning --user "$USERNAME" -a "$PASSWORD" PING
+# Expected: PONG (confirms auth succeeds)
+
+redis-cli ACL GETUSER "$USERNAME"
+# Expected: non-empty response showing permissions, flags, etc.
+# Empty or "nil" response means the user was never created
+```
+
+**Switchover — detecting non-primary execution:**
+
+```bash
+# Before executing switchover, verify this pod is actually primary
+ROLE=$(redis-cli ROLE | head -1)
+if [[ "$ROLE" != "master" ]]; then
+  echo "This pod is not primary (role=$ROLE), cannot initiate switchover" >&2
+  exit 1
+fi
+```