Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/SKILL-INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
- [`addon-clusterdef-topology-componentdef-regex-guide.md`](addon-clusterdef-topology-componentdef-regex-guide.md) — 新增 `serviceVersion` / `ComponentDefinition` family 时,同步更新 `ClusterDefinition.spec.topologies[].components[].compDef` 正则;避免 Cluster `PreCheckFailed` 后无 pod 创建,且 `ClusterDefinition` 自身仍显示 Available 的误判
- [`addon-cmpd-image-override-jsonpath-guide.md`](addon-cmpd-image-override-jsonpath-guide.md) — CMPD 与 ComponentVersion 的两层镜像解析规则、各 container slot 正确的 `kubectl -o jsonpath` 表达式、Oracle 12c/19c/23ai 镜像位置矩阵、T01 sentinel 断言写法、`spec.releases` vs `spec.versions` 陷阱
- [`addon-pvc-rebind-via-workload-intent-guide.md`](addon-pvc-rebind-via-workload-intent-guide.md) — 当一条 OpsRequest 需要把同名 PVC 从一块 PV 改绑到另一块(rebuild / restore-into-place / PV migration),用 Workload CR annotation 把意图交给 Workload 控制器(唯一写者),避免 OpsRequest 控制器、Workload 控制器、动态 provisioner 三方抢同名 PVC 所有权造成 `PersistentVolume "" not found` 或绑错 PV
- [`addon-lifecycle-script-exit-code-reliability-guide.md`](addon-lifecycle-script-exit-code-reliability-guide.md) — lifecycle action 脚本 exit code 可靠性:exit 0 ≠ 业务成功的根因、三路径验证口径(happy path / forced failure / runtime readback)、kbagent 60s clamp 与 retryPolicy 预算、合并前 7 条 checklist;附录含 Oracle W8b sqlplus 静默失败案例 + Redis redis-cli exit 0 陷阱

### 2. 写新 smoke / chaos 测试

Expand Down
321 changes: 321 additions & 0 deletions docs/addon-lifecycle-script-exit-code-reliability-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,321 @@
# Addon Lifecycle Script Exit Code Reliability Guide

> **Audience**: addon dev / TL / test engineer,为任何引擎编写或验收 lifecycle action 脚本
> **Status**: stable
> **Applies to**: 任何 KubeBlocks Addon(accountProvision / memberJoin / switchover / preStop / postStart 等 lifecycle actions)
> **Applies to KB version**: any
> **Affected by version skew**: 不受 KB 版本影响 — exit code 可靠性原则跨 KB 版本一致;kbagent 60s clamp 适用于所有使用 lifecycle action 的 KB 版本

本文面向需要编写或验收 KubeBlocks Addon lifecycle action 脚本的工程师。核心论题:外部客户端命令(数据库 CLI、HTTP client、shell 工具)频繁在操作失败时仍然返回 exit code 0,导致 KubeBlocks 误判为成功。引擎相关命令和案例只放在附录中,正文保持 engine-neutral。

## 先用白话理解这篇文档

### 这篇文档解决什么问题

最容易踩的不是"操作失败了",而是"看起来成功了,但操作根本没发生"。

典型场景:`accountProvision` 脚本调用数据库 CLI 创建用户,CLI 因为数据库未完全就绪而失败,但 CLI 返回 exit 0。kbagent 记录"Action Executed Successfully",KubeBlocks 继续下一步。几秒后,依赖该用户的备份操作失败 —— 用户从未被创建过,但没有任何地方记录了这个失败。

这是一个**两层静默**问题:
1. CLI 吞掉了错误(exit 0)
2. kbagent 没有机制验证副作用,只信任 exit code

### 读完你能做什么决策

- **写 lifecycle 脚本时**:知道哪些客户端命令不可信赖 exit code,知道如何加 output inspection
- **测试 lifecycle 脚本时**:知道必须测 forced failure path,不能只测 happy path
- **配置 kbagent 超时时**:知道 60s clamp 的存在,知道如何用 retryPolicy 扩展预算
- **做合并前 review 时**:用 Section 4 的 checklist 逐条核对

---

## 1. The Core Problem: Exit 0 ≠ Business Action Succeeded

External client commands (database CLIs, HTTP clients, shell utilities) frequently return exit code 0 even when the underlying business operation failed. This is a widespread issue — not specific to any engine — and has caused silent lifecycle failures across multiple addons.

**Why this matters for KubeBlocks:**
- KubeBlocks determines lifecycle action success entirely by the script's exit code.
- If a script exits 0 after a failed operation, KubeBlocks records the action as "Executed" and moves on.
- There is no automatic runtime readback: the operator trusts the exit code completely.
- Consequences: users that should have been created were never created; joins that should have happened never happened; switchover targets that should have been set were never set — all silently.

**Pattern of failure:**
```
script runs DB client → DB client exits 0 despite error → KubeBlocks marks action Succeeded
→ cluster in partially-initialized state → downstream features fail with opaque errors
```

---

## 2. Verification Criteria: Three Paths You Must Test

Every lifecycle script must be tested against all three paths, not just the happy path.

### 2.1 Happy Path

Run the action in a clean environment and verify:
- Script exits 0
- The intended side effect actually occurred (verified by readback, not by absence of error)

```
Criterion: exit_code == 0 AND readback_confirms_effect == true
```

### 2.2 Forced Failure Path

Construct a scenario where the operation must fail, then verify:
- Script exits **non-zero**
- The error is surfaced, not silently swallowed

This is the path most teams skip. Without it, you cannot know whether your error-handling code works at all.

**How to construct forced failure scenarios by action type:**

| Action | How to Force Failure |
|---|---|
| `accountProvision` | Use invalid credentials, an unreachable DB endpoint, or insufficient permissions — NOT a duplicate account (idempotent implementations treat that as a no-op, not a failure) |
| `memberJoin` | Point the join target at an unreachable or invalid primary address |
| `switchover` | Trigger from a non-primary node; expect the script to detect and reject |
| `preStop` / `postStart` | Remove required connection params or point at a non-existent resource |

```
Criterion: exit_code != 0 (the script must propagate the failure)
```

### 2.3 Runtime Readback

After a successful action (exit 0), independently verify the side effect occurred:
- Query the engine directly (not through the script's own logic)
- Use a separate tool or session
- Compare expected state vs actual state

```
Criterion: independent_query_confirms_effect == true
```

**Runtime readback is not optional.** Exit code alone is insufficient evidence of success. Exit code + readback together are the minimum acceptable standard.

---

## 3. KubeBlocks kbagent Timeout and retryPolicy Implications

### 3.1 The 60-Second Hard Clamp

KubeBlocks upstream enforces `maxActionCallTimeout = 60s`. Any `timeoutSeconds` value in your chart that exceeds 60s is silently truncated to 60s.

**Consequence:** If your lifecycle script can silently hang (waiting for a connection that never comes, or looping on an operation that keeps returning the wrong exit code), it will be killed at 60s — and KubeBlocks will record the action as timed out, which may be treated as failure or retried depending on your `retryPolicy`.

**Rule:** Lifecycle scripts must fail fast and explicitly. A script that swallows errors and exits 0 after 55 seconds of silent retrying is worse than one that fails immediately with a clear error message.

### 3.2 retryPolicy as the Extended Budget

If your action genuinely needs more than 60s total (e.g., account provisioning on a cold database), use `retryPolicy`:

```yaml
retryPolicy:
maxRetries: 30
retryInterval: 10s
# 30 retries × 10s = 300s total window, within each retry's 60s clamp
```

**Key distinction:** `retryPolicy` retries the script on non-zero exit. It does not help if the script exits 0 incorrectly. Correct exit codes are a prerequisite for `retryPolicy` to work at all.

### 3.3 Observability Requirement

A script must be observable within the 60s window. This means:
- Print progress to stdout/stderr at meaningful checkpoints
- Exit non-zero with a descriptive message on any failure branch
- Do not use bare `|| true` or `|| exit 0` patterns that discard failures

---

## 4. Validation Checklist (Self-Review Before Merging)

Before merging any lifecycle script, confirm all of the following:

- [ ] **Happy path tested**: Script exits 0, readback confirms the operation took effect
- [ ] **Forced failure path tested**: A constructed failure scenario produces non-zero exit
- [ ] **No bare error suppression**: No `|| true`, `|| exit 0`, `2>/dev/null` that discards actionable errors
- [ ] **Client exit codes verified**: For each external CLI call, confirmed what exit codes it returns on error (do not assume non-zero on failure)
- [ ] **Runtime readback included**: An independent check of the side effect is present in the test suite (not just exit code inspection)
- [ ] **Timeout-aware**: Script fails explicitly within 60s on error paths; no silent hanging
- [ ] **retryPolicy configured**: If the action needs >60s total, retryPolicy budget is calculated: `maxRetries × retryInterval`

---

## Appendix A: Oracle W8b — accountProvision Silent Failure

> *Contributed by Sophia (Oracle Addon test engineer)*

### Root Cause

Oracle's `sqlplus` exits 0 by default even when SQL statements fail. Without the `WHENEVER SQLERROR EXIT SQL.SQLCODE` directive, a `CREATE USER` failure (e.g., due to database not open, invalid credentials, or permission errors) causes `sqlplus` to complete the heredoc and exit 0. Combined with `set -euo pipefail` in the surrounding shell script, the silent success propagates — kbagent records "Action Executed Successfully" while the user was never created.

Secondary issue: the `accountProvision` lifecycle action fires while Oracle is still initializing (`OPEN_MODE` not yet `READ WRITE`). Without an explicit open-mode wait, every SQL in the block fails with `ORA-01109: database not open`, but the script still exits 0.

### Observed Symptom

- kbagent lifecycle log: `Action Executed` for `accountProvision`
- The common user `c##kbdataprotection` never created in the CDB
- Backup operations fail immediately after cluster creation: `ORA-01017: invalid username/password`
- `kubectl exec ... sqlplus` as `c##kbdataprotection` returns `ORA-01017` — user does not exist

This was discovered in W8b (2026-05-05) after observing that backup smoke tests always failed on fresh clusters, even though `accountProvision` was reported as successful by kbagent across multiple retry cycles.

### Fix

**1. Add `WHENEVER SQLERROR EXIT SQL.SQLCODE`** as the first line inside the sqlplus heredoc:

```sql
WHENEVER SQLERROR EXIT SQL.SQLCODE
```

This causes sqlplus to propagate any SQL error as a non-zero exit code. With `set -euo pipefail` in the shell script, the non-zero exit is caught, kbagent records the failure, and retries are triggered.

**2. Wait for CDB to reach OPEN status** before issuing DDL:

```bash
for attempt in $(seq 1 10); do
open_mode=$(sqlplus -S / as sysdba <<'SQLEOF' 2>/dev/null | tr -d ' \r\n'
SET FEEDBACK OFF HEADING OFF VERIFY OFF PAGES 0 LINESIZE 1000
select open_mode from v$database;
exit;
SQLEOF
)
if [[ "$open_mode" == "READWRITE" ]]; then
break
fi
echo "Waiting for CDB OPEN_MODE=READWRITE (attempt $attempt, current: $open_mode)" >&2
sleep 5
done

if [[ "$open_mode" != "READWRITE" ]]; then
echo "CDB did not reach READWRITE in time, aborting accountProvision" >&2
exit 1
fi
```

**3. Idempotent user creation** — wrap `CREATE USER` in a PL/SQL `DECLARE` block with `PRAGMA EXCEPTION_INIT` so re-runs don't fail if the user already exists. All other exceptions propagate normally (and exit non-zero via `WHENEVER SQLERROR EXIT`):

```sql
WHENEVER SQLERROR EXIT SQL.SQLCODE
DECLARE
user_exists EXCEPTION;
PRAGMA EXCEPTION_INIT(user_exists, -1920);
BEGIN
EXECUTE IMMEDIATE q'[CREATE USER c##kbdataprotection IDENTIFIED BY "..." CONTAINER=ALL]';
EXCEPTION
WHEN user_exists THEN NULL; -- ORA-01920: user already exists → idempotent no-op
END;
/
```

### Forced Failure Verification

Rather than staging a complex "DB not open" scenario, verify `WHENEVER SQLERROR EXIT` propagation directly using an intentional SQL error (`SELECT 1/0`). This is simpler, stable, and definitively proves exit code propagation.

**With fix (expected: non-zero exit — ORA-01476):**

```bash
kubectl exec -i -n $NAMESPACE $POD -c oracle -- bash -lc "
sqlplus -S / as sysdba <<'SQL'
WHENEVER SQLERROR EXIT SQL.SQLCODE
SELECT 1/0 FROM dual;
EXIT;
SQL"
echo "Exit code: $?"
# Expected: exit 1476 (ORA-01476: divisor is equal to zero)
```

**Without fix (expected: exit 0 — old silent behavior):**

```bash
kubectl exec -i -n $NAMESPACE $POD -c oracle -- bash -lc "
sqlplus -S / as sysdba <<'SQL'
SELECT 1/0 FROM dual;
EXIT;
SQL"
echo "Exit code: $?"
# Old behavior: exit 0 — SQL error silently swallowed
```

Running both together constitutes a complete verification of sqlplus error propagation. Note: this confirms that `WHENEVER SQLERROR EXIT` is wired correctly — account runtime readback (verifying the user actually exists) is still required as a separate step.

### Lesson

- Never trust `sqlplus` (or any database CLI) to return non-zero on SQL errors without explicit configuration.
- The kbagent retry cycle masks silent failures: if a script exits 0 silently, all retries also "succeed", and the problem is only discovered downstream when a dependent operation fails.
- Always run a forced-failure path test for each lifecycle action before merge.

---

## Appendix B: Redis — redis-cli Exit Code Pitfalls

**Context:** Redis lifecycle scripts (accountProvision, memberJoin, switchover) use `redis-cli`. The `redis-cli` command does not always return non-zero on failure.

**Known cases where redis-cli exits 0 on failure:**
- `redis-cli SET` on a read-only replica: exits 0, prints `(error) READONLY`
- `redis-cli CONFIG SET` with an invalid parameter: exits 0, prints `(error) ERR`
- `redis-cli AUTH` / `-a` with wrong password: exits 0, prints `(error) WRONGPASS` — **not a reliable forced-failure trigger; must use output inspection instead**

**Required pattern — explicit output inspection:**

```bash
# Wrong: trusts exit code only
redis-cli CONFIG SET maxmemory 512mb
echo "Config applied" # may print even on failure

# Correct: inspect output
RESULT=$(redis-cli CONFIG SET maxmemory 512mb)
if echo "$RESULT" | grep -q "^(error)"; then
echo "CONFIG SET failed: $RESULT" >&2
exit 1
fi
echo "Config applied: $RESULT"
```

**Forced failure test:**

The only reliable option across all Redis versions is an unreachable endpoint:

```bash
# Unreachable endpoint: always exits non-zero (connection refused)
redis-cli -h 127.0.0.1 -p 9 PING
# Expected: Could not connect to Redis at 127.0.0.1:9: Connection refused
# Exit code: non-zero (confirmed across Redis 6.x, 7.x)
```

Do not use wrong-auth as a forced-failure trigger. As noted above, `redis-cli -a wrong PING` exits 0 even on auth failure — it is an output-inspection case, not an exit-code case.

**Note on `ACL SETUSER` syntax:** When scripting Redis ACL commands in shell, always quote the password argument to prevent shell redirection. Write `'>password'` (single-quoted) or `\>password`, not `>password`:

```bash
# Wrong: >testpass is parsed as stdout redirection by the shell
redis-cli ACL SETUSER alice on >testpass ~* +@all

# Correct: single-quote the password argument
redis-cli ACL SETUSER alice on '>testpass' ~* +@all
```

**Runtime readback for accountProvision:**

```bash
# After provisioning, verify user exists and can authenticate
redis-cli --no-auth-warning --user "$USERNAME" -a "$PASSWORD" PING
# Expected: PONG (confirms auth succeeds)

redis-cli ACL GETUSER "$USERNAME"
# Expected: non-empty response showing permissions, flags, etc.
# Empty or "nil" response means the user was never created
```

**Switchover — detecting non-primary execution:**

```bash
# Before executing switchover, verify this pod is actually primary
ROLE=$(redis-cli ROLE | head -1)
if [[ "$ROLE" != "master" ]]; then
echo "This pod is not primary (role=$ROLE), cannot initiate switchover" >&2
exit 1
fi
```