Skip to content

docs(smoke-test-preflight): smoke test pre-flight checklist guide v1 (with §7 CoreDNS)#71

Open
weicao wants to merge 1 commit intomainfrom
james/smoke-test-preflight-v1
Open

docs(smoke-test-preflight): smoke test pre-flight checklist guide v1 (with §7 CoreDNS)#71
weicao wants to merge 1 commit intomainfrom
james/smoke-test-preflight-v1

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 5, 2026

Summary

  • Adds addon-smoke-test-pre-flight-checklist-guide.md (307 lines) — proactive 5-min "before-smoke" preflight counterpart to first-blocker / smoke result classification doctrine (PR docs: add soak test result classification guide (4-state schema + N>=3 evidence threshold) #69)
  • 7-section coverage: (1) BackupRepo precondition, (2) StorageClass per-vcluster setup, (3) ImagePullPolicy / sideload audit, (4) autopatcher daemon pattern, (5) kubeconfig isolation SOP, (6) test-runner artifact directory ready check, (7) vcluster CoreDNS image preflight (newly added)
  • One-shot 7-item preflight script
  • Case study appendix: Oracle 19c T08 ORA-12154 → CoreDNS root cause

§7 (CoreDNS) doctrine — high value

If dataprotection / cross-pod-network test cases fail at first run but cluster Running and smoke T01-T07 PASS, check coredns BEFORE addon code.

Symptom is pod-level DNS resolution failure but cluster surface looks healthy because exec-based smoke tests don't need DNS. Layer 2 archetype (vcluster substrate bootstrap precondition) — independent root cause from chart-side / runtime-env-side gaps.

Source evidence (Layer 2 case)

2026-05-05 idc4 incident:

  • Cluster o19-i4-8854 (Oracle 19c standalone) Running, smoke T01-T07 PASS, but every Backup attempt hit ORA-12154 (TNS:could not resolve)
  • Drilled into oracle pod EZ-Connect parsing: short hostname → ORA-12154, FQDN → SUCCESS, IP → SUCCESS
  • Root cause: vcluster default docker.io/coredns/coredns:1.10.1 ImagePullBackOff on idc4 (private idc can't pull docker.io)
  • Fix: kubectl set image deployment/coredns -n kube-system coredns=registry.aliyuncs.com/google_containers/coredns:1.10.1 → Running 1/1 in 9s
  • Verification: Backup o19-i4-8854-rman19c-w7verify2 Status=Completed, 553MB, 2m32s after fix

First-attempt mirror miss: apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/coredns:1.10.1 returned not found — sediments aliyuncs vs apecloud-registry mirror selection caveat.

Evidence pack: work/oracle-idc4-migration/W7-verify-evidence/ (Oracle workspace).

Cross-ref strategy

This guide is the diagnostic / 工艺锚 for §7 CoreDNS issue. Companion docs:

Bidirectional cross-refs added to both directions.

Test plan

  • markdown lint clean (no broken links, headings sequential)
  • one-shot preflight script syntactically valid bash (7-item check)
  • cross-ref to PR docs: add soak test result classification guide (4-state schema + N>=3 evidence threshold) #69 (soak-test-result-classification-guide.md) verified
  • cross-ref to PR docs(kb-schema-preflight): three-layer KB version preflight guide v1 #70 (addon-kb-schema-version-preflight-guide.md) verified
  • cross-ref to addon-vanilla-vcluster-bootstrap-guide.md verified
  • cross-ref to addon-idc-vcluster-migration-checklist-guide.md verified
  • case appendix Oracle 19c T08 + CoreDNS case complete (verification artifact: o19-i4-8854-rman19c-w7verify2 Backup)
  • Allen curator pass / placement check
  • Optional: Helen / Bob2 vcluster bootstrap viewpoint inline review (especially aliyuncs vs apecloud-registry mirror selection)

7-section preflight for KB addon smoke runs covering: (1) BackupRepo precondition,
(2) StorageClass per-vcluster setup, (3) ImagePullPolicy / sideload audit,
(4) autopatcher daemon pattern (alpine compatibility), (5) kubeconfig isolation
SOP, (6) test-runner artifact directory ready check, (7) vcluster CoreDNS image
preflight — newly added based on 2026-05-05 idc4 incident.

Section 7 doctrine: if dataprotection / cross-pod-network test cases fail at
first run but cluster Running and smoke T01-T07 PASS, check coredns BEFORE
addon code. Symptom is pod-level DNS resolution failure but cluster surface
looks healthy because exec-based smoke tests do not need DNS.

Case study appendix: Oracle 19c T08 ORA-12154 to CoreDNS root cause
investigation, image swap fix on idc4 (docker.io/coredns/coredns:1.10.1
ImagePullBackOff swapped to registry.aliyuncs.com/google_containers/coredns:1.10.1
Running 1/1 in 9s). Backup o19-i4-8854-rman19c-w7verify2 Status=Completed,
553MB, 2m32s after fix.

One-shot preflight script updated: 7-item check covering all sections + coredns
Running validation as item 7.

This guide is the proactive "before-smoke" counterpart to first-blocker / smoke
result classification doctrine (PR #69). Cross-refs to:
- addon-vanilla-vcluster-bootstrap-guide.md (autopatcher + dual-image setup)
- addon-idc-vcluster-migration-checklist-guide.md (Alice IDC checklist owner)
- addon-kb-schema-version-preflight-guide.md (schema-side preflight, PR #70)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
weicao pushed a commit that referenced this pull request May 5, 2026
…ection

Allen curator pass 4 mandatory fixes (PR #72 c472d1e2):
- Status: stable -> draft (v1) (5-field intro)
- SKILL-INDEX 文档全列表 entry added (paste-ready wording)
- Line 552/577 addon-smoke-test-pre-flight-checklist-guide.md
  marked (planned, PR #71) for forward-decl
- Lines 574/576/702 addon-idc-image-registry-mirror-guide.md
  backtick -> clickable markdown (landed doc on main)
- Line 576 addon-host-runner-job-pattern-guide.md also linkified
  (landed doc on main)

James jsonpath correction (PR #72 1c15cbb5):

Step 3.5 audit shape was wrong. DP_DB_USER / DP_DB_PASSWORD are
NOT declared in ActionSet spec.env at all -- KB dataprotection Job
runner auto-injects them based on BackupPolicy.spec.backupMethods[]
.target.account (which references systemAccount name). The W8
contract boundary lives at BackupPolicy layer, not ActionSet layer.

Corrected audit:
1. cluster-side: kubectl get backuppolicy ... .spec.backupMethods[].
   target.account vs kubectl get cmpd ... .spec.systemAccounts[].name
   diff
2. chart-side: yq diff cmpd-19c.yaml systemAccounts vs
   backuppolicytemplate.yaml backupMethods[].target.account
3. Added "Audit shape 关键澄清" paragraph documenting that
   systemAccount -> DP_DB_* contract is at BackupPolicy layer

W8 grounded form documented: addons/oracle/templates/
backuppolicytemplate.yaml line 33 references account: kbdataprotection
but cmpd-19c.yaml does not declare it -> KB never generates secret
-> CreateContainerConfigError on 19c.

Doctrine B 5-pattern grep on this commit: clean.
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 5, 2026

Curator review blockers from the 2026-05-06 doc sweep:

  1. The branch adds docs/addon-smoke-test-pre-flight-checklist-guide.md but does not add a docs/SKILL-INDEX.md entry. After merge, the missing_index sweep would fail.
  2. Commit 2e083b0 still has a tool-attribution co-author trailer in the commit body. Please amend or squash with a clean commit body before merge.
  3. Style cleanup: replace the repeated English jargon word at lines 86 / 144 / 213 / 305 with plain wording such as 做法, 规则, or 判读规则.

The PR is mergeable at the git layer, but I am not merging until these doc-quality blockers are fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant