Skip to content

<feature>[gpu]: add GPU XID error event alarm#4019

Open
zstack-robot-2 wants to merge 1 commit into
feature-5.5.22-aiosfrom
sync/xinhao.huang/fix/ZSTAC-85055@@2
Open

<feature>[gpu]: add GPU XID error event alarm#4019
zstack-robot-2 wants to merge 1 commit into
feature-5.5.22-aiosfrom
sync/xinhao.huang/fix/ZSTAC-85055@@2

Conversation

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

Summary

  • 新增 GPU XID 错误事件报警,通过解析 dmesg 内核日志检测 NVIDIA XID 错误
  • 新增 HOST_PHYSICAL_GPU_XID_ERROR CanonicalEvent 路径和数据类
  • 新增 GPU_XID 枚举到 HostHardware,KVMHostFactory 路由到扩展点

Resolves: ZSTAC-85055

sync from gitlab !9913

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: caaa918d-d769-40a1-b716-8daba9e94699

📥 Commits

Reviewing files that changed from the base of the PR and between 48a1af3 and 4505935.

📒 Files selected for processing (6)
  • header/src/main/java/org/zstack/header/host/HostCanonicalEvents.java
  • header/src/main/java/org/zstack/header/host/HostHardware.java
  • header/src/main/java/org/zstack/header/vm/VmCanonicalEvents.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMAgentCommands.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMConstant.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
🚧 Files skipped from review as they are similar to previous changes (3)
  • header/src/main/java/org/zstack/header/host/HostHardware.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
  • header/src/main/java/org/zstack/header/host/HostCanonicalEvents.java

Walkthrough

在代码中新增 GPU XID 硬件类型与主机/虚拟机层事件常量和数据类,增加 KVM Agent DTO 与路径常量,并在 KVM 主机事件分发处添加 GPU_XID 分支以调用扩展点处理。

变更内容

GPU XID 硬件事件支持

Layer / File(s) Summary
硬件类型与事件合约
header/src/main/java/org/zstack/header/host/HostHardware.java, header/src/main/java/org/zstack/header/host/HostCanonicalEvents.java, header/src/main/java/org/zstack/header/vm/VmCanonicalEvents.java
新增 HostHardware.GPU_XID;在 HostCanonicalEvents 新增 HOST_PHYSICAL_GPU_XID_ERROR 常量与 HostPhysicalGpuXidErrorDatahostUuid, pcideviceAddress, xidCode, message);在 VmCanonicalEvents 新增 VM_GPU_XID_ERROR 常量与 VmGpuXidErrorDatavmUuid, pciDeviceAddress, xidCode, message)。
Agent DTO 与 KVM 常量
plugin/kvm/src/main/java/org/zstack/kvm/KVMAgentCommands.java, plugin/kvm/src/main/java/org/zstack/kvm/KVMConstant.java
新增 KVMAgentCommands.VmEventAlarmCmdhostUuid, vmUuid, eventType, properties)和 KVMConstant.HOST_VM_EVENT_ALARM 路径常量。
事件处理分发集成
plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
在物理硬件状态告警事件的 switch 中新增 HostHardware.GPU_XID 分支,遍历 KvmHardwareStatusHandlerExtensionPoint 并以 HostHardware.GPU_XID 调用 handleKvmHardwareStatus

估算代码审查工作量

🎯 2 (Simple) | ⏱️ ~8 分钟

🐰 GPU 的错误码飞来,
新增了 XID 的追踪,
枚举与事件契约已定,
Agent 与分发亦已连通,
兔子为监控鼓掌欢呼!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 拉取请求标题清晰准确地总结了主要变更:添加GPU XID错误事件告警功能,与代码变更完全相关。
Description check ✅ Passed 拉取请求描述与变更集高度相关,详细说明了新增GPU XID错误事件告警功能、新增CanonicalEvent路径、GPU_XID枚举和路由扩展点等具体内容。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/xinhao.huang/fix/ZSTAC-85055@@2

Comment @coderabbitai help to get the list of available commands and usage tips.

Resolves: ZSTAC-85055

Change-Id: Ifc3701d5052af98f6f76054890acd4d27edfb90d
@MatheMatrix MatheMatrix force-pushed the sync/xinhao.huang/fix/ZSTAC-85055@@2 branch from 48a1af3 to 4505935 Compare May 19, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant