hw-native-sys · HecreReed · May 19, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/.claude/skills/ptoas-usability-eval/README.md b/.claude/skills/ptoas-usability-eval/README.md
@@ -0,0 +1,33 @@
+# PTOAS Usability Eval Skill
+
+这是 PTOAS 仓库内的通用 Skill 源目录。
+
+支持的客户端入口：
+- Codex: `.codex/skills/ptoas-usability-eval/`
+- Cursor: `.cursor/skills/ptoas-usability-eval/`
+- Trae: `.trae/skills/ptoas-usability-eval/`
+- Claude Code: `.claude/skills/ptoas-usability-eval/`
+
+当前覆盖的评估场景：
+- `01 算子复现部署`
+- `02 算子迁移部署` 的 PTOAS 支撑子集
+- `04 算子基本功能实现` 的 PTOAS 工程子集
+- `05 特定 shape 性能优化` 的 PTOAS 支撑子集
+- `06 泛化 shape 性能优化` 的 PTOAS 支撑子集
+
+当前附带的汇总规则：
+- 原始指标保留 `10 分制 / 100 分制`
+- 汇总时统一归一到 `100` 分制
+- 默认输出 `总分（支撑）` 和 `总分（实测）`
+- `未实测/N/A` 不进入总分分母
+
+当前评估逻辑：
+- 先按 `touch-point` 体系选取适用于 PTOAS 的触点
+- 再按 `01/02/04/05/06` 场景评分
+- 默认只把适用的 `Core Touch-Points` 放进 repo 级总分
+
+约定：
+- `skills/ptoas-usability-eval/` 作为仓库内的通用主副本
+- 各客户端目录提供可直接发现的副本，便于不同工具开箱即用
+- 修改 Skill 内容时，应同步更新上述四个客户端目录
+- 对 `L3/L4` 依赖 Linux/CANN/NPU 的指标，未实测时必须标 `未实测`，不能因为当前机器缺环境直接给 PTOAS 低分
diff --git a/.claude/skills/ptoas-usability-eval/SKILL.md b/.claude/skills/ptoas-usability-eval/SKILL.md
@@ -0,0 +1,98 @@
+---
+name: ptoas-usability-eval
+description: Evaluate PTOAS repository usability across scene 01 as the primary template plus the PTOAS-supported subsets of scenes 02, 04, 05, and 06. Always classify the evaluation by environment layer first, use only repo-native docs/scripts/samples/CI as primary evidence, keep the user's mixed 10-point and 100-point scoring rules, compute normalized support and measured totals, and mark unsupported or untested dimensions as 未实测 or N/A.
+---
+
+# PTOAS Usability Eval
+
+当用户要评估 `hw-native-sys/PTOAS` 的易用性，或要按 `01/02/04/05/06` 给 PTOAS 打分时，使用这个 Skill。
+
+## 默认范围
+
+- `01 算子复现部署`：主评估场景，正常评分。
+- `02 算子迁移部署`：纳入，但只评 PTOAS 仓库能直接支撑的迁移入口、样例、IR/脚本、编译验证链路。
+- `04 算子基本功能实现`：纳入，但只评 PTOAS 直接覆盖的示例、编译、验证、反馈子链路。
+- `05 特定 shape 性能优化`：纳入，但只评 PTOAS 的文档、样例、性能数据获取入口、编译验证、精度/性能验证支撑能力。
+- `06 泛化 shape 性能优化`：纳入，但只评 PTOAS 的 dynamic/valid-shape、多 shape 样例与验证支撑能力。
+- `03 builtin 算子定制修改`：默认不纳入量化，标 `N/A`；必要时只做差距说明。
+
+先读 [references/touchpoint-selection.md](references/touchpoint-selection.md) 选定适用触点，再读 [references/scope.md](references/scope.md) 确认各场景的边界和 `未实测/N/A` 规则。
+
+## 先判层级
+
+开始评分前，必须先声明本次评估覆盖到哪一层。没有层级，不能直接混着打分。
+
+可选层级：
+- `L1 文档审阅层`：只看仓库文档、脚本、样例、CI，不做运行。
+- `L2 本地最小运行层`：当前机器已有 `ptoas` / `ptobc` / Python 绑定，可做最小命令验证。
+- `L3 Linux compile-only 层`：需要 Linux + CANN/bisheng + `PTO_ISA_ROOT`，不要求带卡。
+- `L4 NPU 上板层`：需要带卡 Linux、驱动、权限、`/dev/davinci*` 与对应用户组。
+
+约束：
+- 没进入某层，就把该层指标记为 `未实测`，不能因为当前机器缺环境就给 PTOAS 低分。
+- `bisheng` / CANN compile-only 一般属于 `L3`，不应在本地 Mac 上硬打低分。
+- 带卡运行、设备权限、驱动、ACL、用户组属于 `L4`。
+
+## 证据来源
+
+优先只用仓库内证据，不把仓外经验当成主证据。固定入口见 [references/evidence-checklist.md](references/evidence-checklist.md)。
+
+高优先级证据：
+- `README.md`
+- `docs/no_npu_compile_only_guide_zh.md`
+- `docs/PTO_IR_manual.md`
+- `test/samples/runop.sh`
+- `test/npu_validation/scripts/generate_testcase.py`
+- `test/npu_validation/scripts/run_remote_npu_validation.sh`
+- `test/samples/PyPTOIRParser/README.md`
+- `test/samples/FlashAttention/`, `test/samples/GQA/`, `test/samples/FFN/`
+- `test/samples/SetValidShape/`, `test/samples/LayoutInference/`, `test/samples/Partition5D/`, `test/samples/planmemory/`
+- `.github/workflows/ci.yml`
+- `.github/ISSUE_TEMPLATE/performance_issue.yml`
+
+## 工作流
+
+1. 先判断用户要的是 `01`、`02`、`04`、`05`、`06` 中哪些场景；未说明时默认 `01`。
+2. 再判断本次覆盖层级：`L1/L2/L3/L4`。输出中必须显式写出来。
+3. 先读 [references/touchpoint-selection.md](references/touchpoint-selection.md)，按场景选定本次的 `Core / Conditional / Excluded` 触点。
+4. 从仓库内收集证据，记录每次检索轮次、文档跳转次数、执行命令、耗时、成功/失败结果。
+5. `01` 场景读 [references/metrics-01.md](references/metrics-01.md)。
+6. `02` 场景读 [references/metrics-02.md](references/metrics-02.md)。
+7. `04` 场景读 [references/metrics-04.md](references/metrics-04.md)。
+8. `05` 场景读 [references/metrics-05.md](references/metrics-05.md)。
+9. `06` 场景读 [references/metrics-06.md](references/metrics-06.md)。
+10. 需要汇总总分时，读 [references/scoring.md](references/scoring.md)。
+11. 对每个指标都输出：原始观测值、评分、证据路径、说明。没有实测的数据不要猜，记为 `未实测` 或 `N/A`。
+12. 明确区分：
+    - PTOAS 仓库已提供的能力
+    - 外部前置条件，例如 LLVM、CANN、`pto-isa`、NPU、驱动/权限、业务 baseline
+13. 若文档描述与实际运行冲突，以实际命令结果为准，并指出冲突位置。
+14. 默认给两个总分：`总分（支撑）` 和 `总分（实测）`。如果用户只要分项，不强制输出总分。
+
+## 计量规则
+
+- 保留用户原始口径，不强行覆盖各指标的原始分制。
+- 但做总分汇总时，必须按 [references/scoring.md](references/scoring.md) 做归一化。
+- `检索轮次`：每次新的定向搜索或定位尝试算 1 轮。
+- `文档跳转次数`：命中首个目标文档后，每跨一个文档/README/脚本入口算 1 次。
+- `耗时`：尽量记录真实墙钟时间；拿不到就写 `未实测`，不要臆测。
+- `成功率`：只基于当前任务里真实执行或真实定位到的结果计算。
+- `未实测`：当前会话未覆盖到对应环境层级，或该层级前置条件不存在，或缺少前后对照 baseline。
+- `N/A`：只用于超出 PTOAS 能力边界，或当前任务明确不纳入本次评估范围的项。
+
+## 输出格式
+
+按下面顺序输出：
+
+1. `评估范围`
+2. `触点选择`
+3. `评估层级`
+4. `总分（支撑）`
+5. `总分（实测）`
+6. `分场景评分`
+7. `覆盖说明`
+8. `关键证据`
+9. `主要短板`
+10. `建议动作`
+
+如果用户只要简版结论，也要至少保留：场景归类、评估层级、总评、最低分项、证据路径。
diff --git a/.claude/skills/ptoas-usability-eval/agents/openai.yaml b/.claude/skills/ptoas-usability-eval/agents/openai.yaml
@@ -0,0 +1,4 @@
+interface:
+  display_name: "PTOAS Usability Eval"
+  short_description: "评估 PTOAS 在 01/02/04/05/06 场景下的易用性"
+  default_prompt: "Use $ptoas-usability-eval to evaluate this PTOAS repo for scene 01 operator reproduction/deployment and the PTOAS-supported subsets of scenes 02, 04, 05, and 06."
diff --git a/.claude/skills/ptoas-usability-eval/references/evidence-checklist.md b/.claude/skills/ptoas-usability-eval/references/evidence-checklist.md
@@ -0,0 +1,86 @@
+# 证据清单
+
+## 固定证据入口
+
+| 路径 | 主要用途 | 对应场景 | 层级 | 主要触点 |
+| --- | --- | --- | --- | --- |
+| `README.md` | 官方构建、环境变量、CLI、Python 绑定、sample 运行、compile-only/上板验证主入口 | `01`, `02`, `04`, `05`, `06` | `L1-L4` | `TP001-005, TP016-020, TP038, TP049, TP053, TP058` |
+| `docs/no_npu_compile_only_guide_zh.md` | 无卡 compile-only 流程、批量验证流程、`pto-isa`/CANN 依赖说明 | `01`, `02`, `04`, `05`, `06` | `L1`, `L3` | `TP005, TP017-018, TP049, TP051-052, TP058-060, TP062-064` |
+| `docs/PTO_IR_manual.md` | IR 层级、tile/view/valid-shape、layout、dynamic shape、Level-2/3 语义 | `02`, `04`, `05`, `06` | `L1-L4` | `TP019-020, TP028, TP032, TP039, TP045-047` |
+| `test/samples/runop.sh` | 批量样例生成、`ptoas`/`ptobc` 运行、A3/A5 默认参数策略 | `01`, `02`, `04`, `05`, `06` | `L1-L4` | `TP018, TP034-039, TP049, TP051-052, TP062-064` |
+| `test/npu_validation/scripts/generate_testcase.py` | 从 `*-pto.cpp` 生成验证工程，观察 golden/compare/兼容层处理 | `01`, `02`, `04`, `05`, `06` | `L1`, `L3`, `L4` | `TP037-039, TP045-047, TP049, TP052` |
+| `test/npu_validation/scripts/run_remote_npu_validation.sh` | compile-only / sim / npu 运行链路、日志格式、设备与 `pto-isa` 检查 | `01`, `02`, `04`, `05`, `06` | `L1`, `L3`, `L4` | `TP049, TP051-052, TP058-060, TP062-064` |
+| `test/samples/PyPTOIRParser/README.md` | 来自 pypto `ir_parser` 的 vendored `.pto` 快照说明 | `02` | `L1` | `TP035-037, TP039, TP044, TP046-047` |
+| `test/samples/MatMul/` | README 直接引用的基准样例，适合作为 `01` 默认复现模板 | `01`, `04`, `05` | `L1-L4` | `TP018, TP034-039, TP042-047` |
+| `test/samples/FlashAttention/` | 特定 shape 性能样例 | `05` | `L1-L4` | `TP035-041, TP044, TP046` |
+| `test/samples/GQA/` | 特定 shape / attention 相关样例 | `05` | `L1-L4` | `TP035-041, TP044, TP046` |
+| `test/samples/FFN/` | 特定 shape / 算子组合样例 | `05` | `L1-L4` | `TP035-041, TP044, TP046` |
+| `test/samples/SetValidShape/` | dynamic/valid-shape 相关样例 | `06` | `L1-L4` | `TP035-039, TP044, TP046-047` |
+| `test/samples/LayoutInference/` | layout 推断相关样例 | `06` | `L1-L4` | `TP019-020, TP035-039, TP046-047` |
+| `test/samples/Partition5D/` | 多维 partition / shape 泛化相关样例 | `02`, `06` | `L1-L4` | `TP035-039, TP044, TP046-047, TP061` |
+| `test/samples/planmemory/` | alias/planmemory/shape 相关样例 | `06` | `L1-L4` | `TP035-039, TP044, TP046-047` |
+| `.github/workflows/ci.yml` | CI 中的 LLVM/PTOAS 构建、lit、sample test、remote validation 参考配置 | `01`, `02`, `04`, `05`, `06` | `L1`, `L3`, `L4` | `TP003, TP011, TP049, TP052-053, TP058, TP060-064` |
+| `.github/ISSUE_TEMPLATE/performance_issue.yml` | 性能问题受理模板，可用来评估性能数据/复现要求的完备性 | `05`, `06` | `L1` | `TP005, TP017, TP040-041, TP062-063` |
+
+说明：
+- 不要把当前分支不存在的样例 README 当成固定证据源。
+- 对 `02/05/06`，没有“前后对照基线”时，不要硬算迁移完备度或性能提升幅度。
+
+## 推荐检索顺序
+
+1. `README.md`
+2. `docs/no_npu_compile_only_guide_zh.md`
+3. `docs/PTO_IR_manual.md`
+4. `test/samples/MatMul/` 或用户指定样例目录
+5. `test/samples/PyPTOIRParser/`, `FlashAttention/`, `GQA/`, `FFN/`, `SetValidShape/`, `LayoutInference/`, `Partition5D/`, `planmemory/`
+6. `test/samples/runop.sh`
+7. `test/npu_validation/scripts/*.py` / `*.sh`
+8. `.github/workflows/ci.yml`
+9. `.github/ISSUE_TEMPLATE/performance_issue.yml`
+
+## 推荐检索命令
+
+```bash
+rg -n "构建|运行测试|compile-only|runop|generate_testcase|run_remote_npu_validation|level3" README.md docs test .github
+rg -n "valid_shape|layout|partition|reshape|dynamic shape|Level-2|Level-3" docs/PTO_IR_manual.md docs test
+rg -n "FlashAttention|GQA|FFN|MatMul|SetValidShape|LayoutInference|Partition5D|planmemory" test .github
+rg --files test/samples
+find test/samples -maxdepth 2 -type f \( -name '*.py' -o -name '*.pto' -o -name 'README.md' \)
+```
+
+## 迁移 / 性能场景的补充记录项
+
+如果在评 `02/05/06`，额外记录：
+- 是否存在迁移前/迁移后对照物
+- 是否存在性能 baseline
+- baseline 来源路径
+- 是否需要 NPU 实机
+- 当前停在哪一层
+- 哪些分数是实测，哪些是文档侧支撑分
+
+## 记录要求
+
+每个评分项至少要落这些证据字段：
+
+- `证据路径`
+- `检索/执行命令`
+- `检索轮次`
+- `文档跳转次数`
+- `评估层级`
+- `耗时`
+- `结果`
+- `评分`
+- `备注`
+
+## 默认样例
+
+若用户没有指定具体算子或样例，优先使用：
+
+- `test/samples/MatMul/tmatmulk.py`
+- `test/samples/MatMul/tmatmulk.pto`
+- `test/samples/Addc/addc.py`
+- `test/samples/PyPTOIRParser/`
+- `test/samples/FlashAttention/`
+- `test/samples/SetValidShape/`
+
+理由：这些路径要么被 `README.md` 直接引用，要么与迁移/性能/shape 泛化评估强相关，且当前主线可稳定找到。
diff --git a/.claude/skills/ptoas-usability-eval/references/metrics-01.md b/.claude/skills/ptoas-usability-eval/references/metrics-01.md
@@ -0,0 +1,119 @@
+# `01 算子复现部署` 指标
+
+## 本场景默认选用触点
+
+- `资料/文档`: `TP001-005, TP008-018`
+- `源码 & 示例类`: `TP034-035, TP038, TP042, TP044`
+- `工具`: `TP049, TP051-052`
+- `版本`: `TP053, TP058`
+- `运行反馈`: `TP062-064`
+- `Conditional`: `TP006, TP054, TP057, TP059-061`
+
+默认把 PTOAS 看作“从样例或 `.pto` 输入出发，完成构建、编译、compile-only 或上板验证”的工具链仓库。
+
+说明：保留用户原始口径，`10 分制` 与 `100 分制` 混用，不强行归一。
+
+## 先声明层级
+
+在 `01` 场景里，必须先声明本次评估覆盖的层级：
+
+- `L1 文档审阅层`
+- `L2 本地最小运行层`
+- `L3 Linux compile-only 层`
+- `L4 NPU 上板层`
+
+没有进入对应层，就标 `未实测`，不要把环境缺失当成 PTOAS 负分。
+
+## 默认任务模板
+
+若用户没有指定具体 case，优先用 `test/samples/MatMul/` 作为复现模板：
+
+```bash
+python3 test/samples/MatMul/tmatmulk.py > /tmp/tmatmulk.pto
+./build/tools/ptoas/ptoas /tmp/tmatmulk.pto -o /tmp/tmatmulk.cpp
+```
+
+需要无卡 compile-only 时，继续参考：
+
+```bash
+python3 test/npu_validation/scripts/generate_testcase.py \
+  --input /tmp/tmatmulk.cpp \
+  --run-mode npu \
+  --soc-version Ascend910
+```
+
+## 易学习
+
+### 文档获取
+
+| 指标 | 在 PTOAS 中怎么测 | 主要证据 | 可评分层级 | 打分规则 |
+| --- | --- | --- | --- | --- |
+| 检索命中成功率 | 统计找到“构建 + sample 运行 + compile-only/上板验证”入口所需检索轮次 | `README.md`, `docs/no_npu_compile_only_guide_zh.md` | `L1-L4` | 1 次 `10`；2 次 `8`；3 次 `6`；4-5 次 `4`；>5 次 `2` |
+| 文档检索耗时 | 从开始找文档到定位到正确路径的时间 | 同上 | `L1-L4` | `2 分钟=10`；之后每增加 `1` 分钟扣 `1` 分 |
+| 文档获取成功率 | 目标文档是否都能在仓库内找到 | 同上 | `L1-L4` | `100%=10`，按比例计算 |
+
+### 文档学习
+
+| 指标 | 在 PTOAS 中怎么测 | 主要证据 | 可评分层级 | 打分规则 |
+| --- | --- | --- | --- | --- |
+| 文档错误点位密度 | 按文档执行时，检查命令、路径、版本、链接、说明是否错误 | `README.md`, `docs/no_npu_compile_only_guide_zh.md`, `.github/workflows/ci.yml` | `L1-L4` | 零错误 `100`；少量小错 `80`；一般错误 `60`；明显错误 `40`；严重不可用 `20` |
+| 文档跳转次数 | 从首个命中文档到凑齐完整执行路径，跨文档跳转的次数 | `README.md -> docs/... -> test/samples/...` | `L1-L4` | 1 次 `100`；2 次 `80`；3 次 `60`；4-5 次 `40`；>5 次 `20` |
+| 内容覆盖缺失率 | 是否清楚区分本地运行、Linux compile-only、带卡上板等层级前置条件 | `README.md`, `docs/...`, `.github/workflows/ci.yml` | `L1-L4` | 无缺失 `100`；`<=5%` `80`；`5%-15%` `60`；`15%-30%` `40`；`>30%` `20` |
+| 文档学习耗时 | 从开始读到能给出执行方案的时间 | 同上 | `L1-L4` | `5 分钟=10`；之后每增加 `1` 分钟扣 `1` 分 |
+| 通过文档学习检索成功率 | 需要通过文档解决的问题里，有多少真正被文档回答 | 同上 | `L1-L4` | `100%=10`，按比例计算 |
+
+## 易部署
+
+### 环境下载
+
+| 指标 | 在 PTOAS 中怎么测 | 主要证据 | 可评分层级 | 打分规则 |
+| --- | --- | --- | --- | --- |
+| 环境获取场景覆盖率 | 看仓库是否说明源码构建、无卡 compile-only、上板验证所需环境 | `README.md`, `docs/no_npu_compile_only_guide_zh.md` | `L1-L4` | `100%=10`；`80%-99%=8`；`60%-79%=6`；`40%-59%=4`；`<40%=2` |
+| 环境下载耗时 | 真实下载 LLVM/CANN/Python 依赖/`pto-isa` 的时间 | 真实执行记录 | `L3-L4` | `<=5 分钟=10`；`5-7=8`；`7-9=6`；`9-11=4`；`>11=2` |
+| 环境下载成功率 | 下载步骤是否一次完成 | 真实执行记录 | `L3-L4` | `100%=10`，按比例计算 |
+
+说明：如果当前任务没有真的在 Linux/CANN 环境里下载依赖，后两项写 `未实测`。
+
+### 环境安装
+
+| 指标 | 在 PTOAS 中怎么测 | 主要证据 | 可评分层级 | 打分规则 |
+| --- | --- | --- | --- | --- |
+| 软硬件配套兼容率 | 检查仓库是否明确 LLVM 版本、Python 包版本、CANN / `pto-isa` 依赖关系 | `README.md`, `docs/no_npu_compile_only_guide_zh.md`, `ci.yml` | `L1-L4` | `100%=10`；`80%-99%=8`；`60%-79%=6`；`40%-59%=4`；`<40%=2` |
+| 部署时长 | 从开始执行安装到可运行 `ptoas`/Python 绑定的时间 | 真实执行记录 | `L2-L4` | `<=5 分钟=10`；`5-7=8`；`7-9=6`；`9-11=4`；`>11=2` |
+| 操作步骤数 | 完成安装所需命令/步骤数量 | `README.md` 构建步骤 | `L1-L4` | `<=8 步=10`；`9-10=8`；`11-12=6`；`13-14=4`；`>14=2` |
+| 环境安装成功率 | 是否能完成 LLVM + PTOAS 构建与安装 | 真实执行记录 | `L2-L4` | `100%=10`，按比例计算 |
+
+### 环境校验
+
+| 指标 | 在 PTOAS 中怎么测 | 主要证据 | 可评分层级 | 打分规则 |
+| --- | --- | --- | --- | --- |
+| 环境安装后校验成功率 | 是否有明确校验动作，例如 `ptoas --version`、Python import、sample compile | `README.md`, `ci.yml` | `L2-L4` | `100%=10`，按比例计算 |
+
+推荐最小校验命令：
+
+```bash
+./build/tools/ptoas/ptoas --version
+python3 -c "from mlir.dialects import pto; print('ok')"
+```
+
+## 易开发、易演进
+
+### 获取示例代码
+
+| 指标 | 在 PTOAS 中怎么测 | 主要证据 | 可评分层级 | 打分规则 |
+| --- | --- | --- | --- | --- |
+| 检索命中成功率 | 统计定位到默认样例目录所需轮次 | `test/samples/MatMul/`, `test/samples/Addc/`, `README.md` | `L1-L4` | 1 次 `10`；2 次 `8`；3 次 `6`；4-5 次 `4`；>5 次 `2` |
+
+### 运行示例代码
+
+| 指标 | 在 PTOAS 中怎么测 | 主要证据 | 可评分层级 | 打分规则 |
+| --- | --- | --- | --- | --- |
+| 示例代码运行耗时 | 从运行 sample 到拿到 `.pto` / `.cpp` / compile-only 结果的时间 | `README.md`, `test/samples/MatMul/` | `L2-L4` | `<=2 分钟=10`；`2-4=8`；`4-6=6`；`6-8=4`；`>8=2` |
+| 示例代码运行成功率 | `py -> .pto -> .cpp` 或 compile-only / validation 是否成功 | 同上 | `L2-L4` | `100%=10`，按比例计算 |
+
+## 何时记 `未实测` / `N/A`
+
+- `未实测`：当前任务没有进入对应环境层级，例如在本地 Mac 上没有 Linux/CANN/bisheng，却要评价 compile-only。
+- `N/A`：该项超出 PTOAS 能力边界，或本次任务明确不纳入该项。
+
+不要把二者混用。