add 中文文档

4b893124 · whlwhlwhl · bb293875 · 4b893124
Commit 4b893124 authored May 21, 2026 by whlwhlwhl
Show whitespace changes
Inline Side-by-side

Showing with 563 additions and 0 deletions

docs/lightop-skills.zh-CN.md docs/lightop-skills.zh-CN.md +563 -0

No files found.
--- a/docs/lightop-skills.zh-CN.md
+++ b/docs/lightop-skills.zh-CN.md
+# LightOp Skills 中文阅读版
+
+这份文档是 `lightop-kernel-agent-loop`、`lightop-kernel-knowledge` 和
+`dcu-profiler-report` 三个 skill 的中文阅读版，方便人工理解和写提示词。
+
+实际给 Claude/Codex 安装和触发的文件仍然是各自目录里的 `SKILL.md`。这里不替代
+`SKILL.md`，只是把里面的规则翻译并整理成更好读的中文。
+
+## 三个 Skill 分别做什么
+
+| Skill | 什么时候用 | 核心产出 |
+| --- | --- | --- |
+| `lightop-kernel-agent-loop` | 让 agent 在 LightOp 里新增或优化 DCU/ROCm 算子 | 计划、源码修改、构建、正确性测试、benchmark、profile、调优记录 |
+| `lightop-kernel-knowledge` | 需要找本地 LightOp 模式、ROCm/DCU 上游证据、可迁移优化思路 | research digest，说明证据来自哪里、能指导什么改动 |
+| `dcu-profiler-report` | benchmark 不够解释瓶颈，下一步修改需要 profiler 证据 | profile digest，最后必须落到一个明确的下一步代码修改 |
+
+通常让 agent 给 LightOp 做算子工作时，只需要安装这三个 skill。`humanize` 目录里的
+其它命令是底层循环、计划生成、review gate、外部模型辅助等工具，不是每次手工调用的
+主要入口。
+
+## lightop-kernel-agent-loop
+
+这是主循环 skill。用户说“给 LightOp 加一个算子”“优化某个 LightOp kernel”
+时，应该触发它。
+
+它不适合泛泛地调 NVIDIA/CUDA kernel。默认方向是 DCU/ROCm/HIP/DTK，并且优先沿用
+LightOp 现有写法。可选路线包括 MIOpen、rocBLAS、hipBLASLt、Composable Kernel、
+Triton AMD，但前提是当前仓库已经支持，或者用户明确要求。
+
+### 输入契约 K/R/W/E
+
+开工前，agent 需要恢复或定义四类信息：
+
+- `K`：算子语义，包括输入输出 tensor、dtype、layout、mode、epsilon 等约束。
+- `R`：正确性参考，通常是 PyTorch 原生实现、现有 LightOp 路径，或者小规模手写
+  oracle。
+- `W`：workload 分布，包括目标 shape、dtype、模型场景、gfx arch、contiguous
+  情况、reduction 轴、带宽公式、latency 或 throughput 目标。
+- `E`：执行环境，包括 Docker 容器或宿主机、容器内 LightOp 路径、可见 DCU、
+  DTK/ROCm/PyTorch 版本、build/test/benchmark/profile 命令和通过阈值。
+
+如果这些信息缺失，但能从 LightOp 现有测试、benchmark、配置表里安全推断，agent
+应该先推断。只有无法安全推断时才问用户。
+
+### 定位 LightOp
+
+LightOp 根目录通常包含：
+
+```text
+setup.py
+lightop/__init__.py
+lightop/csrc/export.cpp
+test/
+```
+
+搜索顺序是：
+
+1. 用户明确给出的路径。
+2. 当前工作目录。
+3. 当前 workspace 附近名为 `lightop` 的兄弟目录。
+
+如果找不到 LightOp checkout，就需要向用户询问路径。
+
+### 执行环境
+
+build、correctness test、benchmark、profiling 必须在同一个环境里做。用户指定 Docker
+容器时，Docker 就是验收契约的一部分。典型命令格式：
+
+```bash
+docker exec <container> bash -lc 'cd /path/in/container/lightop && <command>'
+```
+
+开工前需要记录：
+
+- 容器名或镜像名；不用 Docker 时记录 `direct-host`。
+- 命令视角里的 LightOp 路径，也就是容器内路径，不只是宿主机路径。
+- `HIP_VISIBLE_DEVICES` 或 `HSA_VISIBLE_DEVICES`。
+- 性能测试前的 `hy-smi` 或 `rocm-smi` 输出、选择的空闲卡、HCU 利用率、显存占用。
+- `PYTORCH_ROCM_ARCH`、DTK/ROCm/PyTorch/HIP 版本、设备名、`gcnArchName`。
+- build、import smoke、correctness、benchmark、profiler 的精确命令。
+- 正确性容差和性能目标。
+- benchmark 稳定性规则，包括 warmup、repeat、统计口径、噪声带。
+
+不要混用宿主机构建和容器内测试，除非用户明确要求，并且两个路径确实指向同一套已编译
+扩展。
+
+### LightOp 接入点
+
+新增或修改算子时，优先检查最近的算子 family，并沿用它的风格。常见位置：
+
+```text
+lightop/csrc/<family>/*.cu|*.cuh|*.h|*.cpp
+lightop/csrc/export.cpp
+lightop/<python_wrapper>.py
+lightop/__init__.py
+setup.py
+test/test_<op>.py
+test/<family>/*benchmark*.py
+lightop/config*.py
+```
+
+新增 fused 算子时，agent 应先搜索拆分前的单算子和相关 fused 实现，用本地 LightOp
+路径作为 API、校验、benchmark 和性能预期的主要基准。找不到本地基准时，再记录搜索
+结果为空，并回退到 PyTorch 或字面 oracle。
+
+### 新增算子 Checklist
+
+- 搜索每个组成算子的 wrapper、binding、kernel、test、benchmark、config。
+- 如果本地有拆分路径，优先用 unfused LightOp 组合作为 baseline。
+- 在最近的 `lightop/csrc/` family 下实现 HIP/C++ kernel 或 launcher。
+- 在 `lightop/csrc/export.cpp` 中通过 `m.def(...)` 暴露 C++ symbol。
+- 添加 Python wrapper，保持 LightOp 现有 tensor 校验风格。
+- 只有算子需要对用户公开时，才改 `lightop/__init__.py`。
+- 只有新增 `csrc/<family>` 且 `setup.py` glob 覆盖不到时，才改 `setup.py`。
+- 需要 shape/gfx 特化时，才改 `lightop/config*.py` 或 dispatch 表。
+- 添加聚焦的正确性测试和目标 workload 的 benchmark。
+
+### 已有算子优化 Checklist
+
+已有算子优化时，直接围绕目标 kernel 调，不要绕去新增 public API：
+
+- 找到当前 Python wrapper、C++ binding、kernel launch、config/dispatcher、test、benchmark。
+- 保持 public API 不变，除非用户明确要求破坏性变更。
+- 不新增 public API。
+- 不新建无关算子文件。
+- 不新建无关 operator family。
+- 不修改无关 operator family。
+- 优先只改目标算子的 kernel、launcher、必要 config、聚焦 test、benchmark。
+- 直接在指定环境里 install、test、benchmark、profile、tune。
+- 如果某条优化线回退或只对非目标 shape 有帮助，要记录 reject 原因。
+
+### DCU/ROCm 默认规则
+
+- 在 LightOp 测试里，把 PyTorch 的 `torch.cuda` namespace 视作 ROCm runtime facade。
+- 优先用 `hipcc` 和 ROCm extension build。
+- 尊重 `PYTORCH_ROCM_ARCH`；没设置时从 `gcnArchName` 推断。
+- 不引入 CUDA-only header、PTX/SASS、CUTLASS/CuTe、Nsight Compute、TMA/WGMMA 等
+  NVIDIA 专用假设。
+- 借鉴 CUDA 资料时，必须翻译成 ROCm/DCU 概念，并记录它只是跨平台灵感。
+
+环境探测命令：
+
+```bash
+python - <<'PY'
+import torch
+print("torch:", torch.__version__)
+print("hip:", torch.version.hip)
+print("device:", torch.cuda.get_device_name(0))
+print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
+PY
+hipcc --version
+```
+
+### 性能设备 Gate
+
+每次 benchmark 或 profiling 前，先在目标环境中检查卡状态：
+
+```bash
+hy-smi || rocm-smi
+HIP_VISIBLE_DEVICES=<idle-card> python test/<family>/benchmark_<op>.py
+HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py
+```
+
+要选择 HCU 利用率低、显存占用低的卡。baseline 和 candidate 尽量使用同一张卡。
+如果没有记录设备状态，不应该把性能数字当作最终可采信证据。
+
+### 工作流
+
+Stage 1：检查和计划。
+
+- 定位 LightOp root 和目标算子 family。
+- 恢复 `K/R/W/E`、目标 gfx arch、baseline 命令和成功阈值。
+- 检查 wrapper、binding、kernel、config、test、benchmark。
+- 新增 fused 算子时，搜索拆分单算子和相关 fused kernel。
+- 首次严肃实现前，建议查询 `lightop-kernel-knowledge`。
+- 写 `.humanize/lightop-agent/research-digest.md`。
+- 在写代码前定义 benchmark 契约：shape、dtype、layout、axis、epsilon、带宽公式、
+  warmup/repeat、统计口径、噪声带、命令。
+
+Stage 2：实现和验证。
+
+- 第一次优化前先做 baseline matrix：正确性、benchmark、卡状态、p50/p90/mean、
+  有效带宽、噪声估计。
+- 每轮只做一个主要优化假设。
+- build 使用 `python setup.py install`。
+- 同环境做 import smoke、target correctness、benchmark。
+- 每个通过正确性的 candidate 都记录 shape、dtype、配置、带宽或延迟、对比基准、
+  keep/reject/inconclusive 原因。
+
+Stage 3：profiling 和 tuning。
+
+- 当首个正确 candidate 没达到性能目标时，进入 profiling/tuning，不要停止。
+- 两个连续正确 candidate 没达标时，下一次 kernel/dispatch edit 前必须同时有：
+  `lightop-kernel-knowledge` 调研结论和 `dcu-profiler-report` profile digest。
+- 第二个正确优化尝试如果相对 parent 或 baseline 提升小于 5%，下一步必须做深度
+  profile，包括 PMC、SQTT、`dccobjdump`、code-object resource、LDS/register/
+  occupancy 解释，或记录工具不可用原因。
+
+Stage 4：收尾。
+
+- 完成条件至少包括 install、import smoke、target correctness、benchmark 对比。
+- 结果接近阈值或异常时，需要 profiler 证据。
+- 未达性能目标时，不得宣布 performance-complete，只能给当前最好结果、瓶颈证据、
+  失败线和下一步建议。
+
+### 性能目标纪律
+
+性能目标是验收契约，不是“尽量试试”。规则如下：
+
+- 只有 correctness 通过后，performance candidate 才算有效。
+- 每个正确 candidate 都必须有 benchmark 数据。
+- 提升必须超过计划里定义的噪声带，噪声带内只能记为 inconclusive 或 plateau。
+- 首个正确 candidate 未达标时，必须 profile 和 tune。
+- 至少尝试 3 条有证据支持的优化 lineage，除非 profiler 证明目标不可达。
+- 两个连续正确 candidate 未达标时，下一轮必须先有 knowledge 调研和 profiler digest。
+- 如果目标最终没达到，最终报告只能说明当前最好结果、瓶颈和下一步，不能说任务完成。
+
+### Humanize Loop State
+
+记录文件默认放在 LightOp 仓库本地，且不应提交：
+
+```text
+.humanize/lightop-agent/refined-plan.md
+.humanize/lightop-agent/research-digest.md
+.humanize/lightop-agent/attempt-ledger.md
+.humanize/lightop-agent/kernel_opt_readme.md
+.humanize/lightop-agent/rlcr-fallback.md
+.humanize/lightop-agent/optimization-ledger.md
+.humanize/lightop-agent/lineage.jsonl
+.humanize/lightop-agent/performance-map.json
+.humanize/lightop-agent/tuning-decisions.md
+.humanize/lightop-agent/profile-artifacts/
+```
+
+`kernel_opt_readme.md` 每个 benchmarked candidate 后都要更新，记录 baseline、假设、
+修改文件、build/test/benchmark 命令、性能表、profile 证据、保留或拒绝原因、下一步。
+
+### Build、Test、Benchmark
+
+LightOp build 固定使用：
+
+```bash
+PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
+```
+
+无论 PyTorch 版本是什么，都不切到 `setup_torch29.py`。正常调优循环里不删除
+`build/`，这样能复用增量编译结果。只有用户明确要求 clean build，或者证明 build
+cache 损坏时，才清理。
+
+安装后做 import smoke：
+
+```bash
+python - <<'PY'
+import torch, lightop
+print("torch:", torch.__version__)
+print("hip:", torch.version.hip)
+print("lightop:", getattr(lightop, "__file__", "unknown"))
+print("device:", torch.cuda.get_device_name(0))
+print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
+PY
+```
+
+优先跑最窄测试：
+
+```bash
+cd test
+python test_<op>.py
+```
+
+如果没有 benchmark，就添加小 benchmark，必须包含 warmup、固定 shape、固定 seed、
+计时前后显式 `torch.cuda.synchronize()`。
+
+### RLCR 启动和失败回退
+
+写好 refined plan 且确认 `.humanize*` 已忽略后，从 LightOp root 启动：
+
+```bash
+"{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/lightop-agent/refined-plan.md --yolo
+```
+
+如果失败，应该停止并报告错误。例外是 `codex` CLI 不可用时，可以进入手工 fallback，
+但必须写：
+
+```text
+.humanize/lightop-agent/rlcr-fallback.md
+.humanize/lightop-agent/refined-plan.md
+.humanize/lightop-agent/research-digest.md
+.humanize/lightop-agent/attempt-ledger.md
+.humanize/lightop-agent/kernel_opt_readme.md
+```
+
+fallback 仍然必须遵守 build/test/benchmark/profile、设备选择、证据记录、性能目标和
+日志规则，只是不能声称 Humanize/Codex review gate 已经启用。
+
+## lightop-kernel-knowledge
+
+这个 skill 把 LightOp/DCU 算子问题转成可引用的实现证据。优先级是：
+
+1. 本地 LightOp 源码、测试、配置、benchmark。
+2. ROCm/DCU 上游和官方文档。
+3. bundled CUDA PR corpus，仅作为跨平台灵感。
+
+适合回答：
+
+- LightOp 里类似算子怎么暴露、怎么测试？
+- 新算子应该放在哪个 `lightop/csrc/<family>`？
+- 当前 kernel 由哪个 config 或 dispatcher 分支选择？
+- ROCm、PyTorch ROCm、Triton AMD、SGLang、vLLM 是否有类似 DCU 路径？
+- 哪个 profiler 信号能指导下一步修改？
+- CUDA PR 思路能不能迁移到 HIP/DCU，需要翻译什么？
+
+### Route A：本地 LightOp 源码
+
+这是第一优先级。常用搜索：
+
+```bash
+rg -n "<op_name>|<binding_name>|<kernel_name>" lightop test setup.py setup_torch29.py
+rg -n "m\.def\(\"<op_name>|<wrapper_name>" lightop/csrc/export.cpp lightop/*.py
+rg -n "get_.*config|gfx|gcnArchName|PYTORCH_ROCM_ARCH" lightop/config*.py setup.py setup_torch29.py
+```
+
+需要记录：
+
+- Python wrapper 路径和 public API。
+- `export.cpp` binding 行。
+- `lightop/csrc/` 下的 kernel 文件。
+- config/dispatcher 分支。
+- test 和 benchmark 文件。
+- baseline 命令和结果。
+
+### Route B：ROCm/DCU 上游和官方文档
+
+用于 DCU 特定实现或 profiling 问题。优先官方文档和上游源码：
+
+- SourceFind LightOp MR evidence。
+- DCU Toolkit flash-attention-cutlass MR evidence。
+- `sources/refs/` 下的 Hygon/DCU HIP 优化参考。
+- SourceFind DCU/DTK 性能分析工具指南。
+- ROCm/HIP profiler 文档。
+- PyTorch ROCm、MIOpen、rocBLAS、hipBLASLt、Composable Kernel、AITer。
+- Triton AMD backend。
+- SGLang/vLLM AMD 路径和测试。
+
+需要记录 URL 或本地路径、commit/version、具体文件/函数/config、对 LightOp 计划的影响，
+以及如果复制或改写代码，要记录 license/notice。
+
+### Route C：bundled PR corpus
+
+这个语料库里有很多 CUDA/NVIDIA kernel 工程经验，但只在本地和 DCU 路线证据不足时
+使用。常用命令：
+
+```bash
+python3 scripts/query.py "<operator> <dtype> <symptom>" --compact --limit 30
+python3 scripts/query.py "lightop dcu <operator>" --repo sourcefind-lightop --compact --limit 20
+python3 scripts/query.py "flash attention dcu" --repo flash-attention-cutlass --compact --limit 20
+python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact --limit 20
+python3 scripts/search-pr-diffs.py <term1> <term2> [--any] [--limit 100]
+python3 scripts/get_page.py <page-id>
+```
+
+迁移规则：
+
+- CUTLASS/CuTe/PTX/SASS/TMA/WGMMA/tcgen05/Nsight 证据默认不可直接迁移。
+- 显式把 CUDA 概念翻译成 ROCm/DCU 概念，例如 SM 到 CU、warp 到 wavefront、shared
+  memory 到 LDS、tensor core 路径到 MFMA 或目标 ROCm backend。
+- Hygon/DCU 特定结论优先用目标编译和 profiler/ISA 证据证明。
+- 不要凭 mnemonic 或通用 AMD 文档发明 builtin 名字。
+- 不把 CUDA profiler metric 当成 DCU 瓶颈证据。
+
+### Route D：外部 source map
+
+`index.json` 可能列出补充源码仓库。先 clone，再本地 grep：
+
+```bash
+python3 scripts/clone-index-repos.py
+python3 scripts/search-index-repos.py <term1> <term2> [<term3>]
+```
+
+只在 source map 和当前 LightOp 算子相关时使用。
+
+### 引用 Checklist
+
+每条 finding 建议统一记录：
+
+- Route 名称。
+- 源码路径或 URL。
+- commit/version。
+- 关键函数、binding、config、test 或 benchmark。
+- 它如何改变 LightOp 计划。
+- 是 DCU 直接可用，还是只作为跨平台灵感。
+- 如复制或改写代码，记录 license/notice。
+
+## dcu-profiler-report
+
+这个 skill 用在“benchmark 数字不够，需要 profiler 证据决定下一步改什么”的场景。
+
+核心规则：
+
+```text
+先 profile，再诊断，最后优化
+```
+
+输出不能只写 “memory-bound” 或 “launch overhead”。必须从测量证据推到可能机制，再落到
+一个可执行的 LightOp 修改。
+
+### 何时调用
+
+以下情况应该调用：
+
+- baseline benchmark 已通过，但还没有 baseline profile digest。
+- 正确 candidate 和 baseline 或 prior best 差距在 +/-2% 内。
+- 第二个正确优化尝试相对 parent 或 baseline 提升小于 5%。
+- candidate 在重要 shape 上回退。
+- candidate 快得异常，需要解释。
+- benchmark plateau，下一步修改不清楚。
+- Humanize/RLCR reviewer 要 profiler 证据。
+- 可能受 launch overhead、host-device copy、global memory layout、LDS bank conflict、
+  VGPR/occupancy、wavefront divergence、MFMA underuse 或 shape dispatch 影响。
+
+正确性失败时一般不要 profile，先修正确性。
+
+### 必需 artifacts
+
+默认放在：
+
+```text
+.humanize/lightop-agent/profile-artifacts/<version>/
+  device-status.txt
+  benchmark.log
+  hipprof.txt
+  hipprof-pmc-all/
+  sqtt-json/
+  rocprof.csv
+  rocprof-stats.csv
+  rocprof-compute/
+  code-object-metadata.txt
+  amdgpu-isa.txt
+  resource-usage.txt
+  digest.md
+```
+
+比较 candidate 时，digest 里要引用 baseline 或 parent artifact 路径。
+
+### Profile 工作流
+
+1. 先选一个代表 shape，最好是能暴露回退、plateau 或瓶颈的最小 shape。
+2. 确保 benchmark harness 有 warmup、固定 dtype/shape/seed、显式同步。
+3. 执行设备选择 gate，并用 `HIP_VISIBLE_DEVICES=<idle-card>` 固定卡。
+4. 先跑普通 benchmark，避免 profiler overhead 变成性能结论。
+5. 用 `hipprof` 做第一阶段 API/kernel/memcpy timing。
+6. 深度分析时收集 PMC、SQTT、`dccobjdump`、code-object resource、LDS/register/
+   occupancy 证据。
+7. 如果 `hipprof` 只显示热点 kernel 但解释不了原因，再用更深的 ROCm/DTK profiler。
+8. 如果怀疑 codegen，检查 AMDGPU ISA 或 code-object metadata。
+9. candidate 要和 baseline 或 parent 比，不只看绝对数。
+10. 写 `digest.md`，最后必须只有一个明确的 next edit。
+
+### 常用命令
+
+第一阶段：
+
+```bash
+mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
+hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt || \
+  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/device-status.txt
+HIP_VISIBLE_DEVICES=<idle-card> python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
+HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/test_<op>.py 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
+```
+
+benchmark 脚本：
+
+```bash
+mkdir -p .humanize/lightop-agent/profile-artifacts/v001_candidate
+hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt || \
+  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/device-status.txt
+HIP_VISIBLE_DEVICES=<idle-card> hipprof python test/<family>/benchmark_<op>.py 2>&1 \
+  | tee .humanize/lightop-agent/profile-artifacts/v001_candidate/hipprof.txt
+```
+
+深度分析：
+
+```bash
+mkdir -p .humanize/lightop-agent/profile-artifacts/v002_deep/{hipprof-pmc-all,sqtt-json}
+hy-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v002_deep/device-status.txt || \
+  rocm-smi 2>&1 | tee .humanize/lightop-agent/profile-artifacts/v002_deep/device-status.txt
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
+  -o .humanize/lightop-agent/profile-artifacts/v002_deep/hipprof-pmc-all/pmc \
+  python test/<family>/benchmark_<op>.py
+dccobjdump --inputs=<binary-or-so> --show-sass --show-instruction-encoding \
+  --separate-functions > .humanize/lightop-agent/profile-artifacts/v002_deep/amdgpu-isa.txt
+hipprof --codeobj-analyze <binary-or-so> \
+  > .humanize/lightop-agent/profile-artifacts/v002_deep/resource-usage.txt
+```
+
+如果某个命令当前 DTK 不支持，要把失败命令和错误输出记录下来，不要用猜测替代缺失证据。
+
+### digest 必须包含什么
+
+- runtime 身份：commit、operator、API、shape、dtype、gfx、DTK/ROCm/PyTorch、build 命令。
+- 设备选择：`hy-smi` 或 `rocm-smi`、选择的 `HIP_VISIBLE_DEVICES`、HCU、显存。
+- benchmark timing：warmup、repeat、p50/p90/mean、同步点、方差、delta。
+- `hipprof` timing：HIP API、kernel、memcpy/memset、launch overhead、top kernels。
+- kernel launch：grid/block、waves、occupancy 线索。
+- memory path：读写量、coalescing、alignment/vector width、L2/HBM 压力。
+- LDS path：LDS 使用、bank conflict、barrier。
+- compute path：MFMA/vector ALU、转换/量化、epilogue fusion。
+- resource pressure：VGPR/SGPR、spill/scratch、LDS per block、occupancy 限制。
+- dispatch/config：是否选错 shape 分支、fallback、gfx specialization。
+- ISA/code object：热点指令窗口、vector width、MFMA、scalarization、scratch。
+- 最后一个 section：只写一个下一步 LightOp 修改。
+
+## 推荐提示词模板
+
+### 新增算子
+
+```text
+@lightop-kernel-agent-loop
+宿主机 LightOp 路径：/public/home/wanghl6/pr/lightop
+验证 Docker：wanghl_lightop209
+容器内 LightOp 路径：/home/pr/lightop
+
+任务：在 LightOp 中添加 <operator> 算子。
+正确性参考：<PyTorch/native LightOp reference>
+性能目标：<目标 shape/dtype 的 latency 或有效带宽>
+
+要求：
+- 所有 build/test/benchmark/profile 都必须在容器内 /home/pr/lightop 执行。
+- 使用命令：docker exec wanghl_lightop209 bash -lc 'cd /home/pr/lightop && <command>'
+- 遵循现有 LightOp wrapper、export.cpp、csrc、test、benchmark 风格。
+- 如果首版正确但未达性能目标，进入 profiling/tuning，不要停止。
+- 完成前给出 build、smoke test、correctness test、benchmark 的实际命令和结果。
+```
+
+### 已有算子优化
+
+```text
+@lightop-kernel-agent-loop
+宿主机 LightOp 路径：/public/home/wanghl6/pr/lightop
+验证 Docker：wanghl_lightop209
+容器内 LightOp 路径：/home/pr/lightop
+
+任务：优化已有算子 <operator>。
+目标源码：lightop/csrc/<family>/<file>.cu
+校验文件：test/test_<op>.py
+性能目标：最高有效带宽达到 <target> TB/s。
+
+授权范围：
+- 允许修改该 LightOp 仓库内与本任务相关的源码、测试、benchmark、必要 config。
+- 允许创建/写入 .humanize/lightop-agent/ 记录文件。
+- 允许执行 docker exec 命令进行 build、test、benchmark、profiling。
+- 允许执行 python setup.py install、hipprof/rocprof。
+- 不允许删除 build/。
+- 不允许 git reset、清理仓库、force push、删除大目录。
+- 不需要 git commit/push。
+
+这是已有算子的优化任务：
+- 不要新增 public API。
+- 不要新建无关算子文件。
+- 不要改无关 operator family。
+- 优先只修改目标算子相关源码、测试、benchmark 或必要 config。
+- 直接在 Docker 内进行 install、correctness test、benchmark、profiling、tuning。
+```
+