README.md

<div align="center">

# LightOp KernelPilot

**基于 Humanize 的 LightOp/DCU 算子开发与优化 skill 包。**

</div>

LightOp KernelPilot 是面向 LightOp DCU 算子库的 KernelPilot 工作流改造版。
它保留了自主算子开发循环里真正有用的部分：明确算子语义、正确性参考、
workload 分布、benchmark 证据、profile digest、尝试记录、优化记录，以及
带 review gate 的迭代。

## Skills

| Skill | 作用 |
| --- | --- |
| [`lightop-kernel-agent-loop`](humanize/skills/humanize-kernel-agent-loop/) | 主循环。用于新增或优化 LightOp 算子：恢复 `K/R/W/E`，检查 wrapper/binding/kernel/test/config/benchmark，实现 HIP/ROCm 改动，build、test、benchmark、profile、tune，并启动 Humanize RLCR。 |
| [`lightop-kernel-knowledge`](knowledge/SKILL.md) | 证据检索。优先查本地 LightOp 源码，其次查 ROCm/DCU 上游和官方文档，最后才把 bundled CUDA PR corpus 当作跨平台灵感。 |
| [`dcu-profiler-report`](humanize/skills/ncu-report/) | 性能剖析。把 `hipprof`、ROCm/DTK profiler、benchmark log 和可选 AMDGPU ISA 证据整理成可复现 digest，并给出一个明确的下一步 LightOp 修改。 |

磁盘上的目录名 `humanize-kernel-agent-loop` 和 `ncu-report` 是为了兼容上游
Humanize installer；真正暴露给 agent 的 skill 名称以上表 frontmatter 为准。

如果只是想人工阅读这三个 skill 的中文说明，可以看
[`docs/lightop-skills.zh-CN.md`](docs/lightop-skills.zh-CN.md)。

## 手动安装

如果没有 Claude 或 Codex CLI installer，可以直接安装这三个 LightOp/DCU
skills：

```bash
./install-lightop-skills-manual.sh --target both
```

只安装 Claude：

```bash
./install-lightop-skills-manual.sh --target claude
```

只安装 Codex：

```bash
./install-lightop-skills-manual.sh --target codex
```

默认路径：

```text
Claude: ~/.claude/skills
Codex:  ${CODEX_HOME:-~/.codex}/skills
```

脚本会：

- symlink `lightop-kernel-knowledge`
- symlink `dcu-profiler-report`
- hydrate `lightop-kernel-agent-loop` 中的 `{{HUMANIZE_RUNTIME_ROOT}}` 和
  `{{KERNELPILOT_ROOT}}`
- 可选安装 `knowledge/requirements.txt`

## 请求格式

一个清晰的请求最好包含：算子名、正确性参考、workload、执行环境、目标
DCU/gfx arch、baseline、benchmark 方法、成功阈值。

prompt示例：

```text
@lightop-kernel-agent-loop
- 宿主机 LightOp 路径：/path/to/lightop
- 验证 Docker：<lightop-container>
- 容器内 LightOp 路径：/workspace/lightop

任务：添加 1-pass layer norm 算子。
- 正确性参考：PyTorch layer_norm。
- 性能目标：最高有效带宽达到 1.1 TB/s。
- 所有 build/test/benchmark/profile 必须在容器内 /workspace/lightop 执行。
- 使用命令格式：
  docker exec <lightop-container> bash -lc 'cd /workspace/lightop && <command>'
- 缺少 shape、dtype、API、是否返回 mean/rstd 时，先从现有 LightOp layernorm/rmsnorm
  测试和 benchmark 推断，无法推断再问。

必须先创建：
- .humanize/lightop-agent/refined-plan.md
- .humanize/lightop-agent/research-digest.md
- .humanize/lightop-agent/attempt-ledger.md
并启动或记录 Humanize loop state。

授权范围：
- 允许修改该 LightOp 仓库内与本任务相关的源码、测试、benchmark、必要 config。
- 允许创建/写入 .humanize/lightop-agent/ 记录文件。
- 允许执行上述 docker exec 命令进行 build、test、benchmark、profiling。
- 允许执行 python setup.py install、hipprof/rocprof。
- 不允许删除 build/。
- 不允许 git reset、清理仓库、force push、删除大目录。
- 不需要 git commit/push。
```

## LightOp 接入位置

主循环会在一个 LightOp checkout 中工作。该目录通常包含：

```text
setup.py
setup_torch29.py
lightop/__init__.py
lightop/csrc/export.cpp
lightop/csrc/<family>/
lightop/config*.py
test/
```

新增算子时，agent 通常会检查或修改：

- `lightop/csrc/<family>/`：HIP/C++ kernel 和 launcher，源码使用 `.cu`，不要手写
  `.hip` 算子文件；`.hip` 如出现视作编译自动生成产物
- `lightop/csrc/export.cpp`：`m.def(...)` binding
- `lightop/<op>.py`：Python wrapper
- `lightop/__init__.py`：公开 API export
- `setup.py`：只有新增 `csrc` family 需要 glob 覆盖时才改
- `test/test_<op>.py`：正确性测试
- benchmark 脚本：性能测试
- `lightop/config*.py`：需要 shape/gfx-aware dispatch 时才改

所有改动都要符合 LightOp 现有开发规范。agent 写代码前必须先找最近的同 family
实现作为参照，沿用它的目录、文件命名、C++ namespace/include/launch helper、
wrapper 参数校验、`export.cpp` binding、config/dispatcher、test 和 benchmark 风格。
不要引入无关依赖、外部项目目录结构、批量格式化、生成源码、无关 operator family
改动，除非用户明确要求并在计划里说明原因。交付前需要列出修改文件对应参考了哪些
LightOp 本地文件，并确认没有手写 `.hip` 源码、没有无关改动、`test/` 下没有多个
最终任务测试入口。

最终验证脚本必须在 `test/` 下。新增算子需要添加 `test/test_<算子名>.py`；优化已有算子
时使用用户指定的测试文件，没有指定时再推断或创建 `test/test_<算子名>.py`。每个任务
在 `test/` 下只保留这一个正式测试入口；其它 benchmark、candidate test、parse/sweep
脚本都放 `.humanize/lightop-agent/`。测试脚本先做精度验证，再做性能测试；性能测试固定
10 轮 warmup、100 轮 timed iterations，报告平均耗时 us 和有效带宽。最终验证结果用
简短表格呈现。

LightOp KernelPilot 的 build 规则固定为：

```bash
python setup.py install
```

如果用户指定 Docker 容器，编译也必须进容器执行：

```bash
docker exec <container> bash -lc 'cd <container-lightop> && python setup.py install'
```

无论 PyTorch 版本是什么，都不切到 `setup_torch29.py`。正常调优循环中也不删除
`build/`，也不删除 `build/bdist.*`、`build/lib.*`、`build/temp.*` 这些
`python setup.py install` 正常生成的子目录；它们要保留复用，以免每轮重新全量编译。
只有用户明确要求 clean build，或证明 cache 损坏时才清理。

## DCU Profiling

`dcu-profiler-report` 使用 SourceFind DCU 性能分析工具指南作为一阶段官方参考：

```text
https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html
```

默认一阶段 profiling：

```bash
cd /path/to/lightop
bash /path/to/lightop-skiils/humanize/scripts/measure-device-bandwidth.sh \
  --docker <container> \
  --workdir <container-lightop> \
  --hip-visible-devices <idle-card> \
  --output .humanize/lightop-agent/device-bandwidth.txt
mkdir -p .humanize/lightop-agent/profile-artifacts/v000_baseline
python test/test_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/benchmark.log
hipprof python test/test_<op>.py 2>&1 \
  | tee .humanize/lightop-agent/profile-artifacts/v000_baseline/hipprof.txt
```

开始优化前必须先测当前选中卡的实际读写/拷贝带宽，作为后续算子有效带宽目标的参照。

优化循环里，每个正确性通过的 candidate 在普通 benchmark 后都要补一轮
`hipprof --pmc`，用于查看 cache 行为、LDS/bank conflict、occupancy/resource
压力，并把结果转成下一步明确的 kernel edit；如果当前 DTK 不支持某个 counter，
要记录实际命令和报错，不能用猜测替代。

当 `hipprof` 和 benchmark log 不够解释问题时，可以进一步使用：

```text
rocprof
rocprofv3
rocprof-compute
AMDGPU ISA / code-object inspection
```

profile digest 不是只写一句 “memory-bound”。它必须给出：

```text
测到什么信号 -> 可能机制是什么 -> 为什么其他假设较弱 -> 下一步具体改哪里
```

## Build And Test 默认规则

build、install、test、benchmark、profiling 必须在同一个环境里完成。如果目标环境是
Docker，优先使用可重复的非交互命令：

```bash
docker exec <container> bash -lc 'cd /workspace/lightop && <command>'
```

在该环境的 LightOp root 下先探测版本和设备：

```bash
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
hipcc --version
```

然后 build：

```bash
PYTORCH_ROCM_ARCH='gfx928;gfx936;gfx938' python setup.py install
```

再做 import smoke test：

```bash
python - <<'PY'
import torch, lightop
print("lightop:", getattr(lightop, "__file__", "unknown"))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
PY
```

优先跑最窄的目标测试：

```bash
cd test
python test_<op>.py
```

benchmark 必须包含 warmup，并且在计时区间前后显式使用
`torch.cuda.synchronize()` 或 HIP 同步。不能只凭 Python wall-clock 异步计时宣布
加速。

一个 LightOp 算子改动不能只靠 build 通过就算完成。完成条件至少包括：

- install 成功
- import smoke test 成功
- targeted correctness test 成功
- benchmark 和 baseline 对比完成
- 当结果接近阈值或异常时，有 profiler 证据

## 常规安装

从当前 checkout 安装 Humanize skill pack：

```bash
cd /path/to/kernel-pilot
./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD"
```

Claude Code 用户可以使用：

```bash
./humanize/scripts/install-skills-claude.sh --kernelpilot-root "$PWD"
```

安装后相关 skill 名称是：

```text
lightop-kernel-agent-loop
lightop-kernel-knowledge
dcu-profiler-report
triton-kernel-agent-loop
triton-kernel-knowledge
triton-dcu-profiler-report
```

For Triton/DCU usage notes, see [`docs/triton-skills.md`](docs/triton-skills.md)
or the Chinese version [`docs/triton-skills.zh-CN.md`](docs/triton-skills.zh-CN.md).
Triton skills 支持 vLLM/SGLang 的 framework mode，也支持用户直接指定某个
Triton Python 文件的 direct-file mode。

## 证据规则

- 本地 LightOp 源码、测试、配置和 benchmark 是第一优先级证据。
- ROCm/DCU 官方文档和上游源码是第二优先级证据。
- bundled CUDA PR corpus 只能作为跨平台灵感，除非已经明确翻译并在 DCU 上验证。
- 复制或改写外部源码时，必须记录来源路径/URL、commit 或 version、
  license/notice，以及优化收益。
- profile digest 最后必须给出一个具体的下一步修改；如果 profiling 不可执行，
  要说明原因。