Commit 60c75a2f authored by whlwhlwhl's avatar whlwhlwhl
Browse files

add triton-kernel-skill

parent 6889486d
...@@ -3,12 +3,12 @@ ...@@ -3,12 +3,12 @@
"owner": { "owner": {
"name": "BBuf" "name": "BBuf"
}, },
"description": "Claude Code marketplace for the Humanize LightOp/DCU operator development and optimization loop.", "description": "Claude Code marketplace for Humanize DCU kernel development loops, including LightOp HIP/C++ and Triton vLLM/SGLang or direct-file optimization.",
"plugins": [ "plugins": [
{ {
"name": "humanize", "name": "humanize",
"source": "./humanize", "source": "./humanize",
"description": "Humanize plus LightOp/DCU operator workflows, dcu-profiler-report profiling, and review-gated iteration.", "description": "Humanize plus LightOp/DCU and Triton/DCU kernel workflows for frameworks or direct files, profiler digests, and review-gated iteration.",
"version": "1.17.0" "version": "1.17.0"
} }
] ]
......
...@@ -115,7 +115,8 @@ test/ ...@@ -115,7 +115,8 @@ test/
新增算子时,agent 通常会检查或修改: 新增算子时,agent 通常会检查或修改:
- `lightop/csrc/<family>/`:HIP/C++ kernel 和 launcher - `lightop/csrc/<family>/`:HIP/C++ kernel 和 launcher,源码使用 `.cu`,不要手写
`.hip` 算子文件;`.hip` 如出现视作编译自动生成产物
- `lightop/csrc/export.cpp``m.def(...)` binding - `lightop/csrc/export.cpp``m.def(...)` binding
- `lightop/<op>.py`:Python wrapper - `lightop/<op>.py`:Python wrapper
- `lightop/__init__.py`:公开 API export - `lightop/__init__.py`:公开 API export
...@@ -124,6 +125,21 @@ test/ ...@@ -124,6 +125,21 @@ test/
- benchmark 脚本:性能测试 - benchmark 脚本:性能测试
- `lightop/config*.py`:需要 shape/gfx-aware dispatch 时才改 - `lightop/config*.py`:需要 shape/gfx-aware dispatch 时才改
所有改动都要符合 LightOp 现有开发规范。agent 写代码前必须先找最近的同 family
实现作为参照,沿用它的目录、文件命名、C++ namespace/include/launch helper、
wrapper 参数校验、`export.cpp` binding、config/dispatcher、test 和 benchmark 风格。
不要引入无关依赖、外部项目目录结构、批量格式化、生成源码、无关 operator family
改动,除非用户明确要求并在计划里说明原因。交付前需要列出修改文件对应参考了哪些
LightOp 本地文件,并确认没有手写 `.hip` 源码、没有无关改动、`test/` 下没有多个
最终任务测试入口。
最终验证脚本必须在 `test/` 下。新增算子需要添加 `test/test_<算子名>.py`;优化已有算子
时使用用户指定的测试文件,没有指定时再推断或创建 `test/test_<算子名>.py`。每个任务
`test/` 下只保留这一个正式测试入口;其它 benchmark、candidate test、parse/sweep
脚本都放 `.humanize/lightop-agent/`。测试脚本先做精度验证,再做性能测试;性能测试固定
10 轮 warmup、100 轮 timed iterations,报告平均耗时 us 和有效带宽。最终验证结果用
简短表格呈现。
LightOp KernelPilot 的 build 规则固定为: LightOp KernelPilot 的 build 规则固定为:
```bash ```bash
...@@ -137,8 +153,9 @@ docker exec <container> bash -lc 'cd <container-lightop> && python setup.py inst ...@@ -137,8 +153,9 @@ docker exec <container> bash -lc 'cd <container-lightop> && python setup.py inst
``` ```
无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环中也不删除 无论 PyTorch 版本是什么,都不切到 `setup_torch29.py`。正常调优循环中也不删除
`build/`,以便复用增量编译结果;只有用户明确要求 clean build,或证明 cache `build/`,也不删除 `build/bdist.*``build/lib.*``build/temp.*` 这些
损坏时才清理。 `python setup.py install` 正常生成的子目录;它们要保留复用,以免每轮重新全量编译。
只有用户明确要求 clean build,或证明 cache 损坏时才清理。
## DCU Profiling ## DCU Profiling
...@@ -264,8 +281,16 @@ Claude Code 用户可以使用: ...@@ -264,8 +281,16 @@ Claude Code 用户可以使用:
lightop-kernel-agent-loop lightop-kernel-agent-loop
lightop-kernel-knowledge lightop-kernel-knowledge
dcu-profiler-report dcu-profiler-report
triton-kernel-agent-loop
triton-kernel-knowledge
triton-dcu-profiler-report
``` ```
For Triton/DCU usage notes, see [`docs/triton-skills.md`](docs/triton-skills.md)
or the Chinese version [`docs/triton-skills.zh-CN.md`](docs/triton-skills.zh-CN.md).
Triton skills 支持 vLLM/SGLang 的 framework mode,也支持用户直接指定某个
Triton Python 文件的 direct-file mode。
## 证据规则 ## 证据规则
- 本地 LightOp 源码、测试、配置和 benchmark 是第一优先级证据。 - 本地 LightOp 源码、测试、配置和 benchmark 是第一优先级证据。
......
# Triton/DCU Skills
Chinese reading version: [`triton-skills.zh-CN.md`](triton-skills.zh-CN.md).
KernelPilot includes an independent Triton/DCU skill pack for optimizing Triton
kernels inside vLLM, inside SGLang, or in a user-specified Triton Python file.
It is separate from the LightOp skill pack and uses its own task state
directory:
```text
.humanize/triton-agent/
```
## Skills
| Skill | Purpose |
| --- | --- |
| `triton-kernel-agent-loop` | Main loop for vLLM/SGLang Triton attention, MLA, MoE, quantization, fused norm, cache, sampler, routing, small JIT kernels, and direct Triton files on DCU/ROCm. |
| `triton-kernel-knowledge` | Evidence search for local vLLM/SGLang source, direct-file call sites and harnesses, KernelPilot PR corpus, Triton/ROCm/DTK/DCU docs, and portable cross-platform ideas. |
| `triton-dcu-profiler-report` | DCU profiler digest for framework or standalone Triton JIT kernels, including backend/call-site proof, hipprof/rocprofv3/rocprof-compute evidence, Triton cache/IR dumps, and AMDGPU ISA/code-object clues. |
## Open Kernel References
The Triton knowledge route includes source-reference pages for high-value open
Triton kernel libraries:
```text
ref-rocm-aiter
ref-rocm-aotriton
ref-stackav-conch
ref-flaggems
ref-liger-kernel
ref-huggingface-kernels
ref-triton-distributed
```
Use them as reference implementations or discovery routes, then validate
correctness, benchmark, profiler names, Triton cache/IR, and DCU ISA locally
before promoting any idea.
## Install
The standard installers include both LightOp and Triton skill packs:
```bash
./install-lightop-skills-manual.sh --target both
./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD"
```
## Example Prompt: Framework Mode
```text
@triton-kernel-agent-loop
Framework: vLLM
Repo path: /path/to/vllm
Container: <container-name>
Repo path inside container: /workspace/vllm
Task: optimize the Triton MLA decode or fused MoE kernel on DCU.
Target arch: gfx936 or gfx938
Correctness reference: existing framework path or PyTorch reference
Performance target: p50 latency improvement above benchmark noise band
Requirements:
- Prove that the Triton backend, not AITER/FlashInfer/TRTLLM/C++ fallback, is selected.
- Store all loop state under .humanize/triton-agent/.
- Run correctness before benchmark.
- Use hy-smi or rocm-smi before performance runs.
- Use hipprof/rocprofv3 and Triton cache/IR/ISA evidence when benchmark results are close or below target.
```
## Example Prompt: Direct File Mode
```text
@triton-kernel-agent-loop
Target mode: direct-file
Target file: /path/to/project/kernels/my_triton_kernel.py
Target function or wrapper: <kernel_name_or_wrapper>
Project root or workdir: /path/to/project
Container: <container-name>
Workdir inside container: /workspace/project
Task: optimize this Triton kernel on DCU.
Target arch: gfx936 or gfx938
Correctness reference: existing Python/PyTorch reference, test, or oracle
Workload: shape/dtype/layout distribution and representative benchmark command
Performance target: p50 latency improvement above benchmark noise band
Requirements:
- First identify the @triton.jit function, launch wrapper, grid, configs, caller, and harness.
- If no harness exists, create a temporary correctness/benchmark harness under .humanize/triton-agent/.
- Prove the direct call reaches the target Triton kernel with profiler kernel names, Triton cache/dumps, or temporary instrumentation.
- Store all loop state under .humanize/triton-agent/.
- Run correctness before benchmark.
```
# Triton/DCU Skills 中文说明
这份文档是 `triton-kernel-agent-loop``triton-kernel-knowledge`
`triton-dcu-profiler-report` 的中文阅读版,方便人工理解和写提示词。真正安装给
Codex/Claude 触发的仍然是各自目录里的 `SKILL.md`
这一组 skill 独立于 LightOp skill。它面向的是 vLLM 或 SGLang 里的 Triton JIT
kernel,也支持用户直接指定某个 Triton Python 文件;不假设当前任务在 LightOp
仓库里,也不使用 `.humanize/lightop-agent/` 状态目录。
## 三个 Skill 分别做什么
| Skill | 什么时候用 | 核心产出 |
| --- | --- | --- |
| `triton-kernel-agent-loop` | 优化或新增 vLLM/SGLang 里的 Triton kernel,或直接优化指定的 Triton Python 文件,目标是 DCU/ROCm/DTK | K/R/W/E/B/P、backend/call-site map、源码改动、正确性、benchmark、profile、调优记录 |
| `triton-kernel-knowledge` | 需要找 vLLM、SGLang、直接文件调用点、Triton、ROCm、AITER、DTK/DCU 证据 | research digest,说明证据来源、适用性、是否能直接迁移到 DCU |
| `triton-dcu-profiler-report` | benchmark 不够解释瓶颈,下一步需要 profiler、Triton IR、ISA 或 code object 证据 | profile digest,最后必须落到一个明确的下一步 Triton/config/backend/call-site 修改 |
## 已加入的开源 Triton Kernel 参考库
`triton-kernel-knowledge` 现在可以直接检索这些 source-reference 页面:
```text
ref-rocm-aiter
ref-rocm-aotriton
ref-stackav-conch
ref-flaggems
ref-liger-kernel
ref-huggingface-kernels
ref-triton-distributed
```
它们适合作为 reference implementation 或发现入口。真正用于 DCU 优化前,仍然必须在
本地完成正确性、benchmark、profiler kernel name、Triton cache/IR 和 DCU ISA 证据。
所有 Triton 任务的过程记录默认放在目标 vLLM/SGLang 仓库下;如果是直接指定文件,
则放在最近的项目根目录或用户指定的 workdir 下:
```text
.humanize/triton-agent/
```
不要和 LightOp 的 `.humanize/lightop-agent/` 混用。
## triton-kernel-agent-loop
这是主循环 skill。用户说“优化 vLLM 的 Triton MLA decode”“给 SGLang Triton MoE
调 config”“在 DCU 上调 Triton FP8 quant kernel”,或者“优化这个
`/path/to/kernel.py` 里的 Triton kernel”时,应该触发它。
它不适合 LightOp HIP/C++ 算子任务。LightOp 任务继续使用
`lightop-kernel-agent-loop`
### 输入契约 K/R/W/E/B/P
开工前需要恢复或定义六类信息:
- `K`:kernel 或算子语义,包括 tensor shape、stride、layout、dtype、mask、
causal/window flag、quant scale、page table、KV cache、expert routing、输出契约。
- `R`:正确性参考,通常是框架已有路径、PyTorch reference、已有 Triton path、
AITER path,或小规模 literal oracle。
- `W`:workload 分布,包括模型场景、prefill/decode/batch mix、seq length、
hidden/head dim、expert/topk、quant mode、目标 gfx/DCU、latency/throughput 指标。
- `E`:执行环境,包括宿主机或 Docker、容器内 repo 路径、可见 DCU、DTK/ROCm/
PyTorch/Triton/AITER 版本、安装、测试、benchmark、profile 命令。
- `B`:backend 或 call-site 契约,包括 vLLM/SGLang、direct-file 或 scratch 模式、
V0/V1 或运行路径、env flags、CLI 参数、backend registry 分支、fallback 分支、
直接调用的 wrapper/harness、Triton cache/JIT config,以及“确实走到了目标
Triton kernel”的证明。
- `P`:profiling 契约,包括 artifact 目录、benchmark 稳定性规则、profiler 深度、
Triton IR/ISA dump 要求,以及下一轮修改前最低需要哪些证据。
缺信息时,agent 应该先从仓库、测试、benchmark 和配置里推断。只有推断会改变正确性、
backend 选择或性能验收时才问用户。
### 定位目标
支持三种目标模式:
```text
framework:vLLM 或 SGLang 仓库,需要证明 backend routing。
direct-file:用户直接指定一个或多个 Triton Python 文件,需要恢复调用点和 harness。
scratch:框架外的新 reproducer 或实验。
```
direct-file 模式下,用户给出的文件就是目标。优先定位最近的项目根目录,用于 imports、
测试和 benchmark;如果没有项目根目录,就以文件父目录或用户指定 workdir 为准。
direct-file 常用搜索:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\[grid\]" <target-file-or-dir>
rg -n "<kernel_name>|<wrapper_name>|<function_name>" <project-root-or-target-dir>
rg -n "pytest|benchmark|do_bench|torch\.cuda\.synchronize|TRITON_CACHE_DIR" <project-root-or-target-dir>
```
vLLM 常见根目录标志:
```text
pyproject.toml
vllm/
tests/
benchmarks/
```
SGLang 常见根目录标志:
```text
pyproject.toml
python/sglang/
sgl-kernel/
test/
benchmark/
```
常用搜索:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|VLLM_ROCM|AITER|gcnArchName|triton_mla|fused_moe" vllm tests benchmarks
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|SGLANG_USE_AITER|attention_backend|moe_runner|fused_moe|triton_ops" python/sglang sgl-kernel test benchmark
```
### Backend Gate
不要在没有 backend 证明时优化 Triton kernel。必须先证明本次请求确实选择了目标
Triton 路径,而不是 AITER、FlashInfer、TRTLLM、framework C++ kernel、PyTorch
native fallback 或其它 backend。
direct-file 侧重点:
- 记录目标文件、`@triton.jit` 函数、launch wrapper、grid 函数、autotune/config 表。
- 找到真实传 tensor 的 caller、测试或 benchmark harness。
- 如果没有 harness,先在 `.humanize/triton-agent/` 下建立临时 correctness/benchmark harness。
- 用 profiler kernel name、Triton cache/dump、临时 instrumentation 或最小 reproducer
证明直接调用确实进入目标 kernel。
- 只有当这个文件仍由框架路径调用时,才把框架 backend flags 作为必选项。
vLLM 侧重点:
- 记录 V0/V1 路径和 attention、MLA、MoE、quant backend 选择。
- 记录 `VLLM_ROCM_USE_AITER``VLLM_ROCM_USE_AITER_MOE`
`VLLM_ROCM_USE_AITER_MLA``VLLM_ROCM_USE_AITER_MHA`
`VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION``VLLM_ROCM_USE_AITER_TRITON_GEMM`
等 env var。
- 对 MLA/attention,确认 prefill/decode、block size、page table、FP8 KV cache
和 fallback。
- 对 MoE/quant,确认 expert backend、topk、scale layout、config JSON 或 autotune
表是否生效。
SGLang 侧重点:
- 记录 `SGLANG_USE_AITER``--attention-backend``--moe-runner-backend`
quant mode 和模型配置。
- 确认 AITER、Triton、FlashInfer、TRTLLM、sgl-kernel C++ 路径谁真正胜出。
- 对 Triton attention,区分 decode、extend/prefill、target verify、MLA、
FP8 KV cache 和 page/cache layout。
- 对 Triton MoE,确认 `moe_runner`、tuned config、topk、expert count、block shape、
EP/TP/DP shape 和 fallback。
可接受的 backend 证明包括日志、临时 instrumentation、profiler kernel name、直接调用
目标 wrapper 的 microbenchmark,或 Triton cache/dump artifact。
## triton-kernel-knowledge
这个 skill 用来把问题变成可引用证据。优先级如下:
1. 本地目标源码:vLLM/SGLang 源码、测试、benchmark、配置、当前 diff,或直接指定
文件的调用点和 harness。
2. KernelPilot 本地 PR corpus,重点查 vLLM、SGLang、Triton、PyTorch ROCm、
AITER、FlashAttention、FlashInfer。
3. 官方文档和上游源码,重点是 Triton、ROCm、DTK/DCU、vLLM、SGLang、profiler。
4. CUDA/NVIDIA 资料只能作为跨平台灵感,必须显式翻译并在 DCU 上验证。
常用 corpus 查询:
```bash
python3 scripts/query.py "vllm rocm aiter triton <operator> <dtype>" --compact --limit 30
python3 scripts/query.py "sglang amd aiter triton <operator> <dtype>" --compact --limit 30
python3 scripts/query.py "triton amd backend rocm <operator> <symptom>" --compact --limit 30
python3 scripts/search-pr-diffs.py triton rocm <operator> --any --limit 100
```
每条 finding 建议按这个格式记录:
```text
Route:
Source path or URL:
Commit/version/date:
Relevant file/function/config/test:
Finding:
Impact on plan:
Portability: direct DCU evidence | ROCm upstream evidence | cross-platform idea
License/notice if copied or adapted:
```
### CUDA 资料迁移规则
可以直接借鉴的通常是 Triton 语言层模式,例如 online softmax、mask、stride addressing、
shape-keyed config、microbenchmark 写法。
必须重新验证的包括 tiling、pipeline depth、dot layout、FP8/FP4、cache 行为。
默认不可直接迁移的包括 Nsight 指标、PTX/SASS 结论、TMA、WGMMA、CUDA shared memory
bank 规则、NVIDIA device-name config table。
概念翻译要写清楚:
```text
SM -> CU
warp -> wavefront
shared memory -> LDS
tensor core -> MFMA/MMAC 或目标 ROCm backend
SASS/PTX -> AMDGPU ISA/code object
Nsight Compute -> hipprof/rocprofv3/rocprof-compute/DCU profiler
```
## triton-dcu-profiler-report
这个 skill 用于 profile 和解释 Triton JIT kernel 在 DCU 上的瓶颈。它的规则是:
```text
先证明热路径是目标 Triton kernel -> 再 profile -> 必要时看 Triton IR/ISA -> 最后只给一个下一步修改
```
不要只写“memory-bound”或“launch overhead”。必须写成:
```text
测到了什么信号 -> 可能机制是什么 -> 为什么其它解释较弱 -> 下一步具体改哪里
```
### 什么时候调用
- baseline benchmark 已通过,但还没有 Triton/DCU profile digest。
- 正确性通过的 candidate 刚跑完 benchmark。
- candidate 和 baseline 在 +/-2% 或噪声带内。
- 第二个正确 candidate 相比 parent 或 baseline 提升小于 5%。
- 重要 shape 回退。
- backend 或目标 Triton kernel 仍不确定。
- 怀疑 JIT compile time、Triton cache churn、dynamic shape 或 graph capture 影响数据。
- 下一步可能涉及 `BLOCK_*``num_warps``num_stages``waves_per_eu`、dot layout、
vectorization、LDS、fusion 或 backend routing。
### 必要 artifacts
默认放在:
```text
.humanize/triton-agent/profile-artifacts/<version>/
device-status.txt
env.txt
backend-proof.txt
benchmark.log
hipprof.txt
rocprofv3/
rocprof-compute/
hipprof-pmc-all/
triton-cache/
triton-dumps/
amdgpu-isa.txt
code-object-metadata.txt
resource-usage.txt
digest.md
```
profile 前必须在同一环境里跑 `hy-smi``rocm-smi`,选择空闲卡,并用
`HIP_VISIBLE_DEVICES=<idle-card>``HSA_VISIBLE_DEVICES=<idle-card>` 固定。
### 常用命令形态
```bash
mkdir -p .humanize/triton-agent/profile-artifacts/v000_baseline
hy-smi > .humanize/triton-agent/profile-artifacts/v000_baseline/device-status.txt 2>&1 || \
rocm-smi > .humanize/triton-agent/profile-artifacts/v000_baseline/device-status.txt 2>&1
HIP_VISIBLE_DEVICES=<idle-card> <benchmark-command> \
2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/benchmark.log
HIP_VISIBLE_DEVICES=<idle-card> hipprof <benchmark-command> \
2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/hipprof.txt
```
Triton dump 示例:
```bash
mkdir -p .humanize/triton-agent/profile-artifacts/v001/triton-dumps
mkdir -p .humanize/triton-agent/profile-artifacts/v001/triton-cache
HIP_VISIBLE_DEVICES=<idle-card> \
TRITON_CACHE_DIR=$PWD/.humanize/triton-agent/profile-artifacts/v001/triton-cache \
MLIR_ENABLE_DUMP=1 \
AMDGCN_ENABLE_DUMP=1 \
<benchmark-command> \
> .humanize/triton-agent/profile-artifacts/v001/triton-dumps/stdout.txt \
2> .humanize/triton-agent/profile-artifacts/v001/triton-dumps/stderr.txt
```
如果当前 DTK/ROCm/Triton 不支持某个 flag 或 counter,要保留失败命令和报错,不要用猜测替代证据。
## DCU Triton 调优纪律
调 config 时,每个候选都记录到 `.humanize/triton-agent/tuning-decisions.md`
```text
shape/dtype/backend/gfx
config: BLOCK_*, num_warps, num_stages, waves_per_eu, matrix_instr_nonkdim
correctness: pass/fail and tolerance
latency: p50/p90/mean, repeats, selected card
profile clue: launch/memory/LDS/resource/compute/dispatch
decision: keep/reject/inconclusive
```
只有满足这些条件才提升为候选:
- 请求范围内的 shape family 正确性通过。
- 提升超过 benchmark 噪声带。
- 重要 serving shape 没有不可接受回退。
- backend proof 仍然证明走目标 Triton kernel。
- profiler 或 ISA 证据说明生成的 DCU code path 合理。
## 推荐提示词模板
```text
@triton-kernel-agent-loop
框架:vLLM
宿主机仓库路径:/path/to/vllm
验证容器:<container-name>
容器内仓库路径:/workspace/vllm
任务:在 DCU 上优化 Triton MLA decode 或 fused MoE kernel。
目标架构:gfx936 或 gfx938
正确性参考:框架已有路径或 PyTorch reference
性能目标:p50 latency 提升必须超过 benchmark 噪声带
要求:
- 先证明当前请求确实选择了 Triton backend,而不是 AITER、FlashInfer、TRTLLM 或 C++ fallback。
- 所有过程状态和临时证据都放在 `.humanize/triton-agent/`。
- 先做正确性验证,再做 benchmark。
- 每次性能测试前使用 `hy-smi` 或 `rocm-smi` 选择空闲卡。
- 当 benchmark 结果接近 baseline、低于目标或出现回退时,必须使用 `hipprof`、`rocprofv3`、Triton cache、IR 或 ISA 证据解释下一步修改。
```
直接指定文件时,可以这样写:
```text
@triton-kernel-agent-loop
目标模式:direct-file
目标文件:/path/to/project/kernels/my_triton_kernel.py
目标函数或 wrapper:<kernel_name_or_wrapper>
项目根目录或工作目录:/path/to/project
验证容器:<container-name>
容器内工作目录:/workspace/project
任务:在 DCU 上优化这个文件里的 Triton kernel。
目标架构:gfx936 或 gfx938
正确性参考:已有 Python/PyTorch reference、测试或小规模 oracle
workload:shape、dtype、layout 分布,以及代表性 benchmark 命令
性能目标:p50 latency 提升必须超过 benchmark 噪声带
要求:
- 先定位 `@triton.jit` 函数、launch wrapper、grid、config、caller 和 harness。
- 如果没有现成 harness,先在 `.humanize/triton-agent/` 下创建临时 correctness/benchmark harness。
- 用 profiler kernel name、Triton cache/dump 或临时 instrumentation 证明直接调用确实进入目标 Triton kernel。
- 所有过程状态和临时证据都放在 `.humanize/triton-agent/`。
- 先做正确性验证,再做 benchmark。
```
{ {
"name": "humanize", "name": "humanize",
"description": "Humanize packaged with LightOp/DCU operator-loop skills for adding, optimizing, benchmarking, and profiling fused ROCm operators with review-gated iteration.", "description": "Humanize packaged with KernelPilot DCU skills for LightOp HIP/C++ operators and Triton kernels in vLLM, SGLang, or direct Python files, with benchmark/profile evidence and review-gated iteration.",
"version": "1.17.0", "version": "1.17.0",
"author": { "author": {
"name": "PolyArch" "name": "PolyArch"
...@@ -18,6 +18,10 @@ ...@@ -18,6 +18,10 @@
"lightop", "lightop",
"dcu", "dcu",
"rocm", "rocm",
"hip" "hip",
"triton",
"direct-file",
"vllm",
"sglang"
] ]
} }
...@@ -220,8 +220,6 @@ log "codex config dir: $CODEX_CONFIG_DIR" ...@@ -220,8 +220,6 @@ log "codex config dir: $CODEX_CONFIG_DIR"
log "runtime root: $RUNTIME_ROOT" log "runtime root: $RUNTIME_ROOT"
log "hooks file: $HOOKS_FILE" log "hooks file: $HOOKS_FILE"
require_native_hooks_support
if [[ "$DRY_RUN" == "true" ]]; then if [[ "$DRY_RUN" == "true" ]]; then
log "DRY-RUN merge $HOOKS_TEMPLATE -> $HOOKS_FILE" log "DRY-RUN merge $HOOKS_TEMPLATE -> $HOOKS_FILE"
if [[ "$ENABLE_FEATURE" == "true" ]]; then if [[ "$ENABLE_FEATURE" == "true" ]]; then
...@@ -230,6 +228,8 @@ if [[ "$DRY_RUN" == "true" ]]; then ...@@ -230,6 +228,8 @@ if [[ "$DRY_RUN" == "true" ]]; then
exit 0 exit 0
fi fi
require_native_hooks_support
merge_hooks_json "$HOOKS_FILE" "$HOOKS_TEMPLATE" "$RUNTIME_ROOT" merge_hooks_json "$HOOKS_FILE" "$HOOKS_TEMPLATE" "$RUNTIME_ROOT"
enable_feature "$CODEX_CONFIG_DIR" enable_feature "$CODEX_CONFIG_DIR"
......
...@@ -46,6 +46,10 @@ SKILL_NAMES=( ...@@ -46,6 +46,10 @@ SKILL_NAMES=(
# lightop-kernel-agent-loop and dcu-profiler-report. # lightop-kernel-agent-loop and dcu-profiler-report.
"humanize-kernel-agent-loop" "humanize-kernel-agent-loop"
"ncu-report" "ncu-report"
# Independent Triton/DCU skills for vLLM, SGLang, and direct Triton files.
"triton-kernel-agent-loop"
"triton-kernel-knowledge"
"triton-dcu-profiler-report"
) )
KERNEL_KNOWLEDGE_SKILL_NAME="lightop-kernel-knowledge" KERNEL_KNOWLEDGE_SKILL_NAME="lightop-kernel-knowledge"
...@@ -64,7 +68,7 @@ Options: ...@@ -64,7 +68,7 @@ Options:
--codex-skills-dir PATH Codex skills dir (default: ${CODEX_HOME:-~/.codex}/skills) --codex-skills-dir PATH Codex skills dir (default: ${CODEX_HOME:-~/.codex}/skills)
--codex-config-dir PATH Codex config dir for hooks/config.toml (default: ${CODEX_HOME:-~/.codex}) --codex-config-dir PATH Codex config dir for hooks/config.toml (default: ${CODEX_HOME:-~/.codex})
--command-bin-dir PATH Install helper command shims here (default: ~/.local/bin) --command-bin-dir PATH Install helper command shims here (default: ~/.local/bin)
--kernelpilot-root PATH Root of the LightOp/DCU knowledge pack used by kernel-agent skill --kernelpilot-root PATH Root of the KernelPilot knowledge pack used by kernel skills
--dry-run Print actions without writing --dry-run Print actions without writing
-h, --help Show help -h, --help Show help
EOF EOF
...@@ -138,7 +142,7 @@ resolve_kernelpilot_root() { ...@@ -138,7 +142,7 @@ resolve_kernelpilot_root() {
} }
validate_kernelpilot_root() { validate_kernelpilot_root() {
[[ -n "$KERNELPILOT_ROOT" ]] || die "LightOp/DCU kernel-pilot root not found; run from the kernel-pilot/humanize checkout or pass --kernelpilot-root PATH" [[ -n "$KERNELPILOT_ROOT" ]] || die "KernelPilot root not found; run from the kernel-pilot/humanize checkout or pass --kernelpilot-root PATH"
[[ -d "$KERNELPILOT_ROOT" ]] || die "KernelPilot root is not a directory: $KERNELPILOT_ROOT" [[ -d "$KERNELPILOT_ROOT" ]] || die "KernelPilot root is not a directory: $KERNELPILOT_ROOT"
[[ -f "$KERNELPILOT_ROOT/knowledge/SKILL.md" ]] || die "LightOp kernel knowledge skill not found: $KERNELPILOT_ROOT/knowledge/SKILL.md" [[ -f "$KERNELPILOT_ROOT/knowledge/SKILL.md" ]] || die "LightOp kernel knowledge skill not found: $KERNELPILOT_ROOT/knowledge/SKILL.md"
} }
...@@ -312,17 +316,17 @@ install_codex_user_config() { ...@@ -312,17 +316,17 @@ install_codex_user_config() {
local user_config_file="$user_config_dir/config.json" local user_config_file="$user_config_dir/config.json"
local default_config_file="$runtime_root/config/default_config.json" local default_config_file="$runtime_root/config/default_config.json"
if [[ "$DRY_RUN" == "true" ]]; then
log "DRY-RUN seed Codex-friendly BitLesson config in $user_config_file"
return
fi
[[ -f "$default_config_file" ]] || die "missing default config: $default_config_file" [[ -f "$default_config_file" ]] || die "missing default config: $default_config_file"
if ! command -v python3 >/dev/null 2>&1; then if ! command -v python3 >/dev/null 2>&1; then
die "python3 is required to update Humanize user config for Codex installs" die "python3 is required to update Humanize user config for Codex installs"
fi fi
if [[ "$DRY_RUN" == "true" ]]; then
log "DRY-RUN seed Codex-friendly BitLesson config in $user_config_file"
return
fi
mkdir -p "$user_config_dir" mkdir -p "$user_config_dir"
python3 - "$default_config_file" "$user_config_file" "$install_target" <<'PY' python3 - "$default_config_file" "$user_config_file" "$install_target" <<'PY'
......
#!/usr/bin/env bash #!/usr/bin/env bash
# #
# Install/upgrade LightOp/DCU KernelPilot Humanize for Claude Code. # Install/upgrade KernelPilot Humanize for Claude Code.
# #
# Claude Code plugin installation copies the plugin into ~/.claude/plugins/cache # Claude Code plugin installation copies the plugin into ~/.claude/plugins/cache
# but does not hydrate SKILL.md placeholders. This wrapper performs the normal # but does not hydrate SKILL.md placeholders. This wrapper performs the normal
...@@ -21,7 +21,7 @@ DRY_RUN="false" ...@@ -21,7 +21,7 @@ DRY_RUN="false"
usage() { usage() {
cat <<'EOF' cat <<'EOF'
Install LightOp/DCU KernelPilot Humanize for Claude Code. Install KernelPilot Humanize for Claude Code.
Usage: Usage:
humanize/scripts/install-skills-claude.sh [options] humanize/scripts/install-skills-claude.sh [options]
......
---
name: triton-dcu-profiler-report
description: "Profile and explain Triton JIT kernels running on DCU/ROCm/DTK in vLLM, SGLang, or a user-specified Triton Python file: capture device state, synchronized benchmark logs, hipprof/rocprofv3/rocprof-compute timing and counters, Triton cache/IR/compiler dumps, AMDGPU ISA or code-object resource evidence, then produce exactly one concrete next Triton/config/backend/call-site edit. Use when Triton benchmark numbers are ambiguous, close to baseline, regressing, below target, or need proof before the next optimization."
---
# Triton DCU Profiler Report
Use this skill when benchmark numbers are not enough and the next Triton kernel,
`triton.Config`, autotune table, framework backend, or direct call-site edit
should be driven by DCU evidence. This skill is independent from the LightOp
profiler skill and stores artifacts under `.humanize/triton-agent/`.
The rule:
```text
prove the hot Triton path -> profile -> inspect generated code when needed -> choose one next edit
```
The output must be an inference chain from measured evidence to a likely
mechanism to exactly one actionable edit. Do not stop at "memory-bound" or
"launch overhead".
## When To Invoke
Invoke this skill when any of these hold:
- Baseline benchmark passed but no Triton/DCU profile digest exists.
- A correctness-passing candidate has just been benchmarked.
- Candidate and baseline are within +/-2% or inside the declared noise band.
- The second correctness-passing optimization improves less than 5% over its
parent or baseline.
- A candidate regresses on important shapes.
- The selected backend or target Triton kernel is uncertain.
- JIT compile time, Triton cache churn, dynamic shapes, or graph capture may be
polluting benchmark results.
- The next edit might involve `BLOCK_*`, `num_warps`, `num_stages`,
`waves_per_eu`, dot layout, vectorization, LDS use, fusion, or backend
routing.
Do not profile correctness-failing candidates unless profiling is needed to
debug a profiler or compile failure.
## Required Artifacts
Use the target repo, nearest project root, or user-provided direct-file workdir:
```text
.humanize/triton-agent/profile-artifacts/<version>/
device-status.txt
env.txt
backend-proof.txt
benchmark.log
hipprof.txt
rocprofv3/
rocprof-compute/
hipprof-pmc-all/
triton-cache/
triton-dumps/
amdgpu-isa.txt
code-object-metadata.txt
resource-usage.txt
digest.md
```
When comparing a candidate, cite the baseline or parent artifact path in
`digest.md`.
## Device And Backend Gate
Before timing or profiling:
- Run `hy-smi` or `rocm-smi` in the same environment.
- Choose an idle card and pin `HIP_VISIBLE_DEVICES=<idle-card>` or
`HSA_VISIBLE_DEVICES=<idle-card>`.
- Record framework/project identity, Triton version, ROCm/DTK/HIP version,
device name, `gcnArchName`, relevant env vars, and backend or harness args.
- Prove the framework or direct harness selected the target Triton path with
logs, temporary instrumentation, Triton cache entries, or profiler kernel
names.
- Warm up enough to exclude JIT compile time unless compile time is the target.
## Workflow
1. Pick one representative shape first. Prefer the shape that exposes the
regression, plateau, launch overhead, or suspected bottleneck.
2. Confirm correctness and backend selection.
3. Capture normal benchmark output with explicit synchronization.
4. Capture first-pass `hipprof` or `rocprofv3` timing.
5. If timing is insufficient, collect supported counters with `hipprof --pmc`,
`rocprofv3`, or `rocprof-compute`.
6. If the issue is codegen-sensitive, collect Triton cache, MLIR/LLVM dumps,
code-object metadata, resource usage, and AMDGPU ISA.
7. Compare candidate against baseline or parent, not just absolute time.
8. Interpret using [metrics.md](references/metrics.md).
9. Use command variants from [examples.md](references/examples.md).
10. Use [triton-ir-isa.md](references/triton-ir-isa.md) when inspecting
generated code.
11. Write `digest.md`; the final section must contain exactly one next edit.
12. Update `.humanize/triton-agent/attempt-ledger.md` with digest path and
decision.
## Digest Template
```markdown
# Triton DCU Profile Digest
## Runtime Identity
- Framework or project:
- Commit:
- Target mode:
- Operator/kernel:
- Shape/dtype:
- Backend:
- Device/gfx:
- Versions:
- Command:
## Backend Proof
- Evidence:
- Kernel names:
- Caveats:
## Benchmark
- Warmup/repeats:
- p50/p90/mean:
- Baseline or parent:
- Delta:
- Noise band:
## Profiler Evidence
- API/kernel/memcpy timing:
- Top kernels:
- Launch overhead:
- Memory/cache:
- LDS/barrier:
- Compute/MFMA/MMAC:
- VGPR/SGPR/LDS/resource:
- Triton cache/dump/ISA:
## Diagnosis
- Most likely mechanism:
- Why alternative explanations are weaker:
- Risk:
## Next Edit
Exactly one concrete edit:
```
If a tool is unavailable, record the exact command, error, and what evidence is
missing. Do not silently replace missing evidence with speculation.
interface:
display_name: "Triton DCU Profiler"
short_description: "Profile Triton JIT kernels on DCU and choose the next edit"
default_prompt: "Use $triton-dcu-profiler-report to turn DCU profiler, Triton IR, and ISA artifacts into one next framework or standalone Triton kernel edit."
# Triton/DCU Profiler Command Examples
Adjust paths and command names to the target environment. Preserve the exact
commands in `.humanize/triton-agent/profile-artifacts/<version>/`.
## First-Pass Capture
```bash
mkdir -p .humanize/triton-agent/profile-artifacts/v000_baseline
hy-smi > .humanize/triton-agent/profile-artifacts/v000_baseline/device-status.txt 2>&1 || \
rocm-smi > .humanize/triton-agent/profile-artifacts/v000_baseline/device-status.txt 2>&1
HIP_VISIBLE_DEVICES=<idle-card> <benchmark-command> \
2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/benchmark.log
HIP_VISIBLE_DEVICES=<idle-card> hipprof <benchmark-command> \
2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/hipprof.txt
```
## rocprofv3 Capture
```bash
mkdir -p .humanize/triton-agent/profile-artifacts/v001/rocprofv3
HIP_VISIBLE_DEVICES=<idle-card> rocprofv3 \
--hip-trace --kernel-trace --stats \
-d .humanize/triton-agent/profile-artifacts/v001/rocprofv3 \
<benchmark-command>
```
If the installed DTK/ROCm uses different flags, run `rocprofv3 --help`, save
the command used, and note the version in the digest.
## Counter Capture
```bash
mkdir -p .humanize/triton-agent/profile-artifacts/v001/hipprof-pmc-all
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
-o .humanize/triton-agent/profile-artifacts/v001/hipprof-pmc-all/pmc \
<benchmark-command>
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
-o .humanize/triton-agent/profile-artifacts/v001/hipprof-pmc-all/pmc-read \
<benchmark-command>
HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
-o .humanize/triton-agent/profile-artifacts/v001/hipprof-pmc-all/pmc-write \
<benchmark-command>
```
Some DTK/ROCm versions expose counters differently. Keep failed command output
as evidence instead of guessing.
## Triton Dump Capture
```bash
mkdir -p .humanize/triton-agent/profile-artifacts/v001/triton-dumps
mkdir -p .humanize/triton-agent/profile-artifacts/v001/triton-cache
HIP_VISIBLE_DEVICES=<idle-card> \
TRITON_CACHE_DIR=$PWD/.humanize/triton-agent/profile-artifacts/v001/triton-cache \
MLIR_ENABLE_DUMP=1 \
AMDGCN_ENABLE_DUMP=1 \
<benchmark-command> \
> .humanize/triton-agent/profile-artifacts/v001/triton-dumps/stdout.txt \
2> .humanize/triton-agent/profile-artifacts/v001/triton-dumps/stderr.txt
```
Use the exact dump variables supported by the installed Triton/ROCm build. If
the variables do nothing, save that result and inspect `TRITON_CACHE_DIR`.
## End-To-End Serving Proof
Microbenchmarks can lie when routing, graph capture, or batching differs from
serving. For serving-sensitive changes, pair the microbenchmark with a small
framework-level request benchmark and record:
```text
model
server args
backend args
request mix
prefill/decode split
latency or throughput metric
target kernel presence in profiler
```
## Direct-File Harness Proof
When the user provides a standalone Triton file, keep the harness command and
proof local to `.humanize/triton-agent/`:
```bash
HIP_VISIBLE_DEVICES=<idle-card> python .humanize/triton-agent/bench_<kernel>.py \
--shape <shape> --dtype <dtype> \
2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/benchmark.log
```
Record the wrapper name, grid, constexpr/config values, correctness tolerance,
and profiler or Triton-cache evidence that the target JIT function compiled and
ran.
# Metrics And Interpretation
Use this checklist when writing `digest.md`.
## Runtime Identity
- Framework or project, target mode, commit, local diff state.
- Operator, wrapper, Triton JIT function, profiler kernel name.
- Shape, dtype, layout, strides, sequence length, topk, expert count, block
size, quant mode, and cache/page layout.
- Device name, `gcnArchName`, DTK/ROCm/HIP/PyTorch/Triton/AITER versions.
- Backend env vars, direct harness args, and CLI args.
## Timing
- JIT compile time versus warmed kernel time.
- p50, p90, mean, min, max, and repeat count.
- CPU-side routing, allocation, graph capture, or sync overhead.
- HIP API time, kernel time, memcpy/memset time, and launch overhead.
## Kernel Launch Shape
- Grid dimensions, program ids, block sizes, and active program count.
- Tiny-kernel launch overhead relative to work.
- Number of programs relative to CUs.
- Divergent mask or page-table branches.
## Memory Path
- Global load/store volume and redundant traffic.
- Coalescing and vector width.
- Alignment and padding.
- Cache reuse and page locality.
- Temporary tensors or unfused epilogues causing extra global traffic.
## LDS And Synchronization
- LDS allocation per program.
- Bank conflict clues when available.
- Barriers and staging overhead.
- Whether LDS use improves coalescing or only adds pressure.
## Compute Path
- Dot-heavy versus scalar/vector ALU-heavy.
- MFMA/MMAC instruction presence when expected.
- Conversion, dequantization, scale loads, and saturation overhead.
- FP32 accumulation cost versus required numerical tolerance.
- Epilogue fusion opportunities.
## Resource Pressure
- VGPR and SGPR count.
- LDS per block/program.
- Spills or scratch memory.
- Occupancy limiters.
- Over-unroll or too many live masks/pointers.
## Dispatch And Config
- Wrong backend selected.
- Wrong shape branch or config JSON selected.
- Autotune cache stale or too broad.
- Device-name config table misses DCU/gfx target.
- Dynamic shapes causing repeated compile or many cache entries.
## Diagnosis Discipline
A good diagnosis says:
```text
measured signal -> likely mechanism -> rejected alternatives -> one next edit
```
Examples:
```text
high launch overhead + tiny kernel + no memory pressure -> fuse with adjacent
epilogue or move work into existing Triton kernel.
same kernel time but worse p90 + many Triton cache entries -> stabilize shape
specialization or backend config selection.
lower occupancy + higher VGPR + no memory traffic reduction -> reduce block
tile/unroll or split live ranges.
high global traffic + separate dequant/store kernels -> fuse dequant or epilogue
if correctness and backend constraints allow.
```
# Triton IR, Cache, And ISA Evidence
Use this reference when profiler timing does not explain the next edit.
## Triton Cache
Set a task-local cache directory when reproducing:
```bash
TRITON_CACHE_DIR=$PWD/.humanize/triton-agent/profile-artifacts/<version>/triton-cache
```
Record:
- number of cache entries before and after warmup;
- kernel source path/function when visible;
- config key and constexpr values;
- whether dynamic shapes generate repeated compilations;
- whether candidate and baseline compile to distinct code.
## Compiler Dumps
Try the dump variables supported by the installed Triton/ROCm stack:
```bash
MLIR_ENABLE_DUMP=1
AMDGCN_ENABLE_DUMP=1
```
Capture stdout/stderr. If the variables are unsupported or silent, record that
and inspect `TRITON_CACHE_DIR`, code objects, or generated temporary files.
## ISA And Code Object
Use available tools in this order:
```bash
which dccobjdump || true
which llvm-objdump || true
which roc-objdump || true
which hipprof || true
```
Possible captures:
```bash
dccobjdump --inputs=<code-object-or-binary> --show-sass --show-instruction-encoding \
--separate-functions > .humanize/triton-agent/profile-artifacts/<version>/amdgpu-isa.txt
hipprof --codeobj-analyze <code-object-or-binary> \
> .humanize/triton-agent/profile-artifacts/<version>/resource-usage.txt
```
Tool names and flags vary across DTK/ROCm versions. Preserve failures.
## What To Look For
- Expected MFMA/MMAC instructions for dot-heavy kernels.
- Excessive scalarization or address arithmetic.
- Scratch/spill loads and stores.
- Vector width and memory instruction pattern.
- LDS instructions and barriers.
- Conversion/dequantization instruction clusters.
- Register count and LDS usage that explain occupancy limits.
Treat ISA evidence as target-specific. Do not use AMD docs alone to claim a
Hygon/DCU instruction path; verify with the compiled artifact when possible.
---
name: triton-kernel-agent-loop
description: "Run an autonomous DCU/ROCm Triton-kernel optimization loop inside vLLM, SGLang, or a user-specified Triton Python file: recover operator semantics, framework or direct-call dispatch, Triton JIT/autotune configs, correctness references, workloads, DCU/gfx environment, benchmark and profile evidence, then implement scoped Triton or routing changes with reproducible tests. Use for optimizing vLLM/SGLang Triton attention, MLA, MoE, quantization, fused norm, sampler, routing, cache, small JIT kernels, or standalone Triton files on DCU. Do not use for LightOp HIP/C++ operators; use the LightOp skills for that."
---
# Triton Kernel Agent Loop
Use this flow when the user wants to optimize or add a Triton JIT kernel inside
vLLM, SGLang, or a user-specified Triton Python file for DCU/ROCm/DTK. This
skill is independent from the LightOp skills. It may reuse the KernelPilot
evidence corpus and Humanize runtime, but it must not assume a LightOp checkout,
LightOp wrappers, or LightOp tests.
Default target: Triton Python kernels and framework dispatch code in vLLM or
SGLang. Prefer local framework patterns, AITER/Triton backend contracts,
ROCm/DCU profiler evidence, and target-compiled Triton IR/ISA over generic
CUDA or NVIDIA tuning advice. When the user gives a specific file path, operate
in direct-file mode and treat that file, its call sites, and its harness as the
source of truth.
## Input Contract
Recover or define these before implementation:
```text
K: kernel/operator semantics, tensor shapes, strides, layouts, dtype, masks,
causal/window flags, quant scales, cache/page tables, expert routing, and
output contract.
R: correctness reference, usually framework native path, PyTorch reference,
existing Triton path, AITER path, or a small literal oracle for edge cases.
W: workload distribution: model scenario, prefill/decode/batch mix, seq lengths,
hidden/head dims, experts/topk, quant mode, block sizes, contiguity,
target gfx/DCU, latency/throughput metric, and success threshold.
E: execution environment: host or Docker, framework root in that environment,
selected DCU, DTK/ROCm/PyTorch/Triton/AITER versions, build/install command,
test command, benchmark command, profiler command, and pass threshold.
B: backend or call-site contract: framework, direct-file, or scratch mode;
V0/V1/runtime path, env flags, CLI args, backend registry branch, fallback
branch, direct wrapper/harness, Triton cache/JIT config, and proof that the
target Triton kernel is the hot path.
P: profiling contract: artifact directory, benchmark stability rule,
profiler depth, required Triton IR/ISA dumps, and the minimum evidence
needed before the next edit.
```
Ask the user only when the missing value cannot be inferred from the target
repo or local benchmark/test files and the assumption would change correctness,
backend selection, or the performance target.
## Locate The Target
Support three target modes:
```text
framework: a vLLM or SGLang checkout with backend routing to prove.
direct-file: one or more user-specified Triton Python files and their call sites.
scratch: a new reproducer or experiment outside a framework.
```
For direct-file mode, locate the nearest project root if one exists, but do not
require vLLM/SGLang markers. Use the user-provided file path as the target, then
recover imports, `@triton.jit` functions, launch wrappers, call sites, tests,
and benchmark harnesses around it.
Direct-file search:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\\[grid\\]" <target-file-or-dir>
rg -n "<kernel_name>|<wrapper_name>|<function_name>" <project-root-or-target-dir>
```
For framework mode, operate on a vLLM or SGLang checkout. Identify the root
before editing.
vLLM root markers:
```text
pyproject.toml
vllm/
tests/
benchmarks/
```
SGLang root markers:
```text
pyproject.toml
python/sglang/
sgl-kernel/
test/
benchmark/
```
Useful searches:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|VLLM_ROCM|AITER|current_platform\.is_rocm|gcnArchName|triton_mla|fused_moe" vllm tests benchmarks
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|SGLANG_USE_AITER|attention_backend|moe_runner|is_rocm|is_hip|fused_moe|triton_ops" python/sglang sgl-kernel test benchmark
```
Load [framework-entrypoints-vllm-sglang.md](references/framework-entrypoints-vllm-sglang.md)
when selecting files, backend flags, tests, or benchmarks.
## Execution Environment
Run build/install, correctness, benchmark, and profiling in one consistent
environment. If the user names a Docker container or image, treat that as part
of the acceptance contract. Prefer non-interactive commands:
```bash
docker exec <container> bash -lc 'cd <repo-in-container> && <command>'
```
Record before the first serious benchmark:
- Framework root from the command's point of view.
- Container/image or `direct-host`.
- `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`.
- `hy-smi` or `rocm-smi` output and selected idle card.
- `torch`, `torch.version.hip`, `triton`, AITER if installed, DTK/ROCm/HIP,
device name, and `gcnArchName`.
- `PYTORCH_ROCM_ARCH`, relevant backend env vars, and CLI backend flags.
- Exact install, import smoke, correctness, benchmark, profiler, and dump
commands.
- Warmup/repeat counts, p50/p90/mean policy, acceptable noise band, and the
minimum delta that counts as an optimization.
Useful probe:
```bash
python - <<'PY'
import importlib.util, os, torch
print("torch:", torch.__version__)
print("hip:", torch.version.hip)
print("device:", torch.cuda.get_device_name(0))
print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
for name in ("triton", "aiter", "vllm", "sglang"):
spec = importlib.util.find_spec(name)
if spec is None:
print(name + ": not installed")
else:
mod = __import__(name)
print(name + ":", getattr(mod, "__version__", getattr(mod, "__file__", "unknown")))
print("HIP_VISIBLE_DEVICES:", os.getenv("HIP_VISIBLE_DEVICES"))
print("HSA_VISIBLE_DEVICES:", os.getenv("HSA_VISIBLE_DEVICES"))
PY
hipcc --version || true
```
## Backend Or Call-Site Gate
Do not optimize a Triton kernel until you have proof that the request actually
selects that Triton kernel.
For direct-file mode:
- Record the target file, `@triton.jit` function, launch wrapper, grid function,
and any `@triton.autotune` or manual config table.
- Identify the caller or harness that passes real tensors into the kernel.
- If no harness exists, create a temporary correctness/benchmark harness under
`.humanize/triton-agent/` before editing the target file.
- Prove the direct call reaches the target kernel through profiler kernel names,
Triton cache entries, temporary print/logging, or a minimal reproducer.
- Treat framework backend flags as optional unless the file is still called
from a framework path.
For vLLM:
- Record V0/V1 path and attention/MoE/quant backend selection.
- Capture relevant `VLLM_ROCM_*`, `VLLM_USE_*`, and quantization env vars.
- Check whether AITER, FlashAttention, FlashInfer, CUTLASS/TRTLLM, or Triton
wins the dispatch branch.
- For Triton MLA/attention, confirm prefill/decode mode, block size, page table
layout, FP8 KV cache, and fallback behavior.
- For MoE/quantization, confirm expert backend, topk, group quant scales,
per-token/per-channel/per-block scale layout, and whether config JSON or
autotune tables are active.
For SGLang:
- Record `--attention-backend`, `--moe-runner-backend`, `SGLANG_USE_AITER`, and
model quantization mode.
- Check whether AITER, Triton, FlashInfer, TRTLLM, or sgl-kernel C++ paths win.
- For Triton attention, separate decode, extend/prefill, target verify, MLA,
FP8 KV cache, and page/cache layout.
- For Triton MoE, confirm `moe_runner`, tuned config file, expert parallelism,
topk, dtype, block shape, and fallback branch.
Backend proof can come from logs, explicit instrumentation, profiler kernel
names, a minimal reproducer that calls the target function, or framework debug
output. If no proof exists, add temporary local instrumentation under
`.humanize/triton-agent/` or use profiler kernel-name filtering before editing.
## State And Artifacts
Store loop state in the target repo and do not mix it with LightOp artifacts:
```text
.humanize/triton-agent/refined-plan.md
.humanize/triton-agent/research-digest.md
.humanize/triton-agent/backend-map.md
.humanize/triton-agent/attempt-ledger.md
.humanize/triton-agent/kernel_opt_readme.md
.humanize/triton-agent/lineage.jsonl
.humanize/triton-agent/performance-map.json
.humanize/triton-agent/tuning-decisions.md
.humanize/triton-agent/profile-artifacts/
.humanize/triton-agent/triton-artifacts/
```
Load [state-and-artifacts.md](references/state-and-artifacts.md) for templates.
## Workflow
Stage 1: inspect and plan.
- Locate the target repo, direct file, or scratch workspace.
- Recover `K/R/W/E/B/P`.
- Identify wrapper, backend registry, dispatch branch, Triton JIT function,
config/autotune source, tests, benchmarks, and fallback path.
- In direct-file mode, identify the launch wrapper, caller/harness, config
surface, imports, reference path, and any project-level tests.
- Use `triton-kernel-knowledge` before the first serious implementation or
whenever backend evidence is thin.
- Write `.humanize/triton-agent/research-digest.md` and
`.humanize/triton-agent/backend-map.md`.
- Define representative microbench and, when relevant, an end-to-end serving
benchmark that proves the kernel change matters outside a toy call.
Stage 2: baseline.
- Run import smoke and backend-selection smoke.
- Run correctness against `R` before timing.
- Run a stable benchmark with explicit `torch.cuda.synchronize()`.
- Capture device status and actual Triton kernel names in `hipprof` or
`rocprofv3`.
- If baseline is already ambiguous, invoke `triton-dcu-profiler-report`.
Stage 3: edit and tune.
- Make one main optimization hypothesis per attempt.
- Prefer config/autotune/dispatch edits before rewriting kernel structure when
profiler evidence points to a bad shape branch.
- For code edits, preserve framework public API and local style. Avoid broad
rewrites, unrelated backend churn, new dependencies, or NVIDIA-only concepts.
- Re-run correctness, benchmark, and backend proof after every candidate.
- For every correctness-passing candidate, record keep/reject/inconclusive
with timing, shape, config, backend, selected card, and profiler evidence.
- When a second correct candidate improves less than 5% over parent or
baseline, run deep profiling with Triton IR/ISA evidence before the next edit.
Stage 4: close out.
- Completion requires install/import smoke, targeted correctness, benchmark
against baseline, backend proof, and profile/dump evidence when results are
close, surprising, or below target.
- If the performance target is not reached, report current best, bottleneck
evidence, rejected lineages, and the next edit. Do not claim performance
completion.
## DCU Triton Rules
Load [triton-dcu-optimization.md](references/triton-dcu-optimization.md) before
changing `triton.Config`, block sizes, `num_warps`, `num_stages`,
`waves_per_eu`, dot layout, or memory tiling.
Core rules:
- Treat PyTorch's `torch.cuda` namespace as the ROCm runtime facade.
- Respect `PYTORCH_ROCM_ARCH` and target `gcnArchName`.
- Prefer `gfx928`, `gfx936`, `gfx938`, or the user-provided DCU/BW target over
NVIDIA architecture names.
- Translate CUDA terms explicitly: SM to CU, warp to wavefront, shared memory
to LDS, tensor core to MFMA/MMAC or the selected ROCm backend.
- Do not import Nsight/NVIDIA profiler conclusions as DCU evidence.
- Do not assume Hopper/Blackwell TMA, WGMMA, warp specialization, SASS, PTX,
or CUDA shared-memory bank rules.
- Borrow algorithmic Triton patterns from NVIDIA examples only after validating
the generated AMDGPU code object or profiler counters on the target DCU.
## Validation Commands
Use local framework tests first. Examples:
```bash
python -m pytest tests/kernels -q -k "<operator-or-backend>"
python benchmarks/kernels/<benchmark>.py <args>
python -m pytest test -q -k "<operator-or-backend>"
python benchmark/kernels/<benchmark>.py <args>
```
If no suitable test exists, create a minimal correctness and benchmark harness
under `.humanize/triton-agent/` first. Add a repo test only when the behavior is
stable enough to belong to the framework's test suite.
When ready to start Humanize RLCR from the target repo:
```bash
"{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/triton-agent/refined-plan.md --yolo
```
If the runtime is unavailable, continue manually but keep the same state files,
profiling gates, and evidence discipline.
interface:
display_name: "Triton DCU Agent Loop"
short_description: "Optimize vLLM, SGLang, or standalone Triton kernels on DCU"
default_prompt: "Use $triton-kernel-agent-loop to optimize a vLLM, SGLang, or user-specified Triton file on DCU with correctness, benchmark, and profiling evidence."
# Framework Entrypoints For DCU Triton Work
Use this file after `triton-kernel-agent-loop` identifies the target framework
or direct-file mode. Prefer local source over this map when paths differ.
## Direct File Mode
Use direct-file mode when the user points at one or more Triton Python files
instead of a vLLM or SGLang checkout. The file path is the target. The nearest
project root is useful for imports, tests, and benchmarks, but it is not a
precondition.
Search commands:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\[grid\]" <target-file-or-dir>
rg -n "<jit_function>|<wrapper>|<kernel_name>|<call_site>" <project-root-or-target-dir>
rg -n "pytest|unittest|do_bench|benchmark|torch\.cuda\.synchronize|TRITON_CACHE_DIR" <project-root-or-target-dir>
```
Record:
- Target file, project root or standalone parent directory, and import path.
- `@triton.jit` function, launch wrapper, grid function, constexpr/config
surface, `@triton.autotune` keys, and cache behavior.
- Caller, harness, or proof that no harness exists yet.
- Correctness reference, tolerance, shape/dtype/layout workload, and benchmark
command.
- Any framework branch that still calls the file.
Minimal direct-call proof:
```text
1. A correctness/benchmark harness directly calls the wrapper with real tensors.
2. Profiler output, Triton cache, or dump artifact contains the target kernel.
3. Temporary instrumentation proves the wrapper and config branch were reached.
4. Warmed timing excludes first-run JIT compile time unless compile time is the target.
```
Create temporary harnesses under `.humanize/triton-agent/` when the repo has no
usable test or benchmark. Promote a harness to a real test only after the
correctness contract and workload are stable.
## vLLM
Common Triton/DCU search surface:
```text
vllm/envs.py
vllm/triton_utils/
vllm/attention/
vllm/v1/attention/
vllm/model_executor/layers/
vllm/model_executor/layers/fused_moe/
vllm/model_executor/layers/quantization/
vllm/model_executor/layers/rotary_embedding/
tests/kernels/
benchmarks/kernels/
```
Search commands:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" vllm tests benchmarks
rg -n "VLLM_ROCM|AITER|is_rocm|gcnArchName|Triton|triton_mla|fused_moe|scaled_mm|fp8|fp4" vllm tests benchmarks
rg -n "attention_backend|backend registry|AttentionBackend|MLA|decode|prefill" vllm/attention vllm/v1/attention
```
Backend signals to record:
- `VLLM_ROCM_USE_AITER`
- `VLLM_ROCM_USE_AITER_PAGED_ATTN`
- `VLLM_ROCM_USE_AITER_LINEAR`
- `VLLM_ROCM_USE_AITER_MOE`
- `VLLM_ROCM_USE_AITER_RMSNORM`
- `VLLM_ROCM_USE_AITER_MLA`
- `VLLM_ROCM_USE_AITER_MHA`
- `VLLM_ROCM_USE_AITER_FP8BMM`
- `VLLM_ROCM_USE_AITER_FP4BMM`
- `VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION`
- `VLLM_ROCM_USE_AITER_TRITON_GEMM`
- `VLLM_ROCM_FP8_PADDING`
- `VLLM_ROCM_MOE_PADDING`
- `VLLM_ROCM_CUSTOM_PAGED_ATTN`
- `VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT`
- `VLLM_ROCM_FP8_MFMA_PAGE_ATTN`
- quantization-specific vars such as W8A8 backend selectors in the local tree.
Typical target families:
- Triton MLA and decode attention: prove V0/V1 path, prefill/decode mode, page
table layout, block size, FP8 KV cache, and fallback backend.
- Fused MoE: prove expert implementation, routing/topk, dtype, quant scales,
block shape, config source, and whether LoRA or EP changes the path.
- Quantized linear/scaled MM: prove scale layout, group size, block sizes,
AITER/Triton selection, and whether padding or shape heuristics are active.
- Fused elementwise/norm/rope/cache kernels: prove call site, tensor layout,
graph capture constraints, and whether a framework C++ kernel already wins.
## SGLang
Common Triton/DCU search surface:
```text
python/sglang/srt/layers/attention/
python/sglang/srt/layers/attention/triton_ops/
python/sglang/srt/layers/moe/
python/sglang/srt/layers/moe/moe_runner/
python/sglang/srt/layers/quantization/
python/sglang/srt/lora/
python/sglang/jit_kernel/
sgl-kernel/
test/
benchmark/
```
Search commands:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" python/sglang sgl-kernel test benchmark
rg -n "SGLANG_USE_AITER|attention_backend|moe_runner|triton_ops|fused_moe|is_rocm|is_hip|fp8|fp4" python/sglang sgl-kernel test benchmark
rg -n "decode_attention|extend_attention|MLA|KV cache|target_verify|moe_align|topk|tuning" python/sglang benchmark test
```
Backend signals to record:
- `SGLANG_USE_AITER`
- `--attention-backend`
- `--moe-runner-backend`
- quantization mode and model config
- EP/DP/TP mode when MoE routing or expert parallelism changes shape
- tuned Triton config file selected by model, dtype, device name, or block shape
Typical target families:
- Triton attention backend: separate decode, extend/prefill, target verify,
sliding window, MLA, and FP8 KV cache paths.
- Triton MoE runner: prove runner backend, tuned config, topk, expert count,
block shape, EP shape, and fallback.
- LoRA Triton kernels: prove rank, batching, and dynamic shape behavior.
- JIT diffusion or multimodal kernels: prove generated kernel path and compile
cache stability before tuning.
## Minimal Backend Proof Pattern
Use at least one of these before editing:
```text
1. Framework debug log or explicit backend print showing the selected backend.
2. Microbenchmark that directly calls the target wrapper and reports the
expected Triton kernel name in profiler output.
3. Temporary local instrumentation under .humanize/triton-agent/ that prints
the selected dispatch branch.
4. Triton cache/dump artifact whose source hash points to the target function.
5. End-to-end profiler trace with the target kernel dominating the relevant
request segment.
```
Remove temporary instrumentation before finalizing unless it is intentionally
converted into useful framework logging or tests.
# Triton Agent State And Artifact Templates
Keep Triton loop records under `.humanize/triton-agent/` in the target vLLM or
SGLang checkout, in the nearest project root for direct-file mode, or in the
user-provided workdir when the Triton file is standalone. Do not use
`.humanize/lightop-agent/` for Triton work.
## refined-plan.md
```markdown
# Refined Plan
## Target
- Framework:
- Repo path:
- Target mode:
- Target file:
- Project root/workdir:
- Commit:
- Operator/kernel:
- User goal:
## K/R/W/E/B/P
- K:
- R:
- W:
- E:
- B:
- P:
## Baseline Commands
- install:
- import smoke:
- backend proof:
- correctness:
- benchmark:
- profile:
## Candidate Plan
1. Baseline and backend proof.
2. First edit:
3. Validation:
4. Next gate:
## Stop Conditions
- Correctness:
- Performance:
- Profiling:
- User constraints:
```
## backend-map.md
```markdown
# Backend Map
## Framework Or Call-Site Routing
- Entry point:
- Backend registry:
- Env vars:
- CLI args:
- Direct caller/harness:
- Selected backend:
- Fallback backend:
## Triton Kernel
- Python file/function:
- JIT/autotune decorators:
- Config source:
- Grid function:
- Kernel name in profiler:
- Triton cache or dump path:
## Proof
- Command:
- Evidence:
- Caveats:
```
## attempt-ledger.md
```markdown
# Attempt Ledger
| Version | Hypothesis | Files | Correct | p50 us | p90 us | Delta | Profile | Decision |
| --- | --- | --- | --- | ---: | ---: | ---: | --- | --- |
| v000 | baseline | none | pass | | | | | baseline |
```
## kernel_opt_readme.md
```markdown
# Kernel Optimization Report
## Summary
- Best version:
- Baseline:
- Current best:
- Target:
- Selected card:
- Backend:
## Validation Matrix
| Test | Command | Result |
| --- | --- | --- |
## Performance Matrix
| Shape | Dtype | Backend | Config | Baseline us | Candidate us | Delta | Notes |
| --- | --- | --- | --- | ---: | ---: | ---: | --- |
## Profile Evidence
- Baseline artifact:
- Candidate artifact:
- Bottleneck:
- Next edit:
## Rejected Lineages
| Version | Reason |
| --- | --- |
```
## lineage.jsonl
Write one JSON object per benchmarked candidate:
```json
{"version":"v001","parent":"v000","hypothesis":"tune BLOCK_N for decode","correct":true,"decision":"keep","artifact":".humanize/triton-agent/profile-artifacts/v001/digest.md"}
```
# Triton DCU Optimization Notes
Use these notes for DCU/ROCm Triton work. They adapt common Triton skill
patterns to DCU and intentionally avoid NVIDIA-only assumptions.
## First Questions
Before changing code, answer:
- Is the kernel memory-bound, launch-bound, LDS/resource-bound, dot/MFMA-bound,
or dispatch-bound?
- Is the selected Triton config correct for target `M/N/K`, sequence length,
topk, head dim, block size, and dtype?
- Is the benchmark measuring compile time, graph warmup, framework routing,
memory allocation, or the kernel itself?
- Does the profiler prove the target Triton kernel is hot?
## Tunable Surface
Prefer tuning in this order unless evidence says otherwise:
1. Backend dispatch and shape-specific config selection.
2. `BLOCK_M`, `BLOCK_N`, `BLOCK_K`, head/block/page dimensions.
3. `num_warps`, `num_stages`, `waves_per_eu`, `matrix_instr_nonkdim` when the
installed Triton AMD backend supports them.
4. Load/store vectorization, alignment, mask shape, contiguous layout, and
redundant global traffic.
5. Accumulator dtype, dot layout, dequant placement, and epilogue fusion.
6. Split persistent/stateful kernels only when profiling shows launch overhead
or intermediate memory traffic dominates.
## DCU-Specific Heuristics
Treat these as hypotheses to verify on the target DCU:
- Keep `num_stages` small on ROCm unless a deeper pipeline proves better.
Single-GEMM kernels often start at `num_stages=2`; fused attention or two-GEMM
loops often start at `num_stages=1`.
- Tune `waves_per_eu` instead of assuming NVIDIA warp occupancy rules.
- Balance `num_warps` against VGPR pressure and LDS use. More waves can lose
when spills or LDS bank pressure increase.
- For dot-heavy kernels, inspect generated ISA for the expected MFMA/MMAC path
before claiming compute utilization.
- For small decode kernels, launch overhead and framework dispatch can dominate;
consider fusion or routing changes only after profiler evidence.
- For page-table and KV-cache kernels, coalescing, page layout, and mask shape
often matter more than arithmetic.
- For MoE, separate token routing, alignment, expert grouping, GEMM, and
epilogue timing. A faster GEMM config can regress total MoE if routing or
padding grows.
- For FP8/FP4, validate data format, scale layout, saturation constants, and
ROCm/Triton support. Do not assume NVIDIA E4M3/E5M2 behavior maps exactly to
the current DTK target.
## Patterns Worth Borrowing Carefully
From general Triton skill material, these are portable when revalidated:
- Online softmax for attention.
- Boundary masks written once and reused.
- Stride-based addressing instead of contiguous assumptions.
- Shape-keyed config maps.
- Direct microbench functions around a JIT kernel.
- Fused norm, activation, scale, and store epilogues when they remove global
reads/writes.
- Dynamic launcher tiling based on sequence length, head dim, dtype, and topk.
These are not portable without DCU proof:
- Nsight Compute metrics.
- PTX/SASS conclusions.
- Hopper/Blackwell TMA, WGMMA, warp specialization, or CUDA shared-memory bank
rules.
- NVIDIA device-name config tables.
- CUDA-only FP4/FP8 assumptions.
## Config Sweep Discipline
When tuning configs, record every tested candidate in
`.humanize/triton-agent/tuning-decisions.md`:
```text
shape/dtype/backend/gfx
config: BLOCK_*, num_warps, num_stages, waves_per_eu, matrix_instr_nonkdim
correctness: pass/fail and tolerance
latency: p50/p90/mean, repeats, selected card
profile clue: launch/memory/LDS/resource/compute/dispatch
decision: keep/reject/inconclusive
```
Only promote a config when correctness passes, improvement exceeds the noise
band, no important serving shape regresses outside the accepted tradeoff,
backend proof still selects the target Triton kernel, and the generated DCU
code path is plausible from profiler or ISA evidence.
## Benchmark Shape Coverage
For attention:
```text
prefill: representative prompt lengths, batch sizes, head dims
decode: batch sizes, page/block sizes, kv lengths, topk or speculative paths
MLA: q/nope/pe dims, cache dtype, block size, split prefill/decode behavior
```
For MoE:
```text
tokens per expert distribution
topk and expert count
hidden/intermediate dims
quant mode and scale layout
small-batch decode and large prefill separately
EP/TP/DP shape when enabled
```
For quantized GEMM:
```text
M/N/K sweep around model shapes
scale granularity
padding and alignment
batch-invariant or graph-captured paths
```
---
name: triton-kernel-knowledge
description: "Find citable evidence for DCU/ROCm Triton kernel work in vLLM, SGLang, or a user-specified Triton Python file: inspect local framework source/tests/benchmarks or direct file call sites, query the KernelPilot PR corpus for vLLM/SGLang/Triton/ROCm/AITER changes, consult official Triton/ROCm/DTK/DCU docs, and translate CUDA/NVIDIA examples only as cross-platform inspiration. Use before implementing or tuning Triton attention, MLA, MoE, quantization, fused norm, cache, sampler, routing kernels, standalone Triton files, or direct microbench harnesses on DCU."
---
# Triton Kernel Knowledge
Use this skill to turn a vLLM/SGLang or direct-file Triton-on-DCU question into
citable implementation evidence. This skill is independent from LightOp. It may
query the existing KernelPilot corpus, but its findings must be about Triton,
framework routing or direct call sites, ROCm/DCU tooling, or portable
algorithmic ideas.
## Evidence Priority
Use routes in this order unless the user asks for a different source:
1. Local target checkout or direct file: vLLM/SGLang source, tests, benchmarks,
config files, docs, current diffs, direct call sites, and harnesses.
2. KernelPilot local PR corpus and source maps for vLLM, SGLang, Triton,
PyTorch ROCm, AITER, AOTriton, Conch, FlagGems, Liger Kernel, Hugging Face
kernels, Triton-distributed, FlashAttention, FlashInfer, and related repos.
3. Official docs and upstream source for Triton, ROCm, DTK/DCU, vLLM, SGLang,
PyTorch ROCm, AITER, rocprof, rocprof-compute, and profiler tooling.
4. CUDA/NVIDIA examples only as cross-platform inspiration after translation.
Write findings into `.humanize/triton-agent/research-digest.md` or the user's
requested note before allowing them to drive code changes.
## Route A: Local Source
For a user-specified Triton file:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\\[grid\\]" <target-file-or-dir>
rg -n "<kernel_name>|<wrapper_name>|<function_name>" <project-root-or-target-dir>
rg -n "pytest|benchmark|do_bench|torch\.cuda\.synchronize|TRITON_CACHE_DIR" <project-root-or-target-dir>
```
Record:
- Target file and project root, if any.
- Triton JIT function, launch wrapper, grid function, configs, and autotune keys.
- Caller, test, benchmark, or missing harness.
- Correctness reference and shape/dtype workload.
- Any framework path that still calls this file.
For vLLM:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" vllm tests benchmarks
rg -n "VLLM_ROCM|AITER|is_rocm|gcnArchName|triton_mla|fused_moe|scaled_mm|fp8|fp4|attention backend" vllm tests benchmarks
rg -n "<operator>|<kernel>|<backend>|<env_var>" vllm tests benchmarks
```
For SGLang:
```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" python/sglang sgl-kernel test benchmark
rg -n "SGLANG_USE_AITER|attention_backend|moe_runner|triton_ops|fused_moe|is_rocm|is_hip|fp8|fp4" python/sglang sgl-kernel test benchmark
rg -n "<operator>|<kernel>|<backend>|<env_var>" python/sglang sgl-kernel test benchmark
```
Record:
- Framework commit and local diff state.
- Public entry point and backend registry branch.
- Triton JIT function, launch wrapper, grid function, and config/autotune table.
- Correctness tests and benchmarks.
- Backend env vars or CLI args needed to select the target path.
- Fallback branch and competing backend such as AITER, FlashInfer, TRTLLM,
framework C++ kernels, or PyTorch native.
## Route B: KernelPilot Corpus
Run from `{{KERNELPILOT_ROOT}}/knowledge` when installed through KernelPilot,
or from the local `kernel-pilot/knowledge` checkout.
```bash
python3 scripts/query.py "vllm rocm aiter triton <operator> <dtype>" --compact --limit 30
python3 scripts/query.py "sglang amd aiter triton <operator> <dtype>" --compact --limit 30
python3 scripts/query.py "triton amd backend rocm <operator> <symptom>" --compact --limit 30
python3 scripts/query.py "aiter conch flaggems liger triton <operator>" --type source-reference --compact --limit 30
python3 scripts/search-pr-diffs.py triton rocm <operator> --any --limit 100
python3 scripts/get_page.py <page-id>
```
Load [sources-and-queries.md](references/sources-and-queries.md) for query
families, relevant PR themes, and citation shape.
When a result names an artifact bundle, inspect the underlying `review.diff`,
`source-snapshot`, `upstream.json`, and `ORIGIN.yaml` before copying or
adapting ideas.
## Route C: Official Docs And Upstream
Prefer current official documentation or upstream source for claims about:
- Triton language, `triton.Config`, `@triton.autotune`, compiler dumps, and AMD
backend options.
- ROCm/DTK profiler commands, counter availability, SQTT, rocprofv3,
rocprof-compute, and code object inspection.
- vLLM ROCm install, AITER integration, V1 attention backends, MoE, and
quantized kernels.
- SGLang AMD install, attention backend, MoE runner backend, AITER/Triton
selection, and quantization support.
- Hygon/DCU and SourceFind DTK tool behavior.
Useful official URLs:
```text
https://triton-lang.org/main/index.html
https://triton-lang.org/main/python-api/generated/triton.autotune.html
https://rocm.docs.amd.com/en/docs-6.3.3/how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel.html
https://rocm.docs.amd.com/en/develop/how-to/rocm-for-ai/inference-optimization/workload.html
https://docs.vllm.ai/en/latest/getting_started/installation/gpu/
https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html
https://docs.sglang.io/docs/hardware-platforms/amd_gpu
https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html
```
Browse or clone when the local corpus is stale, when the user asks for the
latest behavior, or when a claim is version-sensitive.
## Route D: Cross-Platform Inspiration
Use general Triton or CUDA/NVIDIA material only after classifying it:
```text
directly portable: Triton language pattern with no hardware-specific claim
needs DCU proof: tiling, pipeline depth, dot layout, FP8/FP4, cache behavior
not portable: Nsight metrics, PTX/SASS, TMA, WGMMA, CUDA bank rules
```
Translate terms explicitly:
```text
SM -> CU
warp -> wavefront where appropriate
shared memory -> LDS
tensor core -> MFMA/MMAC or selected ROCm backend
SASS/PTX -> AMDGPU ISA/code object
Nsight Compute -> hipprof/rocprofv3/rocprof-compute/DCU profiler
```
Do not invent Hygon/DCU builtins, counter names, or ISA expectations from CUDA
mnemonics. Prefer target compile and profiler evidence.
## Research Digest Format
Every finding should contain:
```text
Route:
Source path or URL:
Commit/version/date when available:
Relevant file/function/config/test:
Finding:
Impact on plan:
Portability: direct DCU evidence | ROCm upstream evidence | cross-platform idea
License/notice if code is copied or adapted:
```
If a route is empty, record the search terms and why the absence matters. Thin
evidence should trigger another route before implementation.
interface:
display_name: "Triton DCU Knowledge"
short_description: "Find Triton, ROCm, DCU, vLLM, and SGLang evidence"
default_prompt: "Use $triton-kernel-knowledge to research DCU evidence for a vLLM, SGLang, or direct-file Triton kernel optimization."
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment