add triton-kernel-skill

60c75a2f · whlwhlwhl · 6889486d · 60c75a2f · 60c75a2f · 60c75a2f
Commit 60c75a2f authored May 28, 2026 by whlwhlwhl
20 changed files
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -3,12 +3,12 @@
  "owner": {
    "name": "BBuf"
  },
-  "description": "Claude Code marketplace for the Humanize LightOp/DCU operator development and optimization loop.",
+  "description": "Claude Code marketplace for Humanize DCU kernel development loops, including LightOp HIP/C++ and Triton vLLM/SGLang or direct-file optimization.",
  "plugins": [
    {
      "name": "humanize",
      "source": "./humanize",
-      "description": "Humanize plus LightOp/DCU operator workflows, dcu-profiler-report profiling, and review-gated iteration.",
+      "description": "Humanize plus LightOp/DCU and Triton/DCU kernel workflows for frameworks or direct files, profiler digests, and review-gated iteration.",
      "version": "1.17.0"
    }
  ]

--- a/README.md
+++ b/README.md
@@ -115,7 +115,8 @@ test/
 新增算子时，agent 通常会检查或修改：
- `lightop/csrc/<family>/`：HIP/C++ kernel 和 launcher
+- `lightop/csrc/<family>/`：HIP/C++ kernel 和 launcher，源码使用 `.cu`，不要手写
+  `.hip` 算子文件；`.hip` 如出现视作编译自动生成产物
 - `lightop/csrc/export.cpp`：`m.def(...)` binding
 - `lightop/<op>.py`：Python wrapper
 - `lightop/__init__.py`：公开 API export
@@ -124,6 +125,21 @@ test/
 - benchmark 脚本：性能测试
 - `lightop/config*.py`：需要 shape/gfx-aware dispatch 时才改
+所有改动都要符合 LightOp 现有开发规范。agent 写代码前必须先找最近的同 family
+实现作为参照，沿用它的目录、文件命名、C++ namespace/include/launch helper、
+wrapper 参数校验、`export.cpp` binding、config/dispatcher、test 和 benchmark 风格。
+不要引入无关依赖、外部项目目录结构、批量格式化、生成源码、无关 operator family
+改动，除非用户明确要求并在计划里说明原因。交付前需要列出修改文件对应参考了哪些
+LightOp 本地文件，并确认没有手写 `.hip` 源码、没有无关改动、`test/` 下没有多个
+最终任务测试入口。
+最终验证脚本必须在 `test/` 下。新增算子需要添加 `test/test_<算子名>.py`；优化已有算子
+时使用用户指定的测试文件，没有指定时再推断或创建 `test/test_<算子名>.py`。每个任务
+在 `test/` 下只保留这一个正式测试入口；其它 benchmark、candidate test、parse/sweep
+脚本都放 `.humanize/lightop-agent/`。测试脚本先做精度验证，再做性能测试；性能测试固定
+10 轮 warmup、100 轮 timed iterations，报告平均耗时 us 和有效带宽。最终验证结果用
+简短表格呈现。
 LightOp KernelPilot 的 build 规则固定为：
 ```bash
@@ -137,8 +153,9 @@ docker exec <container> bash -lc 'cd <container-lightop> && python setup.py inst
 ```
 无论 PyTorch 版本是什么，都不切到 `setup_torch29.py`。正常调优循环中也不删除
-`build/`，以便复用增量编译结果；只有用户明确要求 clean build，或证明 cache
+`build/`，也不删除 `build/bdist.*`、`build/lib.*`、`build/temp.*` 这些
-损坏时才清理。
+`python setup.py install` 正常生成的子目录；它们要保留复用，以免每轮重新全量编译。
+只有用户明确要求 clean build，或证明 cache 损坏时才清理。
 ## DCU Profiling
@@ -264,8 +281,16 @@ Claude Code 用户可以使用：
 lightop-kernel-agent-loop
 lightop-kernel-knowledge
 dcu-profiler-report
+triton-kernel-agent-loop
+triton-kernel-knowledge
+triton-dcu-profiler-report
 ```
+For Triton/DCU usage notes, see [`docs/triton-skills.md`](docs/triton-skills.md)
+or the Chinese version [`docs/triton-skills.zh-CN.md`](docs/triton-skills.zh-CN.md).
+Triton skills 支持 vLLM/SGLang 的 framework mode，也支持用户直接指定某个
+Triton Python 文件的 direct-file mode。
 ## 证据规则
 - 本地 LightOp 源码、测试、配置和 benchmark 是第一优先级证据。

--- a/docs/triton-skills.md
+++ b/docs/triton-skills.md
+# Triton/DCU Skills
+Chinese reading version: [`triton-skills.zh-CN.md`](triton-skills.zh-CN.md).
+KernelPilot includes an independent Triton/DCU skill pack for optimizing Triton
+kernels inside vLLM, inside SGLang, or in a user-specified Triton Python file.
+It is separate from the LightOp skill pack and uses its own task state
+directory:
+```text
+.humanize/triton-agent/
+```
+## Skills
+| Skill | Purpose |
+| --- | --- |
+| `triton-kernel-agent-loop` | Main loop for vLLM/SGLang Triton attention, MLA, MoE, quantization, fused norm, cache, sampler, routing, small JIT kernels, and direct Triton files on DCU/ROCm. |
+| `triton-kernel-knowledge` | Evidence search for local vLLM/SGLang source, direct-file call sites and harnesses, KernelPilot PR corpus, Triton/ROCm/DTK/DCU docs, and portable cross-platform ideas. |
+| `triton-dcu-profiler-report` | DCU profiler digest for framework or standalone Triton JIT kernels, including backend/call-site proof, hipprof/rocprofv3/rocprof-compute evidence, Triton cache/IR dumps, and AMDGPU ISA/code-object clues. |
+## Open Kernel References
+The Triton knowledge route includes source-reference pages for high-value open
+Triton kernel libraries:
+```text
+ref-rocm-aiter
+ref-rocm-aotriton
+ref-stackav-conch
+ref-flaggems
+ref-liger-kernel
+ref-huggingface-kernels
+ref-triton-distributed
+```
+Use them as reference implementations or discovery routes, then validate
+correctness, benchmark, profiler names, Triton cache/IR, and DCU ISA locally
+before promoting any idea.
+## Install
+The standard installers include both LightOp and Triton skill packs:
+```bash
+./install-lightop-skills-manual.sh --target both
+./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD"
+```
+## Example Prompt: Framework Mode
+```text
+@triton-kernel-agent-loop
+Framework: vLLM
+Repo path: /path/to/vllm
+Container: <container-name>
+Repo path inside container: /workspace/vllm
+Task: optimize the Triton MLA decode or fused MoE kernel on DCU.
+Target arch: gfx936 or gfx938
+Correctness reference: existing framework path or PyTorch reference
+Performance target: p50 latency improvement above benchmark noise band
+Requirements:
+- Prove that the Triton backend, not AITER/FlashInfer/TRTLLM/C++ fallback, is selected.
+- Store all loop state under .humanize/triton-agent/.
+- Run correctness before benchmark.
+- Use hy-smi or rocm-smi before performance runs.
+- Use hipprof/rocprofv3 and Triton cache/IR/ISA evidence when benchmark results are close or below target.
+```
+## Example Prompt: Direct File Mode
+```text
+@triton-kernel-agent-loop
+Target mode: direct-file
+Target file: /path/to/project/kernels/my_triton_kernel.py
+Target function or wrapper: <kernel_name_or_wrapper>
+Project root or workdir: /path/to/project
+Container: <container-name>
+Workdir inside container: /workspace/project
+Task: optimize this Triton kernel on DCU.
+Target arch: gfx936 or gfx938
+Correctness reference: existing Python/PyTorch reference, test, or oracle
+Workload: shape/dtype/layout distribution and representative benchmark command
+Performance target: p50 latency improvement above benchmark noise band
+Requirements:
+- First identify the @triton.jit function, launch wrapper, grid, configs, caller, and harness.
+- If no harness exists, create a temporary correctness/benchmark harness under .humanize/triton-agent/.
+- Prove the direct call reaches the target Triton kernel with profiler kernel names, Triton cache/dumps, or temporary instrumentation.
+- Store all loop state under .humanize/triton-agent/.
+- Run correctness before benchmark.
+```
--- a/docs/triton-skills.zh-CN.md
+++ b/docs/triton-skills.zh-CN.md
+# Triton/DCU Skills 中文说明
+这份文档是 `triton-kernel-agent-loop`、`triton-kernel-knowledge` 和
+`triton-dcu-profiler-report` 的中文阅读版，方便人工理解和写提示词。真正安装给
+Codex/Claude 触发的仍然是各自目录里的 `SKILL.md`。
+这一组 skill 独立于 LightOp skill。它面向的是 vLLM 或 SGLang 里的 Triton JIT
+kernel，也支持用户直接指定某个 Triton Python 文件；不假设当前任务在 LightOp
+仓库里，也不使用 `.humanize/lightop-agent/` 状态目录。
+## 三个 Skill 分别做什么
+| Skill | 什么时候用 | 核心产出 |
+| --- | --- | --- |
+| `triton-kernel-agent-loop` | 优化或新增 vLLM/SGLang 里的 Triton kernel，或直接优化指定的 Triton Python 文件，目标是 DCU/ROCm/DTK | K/R/W/E/B/P、backend/call-site map、源码改动、正确性、benchmark、profile、调优记录 |
+| `triton-kernel-knowledge` | 需要找 vLLM、SGLang、直接文件调用点、Triton、ROCm、AITER、DTK/DCU 证据 | research digest，说明证据来源、适用性、是否能直接迁移到 DCU |
+| `triton-dcu-profiler-report` | benchmark 不够解释瓶颈，下一步需要 profiler、Triton IR、ISA 或 code object 证据 | profile digest，最后必须落到一个明确的下一步 Triton/config/backend/call-site 修改 |
+## 已加入的开源 Triton Kernel 参考库
+`triton-kernel-knowledge` 现在可以直接检索这些 source-reference 页面：
+```text
+ref-rocm-aiter
+ref-rocm-aotriton
+ref-stackav-conch
+ref-flaggems
+ref-liger-kernel
+ref-huggingface-kernels
+ref-triton-distributed
+```
+它们适合作为 reference implementation 或发现入口。真正用于 DCU 优化前，仍然必须在
+本地完成正确性、benchmark、profiler kernel name、Triton cache/IR 和 DCU ISA 证据。
+所有 Triton 任务的过程记录默认放在目标 vLLM/SGLang 仓库下；如果是直接指定文件，
+则放在最近的项目根目录或用户指定的 workdir 下：
+```text
+.humanize/triton-agent/
+```
+不要和 LightOp 的 `.humanize/lightop-agent/` 混用。
+## triton-kernel-agent-loop
+这是主循环 skill。用户说“优化 vLLM 的 Triton MLA decode”“给 SGLang Triton MoE
+调 config”“在 DCU 上调 Triton FP8 quant kernel”，或者“优化这个
+`/path/to/kernel.py` 里的 Triton kernel”时，应该触发它。
+它不适合 LightOp HIP/C++ 算子任务。LightOp 任务继续使用
+`lightop-kernel-agent-loop`。
+### 输入契约 K/R/W/E/B/P
+开工前需要恢复或定义六类信息：
+- `K`：kernel 或算子语义，包括 tensor shape、stride、layout、dtype、mask、
+  causal/window flag、quant scale、page table、KV cache、expert routing、输出契约。
+- `R`：正确性参考，通常是框架已有路径、PyTorch reference、已有 Triton path、
+  AITER path，或小规模 literal oracle。
+- `W`：workload 分布，包括模型场景、prefill/decode/batch mix、seq length、
+  hidden/head dim、expert/topk、quant mode、目标 gfx/DCU、latency/throughput 指标。
+- `E`：执行环境，包括宿主机或 Docker、容器内 repo 路径、可见 DCU、DTK/ROCm/
+  PyTorch/Triton/AITER 版本、安装、测试、benchmark、profile 命令。
+- `B`：backend 或 call-site 契约，包括 vLLM/SGLang、direct-file 或 scratch 模式、
+  V0/V1 或运行路径、env flags、CLI 参数、backend registry 分支、fallback 分支、
+  直接调用的 wrapper/harness、Triton cache/JIT config，以及“确实走到了目标
+  Triton kernel”的证明。
+- `P`：profiling 契约，包括 artifact 目录、benchmark 稳定性规则、profiler 深度、
+  Triton IR/ISA dump 要求，以及下一轮修改前最低需要哪些证据。
+缺信息时，agent 应该先从仓库、测试、benchmark 和配置里推断。只有推断会改变正确性、
+backend 选择或性能验收时才问用户。
+### 定位目标
+支持三种目标模式：
+```text
+framework：vLLM 或 SGLang 仓库，需要证明 backend routing。
+direct-file：用户直接指定一个或多个 Triton Python 文件，需要恢复调用点和 harness。
+scratch：框架外的新 reproducer 或实验。
+```
+direct-file 模式下，用户给出的文件就是目标。优先定位最近的项目根目录，用于 imports、
+测试和 benchmark；如果没有项目根目录，就以文件父目录或用户指定 workdir 为准。
+direct-file 常用搜索：
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\[grid\]" <target-file-or-dir>
+rg -n "<kernel_name>|<wrapper_name>|<function_name>" <project-root-or-target-dir>
+rg -n "pytest|benchmark|do_bench|torch\.cuda\.synchronize|TRITON_CACHE_DIR" <project-root-or-target-dir>
+```
+vLLM 常见根目录标志：
+```text
+pyproject.toml
+vllm/
+tests/
+benchmarks/
+```
+SGLang 常见根目录标志：
+```text
+pyproject.toml
+python/sglang/
+sgl-kernel/
+test/
+benchmark/
+```
+常用搜索：
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|VLLM_ROCM|AITER|gcnArchName|triton_mla|fused_moe" vllm tests benchmarks
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|SGLANG_USE_AITER|attention_backend|moe_runner|fused_moe|triton_ops" python/sglang sgl-kernel test benchmark
+```
+### Backend Gate
+不要在没有 backend 证明时优化 Triton kernel。必须先证明本次请求确实选择了目标
+Triton 路径，而不是 AITER、FlashInfer、TRTLLM、framework C++ kernel、PyTorch
+native fallback 或其它 backend。
+direct-file 侧重点：
+- 记录目标文件、`@triton.jit` 函数、launch wrapper、grid 函数、autotune/config 表。
+- 找到真实传 tensor 的 caller、测试或 benchmark harness。
+- 如果没有 harness，先在 `.humanize/triton-agent/` 下建立临时 correctness/benchmark harness。
+- 用 profiler kernel name、Triton cache/dump、临时 instrumentation 或最小 reproducer
+  证明直接调用确实进入目标 kernel。
+- 只有当这个文件仍由框架路径调用时，才把框架 backend flags 作为必选项。
+vLLM 侧重点：
+- 记录 V0/V1 路径和 attention、MLA、MoE、quant backend 选择。
+- 记录 `VLLM_ROCM_USE_AITER`、`VLLM_ROCM_USE_AITER_MOE`、
+  `VLLM_ROCM_USE_AITER_MLA`、`VLLM_ROCM_USE_AITER_MHA`、
+  `VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION`、`VLLM_ROCM_USE_AITER_TRITON_GEMM`
+  等 env var。
+- 对 MLA/attention，确认 prefill/decode、block size、page table、FP8 KV cache
+  和 fallback。
+- 对 MoE/quant，确认 expert backend、topk、scale layout、config JSON 或 autotune
+  表是否生效。
+SGLang 侧重点：
+- 记录 `SGLANG_USE_AITER`、`--attention-backend`、`--moe-runner-backend`、
+  quant mode 和模型配置。
+- 确认 AITER、Triton、FlashInfer、TRTLLM、sgl-kernel C++ 路径谁真正胜出。
+- 对 Triton attention，区分 decode、extend/prefill、target verify、MLA、
+  FP8 KV cache 和 page/cache layout。
+- 对 Triton MoE，确认 `moe_runner`、tuned config、topk、expert count、block shape、
+  EP/TP/DP shape 和 fallback。
+可接受的 backend 证明包括日志、临时 instrumentation、profiler kernel name、直接调用
+目标 wrapper 的 microbenchmark，或 Triton cache/dump artifact。
+## triton-kernel-knowledge
+这个 skill 用来把问题变成可引用证据。优先级如下：
+1. 本地目标源码：vLLM/SGLang 源码、测试、benchmark、配置、当前 diff，或直接指定
+   文件的调用点和 harness。
+2. KernelPilot 本地 PR corpus，重点查 vLLM、SGLang、Triton、PyTorch ROCm、
+   AITER、FlashAttention、FlashInfer。
+3. 官方文档和上游源码，重点是 Triton、ROCm、DTK/DCU、vLLM、SGLang、profiler。
+4. CUDA/NVIDIA 资料只能作为跨平台灵感，必须显式翻译并在 DCU 上验证。
+常用 corpus 查询：
+```bash
+python3 scripts/query.py "vllm rocm aiter triton <operator> <dtype>" --compact --limit 30
+python3 scripts/query.py "sglang amd aiter triton <operator> <dtype>" --compact --limit 30
+python3 scripts/query.py "triton amd backend rocm <operator> <symptom>" --compact --limit 30
+python3 scripts/search-pr-diffs.py triton rocm <operator> --any --limit 100
+```
+每条 finding 建议按这个格式记录：
+```text
+Route:
+Source path or URL:
+Commit/version/date:
+Relevant file/function/config/test:
+Finding:
+Impact on plan:
+Portability: direct DCU evidence | ROCm upstream evidence | cross-platform idea
+License/notice if copied or adapted:
+```
+### CUDA 资料迁移规则
+可以直接借鉴的通常是 Triton 语言层模式，例如 online softmax、mask、stride addressing、
+shape-keyed config、microbenchmark 写法。
+必须重新验证的包括 tiling、pipeline depth、dot layout、FP8/FP4、cache 行为。
+默认不可直接迁移的包括 Nsight 指标、PTX/SASS 结论、TMA、WGMMA、CUDA shared memory
+bank 规则、NVIDIA device-name config table。
+概念翻译要写清楚：
+```text
+SM -> CU
+warp -> wavefront
+shared memory -> LDS
+tensor core -> MFMA/MMAC 或目标 ROCm backend
+SASS/PTX -> AMDGPU ISA/code object
+Nsight Compute -> hipprof/rocprofv3/rocprof-compute/DCU profiler
+```
+## triton-dcu-profiler-report
+这个 skill 用于 profile 和解释 Triton JIT kernel 在 DCU 上的瓶颈。它的规则是：
+```text
+先证明热路径是目标 Triton kernel -> 再 profile -> 必要时看 Triton IR/ISA -> 最后只给一个下一步修改
+```
+不要只写“memory-bound”或“launch overhead”。必须写成：
+```text
+测到了什么信号 -> 可能机制是什么 -> 为什么其它解释较弱 -> 下一步具体改哪里
+```
+### 什么时候调用
+- baseline benchmark 已通过，但还没有 Triton/DCU profile digest。
+- 正确性通过的 candidate 刚跑完 benchmark。
+- candidate 和 baseline 在 +/-2% 或噪声带内。
+- 第二个正确 candidate 相比 parent 或 baseline 提升小于 5%。
+- 重要 shape 回退。
+- backend 或目标 Triton kernel 仍不确定。
+- 怀疑 JIT compile time、Triton cache churn、dynamic shape 或 graph capture 影响数据。
+- 下一步可能涉及 `BLOCK_*`、`num_warps`、`num_stages`、`waves_per_eu`、dot layout、
+  vectorization、LDS、fusion 或 backend routing。
+### 必要 artifacts
+默认放在：
+```text
+.humanize/triton-agent/profile-artifacts/<version>/
+  device-status.txt
+  env.txt
+  backend-proof.txt
+  benchmark.log
+  hipprof.txt
+  rocprofv3/
+  rocprof-compute/
+  hipprof-pmc-all/
+  triton-cache/
+  triton-dumps/
+  amdgpu-isa.txt
+  code-object-metadata.txt
+  resource-usage.txt
+  digest.md
+```
+profile 前必须在同一环境里跑 `hy-smi` 或 `rocm-smi`，选择空闲卡，并用
+`HIP_VISIBLE_DEVICES=<idle-card>` 或 `HSA_VISIBLE_DEVICES=<idle-card>` 固定。
+### 常用命令形态
+```bash
+mkdir -p .humanize/triton-agent/profile-artifacts/v000_baseline
+hy-smi > .humanize/triton-agent/profile-artifacts/v000_baseline/device-status.txt 2>&1 || \
+  rocm-smi > .humanize/triton-agent/profile-artifacts/v000_baseline/device-status.txt 2>&1
+HIP_VISIBLE_DEVICES=<idle-card> <benchmark-command> \
+  2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/benchmark.log
+HIP_VISIBLE_DEVICES=<idle-card> hipprof <benchmark-command> \
+  2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/hipprof.txt
+```
+Triton dump 示例：
+```bash
+mkdir -p .humanize/triton-agent/profile-artifacts/v001/triton-dumps
+mkdir -p .humanize/triton-agent/profile-artifacts/v001/triton-cache
+HIP_VISIBLE_DEVICES=<idle-card> \
+TRITON_CACHE_DIR=$PWD/.humanize/triton-agent/profile-artifacts/v001/triton-cache \
+MLIR_ENABLE_DUMP=1 \
+AMDGCN_ENABLE_DUMP=1 \
+<benchmark-command> \
+  > .humanize/triton-agent/profile-artifacts/v001/triton-dumps/stdout.txt \
+  2> .humanize/triton-agent/profile-artifacts/v001/triton-dumps/stderr.txt
+```
+如果当前 DTK/ROCm/Triton 不支持某个 flag 或 counter，要保留失败命令和报错，不要用猜测替代证据。
+## DCU Triton 调优纪律
+调 config 时，每个候选都记录到 `.humanize/triton-agent/tuning-decisions.md`：
+```text
+shape/dtype/backend/gfx
+config: BLOCK_*, num_warps, num_stages, waves_per_eu, matrix_instr_nonkdim
+correctness: pass/fail and tolerance
+latency: p50/p90/mean, repeats, selected card
+profile clue: launch/memory/LDS/resource/compute/dispatch
+decision: keep/reject/inconclusive
+```
+只有满足这些条件才提升为候选：
+- 请求范围内的 shape family 正确性通过。
+- 提升超过 benchmark 噪声带。
+- 重要 serving shape 没有不可接受回退。
+- backend proof 仍然证明走目标 Triton kernel。
+- profiler 或 ISA 证据说明生成的 DCU code path 合理。
+## 推荐提示词模板
+```text
+@triton-kernel-agent-loop
+框架：vLLM
+宿主机仓库路径：/path/to/vllm
+验证容器：<container-name>
+容器内仓库路径：/workspace/vllm
+任务：在 DCU 上优化 Triton MLA decode 或 fused MoE kernel。
+目标架构：gfx936 或 gfx938
+正确性参考：框架已有路径或 PyTorch reference
+性能目标：p50 latency 提升必须超过 benchmark 噪声带
+要求：
+- 先证明当前请求确实选择了 Triton backend，而不是 AITER、FlashInfer、TRTLLM 或 C++ fallback。
+- 所有过程状态和临时证据都放在 `.humanize/triton-agent/`。
+- 先做正确性验证，再做 benchmark。
+- 每次性能测试前使用 `hy-smi` 或 `rocm-smi` 选择空闲卡。
+- 当 benchmark 结果接近 baseline、低于目标或出现回退时，必须使用 `hipprof`、`rocprofv3`、Triton cache、IR 或 ISA 证据解释下一步修改。
+```
+直接指定文件时，可以这样写：
+```text
+@triton-kernel-agent-loop
+目标模式：direct-file
+目标文件：/path/to/project/kernels/my_triton_kernel.py
+目标函数或 wrapper：<kernel_name_or_wrapper>
+项目根目录或工作目录：/path/to/project
+验证容器：<container-name>
+容器内工作目录：/workspace/project
+任务：在 DCU 上优化这个文件里的 Triton kernel。
+目标架构：gfx936 或 gfx938
+正确性参考：已有 Python/PyTorch reference、测试或小规模 oracle
+workload：shape、dtype、layout 分布，以及代表性 benchmark 命令
+性能目标：p50 latency 提升必须超过 benchmark 噪声带
+要求：
+- 先定位 `@triton.jit` 函数、launch wrapper、grid、config、caller 和 harness。
+- 如果没有现成 harness，先在 `.humanize/triton-agent/` 下创建临时 correctness/benchmark harness。
+- 用 profiler kernel name、Triton cache/dump 或临时 instrumentation 证明直接调用确实进入目标 Triton kernel。
+- 所有过程状态和临时证据都放在 `.humanize/triton-agent/`。
+- 先做正确性验证，再做 benchmark。
+```
--- a/humanize/.claude-plugin/plugin.json
+++ b/humanize/.claude-plugin/plugin.json
 {
  "name": "humanize",
-  "description": "Humanize packaged with LightOp/DCU operator-loop skills for adding, optimizing, benchmarking, and profiling fused ROCm operators with review-gated iteration.",
+  "description": "Humanize packaged with KernelPilot DCU skills for LightOp HIP/C++ operators and Triton kernels in vLLM, SGLang, or direct Python files, with benchmark/profile evidence and review-gated iteration.",
  "version": "1.17.0",
  "author": {
    "name": "PolyArch"
@@ -18,6 +18,10 @@
    "lightop",
    "dcu",
    "rocm",
-    "hip"
+    "hip",
+    "triton",
+    "direct-file",
+    "vllm",
+    "sglang"
  ]
 }
--- a/humanize/scripts/install-codex-hooks.sh
+++ b/humanize/scripts/install-codex-hooks.sh
@@ -220,8 +220,6 @@ log "codex config dir: $CODEX_CONFIG_DIR"
 log "runtime root: $RUNTIME_ROOT"
 log "hooks file: $HOOKS_FILE"
-require_native_hooks_support
 if [[ "$DRY_RUN" == "true" ]]; then
    log "DRY-RUN merge $HOOKS_TEMPLATE -> $HOOKS_FILE"
    if [[ "$ENABLE_FEATURE" == "true" ]]; then
@@ -230,6 +228,8 @@ if [[ "$DRY_RUN" == "true" ]]; then
    exit 0
 fi
+require_native_hooks_support
 merge_hooks_json "$HOOKS_FILE" "$HOOKS_TEMPLATE" "$RUNTIME_ROOT"
 enable_feature "$CODEX_CONFIG_DIR"

--- a/humanize/scripts/install-skill.sh
+++ b/humanize/scripts/install-skill.sh
@@ -46,6 +46,10 @@ SKILL_NAMES=(
    # lightop-kernel-agent-loop and dcu-profiler-report.
    "humanize-kernel-agent-loop"
    "ncu-report"
+    # Independent Triton/DCU skills for vLLM, SGLang, and direct Triton files.
+    "triton-kernel-agent-loop"
+    "triton-kernel-knowledge"
+    "triton-dcu-profiler-report"
 )
 KERNEL_KNOWLEDGE_SKILL_NAME="lightop-kernel-knowledge"
@@ -64,7 +68,7 @@ Options:
  --codex-skills-dir PATH Codex skills dir (default: ${CODEX_HOME:-~/.codex}/skills)
  --codex-config-dir PATH Codex config dir for hooks/config.toml (default: ${CODEX_HOME:-~/.codex})
  --command-bin-dir PATH  Install helper command shims here (default: ~/.local/bin)
-  --kernelpilot-root PATH Root of the LightOp/DCU knowledge pack used by kernel-agent skill
+  --kernelpilot-root PATH Root of the KernelPilot knowledge pack used by kernel skills
  --dry-run               Print actions without writing
  -h, --help              Show help
 EOF
@@ -138,7 +142,7 @@ resolve_kernelpilot_root() {
 }
 validate_kernelpilot_root() {
-    [[ -n "$KERNELPILOT_ROOT" ]] || die "LightOp/DCU kernel-pilot root not found; run from the kernel-pilot/humanize checkout or pass --kernelpilot-root PATH"
+    [[ -n "$KERNELPILOT_ROOT" ]] || die "KernelPilot root not found; run from the kernel-pilot/humanize checkout or pass --kernelpilot-root PATH"
    [[ -d "$KERNELPILOT_ROOT" ]] || die "KernelPilot root is not a directory: $KERNELPILOT_ROOT"
    [[ -f "$KERNELPILOT_ROOT/knowledge/SKILL.md" ]] || die "LightOp kernel knowledge skill not found: $KERNELPILOT_ROOT/knowledge/SKILL.md"
 }
@@ -312,17 +316,17 @@ install_codex_user_config() {
    local user_config_file="$user_config_dir/config.json"
    local default_config_file="$runtime_root/config/default_config.json"
+    if [[ "$DRY_RUN" == "true" ]]; then
+        log "DRY-RUN seed Codex-friendly BitLesson config in $user_config_file"
+        return
+    fi
    [[ -f "$default_config_file" ]] || die "missing default config: $default_config_file"
    if ! command -v python3 >/dev/null 2>&1; then
        die "python3 is required to update Humanize user config for Codex installs"
    fi
-    if [[ "$DRY_RUN" == "true" ]]; then
-        log "DRY-RUN seed Codex-friendly BitLesson config in $user_config_file"
-        return
-    fi
    mkdir -p "$user_config_dir"
    python3 - "$default_config_file" "$user_config_file" "$install_target" <<'PY'

--- a/humanize/scripts/install-skills-claude.sh
+++ b/humanize/scripts/install-skills-claude.sh
 #!/usr/bin/env bash
 #
-# Install/upgrade LightOp/DCU KernelPilot Humanize for Claude Code.
+# Install/upgrade KernelPilot Humanize for Claude Code.
 #
 # Claude Code plugin installation copies the plugin into ~/.claude/plugins/cache
 # but does not hydrate SKILL.md placeholders. This wrapper performs the normal
@@ -21,7 +21,7 @@ DRY_RUN="false"
 usage() {
    cat <<'EOF'
-Install LightOp/DCU KernelPilot Humanize for Claude Code.
+Install KernelPilot Humanize for Claude Code.
 Usage:
  humanize/scripts/install-skills-claude.sh [options]

--- a/humanize/skills/triton-dcu-profiler-report/SKILL.md
+++ b/humanize/skills/triton-dcu-profiler-report/SKILL.md
+---
+name: triton-dcu-profiler-report
+description: "Profile and explain Triton JIT kernels running on DCU/ROCm/DTK in vLLM, SGLang, or a user-specified Triton Python file: capture device state, synchronized benchmark logs, hipprof/rocprofv3/rocprof-compute timing and counters, Triton cache/IR/compiler dumps, AMDGPU ISA or code-object resource evidence, then produce exactly one concrete next Triton/config/backend/call-site edit. Use when Triton benchmark numbers are ambiguous, close to baseline, regressing, below target, or need proof before the next optimization."
+---
+# Triton DCU Profiler Report
+Use this skill when benchmark numbers are not enough and the next Triton kernel,
+`triton.Config`, autotune table, framework backend, or direct call-site edit
+should be driven by DCU evidence. This skill is independent from the LightOp
+profiler skill and stores artifacts under `.humanize/triton-agent/`.
+The rule:
+```text
+prove the hot Triton path -> profile -> inspect generated code when needed -> choose one next edit
+```
+The output must be an inference chain from measured evidence to a likely
+mechanism to exactly one actionable edit. Do not stop at "memory-bound" or
+"launch overhead".
+## When To Invoke
+Invoke this skill when any of these hold:
+- Baseline benchmark passed but no Triton/DCU profile digest exists.
+- A correctness-passing candidate has just been benchmarked.
+- Candidate and baseline are within +/-2% or inside the declared noise band.
+- The second correctness-passing optimization improves less than 5% over its
+  parent or baseline.
+- A candidate regresses on important shapes.
+- The selected backend or target Triton kernel is uncertain.
+- JIT compile time, Triton cache churn, dynamic shapes, or graph capture may be
+  polluting benchmark results.
+- The next edit might involve `BLOCK_*`, `num_warps`, `num_stages`,
+  `waves_per_eu`, dot layout, vectorization, LDS use, fusion, or backend
+  routing.
+Do not profile correctness-failing candidates unless profiling is needed to
+debug a profiler or compile failure.
+## Required Artifacts
+Use the target repo, nearest project root, or user-provided direct-file workdir:
+```text
+.humanize/triton-agent/profile-artifacts/<version>/
+  device-status.txt
+  env.txt
+  backend-proof.txt
+  benchmark.log
+  hipprof.txt
+  rocprofv3/
+  rocprof-compute/
+  hipprof-pmc-all/
+  triton-cache/
+  triton-dumps/
+  amdgpu-isa.txt
+  code-object-metadata.txt
+  resource-usage.txt
+  digest.md
+```
+When comparing a candidate, cite the baseline or parent artifact path in
+`digest.md`.
+## Device And Backend Gate
+Before timing or profiling:
+- Run `hy-smi` or `rocm-smi` in the same environment.
+- Choose an idle card and pin `HIP_VISIBLE_DEVICES=<idle-card>` or
+  `HSA_VISIBLE_DEVICES=<idle-card>`.
+- Record framework/project identity, Triton version, ROCm/DTK/HIP version,
+  device name, `gcnArchName`, relevant env vars, and backend or harness args.
+- Prove the framework or direct harness selected the target Triton path with
+  logs, temporary instrumentation, Triton cache entries, or profiler kernel
+  names.
+- Warm up enough to exclude JIT compile time unless compile time is the target.
+## Workflow
+1. Pick one representative shape first. Prefer the shape that exposes the
+   regression, plateau, launch overhead, or suspected bottleneck.
+2. Confirm correctness and backend selection.
+3. Capture normal benchmark output with explicit synchronization.
+4. Capture first-pass `hipprof` or `rocprofv3` timing.
+5. If timing is insufficient, collect supported counters with `hipprof --pmc`,
+   `rocprofv3`, or `rocprof-compute`.
+6. If the issue is codegen-sensitive, collect Triton cache, MLIR/LLVM dumps,
+   code-object metadata, resource usage, and AMDGPU ISA.
+7. Compare candidate against baseline or parent, not just absolute time.
+8. Interpret using [metrics.md](references/metrics.md).
+9. Use command variants from [examples.md](references/examples.md).
+10. Use [triton-ir-isa.md](references/triton-ir-isa.md) when inspecting
+    generated code.
+11. Write `digest.md`; the final section must contain exactly one next edit.
+12. Update `.humanize/triton-agent/attempt-ledger.md` with digest path and
+    decision.
+## Digest Template
+```markdown
+# Triton DCU Profile Digest
+## Runtime Identity
+- Framework or project:
+- Commit:
+- Target mode:
+- Operator/kernel:
+- Shape/dtype:
+- Backend:
+- Device/gfx:
+- Versions:
+- Command:
+## Backend Proof
+- Evidence:
+- Kernel names:
+- Caveats:
+## Benchmark
+- Warmup/repeats:
+- p50/p90/mean:
+- Baseline or parent:
+- Delta:
+- Noise band:
+## Profiler Evidence
+- API/kernel/memcpy timing:
+- Top kernels:
+- Launch overhead:
+- Memory/cache:
+- LDS/barrier:
+- Compute/MFMA/MMAC:
+- VGPR/SGPR/LDS/resource:
+- Triton cache/dump/ISA:
+## Diagnosis
+- Most likely mechanism:
+- Why alternative explanations are weaker:
+- Risk:
+## Next Edit
+Exactly one concrete edit:
+```
+If a tool is unavailable, record the exact command, error, and what evidence is
+missing. Do not silently replace missing evidence with speculation.
--- a/humanize/skills/triton-dcu-profiler-report/agents/openai.yaml
+++ b/humanize/skills/triton-dcu-profiler-report/agents/openai.yaml
+interface:
+  display_name: "Triton DCU Profiler"
+  short_description: "Profile Triton JIT kernels on DCU and choose the next edit"
+  default_prompt: "Use $triton-dcu-profiler-report to turn DCU profiler, Triton IR, and ISA artifacts into one next framework or standalone Triton kernel edit."
--- a/humanize/skills/triton-dcu-profiler-report/references/examples.md
+++ b/humanize/skills/triton-dcu-profiler-report/references/examples.md
+# Triton/DCU Profiler Command Examples
+Adjust paths and command names to the target environment. Preserve the exact
+commands in `.humanize/triton-agent/profile-artifacts/<version>/`.
+## First-Pass Capture
+```bash
+mkdir -p .humanize/triton-agent/profile-artifacts/v000_baseline
+hy-smi > .humanize/triton-agent/profile-artifacts/v000_baseline/device-status.txt 2>&1 || \
+  rocm-smi > .humanize/triton-agent/profile-artifacts/v000_baseline/device-status.txt 2>&1
+HIP_VISIBLE_DEVICES=<idle-card> <benchmark-command> \
+  2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/benchmark.log
+HIP_VISIBLE_DEVICES=<idle-card> hipprof <benchmark-command> \
+  2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/hipprof.txt
+```
+## rocprofv3 Capture
+```bash
+mkdir -p .humanize/triton-agent/profile-artifacts/v001/rocprofv3
+HIP_VISIBLE_DEVICES=<idle-card> rocprofv3 \
+  --hip-trace --kernel-trace --stats \
+  -d .humanize/triton-agent/profile-artifacts/v001/rocprofv3 \
+  <benchmark-command>
+```
+If the installed DTK/ROCm uses different flags, run `rocprofv3 --help`, save
+the command used, and note the version in the digest.
+## Counter Capture
+```bash
+mkdir -p .humanize/triton-agent/profile-artifacts/v001/hipprof-pmc-all
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc --pmc-type 3 \
+  -o .humanize/triton-agent/profile-artifacts/v001/hipprof-pmc-all/pmc \
+  <benchmark-command>
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-read --pmc-type 3 \
+  -o .humanize/triton-agent/profile-artifacts/v001/hipprof-pmc-all/pmc-read \
+  <benchmark-command>
+HIP_VISIBLE_DEVICES=<idle-card> hipprof --pmc-write --pmc-type 3 \
+  -o .humanize/triton-agent/profile-artifacts/v001/hipprof-pmc-all/pmc-write \
+  <benchmark-command>
+```
+Some DTK/ROCm versions expose counters differently. Keep failed command output
+as evidence instead of guessing.
+## Triton Dump Capture
+```bash
+mkdir -p .humanize/triton-agent/profile-artifacts/v001/triton-dumps
+mkdir -p .humanize/triton-agent/profile-artifacts/v001/triton-cache
+HIP_VISIBLE_DEVICES=<idle-card> \
+TRITON_CACHE_DIR=$PWD/.humanize/triton-agent/profile-artifacts/v001/triton-cache \
+MLIR_ENABLE_DUMP=1 \
+AMDGCN_ENABLE_DUMP=1 \
+<benchmark-command> \
+  > .humanize/triton-agent/profile-artifacts/v001/triton-dumps/stdout.txt \
+  2> .humanize/triton-agent/profile-artifacts/v001/triton-dumps/stderr.txt
+```
+Use the exact dump variables supported by the installed Triton/ROCm build. If
+the variables do nothing, save that result and inspect `TRITON_CACHE_DIR`.
+## End-To-End Serving Proof
+Microbenchmarks can lie when routing, graph capture, or batching differs from
+serving. For serving-sensitive changes, pair the microbenchmark with a small
+framework-level request benchmark and record:
+```text
+model
+server args
+backend args
+request mix
+prefill/decode split
+latency or throughput metric
+target kernel presence in profiler
+```
+## Direct-File Harness Proof
+When the user provides a standalone Triton file, keep the harness command and
+proof local to `.humanize/triton-agent/`:
+```bash
+HIP_VISIBLE_DEVICES=<idle-card> python .humanize/triton-agent/bench_<kernel>.py \
+  --shape <shape> --dtype <dtype> \
+  2>&1 | tee .humanize/triton-agent/profile-artifacts/v000_baseline/benchmark.log
+```
+Record the wrapper name, grid, constexpr/config values, correctness tolerance,
+and profiler or Triton-cache evidence that the target JIT function compiled and
+ran.
--- a/humanize/skills/triton-dcu-profiler-report/references/metrics.md
+++ b/humanize/skills/triton-dcu-profiler-report/references/metrics.md
+# Metrics And Interpretation
+Use this checklist when writing `digest.md`.
+## Runtime Identity
+- Framework or project, target mode, commit, local diff state.
+- Operator, wrapper, Triton JIT function, profiler kernel name.
+- Shape, dtype, layout, strides, sequence length, topk, expert count, block
+  size, quant mode, and cache/page layout.
+- Device name, `gcnArchName`, DTK/ROCm/HIP/PyTorch/Triton/AITER versions.
+- Backend env vars, direct harness args, and CLI args.
+## Timing
+- JIT compile time versus warmed kernel time.
+- p50, p90, mean, min, max, and repeat count.
+- CPU-side routing, allocation, graph capture, or sync overhead.
+- HIP API time, kernel time, memcpy/memset time, and launch overhead.
+## Kernel Launch Shape
+- Grid dimensions, program ids, block sizes, and active program count.
+- Tiny-kernel launch overhead relative to work.
+- Number of programs relative to CUs.
+- Divergent mask or page-table branches.
+## Memory Path
+- Global load/store volume and redundant traffic.
+- Coalescing and vector width.
+- Alignment and padding.
+- Cache reuse and page locality.
+- Temporary tensors or unfused epilogues causing extra global traffic.
+## LDS And Synchronization
+- LDS allocation per program.
+- Bank conflict clues when available.
+- Barriers and staging overhead.
+- Whether LDS use improves coalescing or only adds pressure.
+## Compute Path
+- Dot-heavy versus scalar/vector ALU-heavy.
+- MFMA/MMAC instruction presence when expected.
+- Conversion, dequantization, scale loads, and saturation overhead.
+- FP32 accumulation cost versus required numerical tolerance.
+- Epilogue fusion opportunities.
+## Resource Pressure
+- VGPR and SGPR count.
+- LDS per block/program.
+- Spills or scratch memory.
+- Occupancy limiters.
+- Over-unroll or too many live masks/pointers.
+## Dispatch And Config
+- Wrong backend selected.
+- Wrong shape branch or config JSON selected.
+- Autotune cache stale or too broad.
+- Device-name config table misses DCU/gfx target.
+- Dynamic shapes causing repeated compile or many cache entries.
+## Diagnosis Discipline
+A good diagnosis says:
+```text
+measured signal -> likely mechanism -> rejected alternatives -> one next edit
+```
+Examples:
+```text
+high launch overhead + tiny kernel + no memory pressure -> fuse with adjacent
+epilogue or move work into existing Triton kernel.
+same kernel time but worse p90 + many Triton cache entries -> stabilize shape
+specialization or backend config selection.
+lower occupancy + higher VGPR + no memory traffic reduction -> reduce block
+tile/unroll or split live ranges.
+high global traffic + separate dequant/store kernels -> fuse dequant or epilogue
+if correctness and backend constraints allow.
+```
--- a/humanize/skills/triton-dcu-profiler-report/references/triton-ir-isa.md
+++ b/humanize/skills/triton-dcu-profiler-report/references/triton-ir-isa.md
+# Triton IR, Cache, And ISA Evidence
+Use this reference when profiler timing does not explain the next edit.
+## Triton Cache
+Set a task-local cache directory when reproducing:
+```bash
+TRITON_CACHE_DIR=$PWD/.humanize/triton-agent/profile-artifacts/<version>/triton-cache
+```
+Record:
+- number of cache entries before and after warmup;
+- kernel source path/function when visible;
+- config key and constexpr values;
+- whether dynamic shapes generate repeated compilations;
+- whether candidate and baseline compile to distinct code.
+## Compiler Dumps
+Try the dump variables supported by the installed Triton/ROCm stack:
+```bash
+MLIR_ENABLE_DUMP=1
+AMDGCN_ENABLE_DUMP=1
+```
+Capture stdout/stderr. If the variables are unsupported or silent, record that
+and inspect `TRITON_CACHE_DIR`, code objects, or generated temporary files.
+## ISA And Code Object
+Use available tools in this order:
+```bash
+which dccobjdump || true
+which llvm-objdump || true
+which roc-objdump || true
+which hipprof || true
+```
+Possible captures:
+```bash
+dccobjdump --inputs=<code-object-or-binary> --show-sass --show-instruction-encoding \
+  --separate-functions > .humanize/triton-agent/profile-artifacts/<version>/amdgpu-isa.txt
+hipprof --codeobj-analyze <code-object-or-binary> \
+  > .humanize/triton-agent/profile-artifacts/<version>/resource-usage.txt
+```
+Tool names and flags vary across DTK/ROCm versions. Preserve failures.
+## What To Look For
+- Expected MFMA/MMAC instructions for dot-heavy kernels.
+- Excessive scalarization or address arithmetic.
+- Scratch/spill loads and stores.
+- Vector width and memory instruction pattern.
+- LDS instructions and barriers.
+- Conversion/dequantization instruction clusters.
+- Register count and LDS usage that explain occupancy limits.
+Treat ISA evidence as target-specific. Do not use AMD docs alone to claim a
+Hygon/DCU instruction path; verify with the compiled artifact when possible.
--- a/humanize/skills/triton-kernel-agent-loop/SKILL.md
+++ b/humanize/skills/triton-kernel-agent-loop/SKILL.md
+---
+name: triton-kernel-agent-loop
+description: "Run an autonomous DCU/ROCm Triton-kernel optimization loop inside vLLM, SGLang, or a user-specified Triton Python file: recover operator semantics, framework or direct-call dispatch, Triton JIT/autotune configs, correctness references, workloads, DCU/gfx environment, benchmark and profile evidence, then implement scoped Triton or routing changes with reproducible tests. Use for optimizing vLLM/SGLang Triton attention, MLA, MoE, quantization, fused norm, sampler, routing, cache, small JIT kernels, or standalone Triton files on DCU. Do not use for LightOp HIP/C++ operators; use the LightOp skills for that."
+---
+# Triton Kernel Agent Loop
+Use this flow when the user wants to optimize or add a Triton JIT kernel inside
+vLLM, SGLang, or a user-specified Triton Python file for DCU/ROCm/DTK. This
+skill is independent from the LightOp skills. It may reuse the KernelPilot
+evidence corpus and Humanize runtime, but it must not assume a LightOp checkout,
+LightOp wrappers, or LightOp tests.
+Default target: Triton Python kernels and framework dispatch code in vLLM or
+SGLang. Prefer local framework patterns, AITER/Triton backend contracts,
+ROCm/DCU profiler evidence, and target-compiled Triton IR/ISA over generic
+CUDA or NVIDIA tuning advice. When the user gives a specific file path, operate
+in direct-file mode and treat that file, its call sites, and its harness as the
+source of truth.
+## Input Contract
+Recover or define these before implementation:
+```text
+K: kernel/operator semantics, tensor shapes, strides, layouts, dtype, masks,
+   causal/window flags, quant scales, cache/page tables, expert routing, and
+   output contract.
+R: correctness reference, usually framework native path, PyTorch reference,
+   existing Triton path, AITER path, or a small literal oracle for edge cases.
+W: workload distribution: model scenario, prefill/decode/batch mix, seq lengths,
+   hidden/head dims, experts/topk, quant mode, block sizes, contiguity,
+   target gfx/DCU, latency/throughput metric, and success threshold.
+E: execution environment: host or Docker, framework root in that environment,
+   selected DCU, DTK/ROCm/PyTorch/Triton/AITER versions, build/install command,
+   test command, benchmark command, profiler command, and pass threshold.
+B: backend or call-site contract: framework, direct-file, or scratch mode;
+   V0/V1/runtime path, env flags, CLI args, backend registry branch, fallback
+   branch, direct wrapper/harness, Triton cache/JIT config, and proof that the
+   target Triton kernel is the hot path.
+P: profiling contract: artifact directory, benchmark stability rule,
+   profiler depth, required Triton IR/ISA dumps, and the minimum evidence
+   needed before the next edit.
+```
+Ask the user only when the missing value cannot be inferred from the target
+repo or local benchmark/test files and the assumption would change correctness,
+backend selection, or the performance target.
+## Locate The Target
+Support three target modes:
+```text
+framework: a vLLM or SGLang checkout with backend routing to prove.
+direct-file: one or more user-specified Triton Python files and their call sites.
+scratch: a new reproducer or experiment outside a framework.
+```
+For direct-file mode, locate the nearest project root if one exists, but do not
+require vLLM/SGLang markers. Use the user-provided file path as the target, then
+recover imports, `@triton.jit` functions, launch wrappers, call sites, tests,
+and benchmark harnesses around it.
+Direct-file search:
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\\[grid\\]" <target-file-or-dir>
+rg -n "<kernel_name>|<wrapper_name>|<function_name>" <project-root-or-target-dir>
+```
+For framework mode, operate on a vLLM or SGLang checkout. Identify the root
+before editing.
+vLLM root markers:
+```text
+pyproject.toml
+vllm/
+tests/
+benchmarks/
+```
+SGLang root markers:
+```text
+pyproject.toml
+python/sglang/
+sgl-kernel/
+test/
+benchmark/
+```
+Useful searches:
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|VLLM_ROCM|AITER|current_platform\.is_rocm|gcnArchName|triton_mla|fused_moe" vllm tests benchmarks
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|SGLANG_USE_AITER|attention_backend|moe_runner|is_rocm|is_hip|fused_moe|triton_ops" python/sglang sgl-kernel test benchmark
+```
+Load [framework-entrypoints-vllm-sglang.md](references/framework-entrypoints-vllm-sglang.md)
+when selecting files, backend flags, tests, or benchmarks.
+## Execution Environment
+Run build/install, correctness, benchmark, and profiling in one consistent
+environment. If the user names a Docker container or image, treat that as part
+of the acceptance contract. Prefer non-interactive commands:
+```bash
+docker exec <container> bash -lc 'cd <repo-in-container> && <command>'
+```
+Record before the first serious benchmark:
+- Framework root from the command's point of view.
+- Container/image or `direct-host`.
+- `HIP_VISIBLE_DEVICES` or `HSA_VISIBLE_DEVICES`.
+- `hy-smi` or `rocm-smi` output and selected idle card.
+- `torch`, `torch.version.hip`, `triton`, AITER if installed, DTK/ROCm/HIP,
+  device name, and `gcnArchName`.
+- `PYTORCH_ROCM_ARCH`, relevant backend env vars, and CLI backend flags.
+- Exact install, import smoke, correctness, benchmark, profiler, and dump
+  commands.
+- Warmup/repeat counts, p50/p90/mean policy, acceptable noise band, and the
+  minimum delta that counts as an optimization.
+Useful probe:
+```bash
+python - <<'PY'
+import importlib.util, os, torch
+print("torch:", torch.__version__)
+print("hip:", torch.version.hip)
+print("device:", torch.cuda.get_device_name(0))
+print("gcn:", torch.cuda.get_device_properties(0).gcnArchName)
+for name in ("triton", "aiter", "vllm", "sglang"):
+    spec = importlib.util.find_spec(name)
+    if spec is None:
+        print(name + ": not installed")
+    else:
+        mod = __import__(name)
+        print(name + ":", getattr(mod, "__version__", getattr(mod, "__file__", "unknown")))
+print("HIP_VISIBLE_DEVICES:", os.getenv("HIP_VISIBLE_DEVICES"))
+print("HSA_VISIBLE_DEVICES:", os.getenv("HSA_VISIBLE_DEVICES"))
+PY
+hipcc --version || true
+```
+## Backend Or Call-Site Gate
+Do not optimize a Triton kernel until you have proof that the request actually
+selects that Triton kernel.
+For direct-file mode:
+- Record the target file, `@triton.jit` function, launch wrapper, grid function,
+  and any `@triton.autotune` or manual config table.
+- Identify the caller or harness that passes real tensors into the kernel.
+- If no harness exists, create a temporary correctness/benchmark harness under
+  `.humanize/triton-agent/` before editing the target file.
+- Prove the direct call reaches the target kernel through profiler kernel names,
+  Triton cache entries, temporary print/logging, or a minimal reproducer.
+- Treat framework backend flags as optional unless the file is still called
+  from a framework path.
+For vLLM:
+- Record V0/V1 path and attention/MoE/quant backend selection.
+- Capture relevant `VLLM_ROCM_*`, `VLLM_USE_*`, and quantization env vars.
+- Check whether AITER, FlashAttention, FlashInfer, CUTLASS/TRTLLM, or Triton
+  wins the dispatch branch.
+- For Triton MLA/attention, confirm prefill/decode mode, block size, page table
+  layout, FP8 KV cache, and fallback behavior.
+- For MoE/quantization, confirm expert backend, topk, group quant scales,
+  per-token/per-channel/per-block scale layout, and whether config JSON or
+  autotune tables are active.
+For SGLang:
+- Record `--attention-backend`, `--moe-runner-backend`, `SGLANG_USE_AITER`, and
+  model quantization mode.
+- Check whether AITER, Triton, FlashInfer, TRTLLM, or sgl-kernel C++ paths win.
+- For Triton attention, separate decode, extend/prefill, target verify, MLA,
+  FP8 KV cache, and page/cache layout.
+- For Triton MoE, confirm `moe_runner`, tuned config file, expert parallelism,
+  topk, dtype, block shape, and fallback branch.
+Backend proof can come from logs, explicit instrumentation, profiler kernel
+names, a minimal reproducer that calls the target function, or framework debug
+output. If no proof exists, add temporary local instrumentation under
+`.humanize/triton-agent/` or use profiler kernel-name filtering before editing.
+## State And Artifacts
+Store loop state in the target repo and do not mix it with LightOp artifacts:
+```text
+.humanize/triton-agent/refined-plan.md
+.humanize/triton-agent/research-digest.md
+.humanize/triton-agent/backend-map.md
+.humanize/triton-agent/attempt-ledger.md
+.humanize/triton-agent/kernel_opt_readme.md
+.humanize/triton-agent/lineage.jsonl
+.humanize/triton-agent/performance-map.json
+.humanize/triton-agent/tuning-decisions.md
+.humanize/triton-agent/profile-artifacts/
+.humanize/triton-agent/triton-artifacts/
+```
+Load [state-and-artifacts.md](references/state-and-artifacts.md) for templates.
+## Workflow
+Stage 1: inspect and plan.
+- Locate the target repo, direct file, or scratch workspace.
+- Recover `K/R/W/E/B/P`.
+- Identify wrapper, backend registry, dispatch branch, Triton JIT function,
+  config/autotune source, tests, benchmarks, and fallback path.
+- In direct-file mode, identify the launch wrapper, caller/harness, config
+  surface, imports, reference path, and any project-level tests.
+- Use `triton-kernel-knowledge` before the first serious implementation or
+  whenever backend evidence is thin.
+- Write `.humanize/triton-agent/research-digest.md` and
+  `.humanize/triton-agent/backend-map.md`.
+- Define representative microbench and, when relevant, an end-to-end serving
+  benchmark that proves the kernel change matters outside a toy call.
+Stage 2: baseline.
+- Run import smoke and backend-selection smoke.
+- Run correctness against `R` before timing.
+- Run a stable benchmark with explicit `torch.cuda.synchronize()`.
+- Capture device status and actual Triton kernel names in `hipprof` or
+  `rocprofv3`.
+- If baseline is already ambiguous, invoke `triton-dcu-profiler-report`.
+Stage 3: edit and tune.
+- Make one main optimization hypothesis per attempt.
+- Prefer config/autotune/dispatch edits before rewriting kernel structure when
+  profiler evidence points to a bad shape branch.
+- For code edits, preserve framework public API and local style. Avoid broad
+  rewrites, unrelated backend churn, new dependencies, or NVIDIA-only concepts.
+- Re-run correctness, benchmark, and backend proof after every candidate.
+- For every correctness-passing candidate, record keep/reject/inconclusive
+  with timing, shape, config, backend, selected card, and profiler evidence.
+- When a second correct candidate improves less than 5% over parent or
+  baseline, run deep profiling with Triton IR/ISA evidence before the next edit.
+Stage 4: close out.
+- Completion requires install/import smoke, targeted correctness, benchmark
+  against baseline, backend proof, and profile/dump evidence when results are
+  close, surprising, or below target.
+- If the performance target is not reached, report current best, bottleneck
+  evidence, rejected lineages, and the next edit. Do not claim performance
+  completion.
+## DCU Triton Rules
+Load [triton-dcu-optimization.md](references/triton-dcu-optimization.md) before
+changing `triton.Config`, block sizes, `num_warps`, `num_stages`,
+`waves_per_eu`, dot layout, or memory tiling.
+Core rules:
+- Treat PyTorch's `torch.cuda` namespace as the ROCm runtime facade.
+- Respect `PYTORCH_ROCM_ARCH` and target `gcnArchName`.
+- Prefer `gfx928`, `gfx936`, `gfx938`, or the user-provided DCU/BW target over
+  NVIDIA architecture names.
+- Translate CUDA terms explicitly: SM to CU, warp to wavefront, shared memory
+  to LDS, tensor core to MFMA/MMAC or the selected ROCm backend.
+- Do not import Nsight/NVIDIA profiler conclusions as DCU evidence.
+- Do not assume Hopper/Blackwell TMA, WGMMA, warp specialization, SASS, PTX,
+  or CUDA shared-memory bank rules.
+- Borrow algorithmic Triton patterns from NVIDIA examples only after validating
+  the generated AMDGPU code object or profiler counters on the target DCU.
+## Validation Commands
+Use local framework tests first. Examples:
+```bash
+python -m pytest tests/kernels -q -k "<operator-or-backend>"
+python benchmarks/kernels/<benchmark>.py <args>
+python -m pytest test -q -k "<operator-or-backend>"
+python benchmark/kernels/<benchmark>.py <args>
+```
+If no suitable test exists, create a minimal correctness and benchmark harness
+under `.humanize/triton-agent/` first. Add a repo test only when the behavior is
+stable enough to belong to the framework's test suite.
+When ready to start Humanize RLCR from the target repo:
+```bash
+"{{HUMANIZE_RUNTIME_ROOT}}/scripts/setup-rlcr-loop.sh" .humanize/triton-agent/refined-plan.md --yolo
+```
+If the runtime is unavailable, continue manually but keep the same state files,
+profiling gates, and evidence discipline.
--- a/humanize/skills/triton-kernel-agent-loop/agents/openai.yaml
+++ b/humanize/skills/triton-kernel-agent-loop/agents/openai.yaml
+interface:
+  display_name: "Triton DCU Agent Loop"
+  short_description: "Optimize vLLM, SGLang, or standalone Triton kernels on DCU"
+  default_prompt: "Use $triton-kernel-agent-loop to optimize a vLLM, SGLang, or user-specified Triton file on DCU with correctness, benchmark, and profiling evidence."
--- a/humanize/skills/triton-kernel-agent-loop/references/framework-entrypoints-vllm-sglang.md
+++ b/humanize/skills/triton-kernel-agent-loop/references/framework-entrypoints-vllm-sglang.md
+# Framework Entrypoints For DCU Triton Work
+Use this file after `triton-kernel-agent-loop` identifies the target framework
+or direct-file mode. Prefer local source over this map when paths differ.
+## Direct File Mode
+Use direct-file mode when the user points at one or more Triton Python files
+instead of a vLLM or SGLang checkout. The file path is the target. The nearest
+project root is useful for imports, tests, and benchmarks, but it is not a
+precondition.
+Search commands:
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\[grid\]" <target-file-or-dir>
+rg -n "<jit_function>|<wrapper>|<kernel_name>|<call_site>" <project-root-or-target-dir>
+rg -n "pytest|unittest|do_bench|benchmark|torch\.cuda\.synchronize|TRITON_CACHE_DIR" <project-root-or-target-dir>
+```
+Record:
+- Target file, project root or standalone parent directory, and import path.
+- `@triton.jit` function, launch wrapper, grid function, constexpr/config
+  surface, `@triton.autotune` keys, and cache behavior.
+- Caller, harness, or proof that no harness exists yet.
+- Correctness reference, tolerance, shape/dtype/layout workload, and benchmark
+  command.
+- Any framework branch that still calls the file.
+Minimal direct-call proof:
+```text
+1. A correctness/benchmark harness directly calls the wrapper with real tensors.
+2. Profiler output, Triton cache, or dump artifact contains the target kernel.
+3. Temporary instrumentation proves the wrapper and config branch were reached.
+4. Warmed timing excludes first-run JIT compile time unless compile time is the target.
+```
+Create temporary harnesses under `.humanize/triton-agent/` when the repo has no
+usable test or benchmark. Promote a harness to a real test only after the
+correctness contract and workload are stable.
+## vLLM
+Common Triton/DCU search surface:
+```text
+vllm/envs.py
+vllm/triton_utils/
+vllm/attention/
+vllm/v1/attention/
+vllm/model_executor/layers/
+vllm/model_executor/layers/fused_moe/
+vllm/model_executor/layers/quantization/
+vllm/model_executor/layers/rotary_embedding/
+tests/kernels/
+benchmarks/kernels/
+```
+Search commands:
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" vllm tests benchmarks
+rg -n "VLLM_ROCM|AITER|is_rocm|gcnArchName|Triton|triton_mla|fused_moe|scaled_mm|fp8|fp4" vllm tests benchmarks
+rg -n "attention_backend|backend registry|AttentionBackend|MLA|decode|prefill" vllm/attention vllm/v1/attention
+```
+Backend signals to record:
+- `VLLM_ROCM_USE_AITER`
+- `VLLM_ROCM_USE_AITER_PAGED_ATTN`
+- `VLLM_ROCM_USE_AITER_LINEAR`
+- `VLLM_ROCM_USE_AITER_MOE`
+- `VLLM_ROCM_USE_AITER_RMSNORM`
+- `VLLM_ROCM_USE_AITER_MLA`
+- `VLLM_ROCM_USE_AITER_MHA`
+- `VLLM_ROCM_USE_AITER_FP8BMM`
+- `VLLM_ROCM_USE_AITER_FP4BMM`
+- `VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION`
+- `VLLM_ROCM_USE_AITER_TRITON_GEMM`
+- `VLLM_ROCM_FP8_PADDING`
+- `VLLM_ROCM_MOE_PADDING`
+- `VLLM_ROCM_CUSTOM_PAGED_ATTN`
+- `VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT`
+- `VLLM_ROCM_FP8_MFMA_PAGE_ATTN`
+- quantization-specific vars such as W8A8 backend selectors in the local tree.
+Typical target families:
+- Triton MLA and decode attention: prove V0/V1 path, prefill/decode mode, page
+  table layout, block size, FP8 KV cache, and fallback backend.
+- Fused MoE: prove expert implementation, routing/topk, dtype, quant scales,
+  block shape, config source, and whether LoRA or EP changes the path.
+- Quantized linear/scaled MM: prove scale layout, group size, block sizes,
+  AITER/Triton selection, and whether padding or shape heuristics are active.
+- Fused elementwise/norm/rope/cache kernels: prove call site, tensor layout,
+  graph capture constraints, and whether a framework C++ kernel already wins.
+## SGLang
+Common Triton/DCU search surface:
+```text
+python/sglang/srt/layers/attention/
+python/sglang/srt/layers/attention/triton_ops/
+python/sglang/srt/layers/moe/
+python/sglang/srt/layers/moe/moe_runner/
+python/sglang/srt/layers/quantization/
+python/sglang/srt/lora/
+python/sglang/jit_kernel/
+sgl-kernel/
+test/
+benchmark/
+```
+Search commands:
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" python/sglang sgl-kernel test benchmark
+rg -n "SGLANG_USE_AITER|attention_backend|moe_runner|triton_ops|fused_moe|is_rocm|is_hip|fp8|fp4" python/sglang sgl-kernel test benchmark
+rg -n "decode_attention|extend_attention|MLA|KV cache|target_verify|moe_align|topk|tuning" python/sglang benchmark test
+```
+Backend signals to record:
+- `SGLANG_USE_AITER`
+- `--attention-backend`
+- `--moe-runner-backend`
+- quantization mode and model config
+- EP/DP/TP mode when MoE routing or expert parallelism changes shape
+- tuned Triton config file selected by model, dtype, device name, or block shape
+Typical target families:
+- Triton attention backend: separate decode, extend/prefill, target verify,
+  sliding window, MLA, and FP8 KV cache paths.
+- Triton MoE runner: prove runner backend, tuned config, topk, expert count,
+  block shape, EP shape, and fallback.
+- LoRA Triton kernels: prove rank, batching, and dynamic shape behavior.
+- JIT diffusion or multimodal kernels: prove generated kernel path and compile
+  cache stability before tuning.
+## Minimal Backend Proof Pattern
+Use at least one of these before editing:
+```text
+1. Framework debug log or explicit backend print showing the selected backend.
+2. Microbenchmark that directly calls the target wrapper and reports the
+   expected Triton kernel name in profiler output.
+3. Temporary local instrumentation under .humanize/triton-agent/ that prints
+   the selected dispatch branch.
+4. Triton cache/dump artifact whose source hash points to the target function.
+5. End-to-end profiler trace with the target kernel dominating the relevant
+   request segment.
+```
+Remove temporary instrumentation before finalizing unless it is intentionally
+converted into useful framework logging or tests.
--- a/humanize/skills/triton-kernel-agent-loop/references/state-and-artifacts.md
+++ b/humanize/skills/triton-kernel-agent-loop/references/state-and-artifacts.md
+# Triton Agent State And Artifact Templates
+Keep Triton loop records under `.humanize/triton-agent/` in the target vLLM or
+SGLang checkout, in the nearest project root for direct-file mode, or in the
+user-provided workdir when the Triton file is standalone. Do not use
+`.humanize/lightop-agent/` for Triton work.
+## refined-plan.md
+```markdown
+# Refined Plan
+## Target
+- Framework:
+- Repo path:
+- Target mode:
+- Target file:
+- Project root/workdir:
+- Commit:
+- Operator/kernel:
+- User goal:
+## K/R/W/E/B/P
+- K:
+- R:
+- W:
+- E:
+- B:
+- P:
+## Baseline Commands
+- install:
+- import smoke:
+- backend proof:
+- correctness:
+- benchmark:
+- profile:
+## Candidate Plan
+1. Baseline and backend proof.
+2. First edit:
+3. Validation:
+4. Next gate:
+## Stop Conditions
+- Correctness:
+- Performance:
+- Profiling:
+- User constraints:
+```
+## backend-map.md
+```markdown
+# Backend Map
+## Framework Or Call-Site Routing
+- Entry point:
+- Backend registry:
+- Env vars:
+- CLI args:
+- Direct caller/harness:
+- Selected backend:
+- Fallback backend:
+## Triton Kernel
+- Python file/function:
+- JIT/autotune decorators:
+- Config source:
+- Grid function:
+- Kernel name in profiler:
+- Triton cache or dump path:
+## Proof
+- Command:
+- Evidence:
+- Caveats:
+```
+## attempt-ledger.md
+```markdown
+# Attempt Ledger
+| Version | Hypothesis | Files | Correct | p50 us | p90 us | Delta | Profile | Decision |
+| --- | --- | --- | --- | ---: | ---: | ---: | --- | --- |
+| v000 | baseline | none | pass | | | | | baseline |
+```
+## kernel_opt_readme.md
+```markdown
+# Kernel Optimization Report
+## Summary
+- Best version:
+- Baseline:
+- Current best:
+- Target:
+- Selected card:
+- Backend:
+## Validation Matrix
+| Test | Command | Result |
+| --- | --- | --- |
+## Performance Matrix
+| Shape | Dtype | Backend | Config | Baseline us | Candidate us | Delta | Notes |
+| --- | --- | --- | --- | ---: | ---: | ---: | --- |
+## Profile Evidence
+- Baseline artifact:
+- Candidate artifact:
+- Bottleneck:
+- Next edit:
+## Rejected Lineages
+| Version | Reason |
+| --- | --- |
+```
+## lineage.jsonl
+Write one JSON object per benchmarked candidate:
+```json
+{"version":"v001","parent":"v000","hypothesis":"tune BLOCK_N for decode","correct":true,"decision":"keep","artifact":".humanize/triton-agent/profile-artifacts/v001/digest.md"}
+```
--- a/humanize/skills/triton-kernel-agent-loop/references/triton-dcu-optimization.md
+++ b/humanize/skills/triton-kernel-agent-loop/references/triton-dcu-optimization.md
+# Triton DCU Optimization Notes
+Use these notes for DCU/ROCm Triton work. They adapt common Triton skill
+patterns to DCU and intentionally avoid NVIDIA-only assumptions.
+## First Questions
+Before changing code, answer:
+- Is the kernel memory-bound, launch-bound, LDS/resource-bound, dot/MFMA-bound,
+  or dispatch-bound?
+- Is the selected Triton config correct for target `M/N/K`, sequence length,
+  topk, head dim, block size, and dtype?
+- Is the benchmark measuring compile time, graph warmup, framework routing,
+  memory allocation, or the kernel itself?
+- Does the profiler prove the target Triton kernel is hot?
+## Tunable Surface
+Prefer tuning in this order unless evidence says otherwise:
+1. Backend dispatch and shape-specific config selection.
+2. `BLOCK_M`, `BLOCK_N`, `BLOCK_K`, head/block/page dimensions.
+3. `num_warps`, `num_stages`, `waves_per_eu`, `matrix_instr_nonkdim` when the
+   installed Triton AMD backend supports them.
+4. Load/store vectorization, alignment, mask shape, contiguous layout, and
+   redundant global traffic.
+5. Accumulator dtype, dot layout, dequant placement, and epilogue fusion.
+6. Split persistent/stateful kernels only when profiling shows launch overhead
+   or intermediate memory traffic dominates.
+## DCU-Specific Heuristics
+Treat these as hypotheses to verify on the target DCU:
+- Keep `num_stages` small on ROCm unless a deeper pipeline proves better.
+  Single-GEMM kernels often start at `num_stages=2`; fused attention or two-GEMM
+  loops often start at `num_stages=1`.
+- Tune `waves_per_eu` instead of assuming NVIDIA warp occupancy rules.
+- Balance `num_warps` against VGPR pressure and LDS use. More waves can lose
+  when spills or LDS bank pressure increase.
+- For dot-heavy kernels, inspect generated ISA for the expected MFMA/MMAC path
+  before claiming compute utilization.
+- For small decode kernels, launch overhead and framework dispatch can dominate;
+  consider fusion or routing changes only after profiler evidence.
+- For page-table and KV-cache kernels, coalescing, page layout, and mask shape
+  often matter more than arithmetic.
+- For MoE, separate token routing, alignment, expert grouping, GEMM, and
+  epilogue timing. A faster GEMM config can regress total MoE if routing or
+  padding grows.
+- For FP8/FP4, validate data format, scale layout, saturation constants, and
+  ROCm/Triton support. Do not assume NVIDIA E4M3/E5M2 behavior maps exactly to
+  the current DTK target.
+## Patterns Worth Borrowing Carefully
+From general Triton skill material, these are portable when revalidated:
+- Online softmax for attention.
+- Boundary masks written once and reused.
+- Stride-based addressing instead of contiguous assumptions.
+- Shape-keyed config maps.
+- Direct microbench functions around a JIT kernel.
+- Fused norm, activation, scale, and store epilogues when they remove global
+  reads/writes.
+- Dynamic launcher tiling based on sequence length, head dim, dtype, and topk.
+These are not portable without DCU proof:
+- Nsight Compute metrics.
+- PTX/SASS conclusions.
+- Hopper/Blackwell TMA, WGMMA, warp specialization, or CUDA shared-memory bank
+  rules.
+- NVIDIA device-name config tables.
+- CUDA-only FP4/FP8 assumptions.
+## Config Sweep Discipline
+When tuning configs, record every tested candidate in
+`.humanize/triton-agent/tuning-decisions.md`:
+```text
+shape/dtype/backend/gfx
+config: BLOCK_*, num_warps, num_stages, waves_per_eu, matrix_instr_nonkdim
+correctness: pass/fail and tolerance
+latency: p50/p90/mean, repeats, selected card
+profile clue: launch/memory/LDS/resource/compute/dispatch
+decision: keep/reject/inconclusive
+```
+Only promote a config when correctness passes, improvement exceeds the noise
+band, no important serving shape regresses outside the accepted tradeoff,
+backend proof still selects the target Triton kernel, and the generated DCU
+code path is plausible from profiler or ISA evidence.
+## Benchmark Shape Coverage
+For attention:
+```text
+prefill: representative prompt lengths, batch sizes, head dims
+decode: batch sizes, page/block sizes, kv lengths, topk or speculative paths
+MLA: q/nope/pe dims, cache dtype, block size, split prefill/decode behavior
+```
+For MoE:
+```text
+tokens per expert distribution
+topk and expert count
+hidden/intermediate dims
+quant mode and scale layout
+small-batch decode and large prefill separately
+EP/TP/DP shape when enabled
+```
+For quantized GEMM:
+```text
+M/N/K sweep around model shapes
+scale granularity
+padding and alignment
+batch-invariant or graph-captured paths
+```
--- a/humanize/skills/triton-kernel-knowledge/SKILL.md
+++ b/humanize/skills/triton-kernel-knowledge/SKILL.md
+---
+name: triton-kernel-knowledge
+description: "Find citable evidence for DCU/ROCm Triton kernel work in vLLM, SGLang, or a user-specified Triton Python file: inspect local framework source/tests/benchmarks or direct file call sites, query the KernelPilot PR corpus for vLLM/SGLang/Triton/ROCm/AITER changes, consult official Triton/ROCm/DTK/DCU docs, and translate CUDA/NVIDIA examples only as cross-platform inspiration. Use before implementing or tuning Triton attention, MLA, MoE, quantization, fused norm, cache, sampler, routing kernels, standalone Triton files, or direct microbench harnesses on DCU."
+---
+# Triton Kernel Knowledge
+Use this skill to turn a vLLM/SGLang or direct-file Triton-on-DCU question into
+citable implementation evidence. This skill is independent from LightOp. It may
+query the existing KernelPilot corpus, but its findings must be about Triton,
+framework routing or direct call sites, ROCm/DCU tooling, or portable
+algorithmic ideas.
+## Evidence Priority
+Use routes in this order unless the user asks for a different source:
+1. Local target checkout or direct file: vLLM/SGLang source, tests, benchmarks,
+   config files, docs, current diffs, direct call sites, and harnesses.
+2. KernelPilot local PR corpus and source maps for vLLM, SGLang, Triton,
+   PyTorch ROCm, AITER, AOTriton, Conch, FlagGems, Liger Kernel, Hugging Face
+   kernels, Triton-distributed, FlashAttention, FlashInfer, and related repos.
+3. Official docs and upstream source for Triton, ROCm, DTK/DCU, vLLM, SGLang,
+   PyTorch ROCm, AITER, rocprof, rocprof-compute, and profiler tooling.
+4. CUDA/NVIDIA examples only as cross-platform inspiration after translation.
+Write findings into `.humanize/triton-agent/research-digest.md` or the user's
+requested note before allowing them to drive code changes.
+## Route A: Local Source
+For a user-specified Triton file:
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\\[grid\\]" <target-file-or-dir>
+rg -n "<kernel_name>|<wrapper_name>|<function_name>" <project-root-or-target-dir>
+rg -n "pytest|benchmark|do_bench|torch\.cuda\.synchronize|TRITON_CACHE_DIR" <project-root-or-target-dir>
+```
+Record:
+- Target file and project root, if any.
+- Triton JIT function, launch wrapper, grid function, configs, and autotune keys.
+- Caller, test, benchmark, or missing harness.
+- Correctness reference and shape/dtype workload.
+- Any framework path that still calls this file.
+For vLLM:
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" vllm tests benchmarks
+rg -n "VLLM_ROCM|AITER|is_rocm|gcnArchName|triton_mla|fused_moe|scaled_mm|fp8|fp4|attention backend" vllm tests benchmarks
+rg -n "<operator>|<kernel>|<backend>|<env_var>" vllm tests benchmarks
+```
+For SGLang:
+```bash
+rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" python/sglang sgl-kernel test benchmark
+rg -n "SGLANG_USE_AITER|attention_backend|moe_runner|triton_ops|fused_moe|is_rocm|is_hip|fp8|fp4" python/sglang sgl-kernel test benchmark
+rg -n "<operator>|<kernel>|<backend>|<env_var>" python/sglang sgl-kernel test benchmark
+```
+Record:
+- Framework commit and local diff state.
+- Public entry point and backend registry branch.
+- Triton JIT function, launch wrapper, grid function, and config/autotune table.
+- Correctness tests and benchmarks.
+- Backend env vars or CLI args needed to select the target path.
+- Fallback branch and competing backend such as AITER, FlashInfer, TRTLLM,
+  framework C++ kernels, or PyTorch native.
+## Route B: KernelPilot Corpus
+Run from `{{KERNELPILOT_ROOT}}/knowledge` when installed through KernelPilot,
+or from the local `kernel-pilot/knowledge` checkout.
+```bash
+python3 scripts/query.py "vllm rocm aiter triton <operator> <dtype>" --compact --limit 30
+python3 scripts/query.py "sglang amd aiter triton <operator> <dtype>" --compact --limit 30
+python3 scripts/query.py "triton amd backend rocm <operator> <symptom>" --compact --limit 30
+python3 scripts/query.py "aiter conch flaggems liger triton <operator>" --type source-reference --compact --limit 30
+python3 scripts/search-pr-diffs.py triton rocm <operator> --any --limit 100
+python3 scripts/get_page.py <page-id>
+```
+Load [sources-and-queries.md](references/sources-and-queries.md) for query
+families, relevant PR themes, and citation shape.
+When a result names an artifact bundle, inspect the underlying `review.diff`,
+`source-snapshot`, `upstream.json`, and `ORIGIN.yaml` before copying or
+adapting ideas.
+## Route C: Official Docs And Upstream
+Prefer current official documentation or upstream source for claims about:
+- Triton language, `triton.Config`, `@triton.autotune`, compiler dumps, and AMD
+  backend options.
+- ROCm/DTK profiler commands, counter availability, SQTT, rocprofv3,
+  rocprof-compute, and code object inspection.
+- vLLM ROCm install, AITER integration, V1 attention backends, MoE, and
+  quantized kernels.
+- SGLang AMD install, attention backend, MoE runner backend, AITER/Triton
+  selection, and quantization support.
+- Hygon/DCU and SourceFind DTK tool behavior.
+Useful official URLs:
+```text
+https://triton-lang.org/main/index.html
+https://triton-lang.org/main/python-api/generated/triton.autotune.html
+https://rocm.docs.amd.com/en/docs-6.3.3/how-to/rocm-for-ai/inference-optimization/optimizing-triton-kernel.html
+https://rocm.docs.amd.com/en/develop/how-to/rocm-for-ai/inference-optimization/workload.html
+https://docs.vllm.ai/en/latest/getting_started/installation/gpu/
+https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html
+https://docs.sglang.io/docs/hardware-platforms/amd_gpu
+https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html
+```
+Browse or clone when the local corpus is stale, when the user asks for the
+latest behavior, or when a claim is version-sensitive.
+## Route D: Cross-Platform Inspiration
+Use general Triton or CUDA/NVIDIA material only after classifying it:
+```text
+directly portable: Triton language pattern with no hardware-specific claim
+needs DCU proof: tiling, pipeline depth, dot layout, FP8/FP4, cache behavior
+not portable: Nsight metrics, PTX/SASS, TMA, WGMMA, CUDA bank rules
+```
+Translate terms explicitly:
+```text
+SM -> CU
+warp -> wavefront where appropriate
+shared memory -> LDS
+tensor core -> MFMA/MMAC or selected ROCm backend
+SASS/PTX -> AMDGPU ISA/code object
+Nsight Compute -> hipprof/rocprofv3/rocprof-compute/DCU profiler
+```
+Do not invent Hygon/DCU builtins, counter names, or ISA expectations from CUDA
+mnemonics. Prefer target compile and profiler evidence.
+## Research Digest Format
+Every finding should contain:
+```text
+Route:
+Source path or URL:
+Commit/version/date when available:
+Relevant file/function/config/test:
+Finding:
+Impact on plan:
+Portability: direct DCU evidence | ROCm upstream evidence | cross-platform idea
+License/notice if code is copied or adapted:
+```
+If a route is empty, record the search terms and why the absence matters. Thin
+evidence should trigger another route before implementation.
--- a/humanize/skills/triton-kernel-knowledge/agents/openai.yaml
+++ b/humanize/skills/triton-kernel-knowledge/agents/openai.yaml
+interface:
+  display_name: "Triton DCU Knowledge"
+  short_description: "Find Triton, ROCm, DCU, vLLM, and SGLang evidence"
+  default_prompt: "Use $triton-kernel-knowledge to research DCU evidence for a vLLM, SGLang, or direct-file Triton kernel optimization."