triton-skills.md

# Triton/DCU Skills

Chinese reading version: [`triton-skills.zh-CN.md`](triton-skills.zh-CN.md).

KernelPilot includes an independent Triton/DCU skill pack for optimizing Triton
kernels inside vLLM, inside SGLang, or in a user-specified Triton Python file.
It is separate from the LightOp skill pack and uses its own task state
directory:

```text
.humanize/triton-agent/
```

## Skills

| Skill | Purpose |
| --- | --- |
| `triton-kernel-agent-loop` | Main loop for vLLM/SGLang Triton attention, MLA, MoE, quantization, fused norm, cache, sampler, routing, small JIT kernels, and direct Triton files on DCU/ROCm. |
| `triton-kernel-knowledge` | Evidence search for local vLLM/SGLang source, direct-file call sites and harnesses, KernelPilot PR corpus, Triton/ROCm/DTK/DCU docs, and portable cross-platform ideas. |
| `triton-dcu-profiler-report` | DCU profiler digest for framework or standalone Triton JIT kernels, including backend/call-site proof, hipprof/rocprofv3/rocprof-compute evidence, Triton cache/IR dumps, and AMDGPU ISA/code-object clues. |

## Open Kernel References

The Triton knowledge route includes source-reference pages for high-value open
Triton kernel libraries:

```text
ref-rocm-aiter
ref-rocm-aotriton
ref-stackav-conch
ref-flaggems
ref-liger-kernel
ref-huggingface-kernels
ref-triton-distributed
```

Use them as reference implementations or discovery routes, then validate
correctness, benchmark, profiler names, Triton cache/IR, and DCU ISA locally
before promoting any idea.

## Install

The standard installers include both LightOp and Triton skill packs:

```bash
./install-lightop-skills-manual.sh --target both
./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD"
```

## Example Prompt: Framework Mode

```text
@triton-kernel-agent-loop
Framework: vLLM
Repo path: /path/to/vllm
Container: <container-name>
Repo path inside container: /workspace/vllm

Task: optimize the Triton MLA decode or fused MoE kernel on DCU.
Target arch: gfx936 or gfx938
Correctness reference: existing framework path or PyTorch reference
Performance target: p50 latency improvement above benchmark noise band

Requirements:
- Prove that the Triton backend, not AITER/FlashInfer/TRTLLM/C++ fallback, is selected.
- Store all loop state under .humanize/triton-agent/.
- Run correctness before benchmark.
- Use hy-smi or rocm-smi before performance runs.
- Use hipprof/rocprofv3 and Triton cache/IR/ISA evidence when benchmark results are close or below target.
```

## Example Prompt: Direct File Mode

```text
@triton-kernel-agent-loop
Target mode: direct-file
Target file: /path/to/project/kernels/my_triton_kernel.py
Target function or wrapper: <kernel_name_or_wrapper>
Project root or workdir: /path/to/project
Container: <container-name>
Workdir inside container: /workspace/project

Task: optimize this Triton kernel on DCU.
Target arch: gfx936 or gfx938
Correctness reference: existing Python/PyTorch reference, test, or oracle
Workload: shape/dtype/layout distribution and representative benchmark command
Performance target: p50 latency improvement above benchmark noise band

Requirements:
- First identify the @triton.jit function, launch wrapper, grid, configs, caller, and harness.
- If no harness exists, create a temporary correctness/benchmark harness under .humanize/triton-agent/.
- Prove the direct call reaches the target Triton kernel with profiler kernel names, Triton cache/dumps, or temporary instrumentation.
- Store all loop state under .humanize/triton-agent/.
- Run correctness before benchmark.
```