# Triton/DCU Skills Chinese reading version: [`triton-skills.zh-CN.md`](triton-skills.zh-CN.md). KernelPilot includes an independent Triton/DCU skill pack for optimizing Triton kernels inside vLLM, inside SGLang, or in a user-specified Triton Python file. It is separate from the LightOp skill pack and uses its own task state directory: ```text .humanize/triton-agent/ ``` ## Skills | Skill | Purpose | | --- | --- | | `triton-kernel-agent-loop` | Main loop for vLLM/SGLang Triton attention, MLA, MoE, quantization, fused norm, cache, sampler, routing, small JIT kernels, and direct Triton files on DCU/ROCm. | | `triton-kernel-knowledge` | Evidence search for local vLLM/SGLang source, direct-file call sites and harnesses, KernelPilot PR corpus, Triton/ROCm/DTK/DCU docs, and portable cross-platform ideas. | | `triton-dcu-profiler-report` | DCU profiler digest for framework or standalone Triton JIT kernels, including backend/call-site proof, hipprof/rocprofv3/rocprof-compute evidence, Triton cache/IR dumps, and AMDGPU ISA/code-object clues. | ## Open Kernel References The Triton knowledge route includes source-reference pages for high-value open Triton kernel libraries: ```text ref-rocm-aiter ref-rocm-aotriton ref-stackav-conch ref-flaggems ref-liger-kernel ref-huggingface-kernels ref-triton-distributed ``` Use them as reference implementations or discovery routes, then validate correctness, benchmark, profiler names, Triton cache/IR, and DCU ISA locally before promoting any idea. ## Install The standard installers include both LightOp and Triton skill packs: ```bash ./install-lightop-skills-manual.sh --target both ./humanize/scripts/install-skill.sh --target codex --kernelpilot-root "$PWD" ``` ## Example Prompt: Framework Mode ```text @triton-kernel-agent-loop Framework: vLLM Repo path: /path/to/vllm Container: Repo path inside container: /workspace/vllm Task: optimize the Triton MLA decode or fused MoE kernel on DCU. Target arch: gfx936 or gfx938 Correctness reference: existing framework path or PyTorch reference Performance target: p50 latency improvement above benchmark noise band Requirements: - Prove that the Triton backend, not AITER/FlashInfer/TRTLLM/C++ fallback, is selected. - Store all loop state under .humanize/triton-agent/. - Run correctness before benchmark. - Use hy-smi or rocm-smi before performance runs. - Use hipprof/rocprofv3 and Triton cache/IR/ISA evidence when benchmark results are close or below target. ``` ## Example Prompt: Direct File Mode ```text @triton-kernel-agent-loop Target mode: direct-file Target file: /path/to/project/kernels/my_triton_kernel.py Target function or wrapper: Project root or workdir: /path/to/project Container: Workdir inside container: /workspace/project Task: optimize this Triton kernel on DCU. Target arch: gfx936 or gfx938 Correctness reference: existing Python/PyTorch reference, test, or oracle Workload: shape/dtype/layout distribution and representative benchmark command Performance target: p50 latency improvement above benchmark noise band Requirements: - First identify the @triton.jit function, launch wrapper, grid, configs, caller, and harness. - If no harness exists, create a temporary correctness/benchmark harness under .humanize/triton-agent/. - Prove the direct call reaches the target Triton kernel with profiler kernel names, Triton cache/dumps, or temporary instrumentation. - Store all loop state under .humanize/triton-agent/. - Run correctness before benchmark. ```