# Triton DCU Optimization Notes Use these notes for DCU/ROCm Triton work. They adapt common Triton skill patterns to DCU and intentionally avoid NVIDIA-only assumptions. ## First Questions Before changing code, answer: - Is the kernel memory-bound, launch-bound, LDS/resource-bound, dot/MFMA-bound, or dispatch-bound? - Is the selected Triton config correct for target `M/N/K`, sequence length, topk, head dim, block size, and dtype? - Is the benchmark measuring compile time, graph warmup, framework routing, memory allocation, or the kernel itself? - Does the profiler prove the target Triton kernel is hot? ## Tunable Surface Prefer tuning in this order unless evidence says otherwise: 1. Backend dispatch and shape-specific config selection. 2. `BLOCK_M`, `BLOCK_N`, `BLOCK_K`, head/block/page dimensions. 3. `num_warps`, `num_stages`, `waves_per_eu`, `matrix_instr_nonkdim` when the installed Triton AMD backend supports them. 4. Load/store vectorization, alignment, mask shape, contiguous layout, and redundant global traffic. 5. Accumulator dtype, dot layout, dequant placement, and epilogue fusion. 6. Split persistent/stateful kernels only when profiling shows launch overhead or intermediate memory traffic dominates. ## DCU-Specific Heuristics Treat these as hypotheses to verify on the target DCU: - Keep `num_stages` small on ROCm unless a deeper pipeline proves better. Single-GEMM kernels often start at `num_stages=2`; fused attention or two-GEMM loops often start at `num_stages=1`. - Tune `waves_per_eu` instead of assuming NVIDIA warp occupancy rules. - Balance `num_warps` against VGPR pressure and LDS use. More waves can lose when spills or LDS bank pressure increase. - For dot-heavy kernels, inspect generated ISA for the expected MFMA/MMAC path before claiming compute utilization. - For small decode kernels, launch overhead and framework dispatch can dominate; consider fusion or routing changes only after profiler evidence. - For page-table and KV-cache kernels, coalescing, page layout, and mask shape often matter more than arithmetic. - For MoE, separate token routing, alignment, expert grouping, GEMM, and epilogue timing. A faster GEMM config can regress total MoE if routing or padding grows. - For FP8/FP4, validate data format, scale layout, saturation constants, and ROCm/Triton support. Do not assume NVIDIA E4M3/E5M2 behavior maps exactly to the current DTK target. ## Patterns Worth Borrowing Carefully From general Triton skill material, these are portable when revalidated: - Online softmax for attention. - Boundary masks written once and reused. - Stride-based addressing instead of contiguous assumptions. - Shape-keyed config maps. - Direct microbench functions around a JIT kernel. - Fused norm, activation, scale, and store epilogues when they remove global reads/writes. - Dynamic launcher tiling based on sequence length, head dim, dtype, and topk. These are not portable without DCU proof: - Nsight Compute metrics. - PTX/SASS conclusions. - Hopper/Blackwell TMA, WGMMA, warp specialization, or CUDA shared-memory bank rules. - NVIDIA device-name config tables. - CUDA-only FP4/FP8 assumptions. ## Config Sweep Discipline When tuning configs, record every tested candidate in `.humanize/triton-agent/tuning-decisions.md`: ```text shape/dtype/backend/gfx config: BLOCK_*, num_warps, num_stages, waves_per_eu, matrix_instr_nonkdim correctness: pass/fail and tolerance latency: p50/p90/mean, repeats, selected card profile clue: launch/memory/LDS/resource/compute/dispatch decision: keep/reject/inconclusive ``` Only promote a config when correctness passes, improvement exceeds the noise band, no important serving shape regresses outside the accepted tradeoff, backend proof still selects the target Triton kernel, and the generated DCU code path is plausible from profiler or ISA evidence. ## Benchmark Shape Coverage For attention: ```text prefill: representative prompt lengths, batch sizes, head dims decode: batch sizes, page/block sizes, kv lengths, topk or speculative paths MLA: q/nope/pe dims, cache dtype, block size, split prefill/decode behavior ``` For MoE: ```text tokens per expert distribution topk and expert count hidden/intermediate dims quant mode and scale layout small-batch decode and large prefill separately EP/TP/DP shape when enabled ``` For quantized GEMM: ```text M/N/K sweep around model shapes scale granularity padding and alignment batch-invariant or graph-captured paths ```