# Triton DCU Optimization Notes

Use these notes for DCU/ROCm Triton work. They adapt common Triton skill
patterns to DCU and intentionally avoid NVIDIA-only assumptions.

## First Questions

Before changing code, answer:

- Is the kernel memory-bound, launch-bound, LDS/resource-bound, dot/MFMA-bound,
  or dispatch-bound?
- Is the selected Triton config correct for target `M/N/K`, sequence length,
  topk, head dim, block size, and dtype?
- Is the benchmark measuring compile time, graph warmup, framework routing,
  memory allocation, or the kernel itself?
- Does the profiler prove the target Triton kernel is hot?

## Tunable Surface

Prefer tuning in this order unless evidence says otherwise:

1. Backend dispatch and shape-specific config selection.
2. `BLOCK_M`, `BLOCK_N`, `BLOCK_K`, head/block/page dimensions.
3. `num_warps`, `num_stages`, `waves_per_eu`, `matrix_instr_nonkdim` when the
   installed Triton AMD backend supports them.
4. Load/store vectorization, alignment, mask shape, contiguous layout, and
   redundant global traffic.
5. Accumulator dtype, dot layout, dequant placement, and epilogue fusion.
6. Split persistent/stateful kernels only when profiling shows launch overhead
   or intermediate memory traffic dominates.

## DCU-Specific Heuristics

Treat these as hypotheses to verify on the target DCU:

- Keep `num_stages` small on ROCm unless a deeper pipeline proves better.
  Single-GEMM kernels often start at `num_stages=2`; fused attention or two-GEMM
  loops often start at `num_stages=1`.
- Tune `waves_per_eu` instead of assuming NVIDIA warp occupancy rules.
- Balance `num_warps` against VGPR pressure and LDS use. More waves can lose
  when spills or LDS bank pressure increase.
- For dot-heavy kernels, inspect generated ISA for the expected MFMA/MMAC path
  before claiming compute utilization.
- For small decode kernels, launch overhead and framework dispatch can dominate;
  consider fusion or routing changes only after profiler evidence.
- For page-table and KV-cache kernels, coalescing, page layout, and mask shape
  often matter more than arithmetic.
- For MoE, separate token routing, alignment, expert grouping, GEMM, and
  epilogue timing. A faster GEMM config can regress total MoE if routing or
  padding grows.
- For FP8/FP4, validate data format, scale layout, saturation constants, and
  ROCm/Triton support. Do not assume NVIDIA E4M3/E5M2 behavior maps exactly to
  the current DTK target.

## Patterns Worth Borrowing Carefully

From general Triton skill material, these are portable when revalidated:

- Online softmax for attention.
- Boundary masks written once and reused.
- Stride-based addressing instead of contiguous assumptions.
- Shape-keyed config maps.
- Direct microbench functions around a JIT kernel.
- Fused norm, activation, scale, and store epilogues when they remove global
  reads/writes.
- Dynamic launcher tiling based on sequence length, head dim, dtype, and topk.

These are not portable without DCU proof:

- Nsight Compute metrics.
- PTX/SASS conclusions.
- Hopper/Blackwell TMA, WGMMA, warp specialization, or CUDA shared-memory bank
  rules.
- NVIDIA device-name config tables.
- CUDA-only FP4/FP8 assumptions.

## Config Sweep Discipline

When tuning configs, record every tested candidate in
`.humanize/triton-agent/tuning-decisions.md`:

```text
shape/dtype/backend/gfx
config: BLOCK_*, num_warps, num_stages, waves_per_eu, matrix_instr_nonkdim
correctness: pass/fail and tolerance
latency: p50/p90/mean, repeats, selected card
profile clue: launch/memory/LDS/resource/compute/dispatch
decision: keep/reject/inconclusive
```

Only promote a config when correctness passes, improvement exceeds the noise
band, no important serving shape regresses outside the accepted tradeoff,
backend proof still selects the target Triton kernel, and the generated DCU
code path is plausible from profiler or ISA evidence.

## Benchmark Shape Coverage

For attention:

```text
prefill: representative prompt lengths, batch sizes, head dims
decode: batch sizes, page/block sizes, kv lengths, topk or speculative paths
MLA: q/nope/pe dims, cache dtype, block size, split prefill/decode behavior
```

For MoE:

```text
tokens per expert distribution
topk and expert count
hidden/intermediate dims
quant mode and scale layout
small-batch decode and large prefill separately
EP/TP/DP shape when enabled
```

For quantized GEMM:

```text
M/N/K sweep around model shapes
scale granularity
padding and alignment
batch-invariant or graph-captured paths
```