framework-entrypoints-vllm-sglang.md 5.74 KB
Newer Older
whlwhlwhl's avatar
whlwhlwhl committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# Framework Entrypoints For DCU Triton Work

Use this file after `triton-kernel-agent-loop` identifies the target framework
or direct-file mode. Prefer local source over this map when paths differ.

## Direct File Mode

Use direct-file mode when the user points at one or more Triton Python files
instead of a vLLM or SGLang checkout. The file path is the target. The nearest
project root is useful for imports, tests, and benchmarks, but it is not a
precondition.

Search commands:

```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\[grid\]" <target-file-or-dir>
rg -n "<jit_function>|<wrapper>|<kernel_name>|<call_site>" <project-root-or-target-dir>
rg -n "pytest|unittest|do_bench|benchmark|torch\.cuda\.synchronize|TRITON_CACHE_DIR" <project-root-or-target-dir>
```

Record:

- Target file, project root or standalone parent directory, and import path.
- `@triton.jit` function, launch wrapper, grid function, constexpr/config
  surface, `@triton.autotune` keys, and cache behavior.
- Caller, harness, or proof that no harness exists yet.
- Correctness reference, tolerance, shape/dtype/layout workload, and benchmark
  command.
- Any framework branch that still calls the file.

Minimal direct-call proof:

```text
1. A correctness/benchmark harness directly calls the wrapper with real tensors.
2. Profiler output, Triton cache, or dump artifact contains the target kernel.
3. Temporary instrumentation proves the wrapper and config branch were reached.
4. Warmed timing excludes first-run JIT compile time unless compile time is the target.
```

Create temporary harnesses under `.humanize/triton-agent/` when the repo has no
usable test or benchmark. Promote a harness to a real test only after the
correctness contract and workload are stable.

## vLLM

Common Triton/DCU search surface:

```text
vllm/envs.py
vllm/triton_utils/
vllm/attention/
vllm/v1/attention/
vllm/model_executor/layers/
vllm/model_executor/layers/fused_moe/
vllm/model_executor/layers/quantization/
vllm/model_executor/layers/rotary_embedding/
tests/kernels/
benchmarks/kernels/
```

Search commands:

```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" vllm tests benchmarks
rg -n "VLLM_ROCM|AITER|is_rocm|gcnArchName|Triton|triton_mla|fused_moe|scaled_mm|fp8|fp4" vllm tests benchmarks
rg -n "attention_backend|backend registry|AttentionBackend|MLA|decode|prefill" vllm/attention vllm/v1/attention
```

Backend signals to record:

- `VLLM_ROCM_USE_AITER`
- `VLLM_ROCM_USE_AITER_PAGED_ATTN`
- `VLLM_ROCM_USE_AITER_LINEAR`
- `VLLM_ROCM_USE_AITER_MOE`
- `VLLM_ROCM_USE_AITER_RMSNORM`
- `VLLM_ROCM_USE_AITER_MLA`
- `VLLM_ROCM_USE_AITER_MHA`
- `VLLM_ROCM_USE_AITER_FP8BMM`
- `VLLM_ROCM_USE_AITER_FP4BMM`
- `VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION`
- `VLLM_ROCM_USE_AITER_TRITON_GEMM`
- `VLLM_ROCM_FP8_PADDING`
- `VLLM_ROCM_MOE_PADDING`
- `VLLM_ROCM_CUSTOM_PAGED_ATTN`
- `VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT`
- `VLLM_ROCM_FP8_MFMA_PAGE_ATTN`
- quantization-specific vars such as W8A8 backend selectors in the local tree.

Typical target families:

- Triton MLA and decode attention: prove V0/V1 path, prefill/decode mode, page
  table layout, block size, FP8 KV cache, and fallback backend.
- Fused MoE: prove expert implementation, routing/topk, dtype, quant scales,
  block shape, config source, and whether LoRA or EP changes the path.
- Quantized linear/scaled MM: prove scale layout, group size, block sizes,
  AITER/Triton selection, and whether padding or shape heuristics are active.
- Fused elementwise/norm/rope/cache kernels: prove call site, tensor layout,
  graph capture constraints, and whether a framework C++ kernel already wins.

## SGLang

Common Triton/DCU search surface:

```text
python/sglang/srt/layers/attention/
python/sglang/srt/layers/attention/triton_ops/
python/sglang/srt/layers/moe/
python/sglang/srt/layers/moe/moe_runner/
python/sglang/srt/layers/quantization/
python/sglang/srt/lora/
python/sglang/jit_kernel/
sgl-kernel/
test/
benchmark/
```

Search commands:

```bash
rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" python/sglang sgl-kernel test benchmark
rg -n "SGLANG_USE_AITER|attention_backend|moe_runner|triton_ops|fused_moe|is_rocm|is_hip|fp8|fp4" python/sglang sgl-kernel test benchmark
rg -n "decode_attention|extend_attention|MLA|KV cache|target_verify|moe_align|topk|tuning" python/sglang benchmark test
```

Backend signals to record:

- `SGLANG_USE_AITER`
- `--attention-backend`
- `--moe-runner-backend`
- quantization mode and model config
- EP/DP/TP mode when MoE routing or expert parallelism changes shape
- tuned Triton config file selected by model, dtype, device name, or block shape

Typical target families:

- Triton attention backend: separate decode, extend/prefill, target verify,
  sliding window, MLA, and FP8 KV cache paths.
- Triton MoE runner: prove runner backend, tuned config, topk, expert count,
  block shape, EP shape, and fallback.
- LoRA Triton kernels: prove rank, batching, and dynamic shape behavior.
- JIT diffusion or multimodal kernels: prove generated kernel path and compile
  cache stability before tuning.

## Minimal Backend Proof Pattern

Use at least one of these before editing:

```text
1. Framework debug log or explicit backend print showing the selected backend.
2. Microbenchmark that directly calls the target wrapper and reports the
   expected Triton kernel name in profiler output.
3. Temporary local instrumentation under .humanize/triton-agent/ that prints
   the selected dispatch branch.
4. Triton cache/dump artifact whose source hash points to the target function.
5. End-to-end profiler trace with the target kernel dominating the relevant
   request segment.
```

Remove temporary instrumentation before finalizing unless it is intentionally
converted into useful framework logging or tests.