# Framework Entrypoints For DCU Triton Work Use this file after `triton-kernel-agent-loop` identifies the target framework or direct-file mode. Prefer local source over this map when paths differ. ## Direct File Mode Use direct-file mode when the user points at one or more Triton Python files instead of a vLLM or SGLang checkout. The file path is the target. The nearest project root is useful for imports, tests, and benchmarks, but it is not a precondition. Search commands: ```bash rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store|\[grid\]" rg -n "|||" rg -n "pytest|unittest|do_bench|benchmark|torch\.cuda\.synchronize|TRITON_CACHE_DIR" ``` Record: - Target file, project root or standalone parent directory, and import path. - `@triton.jit` function, launch wrapper, grid function, constexpr/config surface, `@triton.autotune` keys, and cache behavior. - Caller, harness, or proof that no harness exists yet. - Correctness reference, tolerance, shape/dtype/layout workload, and benchmark command. - Any framework branch that still calls the file. Minimal direct-call proof: ```text 1. A correctness/benchmark harness directly calls the wrapper with real tensors. 2. Profiler output, Triton cache, or dump artifact contains the target kernel. 3. Temporary instrumentation proves the wrapper and config branch were reached. 4. Warmed timing excludes first-run JIT compile time unless compile time is the target. ``` Create temporary harnesses under `.humanize/triton-agent/` when the repo has no usable test or benchmark. Promote a harness to a real test only after the correctness contract and workload are stable. ## vLLM Common Triton/DCU search surface: ```text vllm/envs.py vllm/triton_utils/ vllm/attention/ vllm/v1/attention/ vllm/model_executor/layers/ vllm/model_executor/layers/fused_moe/ vllm/model_executor/layers/quantization/ vllm/model_executor/layers/rotary_embedding/ tests/kernels/ benchmarks/kernels/ ``` Search commands: ```bash rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" vllm tests benchmarks rg -n "VLLM_ROCM|AITER|is_rocm|gcnArchName|Triton|triton_mla|fused_moe|scaled_mm|fp8|fp4" vllm tests benchmarks rg -n "attention_backend|backend registry|AttentionBackend|MLA|decode|prefill" vllm/attention vllm/v1/attention ``` Backend signals to record: - `VLLM_ROCM_USE_AITER` - `VLLM_ROCM_USE_AITER_PAGED_ATTN` - `VLLM_ROCM_USE_AITER_LINEAR` - `VLLM_ROCM_USE_AITER_MOE` - `VLLM_ROCM_USE_AITER_RMSNORM` - `VLLM_ROCM_USE_AITER_MLA` - `VLLM_ROCM_USE_AITER_MHA` - `VLLM_ROCM_USE_AITER_FP8BMM` - `VLLM_ROCM_USE_AITER_FP4BMM` - `VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION` - `VLLM_ROCM_USE_AITER_TRITON_GEMM` - `VLLM_ROCM_FP8_PADDING` - `VLLM_ROCM_MOE_PADDING` - `VLLM_ROCM_CUSTOM_PAGED_ATTN` - `VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT` - `VLLM_ROCM_FP8_MFMA_PAGE_ATTN` - quantization-specific vars such as W8A8 backend selectors in the local tree. Typical target families: - Triton MLA and decode attention: prove V0/V1 path, prefill/decode mode, page table layout, block size, FP8 KV cache, and fallback backend. - Fused MoE: prove expert implementation, routing/topk, dtype, quant scales, block shape, config source, and whether LoRA or EP changes the path. - Quantized linear/scaled MM: prove scale layout, group size, block sizes, AITER/Triton selection, and whether padding or shape heuristics are active. - Fused elementwise/norm/rope/cache kernels: prove call site, tensor layout, graph capture constraints, and whether a framework C++ kernel already wins. ## SGLang Common Triton/DCU search surface: ```text python/sglang/srt/layers/attention/ python/sglang/srt/layers/attention/triton_ops/ python/sglang/srt/layers/moe/ python/sglang/srt/layers/moe/moe_runner/ python/sglang/srt/layers/quantization/ python/sglang/srt/lora/ python/sglang/jit_kernel/ sgl-kernel/ test/ benchmark/ ``` Search commands: ```bash rg -n "@triton\.jit|@triton\.autotune|triton\.Config|tl\.dot|tl\.load|tl\.store" python/sglang sgl-kernel test benchmark rg -n "SGLANG_USE_AITER|attention_backend|moe_runner|triton_ops|fused_moe|is_rocm|is_hip|fp8|fp4" python/sglang sgl-kernel test benchmark rg -n "decode_attention|extend_attention|MLA|KV cache|target_verify|moe_align|topk|tuning" python/sglang benchmark test ``` Backend signals to record: - `SGLANG_USE_AITER` - `--attention-backend` - `--moe-runner-backend` - quantization mode and model config - EP/DP/TP mode when MoE routing or expert parallelism changes shape - tuned Triton config file selected by model, dtype, device name, or block shape Typical target families: - Triton attention backend: separate decode, extend/prefill, target verify, sliding window, MLA, and FP8 KV cache paths. - Triton MoE runner: prove runner backend, tuned config, topk, expert count, block shape, EP shape, and fallback. - LoRA Triton kernels: prove rank, batching, and dynamic shape behavior. - JIT diffusion or multimodal kernels: prove generated kernel path and compile cache stability before tuning. ## Minimal Backend Proof Pattern Use at least one of these before editing: ```text 1. Framework debug log or explicit backend print showing the selected backend. 2. Microbenchmark that directly calls the target wrapper and reports the expected Triton kernel name in profiler output. 3. Temporary local instrumentation under .humanize/triton-agent/ that prints the selected dispatch branch. 4. Triton cache/dump artifact whose source hash points to the target function. 5. End-to-end profiler trace with the target kernel dominating the relevant request segment. ``` Remove temporary instrumentation before finalizing unless it is intentionally converted into useful framework logging or tests.