Commit 60c75a2f authored by whlwhlwhl's avatar whlwhlwhl
Browse files

add triton-kernel-skill

parent 6889486d
# Triton/DCU Source Routes And Query Patterns
Use this file with `triton-kernel-knowledge` after choosing the target mode,
framework, and operator family.
## KernelPilot Corpus Topics
The local KernelPilot corpus already tracks vLLM, SGLang, Triton, PyTorch,
FlashAttention, FlashInfer, and other kernel PRs. The useful themes for
Triton/DCU work are:
```text
vllm rocm aiter triton attention mla
vllm triton fused moe fp8 w8a8 scaled mm
vllm triton decode attention kv cache fp8
vllm rocm custom paged attention aiter
sglang amd aiter triton attention backend
sglang triton fused moe tuning config
sglang fp8 kv cache triton attention
sglang moe_runner triton_utils fused_moe
triton amd backend rocm autotune num_warps num_stages
triton rocm waves_per_eu matrix_instr_nonkdim
rocprof triton kernel cache LDS occupancy VGPR
aiter aotriton conch flaggems liger huggingface kernels triton distributed
```
Use compact search for triage:
```bash
python3 scripts/query.py "vllm triton mla fp8 kv cache" --compact --limit 20
python3 scripts/query.py "sglang triton fused moe tuning" --compact --limit 20
python3 scripts/query.py "triton amd backend waves_per_eu" --compact --limit 20
python3 scripts/query.py "aiter triton mla rocm" --type source-reference --compact --limit 20
python3 scripts/query.py "conch triton paged attention" --type source-reference --compact --limit 20
python3 scripts/query.py "flaggems triton pytorch operator" --type source-reference --compact --limit 20
python3 scripts/query.py "liger triton rmsnorm swiglu" --type source-reference --compact --limit 20
```
Use PR diff search when a term should appear in changed files:
```bash
python3 scripts/search-pr-diffs.py triton_mla rocm --any --limit 100
python3 scripts/search-pr-diffs.py fused_moe triton config --any --limit 100
python3 scripts/search-pr-diffs.py SGLANG_USE_AITER triton --any --limit 100
```
## Open Triton Kernel Libraries
Use these source-reference pages when local vLLM/SGLang or direct-file evidence
is thin and the task needs a good open Triton kernel pattern:
| Source | Best Use | Query |
| --- | --- | --- |
| `ref-rocm-aiter` | ROCm AITER/Triton dispatch, attention/MLA, MoE, GEMM, quant, communication | `python3 scripts/query.py "aiter triton <operator>" --type source-reference --compact` |
| `ref-rocm-aotriton` | ROCm AOT Triton, FlashAttention, SDPA, codegen-sensitive attention | `python3 scripts/query.py "aotriton flash attention" --type source-reference --compact` |
| `ref-stackav-conch` | Direct-file Triton harnesses, paged/varlen attention, RMSNorm, rotary, KV cache, quant | `python3 scripts/query.py "conch triton <operator>" --type source-reference --compact` |
| `ref-flaggems` | PyTorch-style operator replacement, elementwise, reduction, normalization, portable wrappers | `python3 scripts/query.py "flaggems triton <operator>" --type source-reference --compact` |
| `ref-liger-kernel` | LLM training kernels, RMSNorm, RoPE, SwiGLU, cross entropy, fused loss | `python3 scripts/query.py "liger triton <operator>" --type source-reference --compact` |
| `ref-huggingface-kernels` | Small packageable kernels from kernels-community, MoE, scaled MM, rotary, RMSNorm | `python3 scripts/query.py "kernels-community triton <operator>" --type source-reference --compact` |
| `ref-triton-distributed` | Distributed Triton, GEMM/all-reduce, all-gather/GEMM, MoE communication overlap | `python3 scripts/query.py "triton distributed <operator>" --type source-reference --compact` |
Treat these as reference implementations or discovery routes. Before copying or
adapting code, inspect the source revision, tests, license/notice, and whether
the repo has actual ROCm/DCU evidence for the target kernel. CUDA-only tuning
constants remain hypotheses until validated on DCU.
## Local Source Patterns
Direct-file call surface:
```text
@triton.jit
@triton.autotune
triton.Config
[grid]
do_bench
pytest
torch.cuda.synchronize
TRITON_CACHE_DIR
MLIR_ENABLE_DUMP
AMDGCN_ENABLE_DUMP
```
Attention:
```text
decode_attention
extend_attention
triton_mla
merge_attn_states
compressed_metadata
fp8 kv cache
page table
sliding window
target verify
```
MoE:
```text
fused_moe
moe_align_block_size
moe_runner
triton_utils
topk
expert parallel
tuning_fused_moe
```
Quantization:
```text
w8a8
scaled_mm
block_scaled
per_token_group
fp8_utils
mxfp8
fp4
online quant
```
Framework routing:
```text
attention backend
backend registry
AITER
ROCm
current_platform.is_rocm
gcnArchName
envs.py
server args
```
## Evidence Quality
Strong evidence:
- Local framework code path and profiler kernel name agree.
- Official docs match installed version or the installed source.
- Benchmark reproduces with stable warmup/repeats on selected DCU.
- IR/ISA/profile artifacts support the mechanism.
Weak evidence:
- NVIDIA-only PR with no ROCm compile proof.
- Old docs that predate the installed Triton/ROCm/DTK.
- Benchmark without backend proof.
- Speedup within noise band.
Record weak evidence as inspiration, not proof.
#!/usr/bin/env bash #!/usr/bin/env bash
# #
# Manual install script for LightOp KernelPilot skills. # Manual install script for KernelPilot skills.
# Installs the three LightOp/DCU skills directly into Claude and/or Codex # Installs LightOp/DCU and Triton/DCU skills directly into Claude and/or Codex
# skill directories without requiring the Claude or Codex CLI tools. # skill directories without requiring the Claude or Codex CLI tools.
# #
# Usage: # Usage:
...@@ -11,7 +11,7 @@ ...@@ -11,7 +11,7 @@
# ./install-lightop-skills-manual.sh --target both # ./install-lightop-skills-manual.sh --target both
# #
# Optional: # Optional:
# KERNELPILOT_ROOT=/path/to/lightop-skills ./install-lightop-skills-manual.sh # KERNELPILOT_ROOT=/path/to/kernel-pilot ./install-lightop-skills-manual.sh
# CLAUDE_SKILLS_DIR=/path/to/skills ./install-lightop-skills-manual.sh --target claude # CLAUDE_SKILLS_DIR=/path/to/skills ./install-lightop-skills-manual.sh --target claude
# CODEX_SKILLS_DIR=/path/to/skills ./install-lightop-skills-manual.sh --target codex # CODEX_SKILLS_DIR=/path/to/skills ./install-lightop-skills-manual.sh --target codex
# #
...@@ -31,20 +31,26 @@ CODEX_SKILLS_DIR="${CODEX_SKILLS_DIR:-${CODEX_HOME:-${HOME}/.codex}/skills}" ...@@ -31,20 +31,26 @@ CODEX_SKILLS_DIR="${CODEX_SKILLS_DIR:-${CODEX_HOME:-${HOME}/.codex}/skills}"
KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/knowledge" KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/knowledge"
NCUREPORT_SRC="${KERNELPILOT_ROOT}/humanize/skills/ncu-report" NCUREPORT_SRC="${KERNELPILOT_ROOT}/humanize/skills/ncu-report"
AGENTLOOP_SRC="${KERNELPILOT_ROOT}/humanize/skills/humanize-kernel-agent-loop" AGENTLOOP_SRC="${KERNELPILOT_ROOT}/humanize/skills/humanize-kernel-agent-loop"
TRITON_AGENT_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-kernel-agent-loop"
TRITON_KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-kernel-knowledge"
TRITON_PROFILER_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-dcu-profiler-report"
HUMANIZE_RUNTIME="${KERNELPILOT_ROOT}/humanize" HUMANIZE_RUNTIME="${KERNELPILOT_ROOT}/humanize"
# ---- Skill definitions ---- # ---- Skill definitions ----
# Each entry: "name|source_type|source_path|extra" # Each entry: "name|source_type|source_path|extra"
# source_type: "symlink" or "hydrate" # source_type: "symlink", "hydrate", or "hydrate-dir"
SKILLS=( SKILLS=(
"lightop-kernel-knowledge|symlink|${KNOWLEDGE_SRC}|" "lightop-kernel-knowledge|symlink|${KNOWLEDGE_SRC}|"
"dcu-profiler-report|symlink|${NCUREPORT_SRC}|" "dcu-profiler-report|symlink|${NCUREPORT_SRC}|"
"lightop-kernel-agent-loop|hydrate|${AGENTLOOP_SRC}/SKILL.md|${HUMANIZE_RUNTIME}" "lightop-kernel-agent-loop|hydrate|${AGENTLOOP_SRC}/SKILL.md|${HUMANIZE_RUNTIME}"
"triton-kernel-agent-loop|hydrate-dir|${TRITON_AGENT_SRC}|${HUMANIZE_RUNTIME}"
"triton-kernel-knowledge|hydrate-dir|${TRITON_KNOWLEDGE_SRC}|${HUMANIZE_RUNTIME}"
"triton-dcu-profiler-report|hydrate-dir|${TRITON_PROFILER_SRC}|${HUMANIZE_RUNTIME}"
) )
usage() { usage() {
cat <<'EOF' cat <<'EOF'
Install LightOp KernelPilot skills manually. Install KernelPilot skills manually.
Usage: Usage:
install-lightop-skills-manual.sh [options] install-lightop-skills-manual.sh [options]
...@@ -60,8 +66,8 @@ Options: ...@@ -60,8 +66,8 @@ Options:
EOF EOF
} }
log() { printf '[install-lightop-skills] %s\n' "$*"; } log() { printf '[install-kernelpilot-skills] %s\n' "$*"; }
die() { printf '[install-lightop-skills] Error: %s\n' "$*" >&2; exit 1; } die() { printf '[install-kernelpilot-skills] Error: %s\n' "$*" >&2; exit 1; }
resolve_kernelpilot_root() { resolve_kernelpilot_root() {
KERNELPILOT_ROOT="$(cd "$KERNELPILOT_ROOT" 2>/dev/null && pwd || true)" KERNELPILOT_ROOT="$(cd "$KERNELPILOT_ROOT" 2>/dev/null && pwd || true)"
...@@ -70,22 +76,40 @@ resolve_kernelpilot_root() { ...@@ -70,22 +76,40 @@ resolve_kernelpilot_root() {
KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/knowledge" KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/knowledge"
NCUREPORT_SRC="${KERNELPILOT_ROOT}/humanize/skills/ncu-report" NCUREPORT_SRC="${KERNELPILOT_ROOT}/humanize/skills/ncu-report"
AGENTLOOP_SRC="${KERNELPILOT_ROOT}/humanize/skills/humanize-kernel-agent-loop" AGENTLOOP_SRC="${KERNELPILOT_ROOT}/humanize/skills/humanize-kernel-agent-loop"
TRITON_AGENT_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-kernel-agent-loop"
TRITON_KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-kernel-knowledge"
TRITON_PROFILER_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-dcu-profiler-report"
HUMANIZE_RUNTIME="${KERNELPILOT_ROOT}/humanize" HUMANIZE_RUNTIME="${KERNELPILOT_ROOT}/humanize"
SKILLS=( SKILLS=(
"lightop-kernel-knowledge|symlink|${KNOWLEDGE_SRC}|" "lightop-kernel-knowledge|symlink|${KNOWLEDGE_SRC}|"
"dcu-profiler-report|symlink|${NCUREPORT_SRC}|" "dcu-profiler-report|symlink|${NCUREPORT_SRC}|"
"lightop-kernel-agent-loop|hydrate|${AGENTLOOP_SRC}/SKILL.md|${HUMANIZE_RUNTIME}" "lightop-kernel-agent-loop|hydrate|${AGENTLOOP_SRC}/SKILL.md|${HUMANIZE_RUNTIME}"
"triton-kernel-agent-loop|hydrate-dir|${TRITON_AGENT_SRC}|${HUMANIZE_RUNTIME}"
"triton-kernel-knowledge|hydrate-dir|${TRITON_KNOWLEDGE_SRC}|${HUMANIZE_RUNTIME}"
"triton-dcu-profiler-report|hydrate-dir|${TRITON_PROFILER_SRC}|${HUMANIZE_RUNTIME}"
) )
} }
preflight() { preflight() {
local path local path
for path in "$KNOWLEDGE_SRC/SKILL.md" "$NCUREPORT_SRC/SKILL.md" "$AGENTLOOP_SRC/SKILL.md"; do for path in "$KNOWLEDGE_SRC/SKILL.md" "$NCUREPORT_SRC/SKILL.md" "$AGENTLOOP_SRC/SKILL.md" "$TRITON_AGENT_SRC/SKILL.md" "$TRITON_KNOWLEDGE_SRC/SKILL.md" "$TRITON_PROFILER_SRC/SKILL.md"; do
[[ -e "$path" ]] || die "not found: $path" [[ -e "$path" ]] || die "not found: $path"
done done
} }
hydrate_file() {
local file="$1"
local runtime="$2"
local tmp
tmp="${file}.tmp"
sed \
"s|{{HUMANIZE_RUNTIME_ROOT}}|${runtime}|g; s|{{KERNELPILOT_ROOT}}|${KERNELPILOT_ROOT}|g" \
"$file" > "$tmp"
mv "$tmp" "$file"
}
install_skill_dir() { install_skill_dir() {
local skills_dir="$1" local skills_dir="$1"
local label="$2" local label="$2"
...@@ -131,6 +155,16 @@ install_skill_dir() { ...@@ -131,6 +155,16 @@ install_skill_dir() {
"$src" > "${target}/SKILL.md" "$src" > "${target}/SKILL.md"
fi fi
;; ;;
hydrate-dir)
if [[ "$DRY_RUN" == "true" ]]; then
log "DRY-RUN copy ${name} with hydrated paths"
else
log "copying ${name} with hydrated paths"
mkdir -p "$target"
cp -a "$src/." "$target/"
hydrate_file "${target}/SKILL.md" "$runtime"
fi
;;
*) *)
die "unknown install kind: ${kind}" die "unknown install kind: ${kind}"
;; ;;
...@@ -221,7 +255,7 @@ install_python_deps ...@@ -221,7 +255,7 @@ install_python_deps
cat <<EOF cat <<EOF
Done. Installed LightOp skills: Done. Installed KernelPilot skills:
$(for entry in "${SKILLS[@]}"; do $(for entry in "${SKILLS[@]}"; do
IFS='|' read -r name kind _ <<< "$entry" IFS='|' read -r name kind _ <<< "$entry"
...@@ -232,6 +266,9 @@ $(for entry in "${SKILLS[@]}"; do ...@@ -232,6 +266,9 @@ $(for entry in "${SKILLS[@]}"; do
hydrate) hydrate)
printf " - %-30s (hydrated SKILL.md)\n" "$name" printf " - %-30s (hydrated SKILL.md)\n" "$name"
;; ;;
hydrate-dir)
printf " - %-30s (hydrated directory)\n" "$name"
;;
esac esac
done) done)
......
# LightOp Kernel Knowledge # KernelPilot Kernel Knowledge
This directory backs the `lightop-kernel-knowledge` skill. For LightOp/DCU This directory backs the KernelPilot knowledge skills, including
work, use evidence in this order: `lightop-kernel-knowledge` and `triton-kernel-knowledge`. For DCU kernel work,
use evidence in this order:
1. Local LightOp source, wrappers, bindings, config tables, tests, and 1. Local target source, wrappers, bindings, config tables, tests, and
benchmarks. benchmarks. For Triton work this includes vLLM/SGLang source or a
user-specified direct Triton file and its harness.
2. ROCm/DCU official docs and upstream source: SourceFind DCU/DTK docs, 2. ROCm/DCU official docs and upstream source: SourceFind DCU/DTK docs,
ROCm/HIP, MIOpen, rocBLAS, hipBLASLt, Composable Kernel, Triton AMD, ROCm/HIP, MIOpen, rocBLAS, hipBLASLt, Composable Kernel, Triton AMD,
PyTorch ROCm, SGLang/vLLM AMD paths, the Hygon HIP optimizer reference, PyTorch ROCm, SGLang/vLLM AMD paths, AITER, AOTriton, Conch, FlagGems,
protected DCU Toolkit AMD knowledge-base pointer, plus bundled MR evidence Liger Kernel, Hugging Face kernels, Triton-distributed, the Hygon HIP
from SourceFind LightOp and DCU Toolkit flash-attention-cutlass. optimizer reference, protected DCU Toolkit AMD knowledge-base pointer, plus
bundled MR evidence from SourceFind LightOp and DCU Toolkit
flash-attention-cutlass.
3. The bundled CUDA-oriented PR corpus, only as cross-platform inspiration 3. The bundled CUDA-oriented PR corpus, only as cross-platform inspiration
after translating and validating the idea on DCU. after translating and validating the idea on DCU.
...@@ -30,6 +34,11 @@ python3 scripts/query.py "lightop dcu <operator>" --repo sourcefind-lightop --co ...@@ -30,6 +34,11 @@ python3 scripts/query.py "lightop dcu <operator>" --repo sourcefind-lightop --co
python3 scripts/query.py "flash attention dcu" --repo flash-attention-cutlass --compact --limit 20 python3 scripts/query.py "flash attention dcu" --repo flash-attention-cutlass --compact --limit 20
python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact --limit 20 python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact --limit 20
python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact --limit 20 python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact --limit 20
python3 scripts/query.py "aiter triton mla rocm" --type source-reference --compact --limit 20
python3 scripts/query.py "conch triton paged attention" --type source-reference --compact --limit 20
python3 scripts/query.py "flaggems triton pytorch operator" --type source-reference --compact --limit 20
python3 scripts/query.py "liger triton rmsnorm swiglu" --type source-reference --compact --limit 20
python3 scripts/query.py "triton distributed allreduce gemm" --type source-reference --compact --limit 20
python3 scripts/search-pr-diffs.py <term1> <term2> [--any] [--limit 100] python3 scripts/search-pr-diffs.py <term1> <term2> [--any] [--limit 100]
python3 scripts/get_page.py <page-id> python3 scripts/get_page.py <page-id>
``` ```
......
...@@ -98,6 +98,37 @@ ck-tile: ...@@ -98,6 +98,37 @@ ck-tile:
- "CK Tile" - "CK Tile"
- Composable Kernel - Composable Kernel
aiter:
- AITER
- "AI Tensor Engine"
aotriton:
- AOTriton
- "AOT Triton"
- "ahead of time triton"
conch:
- Conch
- conch-triton-kernels
flaggems:
- FlagGems
- flag_gems
liger-kernel:
- Liger
- "Liger Kernel"
huggingface-kernels:
- "Hugging Face kernels"
- kernels-community
- hf-kernels
triton-distributed:
- Triton-distributed
- triton_dist
- "triton distributed"
# Kernel types # Kernel types
moe: moe:
- MoE - MoE
......
{ {
"schema_version": 1, "schema_version": 1,
"description": "Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code and knowledge repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials, Hygon/DCU optimization references); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.", "description": "Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code and knowledge repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials, Hygon/DCU optimization references, and open Triton kernel libraries such as AITER, AOTriton, Conch, FlagGems, Liger Kernel, Hugging Face kernels, and Triton-distributed); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
"frameworks": [ "frameworks": [
{ {
"id": "nvidia-code-samples", "id": "nvidia-code-samples",
...@@ -150,6 +150,196 @@ ...@@ -150,6 +150,196 @@
"dccobjdump", "dccobjdump",
"sqtt" "sqtt"
] ]
},
{
"id": "rocm-aiter",
"name": "AITER AI Tensor Engine for ROCm",
"repo": "ROCm/aiter",
"url": "https://github.com/ROCm/aiter",
"kernel_paths": [
"aiter",
"op_tests",
"docs",
"gradlib",
"csrc",
"requirements-triton-comms.txt",
".github/scripts/install_triton.sh",
"README.md"
],
"tags": [
"triton",
"rocm",
"aiter",
"attention",
"mla",
"paged-attention",
"fused-moe",
"gemm",
"rmsnorm",
"quantization",
"communication"
]
},
{
"id": "rocm-aotriton",
"name": "AOTriton Ahead-of-Time Triton Math Library",
"repo": "ROCm/aotriton",
"url": "https://github.com/ROCm/aotriton",
"kernel_paths": [
"v2python",
"v2src",
"v3python",
"v3src",
"tritonsrc",
"include/aotriton",
"test",
"docs",
"README.md"
],
"tags": [
"triton",
"rocm",
"aot",
"aotriton",
"flash-attention",
"sdpa",
"attention",
"compiler",
"codegen"
]
},
{
"id": "stackav-conch",
"name": "Conch Triton Kernel Standard Library",
"repo": "stackav-oss/conch",
"url": "https://github.com/stackav-oss/conch",
"kernel_paths": [
"conch",
"tests",
"benchmarks",
"README.md",
"pyproject.toml"
],
"tags": [
"triton",
"rocm",
"standard-library",
"paged-attention",
"varlen-attention",
"rmsnorm",
"rotary",
"kv-cache",
"fp8",
"int8",
"quantization",
"vllm"
]
},
{
"id": "flaggems",
"name": "FlagGems Triton Operator Library",
"repo": "flagos-ai/FlagGems",
"url": "https://github.com/flagos-ai/FlagGems",
"kernel_paths": [
"src/flag_gems",
"benchmark",
"tests",
"modules_tests",
"experimental_tests",
"triton_src",
"docs",
"README.md"
],
"tags": [
"triton",
"pytorch",
"operator-library",
"llm",
"backend-neutral",
"multi-backend",
"aten",
"normalization",
"reduction",
"elementwise",
"quantization"
]
},
{
"id": "liger-kernel",
"name": "Liger Kernel Triton Kernels for LLM Training",
"repo": "linkedin/Liger-Kernel",
"url": "https://github.com/linkedin/Liger-Kernel",
"kernel_paths": [
"src/liger_kernel",
"test",
"benchmark",
"examples",
"docs",
"README.md"
],
"tags": [
"triton",
"llm-training",
"rmsnorm",
"rope",
"swiglu",
"cross-entropy",
"fused-linear-cross-entropy",
"loss",
"amd"
]
},
{
"id": "huggingface-kernels",
"name": "Hugging Face Kernels and kernels-community Hub",
"repo": "huggingface/kernels",
"url": "https://github.com/huggingface/kernels",
"kernel_paths": [
"src",
"kernel-builder",
"tests",
"README.md",
"https://huggingface.co/kernels",
"https://huggingface.co/kernels-community"
],
"tags": [
"triton",
"kernel-hub",
"kernels-community",
"paged-attention",
"triton-moe",
"triton-scaled-mm",
"rmsnorm",
"rotary",
"quantization"
]
},
{
"id": "triton-distributed",
"name": "Triton-distributed",
"repo": "ByteDance-Seed/Triton-distributed",
"url": "https://github.com/ByteDance-Seed/Triton-distributed",
"kernel_paths": [
"python",
"lib",
"include",
"csrc",
"docs",
"tests",
"README.md"
],
"tags": [
"triton",
"distributed",
"communication-overlap",
"gemm-allreduce",
"allgather-gemm",
"reduce-scatter",
"moe",
"flash-decode",
"amd",
"nvidia"
]
} }
], ],
"topics": [ "topics": [
...@@ -161,7 +351,12 @@ ...@@ -161,7 +351,12 @@
"simveit-load-and-store", "simveit-load-and-store",
"colfax-article-src", "colfax-article-src",
"colfax-cutlass-kernels", "colfax-cutlass-kernels",
"hygon-hip-kernel-optimizer" "hygon-hip-kernel-optimizer",
"rocm-aiter",
"rocm-aotriton",
"stackav-conch",
"huggingface-kernels",
"triton-distributed"
], ],
"tags": [ "tags": [
"attention", "attention",
...@@ -181,7 +376,12 @@ ...@@ -181,7 +376,12 @@
"simveit-load-and-store", "simveit-load-and-store",
"colfax-article-src", "colfax-article-src",
"colfax-cutlass-kernels", "colfax-cutlass-kernels",
"hygon-hip-kernel-optimizer" "hygon-hip-kernel-optimizer",
"rocm-aiter",
"stackav-conch",
"flaggems",
"huggingface-kernels",
"triton-distributed"
], ],
"tags": [ "tags": [
"gemm", "gemm",
...@@ -200,7 +400,10 @@ ...@@ -200,7 +400,10 @@
"simveit-load-and-store", "simveit-load-and-store",
"colfax-article-src", "colfax-article-src",
"colfax-cutlass-kernels", "colfax-cutlass-kernels",
"hygon-hip-kernel-optimizer" "hygon-hip-kernel-optimizer",
"rocm-aiter",
"huggingface-kernels",
"triton-distributed"
], ],
"tags": [ "tags": [
"moe", "moe",
...@@ -217,7 +420,12 @@ ...@@ -217,7 +420,12 @@
"applies_to": [ "applies_to": [
"nvidia-code-samples", "nvidia-code-samples",
"simveit-effective-transpose", "simveit-effective-transpose",
"hygon-hip-kernel-optimizer" "hygon-hip-kernel-optimizer",
"rocm-aiter",
"stackav-conch",
"flaggems",
"liger-kernel",
"huggingface-kernels"
], ],
"tags": [ "tags": [
"rmsnorm", "rmsnorm",
...@@ -232,7 +440,12 @@ ...@@ -232,7 +440,12 @@
"name": "Activation / element-wise fusion", "name": "Activation / element-wise fusion",
"applies_to": [ "applies_to": [
"simveit-effective-transpose", "simveit-effective-transpose",
"hygon-hip-kernel-optimizer" "hygon-hip-kernel-optimizer",
"rocm-aiter",
"stackav-conch",
"flaggems",
"liger-kernel",
"huggingface-kernels"
], ],
"tags": [ "tags": [
"silu", "silu",
...@@ -250,7 +463,11 @@ ...@@ -250,7 +463,11 @@
"simveit-load-and-store", "simveit-load-and-store",
"colfax-article-src", "colfax-article-src",
"colfax-cutlass-kernels", "colfax-cutlass-kernels",
"hygon-hip-kernel-optimizer" "hygon-hip-kernel-optimizer",
"rocm-aiter",
"stackav-conch",
"flaggems",
"huggingface-kernels"
], ],
"tags": [ "tags": [
"fp8", "fp8",
...@@ -262,6 +479,48 @@ ...@@ -262,6 +479,48 @@
"per-tensor", "per-tensor",
"per-channel" "per-channel"
] ]
},
{
"id": "triton-open-kernel-libraries",
"name": "Open Triton kernel libraries",
"applies_to": [
"rocm-aiter",
"rocm-aotriton",
"stackav-conch",
"flaggems",
"liger-kernel",
"huggingface-kernels",
"triton-distributed"
],
"tags": [
"triton",
"open-source",
"llm",
"dcu",
"rocm",
"benchmark",
"unit-test",
"reference-implementation"
]
},
{
"id": "distributed-triton",
"name": "Distributed Triton and compute-communication overlap",
"applies_to": [
"rocm-aiter",
"triton-distributed"
],
"tags": [
"triton",
"distributed",
"communication",
"allreduce",
"allgather",
"reduce-scatter",
"moe",
"tensor-parallel",
"expert-parallel"
]
} }
] ]
} }
---
id: ref-flaggems
repo: flagos-ai/FlagGems
title: FlagGems Triton Operator Library
url: https://github.com/flagos-ai/FlagGems
source_type: source-reference
source_category: open-triton-kernel-library
architectures:
- amd
- nvidia
- rocm
- dcu
tags:
- triton
- flaggems
- pytorch
- operator-library
- backend-neutral
- multi-backend
- aten
- normalization
- reduction
- elementwise
- quantization
techniques:
- pytorch-operator-replacement
- backend-neutral-triton
- test-matrix
- benchmark-matrix
- operator-coverage
hardware_features:
- wavefront
- lds
- vectorization
- cache
kernel_types:
- normalization
- reduction
- elementwise
- activation
- quantization
- gemm
languages:
- python
- triton
captured_at: '2026-05-26'
license: not-captured
source_paths:
- src/flag_gems
- benchmark
- tests
- modules_tests
- experimental_tests
- triton_src
- docs
- README.md
---
# FlagGems Triton Operator Library
- Repository: `flagos-ai/FlagGems`
- Source: [flagos-ai/FlagGems](https://github.com/flagos-ai/FlagGems)
## Route Fit
Use FlagGems when the Triton task is a PyTorch-style operator, normalization,
reduction, activation, elementwise fusion, or backend-neutral replacement. It is
less LLM-serving-specific than AITER or Conch, but it is valuable for portable
Triton operator structure, tests, and benchmark organization.
## What To Inspect
- `src/flag_gems` and `triton_src` for operator implementations.
- `tests`, `modules_tests`, and `experimental_tests` for dtype/shape coverage.
- `benchmark` for performance harness layout and comparison policy.
## DCU Use Notes
Treat FlagGems constants as hypotheses. Its portability makes it useful for
syntax and wrapper design, but final tuning still needs DCU profiler, IR/ISA,
and target-version proof.
## Query Hooks
```bash
python3 scripts/query.py "flaggems triton rmsnorm reduction" --type source-reference --compact
python3 scripts/query.py "flaggems triton pytorch operator" --type source-reference --compact
python3 scripts/get_page.py ref-flaggems
```
---
id: ref-huggingface-kernels
repo: huggingface/kernels
title: Hugging Face Kernels and kernels-community Hub
url: https://github.com/huggingface/kernels
source_type: source-reference
source_category: open-triton-kernel-library
architectures:
- amd
- nvidia
- rocm
- dcu
tags:
- triton
- huggingface
- kernels-community
- kernel-hub
- paged-attention
- triton-moe
- triton-scaled-mm
- rmsnorm
- rotary
- quantization
techniques:
- kernel-package
- direct-file-harness
- reference-discovery
- hub-source-search
hardware_features:
- wavefront
- lds
- mfma
- cache
kernel_types:
- attention
- moe
- gemm
- normalization
- rotary
- quantization
languages:
- python
- triton
captured_at: '2026-05-26'
license: not-captured
source_paths:
- src
- kernel-builder
- tests
- README.md
- https://huggingface.co/kernels
- https://huggingface.co/kernels-community
---
# Hugging Face Kernels And kernels-community Hub
- Repository: `huggingface/kernels`
- Source: [huggingface/kernels](https://github.com/huggingface/kernels)
- Hub: [huggingface.co/kernels](https://huggingface.co/kernels)
## Route Fit
Use Hugging Face kernels as a discovery route for small, packageable Triton
kernels and direct-file references. It is useful for finding paged attention,
Triton MoE, scaled MM, RMSNorm, rotary, and quantization examples that are
easier to inspect than a full serving framework.
## What To Inspect
- Kernel package metadata and source links on the Hub.
- `tests` and package examples for minimal wrappers.
- Whether the package declares ROCm/AMD support or only works on CUDA.
## DCU Use Notes
Treat Hub kernels as candidates, not proof. Before adapting one, capture the
source URL, package revision, license/notice, and then run direct correctness,
benchmark, and Triton cache/profiler proof on DCU.
## Query Hooks
```bash
python3 scripts/query.py "huggingface kernels triton moe" --type source-reference --compact
python3 scripts/query.py "kernels-community paged attention triton" --type source-reference --compact
python3 scripts/query.py "triton scaled mm kernels community" --type source-reference --compact
python3 scripts/get_page.py ref-huggingface-kernels
```
---
id: ref-liger-kernel
repo: linkedin/Liger-Kernel
title: Liger Kernel Triton Kernels for LLM Training
url: https://github.com/linkedin/Liger-Kernel
source_type: source-reference
source_category: open-triton-kernel-library
architectures:
- amd
- nvidia
- rocm
- dcu
tags:
- triton
- liger-kernel
- llm-training
- rmsnorm
- rope
- swiglu
- cross-entropy
- fused-linear-cross-entropy
- loss
- amd
techniques:
- llm-training-kernel
- autograd-wrapper
- fused-epilogue
- memory-reduction
- benchmark
hardware_features:
- wavefront
- vectorization
- cache
kernel_types:
- normalization
- rotary
- activation
- loss
- fused-linear
languages:
- python
- triton
captured_at: '2026-05-26'
license: not-captured
source_paths:
- src/liger_kernel
- test
- benchmark
- examples
- docs
- README.md
---
# Liger Kernel Triton Kernels For LLM Training
- Repository: `linkedin/Liger-Kernel`
- Source: [linkedin/Liger-Kernel](https://github.com/linkedin/Liger-Kernel)
## Route Fit
Use Liger Kernel when the Triton task is training-side LLM work: RMSNorm, RoPE,
SwiGLU, cross entropy, fused linear + loss, or memory-saving fused epilogues.
It is most useful for direct-file Triton structure, PyTorch autograd wrappers,
and correctness/benchmark patterns around training kernels.
## What To Inspect
- `src/liger_kernel` for wrappers and Triton kernels.
- `test` for numerical tolerance and training-shape coverage.
- `benchmark` and `examples` for measuring memory and runtime tradeoffs.
## DCU Use Notes
Use Liger as an algorithmic and harness reference. When porting to DCU, verify
dtype support, generated ISA, and resource pressure on the target card before
claiming a performance win.
## Query Hooks
```bash
python3 scripts/query.py "liger triton rmsnorm swiglu" --type source-reference --compact
python3 scripts/query.py "liger fused linear cross entropy" --type source-reference --compact
python3 scripts/get_page.py ref-liger-kernel
```
---
id: ref-rocm-aiter
repo: ROCm/aiter
title: AITER AI Tensor Engine for ROCm
url: https://github.com/ROCm/aiter
source_type: source-reference
source_category: open-triton-kernel-library
architectures:
- amd
- rocm
- dcu
tags:
- triton
- rocm
- aiter
- vllm
- sglang
- attention
- mla
- paged-attention
- fused-moe
- gemm
- rmsnorm
- quantization
- communication
techniques:
- backend-dispatch
- triton-kernel-reference
- triton-comms
- aiter-fallback-map
- rocm-first-validation
hardware_features:
- wavefront
- lds
- mfma
- mmac
- gfx
kernel_types:
- attention
- mla
- moe
- gemm
- normalization
- quantization
- communication
languages:
- python
- triton
- cpp
- hip
captured_at: '2026-05-26'
license: MIT
source_paths:
- aiter
- op_tests
- docs
- gradlib
- csrc
- requirements-triton-comms.txt
- .github/scripts/install_triton.sh
- README.md
---
# AITER AI Tensor Engine For ROCm
- Repository: `ROCm/aiter`
- Source: [ROCm/aiter](https://github.com/ROCm/aiter)
- License: `MIT`
## Route Fit
Use AITER as first-choice upstream evidence when optimizing Triton kernels that
compete with, wrap, or fall back to AITER paths in vLLM or SGLang. It is
especially useful for ROCm/DCU-facing dispatch, attention/MLA, fused MoE,
RMSNorm, quantized GEMM, and communication-related Triton kernels.
## What To Inspect
- Backend selection and fallback logic around AITER versus Triton.
- `op_tests` for shape coverage, tolerances, and reproducible correctness.
- Triton communication docs and install scripts when a task touches distributed
overlap, all-reduce, or tensor/expert parallel serving.
- Kernel names and wrapper APIs that vLLM or SGLang may already recognize.
## DCU Use Notes
Treat AITER as ROCm-strong evidence, but still prove the exact local framework
path and the generated Triton kernel on the target DCU. Do not assume an AITER
kernel is selected unless backend logs, env flags, profiler names, or a direct
wrapper benchmark prove it.
## Query Hooks
```bash
python3 scripts/query.py "aiter triton mla rocm" --type source-reference --compact
python3 scripts/query.py "aiter fused moe triton" --type source-reference --compact
python3 scripts/query.py "aiter triton comms allreduce" --type source-reference --compact
python3 scripts/get_page.py ref-rocm-aiter
```
---
id: ref-rocm-aotriton
repo: ROCm/aotriton
title: AOTriton Ahead-of-Time Triton Math Library
url: https://github.com/ROCm/aotriton
source_type: source-reference
source_category: open-triton-kernel-library
architectures:
- amd
- rocm
- dcu
tags:
- triton
- rocm
- aotriton
- aot
- flash-attention
- sdpa
- attention
- compiler
- codegen
- pytorch
techniques:
- ahead-of-time-triton
- flash-attention-reference
- sdpa-reference
- compiler-artifact-inspection
- rocm-codegen
hardware_features:
- wavefront
- lds
- mfma
- gfx
kernel_types:
- attention
- flash-attention
- sdpa
languages:
- python
- triton
- cpp
- cmake
captured_at: '2026-05-26'
license: not-captured
source_paths:
- v2python
- v2src
- v3python
- v3src
- tritonsrc
- include/aotriton
- test
- docs
- README.md
---
# AOTriton Ahead-Of-Time Triton Math Library
- Repository: `ROCm/aotriton`
- Source: [ROCm/aotriton](https://github.com/ROCm/aotriton)
## Route Fit
Use AOTriton when the task is about ROCm Triton codegen, FlashAttention,
PyTorch SDPA integration, generated code objects, or version/toolchain-sensitive
attention behavior. It is not a drop-in vLLM/SGLang optimization recipe, but it
is strong upstream evidence for how AMD packages and validates Triton attention
kernels.
## What To Inspect
- `tritonsrc`, `v2python`, and `v3python` for Triton source patterns.
- `v2src`, `v3src`, and `include/aotriton` for generated/AOT integration.
- Tests and docs for shape constraints, dtype support, and ROCm build knobs.
## DCU Use Notes
Borrow codegen and attention structure only after checking the installed
Triton/ROCm/DTK version. For runtime Triton JIT work, prefer target cache,
IR/ISA, and profiler evidence over assuming AOTriton generated artifacts match
the local environment.
## Query Hooks
```bash
python3 scripts/query.py "aotriton flash attention rocm" --type source-reference --compact
python3 scripts/query.py "aotriton triton codegen sdpa" --type source-reference --compact
python3 scripts/get_page.py ref-rocm-aotriton
```
---
id: ref-stackav-conch
repo: stackav-oss/conch
title: Conch Triton Kernel Standard Library
url: https://github.com/stackav-oss/conch
source_type: source-reference
source_category: open-triton-kernel-library
architectures:
- amd
- rocm
- nvidia
- dcu
tags:
- triton
- conch
- standard-library
- rocm
- paged-attention
- varlen-attention
- rmsnorm
- rotary
- kv-cache
- fp8
- int8
- quantization
- vllm
techniques:
- pytorch-reference
- microbenchmark
- unit-test
- direct-file-harness
- kernel-wrapper-pattern
hardware_features:
- wavefront
- lds
- mfma
- cache
kernel_types:
- attention
- paged-attention
- normalization
- rotary
- quantization
- kv-cache
languages:
- python
- triton
captured_at: '2026-05-26'
license: Apache-2.0
source_paths:
- conch
- tests
- benchmarks
- README.md
- pyproject.toml
---
# Conch Triton Kernel Standard Library
- Repository: `stackav-oss/conch`
- Source: [stackav-oss/conch](https://github.com/stackav-oss/conch)
- Package: [conch-triton-kernels](https://pypi.org/project/conch-triton-kernels/)
- License: `Apache-2.0`
## Route Fit
Use Conch as a high-quality open Triton kernel reference for direct-file mode
and vLLM-adjacent serving kernels. It is useful when the task needs a PyTorch
reference, unit test, microbenchmark, launch wrapper, or standalone Triton file
that can be adapted into `.humanize/triton-agent/` harnesses.
## What To Inspect
- Paged attention, varlen attention, rotary, RMSNorm, KV-cache, and quantized
utility kernels.
- `tests` for correctness tolerances and edge cases.
- `benchmarks` for warmed timing and direct wrapper invocation patterns.
## DCU Use Notes
Conch is useful because it is closer to direct-file Triton work than framework
backend code. Still verify ROCm/DCU behavior with local profiler names, cache
entries, and target `gcnArchName`; do not copy tuning constants blindly.
## Query Hooks
```bash
python3 scripts/query.py "conch triton paged attention" --type source-reference --compact
python3 scripts/query.py "conch rmsnorm rotary fp8" --type source-reference --compact
python3 scripts/query.py "conch direct file harness triton" --type source-reference --compact
python3 scripts/get_page.py ref-stackav-conch
```
---
id: ref-triton-distributed
repo: ByteDance-Seed/Triton-distributed
title: Triton-distributed
url: https://github.com/ByteDance-Seed/Triton-distributed
source_type: source-reference
source_category: open-triton-kernel-library
architectures:
- amd
- nvidia
- rocm
- dcu
tags:
- triton
- distributed
- communication-overlap
- allreduce
- allgather
- reduce-scatter
- gemm
- moe
- flash-decode
- amd
- nvidia
techniques:
- compute-communication-overlap
- gemm-allreduce
- allgather-gemm
- reduce-scatter-overlap
- distributed-kernel
hardware_features:
- wavefront
- lds
- mfma
- interconnect
kernel_types:
- gemm
- attention
- moe
- communication
languages:
- python
- triton
- cpp
captured_at: '2026-05-26'
license: not-captured
source_paths:
- python
- lib
- include
- csrc
- docs
- tests
- README.md
---
# Triton-distributed
- Repository: `ByteDance-Seed/Triton-distributed`
- Source: [ByteDance-Seed/Triton-distributed](https://github.com/ByteDance-Seed/Triton-distributed)
- Docs: [Triton-distributed kernels](https://triton-distributed.readthedocs.io/en/latest/kernels/index.html)
## Route Fit
Use Triton-distributed when the optimization touches tensor parallelism, expert
parallelism, MoE communication, GEMM + all-reduce, all-gather + GEMM,
reduce-scatter overlap, or distributed flash decode. It is not the first source
for single-kernel tuning, but it is a strong reference for overlap-aware Triton
design.
## What To Inspect
- Distributed kernel examples and docs for communication overlap patterns.
- Tests for shape and process-group assumptions.
- Backend support notes; separate AMD-compatible ideas from NVIDIA-only paths.
## DCU Use Notes
For DCU, prove the communication backend, process topology, and profiler kernel
presence before reusing overlap patterns. Treat NVIDIA-specific launch or
interconnect assumptions as cross-platform inspiration only.
## Query Hooks
```bash
python3 scripts/query.py "triton distributed allreduce gemm" --type source-reference --compact
python3 scripts/query.py "triton distributed moe reduce scatter" --type source-reference --compact
python3 scripts/get_page.py ref-triton-distributed
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment