add triton-kernel-skill

60c75a2f · whlwhlwhl · 6889486d · 60c75a2f · 60c75a2f · 60c75a2f
Commit 60c75a2f authored May 28, 2026 by whlwhlwhl
12 changed files
--- a/humanize/skills/triton-kernel-knowledge/references/sources-and-queries.md
+++ b/humanize/skills/triton-kernel-knowledge/references/sources-and-queries.md
+# Triton/DCU Source Routes And Query Patterns
+Use this file with `triton-kernel-knowledge` after choosing the target mode,
+framework, and operator family.
+## KernelPilot Corpus Topics
+The local KernelPilot corpus already tracks vLLM, SGLang, Triton, PyTorch,
+FlashAttention, FlashInfer, and other kernel PRs. The useful themes for
+Triton/DCU work are:
+```text
+vllm rocm aiter triton attention mla
+vllm triton fused moe fp8 w8a8 scaled mm
+vllm triton decode attention kv cache fp8
+vllm rocm custom paged attention aiter
+sglang amd aiter triton attention backend
+sglang triton fused moe tuning config
+sglang fp8 kv cache triton attention
+sglang moe_runner triton_utils fused_moe
+triton amd backend rocm autotune num_warps num_stages
+triton rocm waves_per_eu matrix_instr_nonkdim
+rocprof triton kernel cache LDS occupancy VGPR
+aiter aotriton conch flaggems liger huggingface kernels triton distributed
+```
+Use compact search for triage:
+```bash
+python3 scripts/query.py "vllm triton mla fp8 kv cache" --compact --limit 20
+python3 scripts/query.py "sglang triton fused moe tuning" --compact --limit 20
+python3 scripts/query.py "triton amd backend waves_per_eu" --compact --limit 20
+python3 scripts/query.py "aiter triton mla rocm" --type source-reference --compact --limit 20
+python3 scripts/query.py "conch triton paged attention" --type source-reference --compact --limit 20
+python3 scripts/query.py "flaggems triton pytorch operator" --type source-reference --compact --limit 20
+python3 scripts/query.py "liger triton rmsnorm swiglu" --type source-reference --compact --limit 20
+```
+Use PR diff search when a term should appear in changed files:
+```bash
+python3 scripts/search-pr-diffs.py triton_mla rocm --any --limit 100
+python3 scripts/search-pr-diffs.py fused_moe triton config --any --limit 100
+python3 scripts/search-pr-diffs.py SGLANG_USE_AITER triton --any --limit 100
+```
+## Open Triton Kernel Libraries
+Use these source-reference pages when local vLLM/SGLang or direct-file evidence
+is thin and the task needs a good open Triton kernel pattern:
+| Source | Best Use | Query |
+| --- | --- | --- |
+| `ref-rocm-aiter` | ROCm AITER/Triton dispatch, attention/MLA, MoE, GEMM, quant, communication | `python3 scripts/query.py "aiter triton <operator>" --type source-reference --compact` |
+| `ref-rocm-aotriton` | ROCm AOT Triton, FlashAttention, SDPA, codegen-sensitive attention | `python3 scripts/query.py "aotriton flash attention" --type source-reference --compact` |
+| `ref-stackav-conch` | Direct-file Triton harnesses, paged/varlen attention, RMSNorm, rotary, KV cache, quant | `python3 scripts/query.py "conch triton <operator>" --type source-reference --compact` |
+| `ref-flaggems` | PyTorch-style operator replacement, elementwise, reduction, normalization, portable wrappers | `python3 scripts/query.py "flaggems triton <operator>" --type source-reference --compact` |
+| `ref-liger-kernel` | LLM training kernels, RMSNorm, RoPE, SwiGLU, cross entropy, fused loss | `python3 scripts/query.py "liger triton <operator>" --type source-reference --compact` |
+| `ref-huggingface-kernels` | Small packageable kernels from kernels-community, MoE, scaled MM, rotary, RMSNorm | `python3 scripts/query.py "kernels-community triton <operator>" --type source-reference --compact` |
+| `ref-triton-distributed` | Distributed Triton, GEMM/all-reduce, all-gather/GEMM, MoE communication overlap | `python3 scripts/query.py "triton distributed <operator>" --type source-reference --compact` |
+Treat these as reference implementations or discovery routes. Before copying or
+adapting code, inspect the source revision, tests, license/notice, and whether
+the repo has actual ROCm/DCU evidence for the target kernel. CUDA-only tuning
+constants remain hypotheses until validated on DCU.
+## Local Source Patterns
+Direct-file call surface:
+```text
+@triton.jit
+@triton.autotune
+triton.Config
+[grid]
+do_bench
+pytest
+torch.cuda.synchronize
+TRITON_CACHE_DIR
+MLIR_ENABLE_DUMP
+AMDGCN_ENABLE_DUMP
+```
+Attention:
+```text
+decode_attention
+extend_attention
+triton_mla
+merge_attn_states
+compressed_metadata
+fp8 kv cache
+page table
+sliding window
+target verify
+```
+MoE:
+```text
+fused_moe
+moe_align_block_size
+moe_runner
+triton_utils
+topk
+expert parallel
+tuning_fused_moe
+```
+Quantization:
+```text
+w8a8
+scaled_mm
+block_scaled
+per_token_group
+fp8_utils
+mxfp8
+fp4
+online quant
+```
+Framework routing:
+```text
+attention backend
+backend registry
+AITER
+ROCm
+current_platform.is_rocm
+gcnArchName
+envs.py
+server args
+```
+## Evidence Quality
+Strong evidence:
+- Local framework code path and profiler kernel name agree.
+- Official docs match installed version or the installed source.
+- Benchmark reproduces with stable warmup/repeats on selected DCU.
+- IR/ISA/profile artifacts support the mechanism.
+Weak evidence:
+- NVIDIA-only PR with no ROCm compile proof.
+- Old docs that predate the installed Triton/ROCm/DTK.
+- Benchmark without backend proof.
+- Speedup within noise band.
+Record weak evidence as inspiration, not proof.
--- a/install-lightop-skills-manual.sh
+++ b/install-lightop-skills-manual.sh
 #!/usr/bin/env bash
 #
-# Manual install script for LightOp KernelPilot skills.
+# Manual install script for KernelPilot skills.
-# Installs the three LightOp/DCU skills directly into Claude and/or Codex
+# Installs LightOp/DCU and Triton/DCU skills directly into Claude and/or Codex
 # skill directories without requiring the Claude or Codex CLI tools.
 #
 # Usage:
@@ -11,7 +11,7 @@
 #   ./install-lightop-skills-manual.sh --target both
 #
 # Optional:
-#   KERNELPILOT_ROOT=/path/to/lightop-skills ./install-lightop-skills-manual.sh
+#   KERNELPILOT_ROOT=/path/to/kernel-pilot ./install-lightop-skills-manual.sh
 #   CLAUDE_SKILLS_DIR=/path/to/skills ./install-lightop-skills-manual.sh --target claude
 #   CODEX_SKILLS_DIR=/path/to/skills ./install-lightop-skills-manual.sh --target codex
 #
@@ -31,20 +31,26 @@ CODEX_SKILLS_DIR="${CODEX_SKILLS_DIR:-${CODEX_HOME:-${HOME}/.codex}/skills}"
 KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/knowledge"
 NCUREPORT_SRC="${KERNELPILOT_ROOT}/humanize/skills/ncu-report"
 AGENTLOOP_SRC="${KERNELPILOT_ROOT}/humanize/skills/humanize-kernel-agent-loop"
+TRITON_AGENT_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-kernel-agent-loop"
+TRITON_KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-kernel-knowledge"
+TRITON_PROFILER_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-dcu-profiler-report"
 HUMANIZE_RUNTIME="${KERNELPILOT_ROOT}/humanize"
 # ---- Skill definitions ----
 # Each entry: "name|source_type|source_path|extra"
-# source_type: "symlink" or "hydrate"
+# source_type: "symlink", "hydrate", or "hydrate-dir"
 SKILLS=(
  "lightop-kernel-knowledge|symlink|${KNOWLEDGE_SRC}|"
  "dcu-profiler-report|symlink|${NCUREPORT_SRC}|"
  "lightop-kernel-agent-loop|hydrate|${AGENTLOOP_SRC}/SKILL.md|${HUMANIZE_RUNTIME}"
+  "triton-kernel-agent-loop|hydrate-dir|${TRITON_AGENT_SRC}|${HUMANIZE_RUNTIME}"
+  "triton-kernel-knowledge|hydrate-dir|${TRITON_KNOWLEDGE_SRC}|${HUMANIZE_RUNTIME}"
+  "triton-dcu-profiler-report|hydrate-dir|${TRITON_PROFILER_SRC}|${HUMANIZE_RUNTIME}"
 )
 usage() {
  cat <<'EOF'
-Install LightOp KernelPilot skills manually.
+Install KernelPilot skills manually.
 Usage:
  install-lightop-skills-manual.sh [options]
@@ -60,8 +66,8 @@ Options:
 EOF
 }
-log() { printf '[install-lightop-skills] %s\n' "$*"; }
+log() { printf '[install-kernelpilot-skills] %s\n' "$*"; }
-die() { printf '[install-lightop-skills] Error: %s\n' "$*" >&2; exit 1; }
+die() { printf '[install-kernelpilot-skills] Error: %s\n' "$*" >&2; exit 1; }
 resolve_kernelpilot_root() {
  KERNELPILOT_ROOT="$(cd "$KERNELPILOT_ROOT" 2>/dev/null && pwd || true)"
@@ -70,22 +76,40 @@ resolve_kernelpilot_root() {
  KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/knowledge"
  NCUREPORT_SRC="${KERNELPILOT_ROOT}/humanize/skills/ncu-report"
  AGENTLOOP_SRC="${KERNELPILOT_ROOT}/humanize/skills/humanize-kernel-agent-loop"
+  TRITON_AGENT_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-kernel-agent-loop"
+  TRITON_KNOWLEDGE_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-kernel-knowledge"
+  TRITON_PROFILER_SRC="${KERNELPILOT_ROOT}/humanize/skills/triton-dcu-profiler-report"
  HUMANIZE_RUNTIME="${KERNELPILOT_ROOT}/humanize"
  SKILLS=(
    "lightop-kernel-knowledge|symlink|${KNOWLEDGE_SRC}|"
    "dcu-profiler-report|symlink|${NCUREPORT_SRC}|"
    "lightop-kernel-agent-loop|hydrate|${AGENTLOOP_SRC}/SKILL.md|${HUMANIZE_RUNTIME}"
+    "triton-kernel-agent-loop|hydrate-dir|${TRITON_AGENT_SRC}|${HUMANIZE_RUNTIME}"
+    "triton-kernel-knowledge|hydrate-dir|${TRITON_KNOWLEDGE_SRC}|${HUMANIZE_RUNTIME}"
+    "triton-dcu-profiler-report|hydrate-dir|${TRITON_PROFILER_SRC}|${HUMANIZE_RUNTIME}"
  )
 }
 preflight() {
  local path
-  for path in "$KNOWLEDGE_SRC/SKILL.md" "$NCUREPORT_SRC/SKILL.md" "$AGENTLOOP_SRC/SKILL.md"; do
+  for path in "$KNOWLEDGE_SRC/SKILL.md" "$NCUREPORT_SRC/SKILL.md" "$AGENTLOOP_SRC/SKILL.md" "$TRITON_AGENT_SRC/SKILL.md" "$TRITON_KNOWLEDGE_SRC/SKILL.md" "$TRITON_PROFILER_SRC/SKILL.md"; do
    [[ -e "$path" ]] || die "not found: $path"
  done
 }
+hydrate_file() {
+  local file="$1"
+  local runtime="$2"
+  local tmp
+  tmp="${file}.tmp"
+  sed \
+    "s|{{HUMANIZE_RUNTIME_ROOT}}|${runtime}|g; s|{{KERNELPILOT_ROOT}}|${KERNELPILOT_ROOT}|g" \
+    "$file" > "$tmp"
+  mv "$tmp" "$file"
+}
 install_skill_dir() {
  local skills_dir="$1"
  local label="$2"
@@ -131,6 +155,16 @@ install_skill_dir() {
            "$src" > "${target}/SKILL.md"
        fi
        ;;
+      hydrate-dir)
+        if [[ "$DRY_RUN" == "true" ]]; then
+          log "DRY-RUN copy ${name} with hydrated paths"
+        else
+          log "copying ${name} with hydrated paths"
+          mkdir -p "$target"
+          cp -a "$src/." "$target/"
+          hydrate_file "${target}/SKILL.md" "$runtime"
+        fi
+        ;;
      *)
        die "unknown install kind: ${kind}"
        ;;
@@ -221,7 +255,7 @@ install_python_deps
 cat <<EOF
-Done. Installed LightOp skills:
+Done. Installed KernelPilot skills:
 $(for entry in "${SKILLS[@]}"; do
  IFS='|' read -r name kind _ <<< "$entry"
@@ -232,6 +266,9 @@ $(for entry in "${SKILLS[@]}"; do
    hydrate)
      printf "  - %-30s  (hydrated SKILL.md)\n" "$name"
      ;;
+    hydrate-dir)
+      printf "  - %-30s  (hydrated directory)\n" "$name"
+      ;;
  esac
 done)

--- a/knowledge/README.md
+++ b/knowledge/README.md
-# LightOp Kernel Knowledge
+# KernelPilot Kernel Knowledge
-This directory backs the `lightop-kernel-knowledge` skill. For LightOp/DCU
+This directory backs the KernelPilot knowledge skills, including
-work, use evidence in this order:
+`lightop-kernel-knowledge` and `triton-kernel-knowledge`. For DCU kernel work,
+use evidence in this order:
-1. Local LightOp source, wrappers, bindings, config tables, tests, and
+1. Local target source, wrappers, bindings, config tables, tests, and
-   benchmarks.
+   benchmarks. For Triton work this includes vLLM/SGLang source or a
+   user-specified direct Triton file and its harness.
 2. ROCm/DCU official docs and upstream source: SourceFind DCU/DTK docs,
   ROCm/HIP, MIOpen, rocBLAS, hipBLASLt, Composable Kernel, Triton AMD,
-   PyTorch ROCm, SGLang/vLLM AMD paths, the Hygon HIP optimizer reference,
+   PyTorch ROCm, SGLang/vLLM AMD paths, AITER, AOTriton, Conch, FlagGems,
-   protected DCU Toolkit AMD knowledge-base pointer, plus bundled MR evidence
+   Liger Kernel, Hugging Face kernels, Triton-distributed, the Hygon HIP
-   from SourceFind LightOp and DCU Toolkit flash-attention-cutlass.
+   optimizer reference, protected DCU Toolkit AMD knowledge-base pointer, plus
+   bundled MR evidence from SourceFind LightOp and DCU Toolkit
+   flash-attention-cutlass.
 3. The bundled CUDA-oriented PR corpus, only as cross-platform inspiration
   after translating and validating the idea on DCU.
@@ -30,6 +34,11 @@ python3 scripts/query.py "lightop dcu <operator>" --repo sourcefind-lightop --co
 python3 scripts/query.py "flash attention dcu" --repo flash-attention-cutlass --compact --limit 20
 python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact --limit 20
 python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact --limit 20
+python3 scripts/query.py "aiter triton mla rocm" --type source-reference --compact --limit 20
+python3 scripts/query.py "conch triton paged attention" --type source-reference --compact --limit 20
+python3 scripts/query.py "flaggems triton pytorch operator" --type source-reference --compact --limit 20
+python3 scripts/query.py "liger triton rmsnorm swiglu" --type source-reference --compact --limit 20
+python3 scripts/query.py "triton distributed allreduce gemm" --type source-reference --compact --limit 20
 python3 scripts/search-pr-diffs.py <term1> <term2> [--any] [--limit 100]
 python3 scripts/get_page.py <page-id>
 ```

--- a/knowledge/data/aliases.yaml
+++ b/knowledge/data/aliases.yaml
@@ -98,6 +98,37 @@ ck-tile:
  - "CK Tile"
  - Composable Kernel
+aiter:
+  - AITER
+  - "AI Tensor Engine"
+aotriton:
+  - AOTriton
+  - "AOT Triton"
+  - "ahead of time triton"
+conch:
+  - Conch
+  - conch-triton-kernels
+flaggems:
+  - FlagGems
+  - flag_gems
+liger-kernel:
+  - Liger
+  - "Liger Kernel"
+huggingface-kernels:
+  - "Hugging Face kernels"
+  - kernels-community
+  - hf-kernels
+triton-distributed:
+  - Triton-distributed
+  - triton_dist
+  - "triton distributed"
 # Kernel types
 moe:
  - MoE

--- a/knowledge/index.json
+++ b/knowledge/index.json
 {
  "schema_version": 1,
-  "description": "Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code and knowledge repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials, Hygon/DCU optimization references); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
+  "description": "Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code and knowledge repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials, Hygon/DCU optimization references, and open Triton kernel libraries such as AITER, AOTriton, Conch, FlagGems, Liger Kernel, Hugging Face kernels, and Triton-distributed); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
  "frameworks": [
    {
      "id": "nvidia-code-samples",
@@ -150,6 +150,196 @@
        "dccobjdump",
        "sqtt"
      ]
+    },
+    {
+      "id": "rocm-aiter",
+      "name": "AITER AI Tensor Engine for ROCm",
+      "repo": "ROCm/aiter",
+      "url": "https://github.com/ROCm/aiter",
+      "kernel_paths": [
+        "aiter",
+        "op_tests",
+        "docs",
+        "gradlib",
+        "csrc",
+        "requirements-triton-comms.txt",
+        ".github/scripts/install_triton.sh",
+        "README.md"
+      ],
+      "tags": [
+        "triton",
+        "rocm",
+        "aiter",
+        "attention",
+        "mla",
+        "paged-attention",
+        "fused-moe",
+        "gemm",
+        "rmsnorm",
+        "quantization",
+        "communication"
+      ]
+    },
+    {
+      "id": "rocm-aotriton",
+      "name": "AOTriton Ahead-of-Time Triton Math Library",
+      "repo": "ROCm/aotriton",
+      "url": "https://github.com/ROCm/aotriton",
+      "kernel_paths": [
+        "v2python",
+        "v2src",
+        "v3python",
+        "v3src",
+        "tritonsrc",
+        "include/aotriton",
+        "test",
+        "docs",
+        "README.md"
+      ],
+      "tags": [
+        "triton",
+        "rocm",
+        "aot",
+        "aotriton",
+        "flash-attention",
+        "sdpa",
+        "attention",
+        "compiler",
+        "codegen"
+      ]
+    },
+    {
+      "id": "stackav-conch",
+      "name": "Conch Triton Kernel Standard Library",
+      "repo": "stackav-oss/conch",
+      "url": "https://github.com/stackav-oss/conch",
+      "kernel_paths": [
+        "conch",
+        "tests",
+        "benchmarks",
+        "README.md",
+        "pyproject.toml"
+      ],
+      "tags": [
+        "triton",
+        "rocm",
+        "standard-library",
+        "paged-attention",
+        "varlen-attention",
+        "rmsnorm",
+        "rotary",
+        "kv-cache",
+        "fp8",
+        "int8",
+        "quantization",
+        "vllm"
+      ]
+    },
+    {
+      "id": "flaggems",
+      "name": "FlagGems Triton Operator Library",
+      "repo": "flagos-ai/FlagGems",
+      "url": "https://github.com/flagos-ai/FlagGems",
+      "kernel_paths": [
+        "src/flag_gems",
+        "benchmark",
+        "tests",
+        "modules_tests",
+        "experimental_tests",
+        "triton_src",
+        "docs",
+        "README.md"
+      ],
+      "tags": [
+        "triton",
+        "pytorch",
+        "operator-library",
+        "llm",
+        "backend-neutral",
+        "multi-backend",
+        "aten",
+        "normalization",
+        "reduction",
+        "elementwise",
+        "quantization"
+      ]
+    },
+    {
+      "id": "liger-kernel",
+      "name": "Liger Kernel Triton Kernels for LLM Training",
+      "repo": "linkedin/Liger-Kernel",
+      "url": "https://github.com/linkedin/Liger-Kernel",
+      "kernel_paths": [
+        "src/liger_kernel",
+        "test",
+        "benchmark",
+        "examples",
+        "docs",
+        "README.md"
+      ],
+      "tags": [
+        "triton",
+        "llm-training",
+        "rmsnorm",
+        "rope",
+        "swiglu",
+        "cross-entropy",
+        "fused-linear-cross-entropy",
+        "loss",
+        "amd"
+      ]
+    },
+    {
+      "id": "huggingface-kernels",
+      "name": "Hugging Face Kernels and kernels-community Hub",
+      "repo": "huggingface/kernels",
+      "url": "https://github.com/huggingface/kernels",
+      "kernel_paths": [
+        "src",
+        "kernel-builder",
+        "tests",
+        "README.md",
+        "https://huggingface.co/kernels",
+        "https://huggingface.co/kernels-community"
+      ],
+      "tags": [
+        "triton",
+        "kernel-hub",
+        "kernels-community",
+        "paged-attention",
+        "triton-moe",
+        "triton-scaled-mm",
+        "rmsnorm",
+        "rotary",
+        "quantization"
+      ]
+    },
+    {
+      "id": "triton-distributed",
+      "name": "Triton-distributed",
+      "repo": "ByteDance-Seed/Triton-distributed",
+      "url": "https://github.com/ByteDance-Seed/Triton-distributed",
+      "kernel_paths": [
+        "python",
+        "lib",
+        "include",
+        "csrc",
+        "docs",
+        "tests",
+        "README.md"
+      ],
+      "tags": [
+        "triton",
+        "distributed",
+        "communication-overlap",
+        "gemm-allreduce",
+        "allgather-gemm",
+        "reduce-scatter",
+        "moe",
+        "flash-decode",
+        "amd",
+        "nvidia"
+      ]
    }
  ],
  "topics": [
@@ -161,7 +351,12 @@
        "simveit-load-and-store",
        "colfax-article-src",
        "colfax-cutlass-kernels",
-        "hygon-hip-kernel-optimizer"
+        "hygon-hip-kernel-optimizer",
+        "rocm-aiter",
+        "rocm-aotriton",
+        "stackav-conch",
+        "huggingface-kernels",
+        "triton-distributed"
      ],
      "tags": [
        "attention",
@@ -181,7 +376,12 @@
        "simveit-load-and-store",
        "colfax-article-src",
        "colfax-cutlass-kernels",
-        "hygon-hip-kernel-optimizer"
+        "hygon-hip-kernel-optimizer",
+        "rocm-aiter",
+        "stackav-conch",
+        "flaggems",
+        "huggingface-kernels",
+        "triton-distributed"
      ],
      "tags": [
        "gemm",
@@ -200,7 +400,10 @@
        "simveit-load-and-store",
        "colfax-article-src",
        "colfax-cutlass-kernels",
-        "hygon-hip-kernel-optimizer"
+        "hygon-hip-kernel-optimizer",
+        "rocm-aiter",
+        "huggingface-kernels",
+        "triton-distributed"
      ],
      "tags": [
        "moe",
@@ -217,7 +420,12 @@
      "applies_to": [
        "nvidia-code-samples",
        "simveit-effective-transpose",
-        "hygon-hip-kernel-optimizer"
+        "hygon-hip-kernel-optimizer",
+        "rocm-aiter",
+        "stackav-conch",
+        "flaggems",
+        "liger-kernel",
+        "huggingface-kernels"
      ],
      "tags": [
        "rmsnorm",
@@ -232,7 +440,12 @@
      "name": "Activation / element-wise fusion",
      "applies_to": [
        "simveit-effective-transpose",
-        "hygon-hip-kernel-optimizer"
+        "hygon-hip-kernel-optimizer",
+        "rocm-aiter",
+        "stackav-conch",
+        "flaggems",
+        "liger-kernel",
+        "huggingface-kernels"
      ],
      "tags": [
        "silu",
@@ -250,7 +463,11 @@
        "simveit-load-and-store",
        "colfax-article-src",
        "colfax-cutlass-kernels",
-        "hygon-hip-kernel-optimizer"
+        "hygon-hip-kernel-optimizer",
+        "rocm-aiter",
+        "stackav-conch",
+        "flaggems",
+        "huggingface-kernels"
      ],
      "tags": [
        "fp8",
@@ -262,6 +479,48 @@
        "per-tensor",
        "per-channel"
      ]
+    },
+    {
+      "id": "triton-open-kernel-libraries",
+      "name": "Open Triton kernel libraries",
+      "applies_to": [
+        "rocm-aiter",
+        "rocm-aotriton",
+        "stackav-conch",
+        "flaggems",
+        "liger-kernel",
+        "huggingface-kernels",
+        "triton-distributed"
+      ],
+      "tags": [
+        "triton",
+        "open-source",
+        "llm",
+        "dcu",
+        "rocm",
+        "benchmark",
+        "unit-test",
+        "reference-implementation"
+      ]
+    },
+    {
+      "id": "distributed-triton",
+      "name": "Distributed Triton and compute-communication overlap",
+      "applies_to": [
+        "rocm-aiter",
+        "triton-distributed"
+      ],
+      "tags": [
+        "triton",
+        "distributed",
+        "communication",
+        "allreduce",
+        "allgather",
+        "reduce-scatter",
+        "moe",
+        "tensor-parallel",
+        "expert-parallel"
+      ]
    }
  ]
 }
--- a/knowledge/sources/refs/flaggems.md
+++ b/knowledge/sources/refs/flaggems.md
+---
+id: ref-flaggems
+repo: flagos-ai/FlagGems
+title: FlagGems Triton Operator Library
+url: https://github.com/flagos-ai/FlagGems
+source_type: source-reference
+source_category: open-triton-kernel-library
+architectures:
+- amd
+- nvidia
+- rocm
+- dcu
+tags:
+- triton
+- flaggems
+- pytorch
+- operator-library
+- backend-neutral
+- multi-backend
+- aten
+- normalization
+- reduction
+- elementwise
+- quantization
+techniques:
+- pytorch-operator-replacement
+- backend-neutral-triton
+- test-matrix
+- benchmark-matrix
+- operator-coverage
+hardware_features:
+- wavefront
+- lds
+- vectorization
+- cache
+kernel_types:
+- normalization
+- reduction
+- elementwise
+- activation
+- quantization
+- gemm
+languages:
+- python
+- triton
+captured_at: '2026-05-26'
+license: not-captured
+source_paths:
+- src/flag_gems
+- benchmark
+- tests
+- modules_tests
+- experimental_tests
+- triton_src
+- docs
+- README.md
+---
+# FlagGems Triton Operator Library
+- Repository: `flagos-ai/FlagGems`
+- Source: [flagos-ai/FlagGems](https://github.com/flagos-ai/FlagGems)
+## Route Fit
+Use FlagGems when the Triton task is a PyTorch-style operator, normalization,
+reduction, activation, elementwise fusion, or backend-neutral replacement. It is
+less LLM-serving-specific than AITER or Conch, but it is valuable for portable
+Triton operator structure, tests, and benchmark organization.
+## What To Inspect
+- `src/flag_gems` and `triton_src` for operator implementations.
+- `tests`, `modules_tests`, and `experimental_tests` for dtype/shape coverage.
+- `benchmark` for performance harness layout and comparison policy.
+## DCU Use Notes
+Treat FlagGems constants as hypotheses. Its portability makes it useful for
+syntax and wrapper design, but final tuning still needs DCU profiler, IR/ISA,
+and target-version proof.
+## Query Hooks
+```bash
+python3 scripts/query.py "flaggems triton rmsnorm reduction" --type source-reference --compact
+python3 scripts/query.py "flaggems triton pytorch operator" --type source-reference --compact
+python3 scripts/get_page.py ref-flaggems
+```
--- a/knowledge/sources/refs/huggingface-kernels.md
+++ b/knowledge/sources/refs/huggingface-kernels.md
+---
+id: ref-huggingface-kernels
+repo: huggingface/kernels
+title: Hugging Face Kernels and kernels-community Hub
+url: https://github.com/huggingface/kernels
+source_type: source-reference
+source_category: open-triton-kernel-library
+architectures:
+- amd
+- nvidia
+- rocm
+- dcu
+tags:
+- triton
+- huggingface
+- kernels-community
+- kernel-hub
+- paged-attention
+- triton-moe
+- triton-scaled-mm
+- rmsnorm
+- rotary
+- quantization
+techniques:
+- kernel-package
+- direct-file-harness
+- reference-discovery
+- hub-source-search
+hardware_features:
+- wavefront
+- lds
+- mfma
+- cache
+kernel_types:
+- attention
+- moe
+- gemm
+- normalization
+- rotary
+- quantization
+languages:
+- python
+- triton
+captured_at: '2026-05-26'
+license: not-captured
+source_paths:
+- src
+- kernel-builder
+- tests
+- README.md
+- https://huggingface.co/kernels
+- https://huggingface.co/kernels-community
+---
+# Hugging Face Kernels And kernels-community Hub
+- Repository: `huggingface/kernels`
+- Source: [huggingface/kernels](https://github.com/huggingface/kernels)
+- Hub: [huggingface.co/kernels](https://huggingface.co/kernels)
+## Route Fit
+Use Hugging Face kernels as a discovery route for small, packageable Triton
+kernels and direct-file references. It is useful for finding paged attention,
+Triton MoE, scaled MM, RMSNorm, rotary, and quantization examples that are
+easier to inspect than a full serving framework.
+## What To Inspect
+- Kernel package metadata and source links on the Hub.
+- `tests` and package examples for minimal wrappers.
+- Whether the package declares ROCm/AMD support or only works on CUDA.
+## DCU Use Notes
+Treat Hub kernels as candidates, not proof. Before adapting one, capture the
+source URL, package revision, license/notice, and then run direct correctness,
+benchmark, and Triton cache/profiler proof on DCU.
+## Query Hooks
+```bash
+python3 scripts/query.py "huggingface kernels triton moe" --type source-reference --compact
+python3 scripts/query.py "kernels-community paged attention triton" --type source-reference --compact
+python3 scripts/query.py "triton scaled mm kernels community" --type source-reference --compact
+python3 scripts/get_page.py ref-huggingface-kernels
+```
--- a/knowledge/sources/refs/liger-kernel.md
+++ b/knowledge/sources/refs/liger-kernel.md
+---
+id: ref-liger-kernel
+repo: linkedin/Liger-Kernel
+title: Liger Kernel Triton Kernels for LLM Training
+url: https://github.com/linkedin/Liger-Kernel
+source_type: source-reference
+source_category: open-triton-kernel-library
+architectures:
+- amd
+- nvidia
+- rocm
+- dcu
+tags:
+- triton
+- liger-kernel
+- llm-training
+- rmsnorm
+- rope
+- swiglu
+- cross-entropy
+- fused-linear-cross-entropy
+- loss
+- amd
+techniques:
+- llm-training-kernel
+- autograd-wrapper
+- fused-epilogue
+- memory-reduction
+- benchmark
+hardware_features:
+- wavefront
+- vectorization
+- cache
+kernel_types:
+- normalization
+- rotary
+- activation
+- loss
+- fused-linear
+languages:
+- python
+- triton
+captured_at: '2026-05-26'
+license: not-captured
+source_paths:
+- src/liger_kernel
+- test
+- benchmark
+- examples
+- docs
+- README.md
+---
+# Liger Kernel Triton Kernels For LLM Training
+- Repository: `linkedin/Liger-Kernel`
+- Source: [linkedin/Liger-Kernel](https://github.com/linkedin/Liger-Kernel)
+## Route Fit
+Use Liger Kernel when the Triton task is training-side LLM work: RMSNorm, RoPE,
+SwiGLU, cross entropy, fused linear + loss, or memory-saving fused epilogues.
+It is most useful for direct-file Triton structure, PyTorch autograd wrappers,
+and correctness/benchmark patterns around training kernels.
+## What To Inspect
+- `src/liger_kernel` for wrappers and Triton kernels.
+- `test` for numerical tolerance and training-shape coverage.
+- `benchmark` and `examples` for measuring memory and runtime tradeoffs.
+## DCU Use Notes
+Use Liger as an algorithmic and harness reference. When porting to DCU, verify
+dtype support, generated ISA, and resource pressure on the target card before
+claiming a performance win.
+## Query Hooks
+```bash
+python3 scripts/query.py "liger triton rmsnorm swiglu" --type source-reference --compact
+python3 scripts/query.py "liger fused linear cross entropy" --type source-reference --compact
+python3 scripts/get_page.py ref-liger-kernel
+```
--- a/knowledge/sources/refs/rocm-aiter.md
+++ b/knowledge/sources/refs/rocm-aiter.md
+---
+id: ref-rocm-aiter
+repo: ROCm/aiter
+title: AITER AI Tensor Engine for ROCm
+url: https://github.com/ROCm/aiter
+source_type: source-reference
+source_category: open-triton-kernel-library
+architectures:
+- amd
+- rocm
+- dcu
+tags:
+- triton
+- rocm
+- aiter
+- vllm
+- sglang
+- attention
+- mla
+- paged-attention
+- fused-moe
+- gemm
+- rmsnorm
+- quantization
+- communication
+techniques:
+- backend-dispatch
+- triton-kernel-reference
+- triton-comms
+- aiter-fallback-map
+- rocm-first-validation
+hardware_features:
+- wavefront
+- lds
+- mfma
+- mmac
+- gfx
+kernel_types:
+- attention
+- mla
+- moe
+- gemm
+- normalization
+- quantization
+- communication
+languages:
+- python
+- triton
+- cpp
+- hip
+captured_at: '2026-05-26'
+license: MIT
+source_paths:
+- aiter
+- op_tests
+- docs
+- gradlib
+- csrc
+- requirements-triton-comms.txt
+- .github/scripts/install_triton.sh
+- README.md
+---
+# AITER AI Tensor Engine For ROCm
+- Repository: `ROCm/aiter`
+- Source: [ROCm/aiter](https://github.com/ROCm/aiter)
+- License: `MIT`
+## Route Fit
+Use AITER as first-choice upstream evidence when optimizing Triton kernels that
+compete with, wrap, or fall back to AITER paths in vLLM or SGLang. It is
+especially useful for ROCm/DCU-facing dispatch, attention/MLA, fused MoE,
+RMSNorm, quantized GEMM, and communication-related Triton kernels.
+## What To Inspect
+- Backend selection and fallback logic around AITER versus Triton.
+- `op_tests` for shape coverage, tolerances, and reproducible correctness.
+- Triton communication docs and install scripts when a task touches distributed
+  overlap, all-reduce, or tensor/expert parallel serving.
+- Kernel names and wrapper APIs that vLLM or SGLang may already recognize.
+## DCU Use Notes
+Treat AITER as ROCm-strong evidence, but still prove the exact local framework
+path and the generated Triton kernel on the target DCU. Do not assume an AITER
+kernel is selected unless backend logs, env flags, profiler names, or a direct
+wrapper benchmark prove it.
+## Query Hooks
+```bash
+python3 scripts/query.py "aiter triton mla rocm" --type source-reference --compact
+python3 scripts/query.py "aiter fused moe triton" --type source-reference --compact
+python3 scripts/query.py "aiter triton comms allreduce" --type source-reference --compact
+python3 scripts/get_page.py ref-rocm-aiter
+```
--- a/knowledge/sources/refs/rocm-aotriton.md
+++ b/knowledge/sources/refs/rocm-aotriton.md
+---
+id: ref-rocm-aotriton
+repo: ROCm/aotriton
+title: AOTriton Ahead-of-Time Triton Math Library
+url: https://github.com/ROCm/aotriton
+source_type: source-reference
+source_category: open-triton-kernel-library
+architectures:
+- amd
+- rocm
+- dcu
+tags:
+- triton
+- rocm
+- aotriton
+- aot
+- flash-attention
+- sdpa
+- attention
+- compiler
+- codegen
+- pytorch
+techniques:
+- ahead-of-time-triton
+- flash-attention-reference
+- sdpa-reference
+- compiler-artifact-inspection
+- rocm-codegen
+hardware_features:
+- wavefront
+- lds
+- mfma
+- gfx
+kernel_types:
+- attention
+- flash-attention
+- sdpa
+languages:
+- python
+- triton
+- cpp
+- cmake
+captured_at: '2026-05-26'
+license: not-captured
+source_paths:
+- v2python
+- v2src
+- v3python
+- v3src
+- tritonsrc
+- include/aotriton
+- test
+- docs
+- README.md
+---
+# AOTriton Ahead-Of-Time Triton Math Library
+- Repository: `ROCm/aotriton`
+- Source: [ROCm/aotriton](https://github.com/ROCm/aotriton)
+## Route Fit
+Use AOTriton when the task is about ROCm Triton codegen, FlashAttention,
+PyTorch SDPA integration, generated code objects, or version/toolchain-sensitive
+attention behavior. It is not a drop-in vLLM/SGLang optimization recipe, but it
+is strong upstream evidence for how AMD packages and validates Triton attention
+kernels.
+## What To Inspect
+- `tritonsrc`, `v2python`, and `v3python` for Triton source patterns.
+- `v2src`, `v3src`, and `include/aotriton` for generated/AOT integration.
+- Tests and docs for shape constraints, dtype support, and ROCm build knobs.
+## DCU Use Notes
+Borrow codegen and attention structure only after checking the installed
+Triton/ROCm/DTK version. For runtime Triton JIT work, prefer target cache,
+IR/ISA, and profiler evidence over assuming AOTriton generated artifacts match
+the local environment.
+## Query Hooks
+```bash
+python3 scripts/query.py "aotriton flash attention rocm" --type source-reference --compact
+python3 scripts/query.py "aotriton triton codegen sdpa" --type source-reference --compact
+python3 scripts/get_page.py ref-rocm-aotriton
+```
--- a/knowledge/sources/refs/stackav-conch.md
+++ b/knowledge/sources/refs/stackav-conch.md
+---
+id: ref-stackav-conch
+repo: stackav-oss/conch
+title: Conch Triton Kernel Standard Library
+url: https://github.com/stackav-oss/conch
+source_type: source-reference
+source_category: open-triton-kernel-library
+architectures:
+- amd
+- rocm
+- nvidia
+- dcu
+tags:
+- triton
+- conch
+- standard-library
+- rocm
+- paged-attention
+- varlen-attention
+- rmsnorm
+- rotary
+- kv-cache
+- fp8
+- int8
+- quantization
+- vllm
+techniques:
+- pytorch-reference
+- microbenchmark
+- unit-test
+- direct-file-harness
+- kernel-wrapper-pattern
+hardware_features:
+- wavefront
+- lds
+- mfma
+- cache
+kernel_types:
+- attention
+- paged-attention
+- normalization
+- rotary
+- quantization
+- kv-cache
+languages:
+- python
+- triton
+captured_at: '2026-05-26'
+license: Apache-2.0
+source_paths:
+- conch
+- tests
+- benchmarks
+- README.md
+- pyproject.toml
+---
+# Conch Triton Kernel Standard Library
+- Repository: `stackav-oss/conch`
+- Source: [stackav-oss/conch](https://github.com/stackav-oss/conch)
+- Package: [conch-triton-kernels](https://pypi.org/project/conch-triton-kernels/)
+- License: `Apache-2.0`
+## Route Fit
+Use Conch as a high-quality open Triton kernel reference for direct-file mode
+and vLLM-adjacent serving kernels. It is useful when the task needs a PyTorch
+reference, unit test, microbenchmark, launch wrapper, or standalone Triton file
+that can be adapted into `.humanize/triton-agent/` harnesses.
+## What To Inspect
+- Paged attention, varlen attention, rotary, RMSNorm, KV-cache, and quantized
+  utility kernels.
+- `tests` for correctness tolerances and edge cases.
+- `benchmarks` for warmed timing and direct wrapper invocation patterns.
+## DCU Use Notes
+Conch is useful because it is closer to direct-file Triton work than framework
+backend code. Still verify ROCm/DCU behavior with local profiler names, cache
+entries, and target `gcnArchName`; do not copy tuning constants blindly.
+## Query Hooks
+```bash
+python3 scripts/query.py "conch triton paged attention" --type source-reference --compact
+python3 scripts/query.py "conch rmsnorm rotary fp8" --type source-reference --compact
+python3 scripts/query.py "conch direct file harness triton" --type source-reference --compact
+python3 scripts/get_page.py ref-stackav-conch
+```
--- a/knowledge/sources/refs/triton-distributed.md
+++ b/knowledge/sources/refs/triton-distributed.md
+---
+id: ref-triton-distributed
+repo: ByteDance-Seed/Triton-distributed
+title: Triton-distributed
+url: https://github.com/ByteDance-Seed/Triton-distributed
+source_type: source-reference
+source_category: open-triton-kernel-library
+architectures:
+- amd
+- nvidia
+- rocm
+- dcu
+tags:
+- triton
+- distributed
+- communication-overlap
+- allreduce
+- allgather
+- reduce-scatter
+- gemm
+- moe
+- flash-decode
+- amd
+- nvidia
+techniques:
+- compute-communication-overlap
+- gemm-allreduce
+- allgather-gemm
+- reduce-scatter-overlap
+- distributed-kernel
+hardware_features:
+- wavefront
+- lds
+- mfma
+- interconnect
+kernel_types:
+- gemm
+- attention
+- moe
+- communication
+languages:
+- python
+- triton
+- cpp
+captured_at: '2026-05-26'
+license: not-captured
+source_paths:
+- python
+- lib
+- include
+- csrc
+- docs
+- tests
+- README.md
+---
+# Triton-distributed
+- Repository: `ByteDance-Seed/Triton-distributed`
+- Source: [ByteDance-Seed/Triton-distributed](https://github.com/ByteDance-Seed/Triton-distributed)
+- Docs: [Triton-distributed kernels](https://triton-distributed.readthedocs.io/en/latest/kernels/index.html)
+## Route Fit
+Use Triton-distributed when the optimization touches tensor parallelism, expert
+parallelism, MoE communication, GEMM + all-reduce, all-gather + GEMM,
+reduce-scatter overlap, or distributed flash decode. It is not the first source
+for single-kernel tuning, but it is a strong reference for overlap-aware Triton
+design.
+## What To Inspect
+- Distributed kernel examples and docs for communication overlap patterns.
+- Tests for shape and process-group assumptions.
+- Backend support notes; separate AMD-compatible ideas from NVIDIA-only paths.
+## DCU Use Notes
+For DCU, prove the communication backend, process topology, and profiler kernel
+presence before reusing overlap patterns. Treat NVIDIA-specific launch or
+interconnect assumptions as cross-platform inspiration only.
+## Query Hooks
+```bash
+python3 scripts/query.py "triton distributed allreduce gemm" --type source-reference --compact
+python3 scripts/query.py "triton distributed moe reduce scatter" --type source-reference --compact
+python3 scripts/get_page.py ref-triton-distributed
+```