"description":"Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
"description":"Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code and knowledge repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials, Hygon/DCU optimization references); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
- Access status on 2026-05-20: anonymous page access redirected to GitLab sign-in; anonymous API tree access returned project-not-found.
## Route Fit
Use this as a Route B pointer for AMD/DCU upstream knowledge once authenticated access or a local export is available. The content is not mirrored here because anonymous access is protected, so do not cite it as direct evidence until the actual files have been synced or pasted into the workspace.
## How To Use
- Search this entry when a DCU optimization question needs AMD/ROCm/HIP knowledge beyond local LightOp and bundled MR evidence.
- If credentials or an exported copy become available, import the actual markdown/source files under `sources/refs/` or an evidence bundle and record commit/version plus license/notice details.
- Until the files are imported, cite this page only as a pending protected source, not as implementation proof.
## Query Hooks
```bash
python3 scripts/query.py "amd knowledge base dcu"--type source-reference --compact
Use this as Route B DCU/Hygon optimization knowledge before falling back to CUDA-only PR evidence. It is a workflow and method catalog, not LightOp source code, so any borrowed idea still needs LightOp integration, target compilation, benchmark correctness, `hipprof` evidence, and `dccobjdump` ISA verification.
## Core Operating Rules
- Treat DCU wavefront size as 64 and rewrite CUDA warp32 assumptions before reuse.
- Prefer CK Tile or HCU examples for GEMM, convolution, attention, MoE, and norm template work; do not port CUTLASS paths by name.
- Use benchmark timing, `hipprof` PMC/read/write passes, optional SQTT, code-object resource analysis, and final `dccobjdump` together. Final ISA is the proof for Hygon behavior.
- Keep a branch-and-select loop: profile the current best, choose methods from measured bottlenecks, generate several variants, benchmark all correct branches, then keep only changes with positive attribution.
- Treat source-backed HCU or AMD-named builtins as candidates only when the exact call signature and target guard are visible in DCU source, an existing project, or a minimal compile probe.
- Do not invent builtin names from AMD docs or mnemonic guesses. Compile the exact candidate and inspect generated Hygon ISA.
- Avoid FP4 strategies for current Hygon DCU targets unless newer hardware/toolchain evidence appears.
## Method Families
- Compute: HCU MMOP/MMAC matrix core, FP8/BF8/TF32 low precision on supported targets, wave64 launch geometry, thread coarsening, register-pressure control, fast math, and inline asm/builtin escape hatches.
- Memory: coalesced and vectorized access, layout transforms, LDS tiling, global-to-LDS paths, CK Tile MLS/TLS/WASP loaders, DS matrix-read layouts, LDS bank-conflict reduction, cache policy, CK Tile named pipelines, and epilogue fusion.