Commit 91f343be authored by whlwhlwhl's avatar whlwhlwhl
Browse files

add dcu knowledge

parent 21b6d8ec
......@@ -19,7 +19,9 @@ Useful local commands:
```bash
python3 scripts/query.py "<keywords>" --compact
python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact
python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact
python3 scripts/search-pr-diffs.py <term1> <term2> [--any]
python3 scripts/get_page.py <pr-page-id>
python3 scripts/get_page.py <page-id>
python3 scripts/validate.py
```
......@@ -7,14 +7,16 @@ work, use evidence in this order:
benchmarks.
2. ROCm/DCU official docs and upstream source: SourceFind DCU/DTK docs,
ROCm/HIP, MIOpen, rocBLAS, hipBLASLt, Composable Kernel, Triton AMD,
PyTorch ROCm, SGLang/vLLM AMD paths, plus bundled MR evidence from
SourceFind LightOp and DCU Toolkit flash-attention-cutlass.
PyTorch ROCm, SGLang/vLLM AMD paths, the Hygon HIP optimizer reference,
protected DCU Toolkit AMD knowledge-base pointer, plus bundled MR evidence
from SourceFind LightOp and DCU Toolkit flash-attention-cutlass.
3. The bundled CUDA-oriented PR corpus, only as cross-platform inspiration
after translating and validating the idea on DCU.
The existing corpus is still available:
- PR pages under `sources/prs/`
- Source reference pages under `sources/refs/`
- PR evidence bundles under `evidence/pull-bundles/`
- Candidate ledgers under `candidates/`
- Metadata under `data/`
......@@ -26,14 +28,21 @@ Run local PR queries from this directory:
python3 scripts/query.py "<keywords>" --compact --limit 30
python3 scripts/query.py "lightop dcu <operator>" --repo sourcefind-lightop --compact --limit 20
python3 scripts/query.py "flash attention dcu" --repo flash-attention-cutlass --compact --limit 20
python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact --limit 20
python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact --limit 20
python3 scripts/search-pr-diffs.py <term1> <term2> [--any] [--limit 100]
python3 scripts/get_page.py <pr-page-id>
python3 scripts/get_page.py <page-id>
```
When CUDA evidence shapes a LightOp implementation, record that it is
inspiration, cite the source path/URL and commit/version, and explain the DCU
translation: HIP API, CU/wavefront/LDS/MFMA/resource/config differences.
Protected source pointers, such as the DCU Toolkit AMD knowledge base, are
index entries only until authenticated content is synced. Do not cite them as
direct implementation evidence without the imported file path, commit/version,
and license/notice details.
Validate the bundled corpus with:
```bash
......
......@@ -65,6 +65,13 @@ Useful source families:
- Bundled MR evidence from the DCU Toolkit flash-attention-cutlass repository:
`sources/prs/flash-attention-cutlass/` with artifacts under
`evidence/pull-bundles/flash-attention-cutlass/`.
- Source reference pages under `sources/refs/`, especially:
- `ref-hygon-hip-kernel-optimizer` for Hygon/DCU HIP optimization workflow,
method taxonomy, profiling signals, wave64 rules, CK Tile-first guidance,
source-backed builtin discipline, and `dccobjdump` ISA verification.
- `ref-dcu-toolkit-amd-knowledge-base` for the protected DCU Toolkit
`amd-knowledge-base` pointer. Use it as a pending source until the
authenticated files are synced locally.
- SourceFind DCU/DTK docs, especially the performance analysis tool guide:
`https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html`
- ROCm/HIP docs and profiler docs.
......@@ -92,8 +99,10 @@ Run from the knowledge root:
python3 scripts/query.py "<operator> <dtype> <symptom>" --compact --limit 30
python3 scripts/query.py "lightop dcu <operator>" --repo sourcefind-lightop --compact --limit 20
python3 scripts/query.py "flash attention dcu" --repo flash-attention-cutlass --compact --limit 20
python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact --limit 20
python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact --limit 20
python3 scripts/search-pr-diffs.py <term1> <term2> [--any] [--limit 100]
python3 scripts/get_page.py <pr-page-id>
python3 scripts/get_page.py <page-id>
```
Before borrowing an idea, open the bundle named by `artifact_dir`:
......@@ -116,6 +125,14 @@ Translation rules:
- Map CUDA concepts to ROCm concepts explicitly: SM to CU, warp to wavefront
where appropriate, shared memory to LDS, tensor-core paths to MFMA or the
selected ROCm library/backend.
- For Hygon/DCU-specific claims, prefer target-compiled proof: `hipprof`,
code-object resource analysis, optional SQTT, and final `dccobjdump` output.
Treat AMD MFMA and CUDA tensor-core names as analogies until the Hygon ISA
shows `v_mmac`, MMOP, matrix-load, DS matrix-read, direct-to-LDS, or other
expected DCU patterns.
- Use source-backed HCU or AMD-named builtins only when the exact signature is
present in a DCU source page, existing project source, or a compile probe.
Do not invent builtin names from mnemonics or generic AMD documents.
- Do not cite CUDA profiler metrics as evidence for a DCU bottleneck.
- Record the CUDA source as inspiration, not as a DCU implementation proof.
......@@ -143,6 +160,8 @@ moe w8a8 gfx936 config benchmark
fuse silu quant op binding csrc
sourcefind-lightop gfx938 fused op test
flash-attention-cutlass dcu attention benchmark
hygon dcu mmac wave64 lds waitcnt
amd knowledge base dcu protected source
```
For DCU profiling:
......
......@@ -77,6 +77,27 @@ sm90:
- H800
- "SM90"
dcu:
- Hygon
- hygon
- DCU
- "Hygon DCU"
wave64:
- wavefront64
- "wavefront size 64"
- wavefront
mmac:
- MMOP
- "matrix core"
- "v_mmac"
- HCU
ck-tile:
- "CK Tile"
- Composable Kernel
# Kernel types
moe:
- MoE
......
......@@ -506,3 +506,46 @@ frameworks:
start_date: '2024-01-01'
cutoff_date: '2026-05-19'
scan_mode: git-merge-request-refs
- id: hygon-hip-kernel-optimizer
name: Hygon HIP Kernel Optimizer Skill
repo: yuguo-Jack/cuda-optimized-skill
url: https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer
kernel_paths:
- skills/hygon-hip-kernel-optimizer/SKILL.md
- skills/hygon-hip-kernel-optimizer/references/optimization_catalog.md
- skills/hygon-hip-kernel-optimizer/references/dcu_metrics_guide.md
- skills/hygon-hip-kernel-optimizer/references/method_registry.json
- skills/hygon-hip-kernel-optimizer/references/dcu_isa_signatures.json
- skills/hygon-hip-kernel-optimizer/examples/walkthrough.md
tags:
- dcu
- hygon
- hip
- rocm
- ck-tile
- mmac
- wave64
- hipprof
- dccobjdump
- sqtt
start_date: '2026-05-20'
cutoff_date: '2026-05-20'
scan_mode: source-reference
- id: dcu-toolkit-amd-knowledge-base
name: DCU Toolkit AMD Knowledge Base
repo: dcutoolkit/deeplearing/geak_dcu
url: http://42.228.13.241:10068/dcutoolkit/deeplearing/geak_dcu/-/tree/main/mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base
kernel_paths:
- mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base
tags:
- dcu
- amd
- rocm
- hip
- geak-dcu
- amd-knowledge-base
- protected-source
start_date: '2026-05-20'
cutoff_date: '2026-05-20'
scan_mode: protected-source-reference
access: protected-login-required
{
"schema_version": 1,
"description": "Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
"description": "Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code and knowledge repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials, Hygon/DCU optimization references); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
"frameworks": [
{
"id": "nvidia-code-samples",
......@@ -124,6 +124,32 @@
"blackwell",
"hopper"
]
},
{
"id": "hygon-hip-kernel-optimizer",
"name": "Hygon HIP Kernel Optimizer Skill",
"repo": "yuguo-Jack/cuda-optimized-skill",
"url": "https://github.com/yuguo-Jack/cuda-optimized-skill",
"kernel_paths": [
"skills/hygon-hip-kernel-optimizer/SKILL.md",
"skills/hygon-hip-kernel-optimizer/references/optimization_catalog.md",
"skills/hygon-hip-kernel-optimizer/references/dcu_metrics_guide.md",
"skills/hygon-hip-kernel-optimizer/references/method_registry.json",
"skills/hygon-hip-kernel-optimizer/references/dcu_isa_signatures.json",
"skills/hygon-hip-kernel-optimizer/examples/walkthrough.md"
],
"tags": [
"dcu",
"hygon",
"hip",
"rocm",
"ck-tile",
"mmac",
"wave64",
"hipprof",
"dccobjdump",
"sqtt"
]
}
],
"topics": [
......@@ -134,7 +160,8 @@
"simveit-effective-transpose",
"simveit-load-and-store",
"colfax-article-src",
"colfax-cutlass-kernels"
"colfax-cutlass-kernels",
"hygon-hip-kernel-optimizer"
],
"tags": [
"attention",
......@@ -153,7 +180,8 @@
"simveit-effective-transpose",
"simveit-load-and-store",
"colfax-article-src",
"colfax-cutlass-kernels"
"colfax-cutlass-kernels",
"hygon-hip-kernel-optimizer"
],
"tags": [
"gemm",
......@@ -171,7 +199,8 @@
"simveit-effective-transpose",
"simveit-load-and-store",
"colfax-article-src",
"colfax-cutlass-kernels"
"colfax-cutlass-kernels",
"hygon-hip-kernel-optimizer"
],
"tags": [
"moe",
......@@ -187,7 +216,8 @@
"name": "RMSNorm / LayerNorm fused norms",
"applies_to": [
"nvidia-code-samples",
"simveit-effective-transpose"
"simveit-effective-transpose",
"hygon-hip-kernel-optimizer"
],
"tags": [
"rmsnorm",
......@@ -201,7 +231,8 @@
"id": "activation-fusion",
"name": "Activation / element-wise fusion",
"applies_to": [
"simveit-effective-transpose"
"simveit-effective-transpose",
"hygon-hip-kernel-optimizer"
],
"tags": [
"silu",
......@@ -218,7 +249,8 @@
"simveit-effective-transpose",
"simveit-load-and-store",
"colfax-article-src",
"colfax-cutlass-kernels"
"colfax-cutlass-kernels",
"hygon-hip-kernel-optimizer"
],
"tags": [
"fp8",
......
......@@ -47,7 +47,7 @@ def parse_markdown(path: Path) -> Page:
def iter_pages() -> list[Page]:
root = knowledge_root()
bases = [root / "sources" / "prs"]
bases = [root / "sources" / "prs", root / "sources" / "refs"]
pages: list[Page] = []
for base in bases:
if not base.exists():
......
......@@ -20,7 +20,7 @@ def render_page(path: Path, *, frontmatter_only: bool, body_only: bool) -> str:
def main() -> int:
parser = argparse.ArgumentParser(description="Fetch a KernelPilot PR evidence page by id or path")
parser = argparse.ArgumentParser(description="Fetch a KernelPilot evidence page by id or path")
parser.add_argument("page")
parser.add_argument("--follow-sources", action="store_true", help="Deprecated no-op kept for compatibility; PR pages are the source.")
parser.add_argument("--frontmatter-only", action="store_true")
......
......@@ -53,7 +53,7 @@ def score_page(text: str, terms: list[str]) -> int:
def main() -> int:
parser = argparse.ArgumentParser(description="Search KernelPilot PR diff evidence pages")
parser = argparse.ArgumentParser(description="Search KernelPilot evidence and source-reference pages")
parser.add_argument("query", nargs="*", help="keyword query")
parser.add_argument("--type", dest="type_filter")
parser.add_argument("--tag")
......@@ -105,7 +105,8 @@ def main() -> int:
for score, page in rows:
tags = ",".join(as_list(page.meta.get("tags"))[:5])
artifact_dir = page.meta.get("artifact_dir", "-")
print(f"{page.id or '-'} | PR | score={score} | {page.relpath} | {artifact_dir} | {tags} | {page.title}")
kind = "PR" if page.relpath.startswith("sources/prs/") else page.kind
print(f"{page.id or '-'} | {kind} | score={score} | {page.relpath} | {artifact_dir} | {tags} | {page.title}")
return 0
print("| ID | Score | Path | Evidence Bundle | Title | Tags |")
print("| --- | ---: | --- | --- | --- | --- |")
......
......@@ -61,9 +61,14 @@ def main() -> int:
else:
errors.append(f"{page.relpath}: missing id")
source_refs = 0
for page in pages:
if not page.relpath.startswith("sources/prs/"):
errors.append(f"{page.relpath}: non-PR page indexed")
if page.relpath.startswith("sources/prs/"):
continue
if page.relpath.startswith("sources/refs/"):
source_refs += 1
continue
errors.append(f"{page.relpath}: unsupported indexed page location")
source_prs = 0
complete_source_pr_bundles = 0
......@@ -119,6 +124,7 @@ def main() -> int:
"pull_bundles": len(list((root / PULL_BUNDLE_ROOT).glob("*/*"))),
"source_prs": source_prs,
"complete_source_pr_bundles": complete_source_pr_bundles,
"source_refs": source_refs,
"candidate_prs": candidate_prs,
"candidate_ledgers": len(ledgers),
"index_repos": index_repos,
......
---
id: ref-dcu-toolkit-amd-knowledge-base
repo: dcutoolkit/deeplearing/geak_dcu
title: DCU Toolkit AMD Knowledge Base
url: http://42.228.13.241:10068/dcutoolkit/deeplearing/geak_dcu/-/tree/main/mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base
source_type: source-reference
source_category: upstream-knowledge
access: protected-login-required
architectures:
- dcu
- amd
- rocm
tags:
- dcu
- amd
- rocm
- hip
- geak-dcu
- amd-knowledge-base
- protected-source
techniques:
- dcu-reference-search
- source-backed-porting
- protected-upstream-sync
hardware_features:
- wavefront
- lds
- mfma
- mmac
- profiler
kernel_types:
- gemm
- attention
- moe
- normalization
- reduction
languages:
- markdown
- hip
- cpp
- python
captured_at: '2026-05-20'
access_probe:
tree_page: anonymous request returned HTTP 302 to /users/sign_in
api_tree: anonymous request returned HTTP 404 Project Not Found
---
# DCU Toolkit AMD Knowledge Base
- Repository: `dcutoolkit/deeplearing/geak_dcu`
- Knowledge path: `mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base`
- Source URL: [amd-knowledge-base](http://42.228.13.241:10068/dcutoolkit/deeplearing/geak_dcu/-/tree/main/mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base)
- Access status on 2026-05-20: anonymous page access redirected to GitLab sign-in; anonymous API tree access returned project-not-found.
## Route Fit
Use this as a Route B pointer for AMD/DCU upstream knowledge once authenticated access or a local export is available. The content is not mirrored here because anonymous access is protected, so do not cite it as direct evidence until the actual files have been synced or pasted into the workspace.
## How To Use
- Search this entry when a DCU optimization question needs AMD/ROCm/HIP knowledge beyond local LightOp and bundled MR evidence.
- If credentials or an exported copy become available, import the actual markdown/source files under `sources/refs/` or an evidence bundle and record commit/version plus license/notice details.
- Until the files are imported, cite this page only as a pending protected source, not as implementation proof.
## Query Hooks
```bash
python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact
python3 scripts/get_page.py ref-dcu-toolkit-amd-knowledge-base
```
---
id: ref-hygon-hip-kernel-optimizer
repo: yuguo-Jack/cuda-optimized-skill
title: Hygon HIP Kernel Iterative Optimizer
url: https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer
source_type: source-reference
source_category: upstream-knowledge
architectures:
- dcu
- gfx936
- gfx938
tags:
- dcu
- hygon
- hip
- rocm
- ck-tile
- hipprof
- dccobjdump
- sqtt
techniques:
- roofline-axis-budget
- branch-and-select
- ablation-attribution
- dcu-isa-verification
- source-backed-builtin-probes
- wave64-porting
- ck-tile-first
hardware_features:
- wave64
- lds
- mmac
- matrix-load
- ds-read-matrix
- waitcnt
- vgpr
- sgpr
kernel_types:
- gemm
- attention
- moe
- normalization
- convolution
- reduction
- activation
languages:
- markdown
- python
- hip
captured_at: '2026-05-20'
commit: c069290452aee67baa709f55d767358ab4171e69
license: MIT
source_paths:
- skills/hygon-hip-kernel-optimizer/SKILL.md
- skills/hygon-hip-kernel-optimizer/references/optimization_catalog.md
- skills/hygon-hip-kernel-optimizer/references/dcu_metrics_guide.md
- skills/hygon-hip-kernel-optimizer/references/method_registry.json
- skills/hygon-hip-kernel-optimizer/references/dcu_isa_signatures.json
- skills/hygon-hip-kernel-optimizer/examples/walkthrough.md
---
# Hygon HIP Kernel Iterative Optimizer
- Repository: `yuguo-Jack/cuda-optimized-skill`
- Source: [skills/hygon-hip-kernel-optimizer](https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer)
- Captured commit: `c069290452aee67baa709f55d767358ab4171e69`
- License: `MIT`
## Route Fit
Use this as Route B DCU/Hygon optimization knowledge before falling back to CUDA-only PR evidence. It is a workflow and method catalog, not LightOp source code, so any borrowed idea still needs LightOp integration, target compilation, benchmark correctness, `hipprof` evidence, and `dccobjdump` ISA verification.
## Core Operating Rules
- Treat DCU wavefront size as 64 and rewrite CUDA warp32 assumptions before reuse.
- Prefer CK Tile or HCU examples for GEMM, convolution, attention, MoE, and norm template work; do not port CUTLASS paths by name.
- Use benchmark timing, `hipprof` PMC/read/write passes, optional SQTT, code-object resource analysis, and final `dccobjdump` together. Final ISA is the proof for Hygon behavior.
- Keep a branch-and-select loop: profile the current best, choose methods from measured bottlenecks, generate several variants, benchmark all correct branches, then keep only changes with positive attribution.
- Treat source-backed HCU or AMD-named builtins as candidates only when the exact call signature and target guard are visible in DCU source, an existing project, or a minimal compile probe.
- Do not invent builtin names from AMD docs or mnemonic guesses. Compile the exact candidate and inspect generated Hygon ISA.
- Avoid FP4 strategies for current Hygon DCU targets unless newer hardware/toolchain evidence appears.
## Method Families
- Compute: HCU MMOP/MMAC matrix core, FP8/BF8/TF32 low precision on supported targets, wave64 launch geometry, thread coarsening, register-pressure control, fast math, and inline asm/builtin escape hatches.
- Memory: coalesced and vectorized access, layout transforms, LDS tiling, global-to-LDS paths, CK Tile MLS/TLS/WASP loaders, DS matrix-read layouts, LDS bank-conflict reduction, cache policy, CK Tile named pipelines, and epilogue fusion.
- Latency: waitcnt-aware pipelining, barrier reduction, wavefront shuffle or DS permute exchange, ILP/unroll, persistent scheduling, split-K/stream-K, SALU/VALU phase balance, and SQTT stall triage.
- Operator shortcuts: map GEMM/MoE/attention/conv/norm/reduction questions to CK Tile/HCU examples first, then tune geometry, layout, pipeline, split, persistent, quant, and epilogue choices.
## Profiling Signals
- Low matrix-core instruction evidence on GEMM/conv/attention/MoE work points to MMAC or mixed-precision routes.
- High global/cache request pressure or scalar contiguous memory ops points to coalescing, vectorized loads/stores, layout work, or LDS staging.
- High VGPR/SGPR/LDS pressure from code-object analysis should feed launch geometry, register control, and occupancy decisions.
- Many close-to-load `s_waitcnt` or SQTT wait stalls points to wait placement, software pipelining, and ILP.
- PMC gaps or contradictory results should trigger SQTT/stat-stall or source/ISA inspection instead of blind tuning.
## ISA Evidence To Search
Useful Hygon/DCU evidence strings include `v_mmac`, `MMOP`, `fp8`, `bf8`, `tf32`, `global_load_dwordx2`, `global_load_dwordx4`, `buffer_load`, `lds`, `matrix_load`, `ds_read_m32x16_b16`, `ds_read_m32x16_b16_alt`, `ds_read_m32x32_b8`, `ds_read_m32x64_b4`, `ds_read_m32x8_b32`, `ds_bpermute`, `ds_permute`, `s_waitcnt vmcnt(0)`, `s_waitcnt lgkmcnt(0)`, and `s_barrier`.
## Query Hooks
```bash
python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact
python3 scripts/query.py "hipprof sqtt waitcnt" --type source-reference --compact
python3 scripts/get_page.py ref-hygon-hip-kernel-optimizer
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment