add dcu knowledge

91f343be · whlwhlwhl · 21b6d8ec · 91f343be · 91f343be · 91f343be
Commit 91f343be authored May 20, 2026 by whlwhlwhl
12 changed files
--- a/knowledge/CLAUDE.md
+++ b/knowledge/CLAUDE.md
@@ -19,7 +19,9 @@ Useful local commands:

 ```bash
 python3 scripts/query.py "<keywords>" --compact
+python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact
+python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact
 python3 scripts/search-pr-diffs.py <term1> <term2> [--any]
-python3 scripts/get_page.py <pr-page-id>
+python3 scripts/get_page.py <page-id>
 python3 scripts/validate.py
 ```
--- a/knowledge/README.md
+++ b/knowledge/README.md
@@ -7,14 +7,16 @@ work, use evidence in this order:
   benchmarks.
 2. ROCm/DCU official docs and upstream source: SourceFind DCU/DTK docs,
   ROCm/HIP, MIOpen, rocBLAS, hipBLASLt, Composable Kernel, Triton AMD,
-   PyTorch ROCm, SGLang/vLLM AMD paths, plus bundled MR evidence from
-   SourceFind LightOp and DCU Toolkit flash-attention-cutlass.
+   PyTorch ROCm, SGLang/vLLM AMD paths, the Hygon HIP optimizer reference,
+   protected DCU Toolkit AMD knowledge-base pointer, plus bundled MR evidence
+   from SourceFind LightOp and DCU Toolkit flash-attention-cutlass.
 3. The bundled CUDA-oriented PR corpus, only as cross-platform inspiration
   after translating and validating the idea on DCU.

 The existing corpus is still available:

 - PR pages under `sources/prs/`
+- Source reference pages under `sources/refs/`
 - PR evidence bundles under `evidence/pull-bundles/`
 - Candidate ledgers under `candidates/`
 - Metadata under `data/`
@@ -26,14 +28,21 @@ Run local PR queries from this directory:
 python3 scripts/query.py "<keywords>" --compact --limit 30
 python3 scripts/query.py "lightop dcu <operator>" --repo sourcefind-lightop --compact --limit 20
 python3 scripts/query.py "flash attention dcu" --repo flash-attention-cutlass --compact --limit 20
+python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact --limit 20
+python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact --limit 20
 python3 scripts/search-pr-diffs.py <term1> <term2> [--any] [--limit 100]
-python3 scripts/get_page.py <pr-page-id>
+python3 scripts/get_page.py <page-id>
 ```

 When CUDA evidence shapes a LightOp implementation, record that it is
 inspiration, cite the source path/URL and commit/version, and explain the DCU
 translation: HIP API, CU/wavefront/LDS/MFMA/resource/config differences.

+Protected source pointers, such as the DCU Toolkit AMD knowledge base, are
+index entries only until authenticated content is synced. Do not cite them as
+direct implementation evidence without the imported file path, commit/version,
+and license/notice details.
+
 Validate the bundled corpus with:

 ```bash

--- a/knowledge/SKILL.md
+++ b/knowledge/SKILL.md
@@ -65,6 +65,13 @@ Useful source families:
 - Bundled MR evidence from the DCU Toolkit flash-attention-cutlass repository:
  `sources/prs/flash-attention-cutlass/` with artifacts under
  `evidence/pull-bundles/flash-attention-cutlass/`.
+- Source reference pages under `sources/refs/`, especially:
+  - `ref-hygon-hip-kernel-optimizer` for Hygon/DCU HIP optimization workflow,
+    method taxonomy, profiling signals, wave64 rules, CK Tile-first guidance,
+    source-backed builtin discipline, and `dccobjdump` ISA verification.
+  - `ref-dcu-toolkit-amd-knowledge-base` for the protected DCU Toolkit
+    `amd-knowledge-base` pointer. Use it as a pending source until the
+    authenticated files are synced locally.
 - SourceFind DCU/DTK docs, especially the performance analysis tool guide:
  `https://developer.sourcefind.cn/gitbook//dcu_developer/DeveloperGuide/dcu_programming/DCU_programming_chapter3_7.html`
 - ROCm/HIP docs and profiler docs.
@@ -92,8 +99,10 @@ Run from the knowledge root:
 python3 scripts/query.py "<operator> <dtype> <symptom>" --compact --limit 30
 python3 scripts/query.py "lightop dcu <operator>" --repo sourcefind-lightop --compact --limit 20
 python3 scripts/query.py "flash attention dcu" --repo flash-attention-cutlass --compact --limit 20
+python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact --limit 20
+python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact --limit 20
 python3 scripts/search-pr-diffs.py <term1> <term2> [--any] [--limit 100]
-python3 scripts/get_page.py <pr-page-id>
+python3 scripts/get_page.py <page-id>
 ```

 Before borrowing an idea, open the bundle named by `artifact_dir`:
@@ -116,6 +125,14 @@ Translation rules:
 - Map CUDA concepts to ROCm concepts explicitly: SM to CU, warp to wavefront
  where appropriate, shared memory to LDS, tensor-core paths to MFMA or the
  selected ROCm library/backend.
+- For Hygon/DCU-specific claims, prefer target-compiled proof: `hipprof`,
+  code-object resource analysis, optional SQTT, and final `dccobjdump` output.
+  Treat AMD MFMA and CUDA tensor-core names as analogies until the Hygon ISA
+  shows `v_mmac`, MMOP, matrix-load, DS matrix-read, direct-to-LDS, or other
+  expected DCU patterns.
+- Use source-backed HCU or AMD-named builtins only when the exact signature is
+  present in a DCU source page, existing project source, or a compile probe.
+  Do not invent builtin names from mnemonics or generic AMD documents.
 - Do not cite CUDA profiler metrics as evidence for a DCU bottleneck.
 - Record the CUDA source as inspiration, not as a DCU implementation proof.

@@ -143,6 +160,8 @@ moe w8a8 gfx936 config benchmark
 fuse silu quant op binding csrc
 sourcefind-lightop gfx938 fused op test
 flash-attention-cutlass dcu attention benchmark
+hygon dcu mmac wave64 lds waitcnt
+amd knowledge base dcu protected source
 ```

 For DCU profiling:

--- a/knowledge/data/aliases.yaml
+++ b/knowledge/data/aliases.yaml
@@ -77,6 +77,27 @@ sm90:
  - H800
  - "SM90"

+dcu:
+  - Hygon
+  - hygon
+  - DCU
+  - "Hygon DCU"
+
+wave64:
+  - wavefront64
+  - "wavefront size 64"
+  - wavefront
+
+mmac:
+  - MMOP
+  - "matrix core"
+  - "v_mmac"
+  - HCU
+
+ck-tile:
+  - "CK Tile"
+  - Composable Kernel
+
 # Kernel types
 moe:
  - MoE

--- a/knowledge/data/repo-catalog.yaml
+++ b/knowledge/data/repo-catalog.yaml
@@ -506,3 +506,46 @@ frameworks:
  start_date: '2024-01-01'
  cutoff_date: '2026-05-19'
  scan_mode: git-merge-request-refs
+- id: hygon-hip-kernel-optimizer
+  name: Hygon HIP Kernel Optimizer Skill
+  repo: yuguo-Jack/cuda-optimized-skill
+  url: https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer
+  kernel_paths:
+  - skills/hygon-hip-kernel-optimizer/SKILL.md
+  - skills/hygon-hip-kernel-optimizer/references/optimization_catalog.md
+  - skills/hygon-hip-kernel-optimizer/references/dcu_metrics_guide.md
+  - skills/hygon-hip-kernel-optimizer/references/method_registry.json
+  - skills/hygon-hip-kernel-optimizer/references/dcu_isa_signatures.json
+  - skills/hygon-hip-kernel-optimizer/examples/walkthrough.md
+  tags:
+  - dcu
+  - hygon
+  - hip
+  - rocm
+  - ck-tile
+  - mmac
+  - wave64
+  - hipprof
+  - dccobjdump
+  - sqtt
+  start_date: '2026-05-20'
+  cutoff_date: '2026-05-20'
+  scan_mode: source-reference
+- id: dcu-toolkit-amd-knowledge-base
+  name: DCU Toolkit AMD Knowledge Base
+  repo: dcutoolkit/deeplearing/geak_dcu
+  url: http://42.228.13.241:10068/dcutoolkit/deeplearing/geak_dcu/-/tree/main/mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base
+  kernel_paths:
+  - mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base
+  tags:
+  - dcu
+  - amd
+  - rocm
+  - hip
+  - geak-dcu
+  - amd-knowledge-base
+  - protected-source
+  start_date: '2026-05-20'
+  cutoff_date: '2026-05-20'
+  scan_mode: protected-source-reference
+  access: protected-login-required
--- a/knowledge/index.json
+++ b/knowledge/index.json
 {
  "schema_version": 1,
-  "description": "Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
+  "description": "Complementary kernel knowledge map for Humanize-driven GPU kernel optimization. Lists code and knowledge repositories that have no curated PR diffs in the local Route A corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials, Hygon/DCU optimization references); each framework entry points to upstream repos, kernel directories, and source guides. Topic entries map kernel topics to per-framework references for live clone/grep workflows. Frameworks already covered by Route A PR bundles (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS/CuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels) are intentionally excluded.",
  "frameworks": [
    {
      "id": "nvidia-code-samples",
@@ -124,6 +124,32 @@
        "blackwell",
        "hopper"
      ]
+    },
+    {
+      "id": "hygon-hip-kernel-optimizer",
+      "name": "Hygon HIP Kernel Optimizer Skill",
+      "repo": "yuguo-Jack/cuda-optimized-skill",
+      "url": "https://github.com/yuguo-Jack/cuda-optimized-skill",
+      "kernel_paths": [
+        "skills/hygon-hip-kernel-optimizer/SKILL.md",
+        "skills/hygon-hip-kernel-optimizer/references/optimization_catalog.md",
+        "skills/hygon-hip-kernel-optimizer/references/dcu_metrics_guide.md",
+        "skills/hygon-hip-kernel-optimizer/references/method_registry.json",
+        "skills/hygon-hip-kernel-optimizer/references/dcu_isa_signatures.json",
+        "skills/hygon-hip-kernel-optimizer/examples/walkthrough.md"
+      ],
+      "tags": [
+        "dcu",
+        "hygon",
+        "hip",
+        "rocm",
+        "ck-tile",
+        "mmac",
+        "wave64",
+        "hipprof",
+        "dccobjdump",
+        "sqtt"
+      ]
    }
  ],
  "topics": [
@@ -134,7 +160,8 @@
        "simveit-effective-transpose",
        "simveit-load-and-store",
        "colfax-article-src",
-        "colfax-cutlass-kernels"
+        "colfax-cutlass-kernels",
+        "hygon-hip-kernel-optimizer"
      ],
      "tags": [
        "attention",
@@ -153,7 +180,8 @@
        "simveit-effective-transpose",
        "simveit-load-and-store",
        "colfax-article-src",
-        "colfax-cutlass-kernels"
+        "colfax-cutlass-kernels",
+        "hygon-hip-kernel-optimizer"
      ],
      "tags": [
        "gemm",
@@ -171,7 +199,8 @@
        "simveit-effective-transpose",
        "simveit-load-and-store",
        "colfax-article-src",
-        "colfax-cutlass-kernels"
+        "colfax-cutlass-kernels",
+        "hygon-hip-kernel-optimizer"
      ],
      "tags": [
        "moe",
@@ -187,7 +216,8 @@
      "name": "RMSNorm / LayerNorm fused norms",
      "applies_to": [
        "nvidia-code-samples",
-        "simveit-effective-transpose"
+        "simveit-effective-transpose",
+        "hygon-hip-kernel-optimizer"
      ],
      "tags": [
        "rmsnorm",
@@ -201,7 +231,8 @@
      "id": "activation-fusion",
      "name": "Activation / element-wise fusion",
      "applies_to": [
-        "simveit-effective-transpose"
+        "simveit-effective-transpose",
+        "hygon-hip-kernel-optimizer"
      ],
      "tags": [
        "silu",
@@ -218,7 +249,8 @@
        "simveit-effective-transpose",
        "simveit-load-and-store",
        "colfax-article-src",
-        "colfax-cutlass-kernels"
+        "colfax-cutlass-kernels",
+        "hygon-hip-kernel-optimizer"
      ],
      "tags": [
        "fp8",

--- a/knowledge/scripts/_kb.py
+++ b/knowledge/scripts/_kb.py
@@ -47,7 +47,7 @@ def parse_markdown(path: Path) -> Page:

 def iter_pages() -> list[Page]:
    root = knowledge_root()
-    bases = [root / "sources" / "prs"]
+    bases = [root / "sources" / "prs", root / "sources" / "refs"]
    pages: list[Page] = []
    for base in bases:
        if not base.exists():

--- a/knowledge/scripts/get_page.py
+++ b/knowledge/scripts/get_page.py
@@ -20,7 +20,7 @@ def render_page(path: Path, *, frontmatter_only: bool, body_only: bool) -> str:


 def main() -> int:
-    parser = argparse.ArgumentParser(description="Fetch a KernelPilot PR evidence page by id or path")
+    parser = argparse.ArgumentParser(description="Fetch a KernelPilot evidence page by id or path")
    parser.add_argument("page")
    parser.add_argument("--follow-sources", action="store_true", help="Deprecated no-op kept for compatibility; PR pages are the source.")
    parser.add_argument("--frontmatter-only", action="store_true")

--- a/knowledge/scripts/query.py
+++ b/knowledge/scripts/query.py
@@ -53,7 +53,7 @@ def score_page(text: str, terms: list[str]) -> int:


 def main() -> int:
-    parser = argparse.ArgumentParser(description="Search KernelPilot PR diff evidence pages")
+    parser = argparse.ArgumentParser(description="Search KernelPilot evidence and source-reference pages")
    parser.add_argument("query", nargs="*", help="keyword query")
    parser.add_argument("--type", dest="type_filter")
    parser.add_argument("--tag")
@@ -105,7 +105,8 @@ def main() -> int:
        for score, page in rows:
            tags = ",".join(as_list(page.meta.get("tags"))[:5])
            artifact_dir = page.meta.get("artifact_dir", "-")
-            print(f"{page.id or '-'} | PR | score={score} | {page.relpath} | {artifact_dir} | {tags} | {page.title}")
+            kind = "PR" if page.relpath.startswith("sources/prs/") else page.kind
+            print(f"{page.id or '-'} | {kind} | score={score} | {page.relpath} | {artifact_dir} | {tags} | {page.title}")
        return 0
    print("| ID | Score | Path | Evidence Bundle | Title | Tags |")
    print("| --- | ---: | --- | --- | --- | --- |")

--- a/knowledge/scripts/validate.py
+++ b/knowledge/scripts/validate.py
@@ -61,9 +61,14 @@ def main() -> int:
        else:
            errors.append(f"{page.relpath}: missing id")

+    source_refs = 0
    for page in pages:
-        if not page.relpath.startswith("sources/prs/"):
-            errors.append(f"{page.relpath}: non-PR page indexed")
+        if page.relpath.startswith("sources/prs/"):
+            continue
+        if page.relpath.startswith("sources/refs/"):
+            source_refs += 1
+            continue
+        errors.append(f"{page.relpath}: unsupported indexed page location")

    source_prs = 0
    complete_source_pr_bundles = 0
@@ -119,6 +124,7 @@ def main() -> int:
        "pull_bundles": len(list((root / PULL_BUNDLE_ROOT).glob("*/*"))),
        "source_prs": source_prs,
        "complete_source_pr_bundles": complete_source_pr_bundles,
+        "source_refs": source_refs,
        "candidate_prs": candidate_prs,
        "candidate_ledgers": len(ledgers),
        "index_repos": index_repos,

--- a/knowledge/sources/refs/dcu-toolkit-amd-knowledge-base.md
+++ b/knowledge/sources/refs/dcu-toolkit-amd-knowledge-base.md
+---
+id: ref-dcu-toolkit-amd-knowledge-base
+repo: dcutoolkit/deeplearing/geak_dcu
+title: DCU Toolkit AMD Knowledge Base
+url: http://42.228.13.241:10068/dcutoolkit/deeplearing/geak_dcu/-/tree/main/mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base
+source_type: source-reference
+source_category: upstream-knowledge
+access: protected-login-required
+architectures:
+- dcu
+- amd
+- rocm
+tags:
+- dcu
+- amd
+- rocm
+- hip
+- geak-dcu
+- amd-knowledge-base
+- protected-source
+techniques:
+- dcu-reference-search
+- source-backed-porting
+- protected-upstream-sync
+hardware_features:
+- wavefront
+- lds
+- mfma
+- mmac
+- profiler
+kernel_types:
+- gemm
+- attention
+- moe
+- normalization
+- reduction
+languages:
+- markdown
+- hip
+- cpp
+- python
+captured_at: '2026-05-20'
+access_probe:
+  tree_page: anonymous request returned HTTP 302 to /users/sign_in
+  api_tree: anonymous request returned HTTP 404 Project Not Found
+---
+# DCU Toolkit AMD Knowledge Base
+
+- Repository: `dcutoolkit/deeplearing/geak_dcu`
+- Knowledge path: `mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base`
+- Source URL: [amd-knowledge-base](http://42.228.13.241:10068/dcutoolkit/deeplearing/geak_dcu/-/tree/main/mcp_tools/rag-mcp/knowledge-base/amd-knowledge-base)
+- Access status on 2026-05-20: anonymous page access redirected to GitLab sign-in; anonymous API tree access returned project-not-found.
+
+## Route Fit
+
+Use this as a Route B pointer for AMD/DCU upstream knowledge once authenticated access or a local export is available. The content is not mirrored here because anonymous access is protected, so do not cite it as direct evidence until the actual files have been synced or pasted into the workspace.
+
+## How To Use
+
+- Search this entry when a DCU optimization question needs AMD/ROCm/HIP knowledge beyond local LightOp and bundled MR evidence.
+- If credentials or an exported copy become available, import the actual markdown/source files under `sources/refs/` or an evidence bundle and record commit/version plus license/notice details.
+- Until the files are imported, cite this page only as a pending protected source, not as implementation proof.
+
+## Query Hooks
+
+```bash
+python3 scripts/query.py "amd knowledge base dcu" --type source-reference --compact
+python3 scripts/get_page.py ref-dcu-toolkit-amd-knowledge-base
+```
--- a/knowledge/sources/refs/hygon-hip-kernel-optimizer.md
+++ b/knowledge/sources/refs/hygon-hip-kernel-optimizer.md
+---
+id: ref-hygon-hip-kernel-optimizer
+repo: yuguo-Jack/cuda-optimized-skill
+title: Hygon HIP Kernel Iterative Optimizer
+url: https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer
+source_type: source-reference
+source_category: upstream-knowledge
+architectures:
+- dcu
+- gfx936
+- gfx938
+tags:
+- dcu
+- hygon
+- hip
+- rocm
+- ck-tile
+- hipprof
+- dccobjdump
+- sqtt
+techniques:
+- roofline-axis-budget
+- branch-and-select
+- ablation-attribution
+- dcu-isa-verification
+- source-backed-builtin-probes
+- wave64-porting
+- ck-tile-first
+hardware_features:
+- wave64
+- lds
+- mmac
+- matrix-load
+- ds-read-matrix
+- waitcnt
+- vgpr
+- sgpr
+kernel_types:
+- gemm
+- attention
+- moe
+- normalization
+- convolution
+- reduction
+- activation
+languages:
+- markdown
+- python
+- hip
+captured_at: '2026-05-20'
+commit: c069290452aee67baa709f55d767358ab4171e69
+license: MIT
+source_paths:
+- skills/hygon-hip-kernel-optimizer/SKILL.md
+- skills/hygon-hip-kernel-optimizer/references/optimization_catalog.md
+- skills/hygon-hip-kernel-optimizer/references/dcu_metrics_guide.md
+- skills/hygon-hip-kernel-optimizer/references/method_registry.json
+- skills/hygon-hip-kernel-optimizer/references/dcu_isa_signatures.json
+- skills/hygon-hip-kernel-optimizer/examples/walkthrough.md
+---
+# Hygon HIP Kernel Iterative Optimizer
+
+- Repository: `yuguo-Jack/cuda-optimized-skill`
+- Source: [skills/hygon-hip-kernel-optimizer](https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer)
+- Captured commit: `c069290452aee67baa709f55d767358ab4171e69`
+- License: `MIT`
+
+## Route Fit
+
+Use this as Route B DCU/Hygon optimization knowledge before falling back to CUDA-only PR evidence. It is a workflow and method catalog, not LightOp source code, so any borrowed idea still needs LightOp integration, target compilation, benchmark correctness, `hipprof` evidence, and `dccobjdump` ISA verification.
+
+## Core Operating Rules
+
+- Treat DCU wavefront size as 64 and rewrite CUDA warp32 assumptions before reuse.
+- Prefer CK Tile or HCU examples for GEMM, convolution, attention, MoE, and norm template work; do not port CUTLASS paths by name.
+- Use benchmark timing, `hipprof` PMC/read/write passes, optional SQTT, code-object resource analysis, and final `dccobjdump` together. Final ISA is the proof for Hygon behavior.
+- Keep a branch-and-select loop: profile the current best, choose methods from measured bottlenecks, generate several variants, benchmark all correct branches, then keep only changes with positive attribution.
+- Treat source-backed HCU or AMD-named builtins as candidates only when the exact call signature and target guard are visible in DCU source, an existing project, or a minimal compile probe.
+- Do not invent builtin names from AMD docs or mnemonic guesses. Compile the exact candidate and inspect generated Hygon ISA.
+- Avoid FP4 strategies for current Hygon DCU targets unless newer hardware/toolchain evidence appears.
+
+## Method Families
+
+- Compute: HCU MMOP/MMAC matrix core, FP8/BF8/TF32 low precision on supported targets, wave64 launch geometry, thread coarsening, register-pressure control, fast math, and inline asm/builtin escape hatches.
+- Memory: coalesced and vectorized access, layout transforms, LDS tiling, global-to-LDS paths, CK Tile MLS/TLS/WASP loaders, DS matrix-read layouts, LDS bank-conflict reduction, cache policy, CK Tile named pipelines, and epilogue fusion.
+- Latency: waitcnt-aware pipelining, barrier reduction, wavefront shuffle or DS permute exchange, ILP/unroll, persistent scheduling, split-K/stream-K, SALU/VALU phase balance, and SQTT stall triage.
+- Operator shortcuts: map GEMM/MoE/attention/conv/norm/reduction questions to CK Tile/HCU examples first, then tune geometry, layout, pipeline, split, persistent, quant, and epilogue choices.
+
+## Profiling Signals
+
+- Low matrix-core instruction evidence on GEMM/conv/attention/MoE work points to MMAC or mixed-precision routes.
+- High global/cache request pressure or scalar contiguous memory ops points to coalescing, vectorized loads/stores, layout work, or LDS staging.
+- High VGPR/SGPR/LDS pressure from code-object analysis should feed launch geometry, register control, and occupancy decisions.
+- Many close-to-load `s_waitcnt` or SQTT wait stalls points to wait placement, software pipelining, and ILP.
+- PMC gaps or contradictory results should trigger SQTT/stat-stall or source/ISA inspection instead of blind tuning.
+
+## ISA Evidence To Search
+
+Useful Hygon/DCU evidence strings include `v_mmac`, `MMOP`, `fp8`, `bf8`, `tf32`, `global_load_dwordx2`, `global_load_dwordx4`, `buffer_load`, `lds`, `matrix_load`, `ds_read_m32x16_b16`, `ds_read_m32x16_b16_alt`, `ds_read_m32x32_b8`, `ds_read_m32x64_b4`, `ds_read_m32x8_b32`, `ds_bpermute`, `ds_permute`, `s_waitcnt vmcnt(0)`, `s_waitcnt lgkmcnt(0)`, and `s_barrier`.
+
+## Query Hooks
+
+```bash
+python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact
+python3 scripts/query.py "hipprof sqtt waitcnt" --type source-reference --compact
+python3 scripts/get_page.py ref-hygon-hip-kernel-optimizer
+```