---
id: ref-hygon-hip-kernel-optimizer
repo: yuguo-Jack/cuda-optimized-skill
title: Hygon HIP Kernel Iterative Optimizer
url: https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer
source_type: source-reference
source_category: upstream-knowledge
architectures:
- dcu
- gfx936
- gfx938
tags:
- dcu
- hygon
- hip
- rocm
- ck-tile
- hipprof
- dccobjdump
- sqtt
techniques:
- roofline-axis-budget
- branch-and-select
- ablation-attribution
- dcu-isa-verification
- source-backed-builtin-probes
- wave64-porting
- ck-tile-first
hardware_features:
- wave64
- lds
- mmac
- matrix-load
- ds-read-matrix
- waitcnt
- vgpr
- sgpr
kernel_types:
- gemm
- attention
- moe
- normalization
- convolution
- reduction
- activation
languages:
- markdown
- python
- hip
captured_at: '2026-05-20'
commit: c069290452aee67baa709f55d767358ab4171e69
license: MIT
source_paths:
- skills/hygon-hip-kernel-optimizer/SKILL.md
- skills/hygon-hip-kernel-optimizer/references/optimization_catalog.md
- skills/hygon-hip-kernel-optimizer/references/dcu_metrics_guide.md
- skills/hygon-hip-kernel-optimizer/references/method_registry.json
- skills/hygon-hip-kernel-optimizer/references/dcu_isa_signatures.json
- skills/hygon-hip-kernel-optimizer/examples/walkthrough.md
---
# Hygon HIP Kernel Iterative Optimizer

- Repository: `yuguo-Jack/cuda-optimized-skill`
- Source: [skills/hygon-hip-kernel-optimizer](https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer)
- Captured commit: `c069290452aee67baa709f55d767358ab4171e69`
- License: `MIT`

## Route Fit

Use this as Route B DCU/Hygon optimization knowledge before falling back to CUDA-only PR evidence. It is a workflow and method catalog, not LightOp source code, so any borrowed idea still needs LightOp integration, target compilation, benchmark correctness, `hipprof` evidence, and `dccobjdump` ISA verification.

## Core Operating Rules

- Treat DCU wavefront size as 64 and rewrite CUDA warp32 assumptions before reuse.
- Prefer CK Tile or HCU examples for GEMM, convolution, attention, MoE, and norm template work; do not port CUTLASS paths by name.
- Use benchmark timing, `hipprof` PMC/read/write passes, optional SQTT, code-object resource analysis, and final `dccobjdump` together. Final ISA is the proof for Hygon behavior.
- Keep a branch-and-select loop: profile the current best, choose methods from measured bottlenecks, generate several variants, benchmark all correct branches, then keep only changes with positive attribution.
- Treat source-backed HCU or AMD-named builtins as candidates only when the exact call signature and target guard are visible in DCU source, an existing project, or a minimal compile probe.
- Do not invent builtin names from AMD docs or mnemonic guesses. Compile the exact candidate and inspect generated Hygon ISA.
- Avoid FP4 strategies for current Hygon DCU targets unless newer hardware/toolchain evidence appears.

## Method Families

- Compute: HCU MMOP/MMAC matrix core, FP8/BF8/TF32 low precision on supported targets, wave64 launch geometry, thread coarsening, register-pressure control, fast math, and inline asm/builtin escape hatches.
- Memory: coalesced and vectorized access, layout transforms, LDS tiling, global-to-LDS paths, CK Tile MLS/TLS/WASP loaders, DS matrix-read layouts, LDS bank-conflict reduction, cache policy, CK Tile named pipelines, and epilogue fusion.
- Latency: waitcnt-aware pipelining, barrier reduction, wavefront shuffle or DS permute exchange, ILP/unroll, persistent scheduling, split-K/stream-K, SALU/VALU phase balance, and SQTT stall triage.
- Operator shortcuts: map GEMM/MoE/attention/conv/norm/reduction questions to CK Tile/HCU examples first, then tune geometry, layout, pipeline, split, persistent, quant, and epilogue choices.

## Profiling Signals

- Low matrix-core instruction evidence on GEMM/conv/attention/MoE work points to MMAC or mixed-precision routes.
- High global/cache request pressure or scalar contiguous memory ops points to coalescing, vectorized loads/stores, layout work, or LDS staging.
- High VGPR/SGPR/LDS pressure from code-object analysis should feed launch geometry, register control, and occupancy decisions.
- Many close-to-load `s_waitcnt` or SQTT wait stalls points to wait placement, software pipelining, and ILP.
- PMC gaps or contradictory results should trigger SQTT/stat-stall or source/ISA inspection instead of blind tuning.

## ISA Evidence To Search

Useful Hygon/DCU evidence strings include `v_mmac`, `MMOP`, `fp8`, `bf8`, `tf32`, `global_load_dwordx2`, `global_load_dwordx4`, `buffer_load`, `lds`, `matrix_load`, `ds_read_m32x16_b16`, `ds_read_m32x16_b16_alt`, `ds_read_m32x32_b8`, `ds_read_m32x64_b4`, `ds_read_m32x8_b32`, `ds_bpermute`, `ds_permute`, `s_waitcnt vmcnt(0)`, `s_waitcnt lgkmcnt(0)`, and `s_barrier`.

## Query Hooks

```bash
python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact
python3 scripts/query.py "hipprof sqtt waitcnt" --type source-reference --compact
python3 scripts/get_page.py ref-hygon-hip-kernel-optimizer
```