--- id: ref-hygon-hip-kernel-optimizer repo: yuguo-Jack/cuda-optimized-skill title: Hygon HIP Kernel Iterative Optimizer url: https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer source_type: source-reference source_category: upstream-knowledge architectures: - dcu - gfx936 - gfx938 tags: - dcu - hygon - hip - rocm - ck-tile - hipprof - dccobjdump - sqtt techniques: - roofline-axis-budget - branch-and-select - ablation-attribution - dcu-isa-verification - source-backed-builtin-probes - wave64-porting - ck-tile-first hardware_features: - wave64 - lds - mmac - matrix-load - ds-read-matrix - waitcnt - vgpr - sgpr kernel_types: - gemm - attention - moe - normalization - convolution - reduction - activation languages: - markdown - python - hip captured_at: '2026-05-20' commit: c069290452aee67baa709f55d767358ab4171e69 license: MIT source_paths: - skills/hygon-hip-kernel-optimizer/SKILL.md - skills/hygon-hip-kernel-optimizer/references/optimization_catalog.md - skills/hygon-hip-kernel-optimizer/references/dcu_metrics_guide.md - skills/hygon-hip-kernel-optimizer/references/method_registry.json - skills/hygon-hip-kernel-optimizer/references/dcu_isa_signatures.json - skills/hygon-hip-kernel-optimizer/examples/walkthrough.md --- # Hygon HIP Kernel Iterative Optimizer - Repository: `yuguo-Jack/cuda-optimized-skill` - Source: [skills/hygon-hip-kernel-optimizer](https://github.com/yuguo-Jack/cuda-optimized-skill/tree/main/skills/hygon-hip-kernel-optimizer) - Captured commit: `c069290452aee67baa709f55d767358ab4171e69` - License: `MIT` ## Route Fit Use this as Route B DCU/Hygon optimization knowledge before falling back to CUDA-only PR evidence. It is a workflow and method catalog, not LightOp source code, so any borrowed idea still needs LightOp integration, target compilation, benchmark correctness, `hipprof` evidence, and `dccobjdump` ISA verification. ## Core Operating Rules - Treat DCU wavefront size as 64 and rewrite CUDA warp32 assumptions before reuse. - Prefer CK Tile or HCU examples for GEMM, convolution, attention, MoE, and norm template work; do not port CUTLASS paths by name. - Use benchmark timing, `hipprof` PMC/read/write passes, optional SQTT, code-object resource analysis, and final `dccobjdump` together. Final ISA is the proof for Hygon behavior. - Keep a branch-and-select loop: profile the current best, choose methods from measured bottlenecks, generate several variants, benchmark all correct branches, then keep only changes with positive attribution. - Treat source-backed HCU or AMD-named builtins as candidates only when the exact call signature and target guard are visible in DCU source, an existing project, or a minimal compile probe. - Do not invent builtin names from AMD docs or mnemonic guesses. Compile the exact candidate and inspect generated Hygon ISA. - Avoid FP4 strategies for current Hygon DCU targets unless newer hardware/toolchain evidence appears. ## Method Families - Compute: HCU MMOP/MMAC matrix core, FP8/BF8/TF32 low precision on supported targets, wave64 launch geometry, thread coarsening, register-pressure control, fast math, and inline asm/builtin escape hatches. - Memory: coalesced and vectorized access, layout transforms, LDS tiling, global-to-LDS paths, CK Tile MLS/TLS/WASP loaders, DS matrix-read layouts, LDS bank-conflict reduction, cache policy, CK Tile named pipelines, and epilogue fusion. - Latency: waitcnt-aware pipelining, barrier reduction, wavefront shuffle or DS permute exchange, ILP/unroll, persistent scheduling, split-K/stream-K, SALU/VALU phase balance, and SQTT stall triage. - Operator shortcuts: map GEMM/MoE/attention/conv/norm/reduction questions to CK Tile/HCU examples first, then tune geometry, layout, pipeline, split, persistent, quant, and epilogue choices. ## Profiling Signals - Low matrix-core instruction evidence on GEMM/conv/attention/MoE work points to MMAC or mixed-precision routes. - High global/cache request pressure or scalar contiguous memory ops points to coalescing, vectorized loads/stores, layout work, or LDS staging. - High VGPR/SGPR/LDS pressure from code-object analysis should feed launch geometry, register control, and occupancy decisions. - Many close-to-load `s_waitcnt` or SQTT wait stalls points to wait placement, software pipelining, and ILP. - PMC gaps or contradictory results should trigger SQTT/stat-stall or source/ISA inspection instead of blind tuning. ## ISA Evidence To Search Useful Hygon/DCU evidence strings include `v_mmac`, `MMOP`, `fp8`, `bf8`, `tf32`, `global_load_dwordx2`, `global_load_dwordx4`, `buffer_load`, `lds`, `matrix_load`, `ds_read_m32x16_b16`, `ds_read_m32x16_b16_alt`, `ds_read_m32x32_b8`, `ds_read_m32x64_b4`, `ds_read_m32x8_b32`, `ds_bpermute`, `ds_permute`, `s_waitcnt vmcnt(0)`, `s_waitcnt lgkmcnt(0)`, and `s_barrier`. ## Query Hooks ```bash python3 scripts/query.py "hygon dcu mmac" --type source-reference --compact python3 scripts/query.py "hipprof sqtt waitcnt" --type source-reference --compact python3 scripts/get_page.py ref-hygon-hip-kernel-optimizer ```