• zjing14's avatar
    Navi3 rel (#1176) · 1837040a
    zjing14 authored
    
    
    * wmma_op + unit test
    
    * add arch limitation to wmma test
    
    * change arch limitation
    
    * Refactor + Add all type unit test(int4 compile failed)
    
    * Add f32_16x16x16_bf16 unit test
    
    * tempsave
    
    * tempsave
    
    * tempsave
    
    * runtime bug, cannot find symbol
    
    * workaround for incorrect HIP warpSize return value
    
    * debugging
    
    * tempsave
    
    * Correctness OK, waiting for optimization
    
    * Tidy up + format
    
    * temp save
    
    * temp save, reproduce the v_bfi_b32 issue
    
    * add inline asm for wmmaop test
    
    * tidy up
    
    * clean some debug purpose code
    
    * discard some codes
    
    * clang format
    
    * clang format
    
    * compiler issue fixed + increase tile size
    
    * navi3x_multipleD+example
    
    * temp save
    
    * workable
    
    * batchedgemm[OK], groupconv[debug]
    
    * groupconv: Sanity check[OK], Performance[Bad]
    
    * navi3x_groupconv_need_optimization
    
    * create necessary files
    
    * save progress
    
    * Add Inter-Row thread transfer
    
    * save progress
    
    * save debugging progress
    
    * sanity check pass
    
    * fix a host tensor bug and clean up flash-attn code
    
    * format
    
    * cancel unnecessary change
    
    * cancel unnecessary change
    
    * cancel unnecessary change
    
    * temp save, add asm backend flag to amd_wmma
    
    * Mat-A LDS Bypass sanity pass
    
    * temp save
    
    * gemm sanity fix
    
    * Porting new blockwise gemm to flash attention
    
    * Example branch provide to compiler team
    
    * tempsave
    
    * Fix a bug
    
    * batched gemm ported
    
    * conv A-skip lds ported
    
    * Skip B-Lds real gemm
    
    * Skip B Lds Gemm + MulD
    
    * batched gemm, conv, skip b lds
    
    * format
    
    * Attn, skip b lds
    
    * Change GridwiseOp nam
    
    * fix a typo caused bug
    
    * Skip A_Lds sanity pass, Skip B_Lds scratch occured
    
    * Bug found, intra-row permute off caused
    
    * bug found
    
    * a fix
    
    * disable buffer load due to incorrect 3rd dword
    
    * update fmha config, no scratch generated
    
    * update 3rd dword
    
    * fmha config update
    
    * FMHA, add support to gfx1101/gfx1102
    
    * Merge origin dev (#2)
    
    * [Navi3x] Fix Gridwise_multiple_d operation (#649)
    
    * Add CMake Option "USE_OPT_NAVI3X"
    
    * fix bug
    
    * standardize docs (#655)
    
    * Separate bibtex requirement from rocm-docs-core (#656)
    
    * separate bibtex requirement from rocm-docs-core
    
    * point requirements to source rocm-docs-core repo
    
    * Add CMake Option "USE_OPT_NAVI3X" (#647)
    
    * Add CMake Option "USE_OPT_NAVI3X"
    
    * remove navi3x opt compile option from cmake script
    
    * Conv + quantization + tanh  (#645)
    
    * Rename file. Prepare to support another activation
    
    * Add comment for quantization
    
    * Extract out_elementop
    
    * Add tanh example
    
    * Add conv + bias + tanh quantization instance
    
    * Add missing parameter
    
    * Refine cmake
    
    * Add external api and client example
    
    * Extract variable in example
    
    * Fix the comment
    
    ---------
    Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
    
    * Add a denorm test fix (#603)
    
    * Add type_convert implementations for bf16
    
    * Add the fix for conv_fwd
    
    * Add the fix for conv_bwd_data
    
    * Add the fix for conv_bwd_weight
    
    * Format
    
    * Format
    
    * Another format
    
    * Add a macro to use workaround on MI200 only
    
    * Format
    
    ---------
    Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
    Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
    
    * simplify karg in device/grid of split-k op (#644)
    
    * simplify karg in device/grid split-k op
    
    * fix mk_kn_mn instances
    
    * add more instances
    
    * use name from tensor layout
    
    * fix 3rd dword of buffer source descriptor (#659)
    
    * add fp64 instances (#658)
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    
    * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
    
    This reverts commit bb5530af
    
    .
    
    * Groupnorm + swish external api (#668)
    
    * Rename to proper naming
    
    * Add example of groupnorm + swish
    
    * Extract duplicate code in example
    
    * Add groupnorm + swish instances
    
    * Ractor instance generation, split into multiple cpp file
    
    * Add external api and client example
    
    * Refine profiler message
    
    * Use ck math version of exp
    
    * Refine problem size in example
    
    * Add host version of exp
    
    * add a marco to turn on/off denorm fix (off by default) (#673)
    
    * add a marco to turn off denorm fix by default
    
    * expose the marco
    
    ---------
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    
    * fixed quant example (#672)
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    
    * Add dependabot config and pin rocm-docs-core (#663)
    
    * [gtest] suppress unsafe buffer warn (#670)
    
    ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
    
    
    
    * Add memory index guard in wmma device ops (#667)
    
    * Add more macros to turn on/off denorm fix (#678)
    Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
    
    * Fix a typo (#676)
    
    * Add (#677)
    
    * Allow using ROCm release candidate compilers. (#679)
    
    * enable use of rocm5.5 release candidate 4
    
    * upgrade to ROCM5.5 RC5
    
    * try fix the PUB_KEY error, remove the cmake-data package
    
    * upgrade to latest cmake version
    
    * use private dockerhub repo for rocm5.5 rc5
    
    * add missing bracket
    
    * add vector load check
    
    * solve conflicts
    
    ---------
    Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
    Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
    Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
    Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
    Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
    Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
    Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
    Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
    
    * Disable SkipLDS & Align AIT api (#3)
    
    * fix layernorm, reduction Ops (#4)
    
    * [Navi3x] Fix Gridwise_multiple_d operation (#649)
    
    * Add CMake Option "USE_OPT_NAVI3X"
    
    * fix bug
    
    * standardize docs (#655)
    
    * Separate bibtex requirement from rocm-docs-core (#656)
    
    * separate bibtex requirement from rocm-docs-core
    
    * point requirements to source rocm-docs-core repo
    
    * Add CMake Option "USE_OPT_NAVI3X" (#647)
    
    * Add CMake Option "USE_OPT_NAVI3X"
    
    * remove navi3x opt compile option from cmake script
    
    * Conv + quantization + tanh  (#645)
    
    * Rename file. Prepare to support another activation
    
    * Add comment for quantization
    
    * Extract out_elementop
    
    * Add tanh example
    
    * Add conv + bias + tanh quantization instance
    
    * Add missing parameter
    
    * Refine cmake
    
    * Add external api and client example
    
    * Extract variable in example
    
    * Fix the comment
    
    ---------
    Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
    
    * Add a denorm test fix (#603)
    
    * Add type_convert implementations for bf16
    
    * Add the fix for conv_fwd
    
    * Add the fix for conv_bwd_data
    
    * Add the fix for conv_bwd_weight
    
    * Format
    
    * Format
    
    * Another format
    
    * Add a macro to use workaround on MI200 only
    
    * Format
    
    ---------
    Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
    Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
    
    * simplify karg in device/grid of split-k op (#644)
    
    * simplify karg in device/grid split-k op
    
    * fix mk_kn_mn instances
    
    * add more instances
    
    * use name from tensor layout
    
    * fix 3rd dword of buffer source descriptor (#659)
    
    * add fp64 instances (#658)
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    
    * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
    
    This reverts commit bb5530af
    
    .
    
    * Groupnorm + swish external api (#668)
    
    * Rename to proper naming
    
    * Add example of groupnorm + swish
    
    * Extract duplicate code in example
    
    * Add groupnorm + swish instances
    
    * Ractor instance generation, split into multiple cpp file
    
    * Add external api and client example
    
    * Refine profiler message
    
    * Use ck math version of exp
    
    * Refine problem size in example
    
    * Add host version of exp
    
    * add a marco to turn on/off denorm fix (off by default) (#673)
    
    * add a marco to turn off denorm fix by default
    
    * expose the marco
    
    ---------
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    
    * fixed quant example (#672)
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    
    * Add dependabot config and pin rocm-docs-core (#663)
    
    * [gtest] suppress unsafe buffer warn (#670)
    
    ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
    
    
    
    * Add memory index guard in wmma device ops (#667)
    
    * Add more macros to turn on/off denorm fix (#678)
    Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
    
    * Fix a typo (#676)
    
    * Add (#677)
    
    * Allow using ROCm release candidate compilers. (#679)
    
    * enable use of rocm5.5 release candidate 4
    
    * upgrade to ROCM5.5 RC5
    
    * try fix the PUB_KEY error, remove the cmake-data package
    
    * upgrade to latest cmake version
    
    * use private dockerhub repo for rocm5.5 rc5
    
    * add missing bracket
    
    * Disable SkipLDS & Align AIT api
    
    * Update dependabot config (#682)
    Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
    
    * update attn api
    
    * solve type_convert bug + enable
    
    ---------
    Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
    Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
    Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
    Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
    Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
    Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
    Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
    Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
    Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
    Co-authored-by: default avatarhaocwang <Haocong.WANG@amd.com>
    
    * fix typo
    
    * Fix attention with causal mask
    
    * multiple fix, try ait compile
    
    * Add A/B not use LDS pipeline
    
    * Clang format, Add gfx1101, gfx1102 support of FMHA example
    
    * cancel change of format script
    
    * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
    2. Enable FP16 accumulator blockwise_gemm
    
    * clang-format
    
    * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
    2. change kernel timing mode to 50 warmup + 50 timed repeat
    
    * Update low level abstration of blockwise gemm wmma
    
    * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds
    
    * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds
    
    * (4/5) grouped conv pass
    
    * (5/5) attention pass, todo: debug lds perf bug
    
    * AIT Attention API refactor (#8)
    
    * sanity pass
    
    * sanity pass 2
    
    * confirm significant performance regression.
    
    * turn on all instances
    
    * turn off instance format
    
    * Fix bug & tunning & format
    
    * DML meta, self_attn+cross_attn
    
    * sanity pass
    
    * remove useless flag
    
    * update tile and problem size used in AIT attention
    
    * bug fix in grouped conv supporting check
    
    * deprecate inline asm wmma
    
    * Bug fix: double lds skip
    
    * clang-format
    
    * Fix errors in
    1. example, fmha
    2. gridwise pipeline
    3. deviceop, fmha, change some containers from vector to array
    
    * part2 of previous commit
    
    * clang format
    
    * API fix of gridwisegemmpipeline
    
    * separate array base and vector base attention tensor transformation
    
    * fix gemm
    
    * clang format
    
    * add gemm fp16 instances
    
    * Temp save
    
    * fpAintB kernel compile pass
    
    * Sanity pass.
    
    * Temp save
    
    * debug code enabled
    
    * Fp16AInt8B_GEMM sanity
    
    * MQA implementation
    
    * GQA-4 example
    
    * tempsave
    
    * Compile pass
    
    * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm
    
    * format
    
    * Todo: fix gemm_bilinear_wmma instances compilation bug
    
    * Solve a bug when K1=16
    
    * remove unnecessary changes
    
    * Remove tensor layout limitation to LDS usage in tesnor contraction
    
    * update self-attention and cross-attention
    
    * fix a typo of name
    
    * Add arch limiter for fp8 gemm
    
    * enable fp8 gemm_xdl for all gfx9 targets
    
    * temporarily disable gemm_xdl_fp16_fp8 on MI100/200
    
    * fix the cmake logic for gemm_xdl_fp16_fp8
    
    * re-enable the gemm_xdl_fp16_fp8 on MI100/200
    
    ---------
    Co-authored-by: default avataraska-0096 <haocwang@amd.com>
    Co-authored-by: default avatarSam Wu <sjwu@ualberta.ca>
    Co-authored-by: default avatarSam Wu <sam.wu2@amd.com>
    Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
    Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
    Co-authored-by: default avatarRosty Geyyer <rosty.geyyer@amd.com>
    Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
    Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
    Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
    Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
    Co-authored-by: default avatarsamjwu <samjwu@users.noreply.github.com>
    Co-authored-by: default avatarhaocwang <Haocong.WANG@amd.com>
    Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
    1837040a
amd_buffer_addressing.hpp 45.7 KB