1. 07 Feb, 2022 1 commit
    • Chao Liu's avatar
      GEMM+Bias+ReLU+Add (#76) · 823657ed
      Chao Liu authored
      * tweak conv for odd C
      
      * update script
      
      * clean up elementwise op
      
      * fix build
      
      * clean up
      
      * added example for gemm+bias+relu+add
      
      * added example for gemm+bias+relu
      
      * add profiler for gemm_s_shuffle; re-org files
      
      * add profiler
      
      * fix build
      
      * clean up
      
      * clean up
      
      * clean up
      
      * fix build
      823657ed
  2. 04 Feb, 2022 1 commit
  3. 03 Feb, 2022 1 commit
  4. 26 Dec, 2021 1 commit
    • Chao Liu's avatar
      Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7
      Chao Liu authored
      * fix relu
      
      * clean up
      
      * clean up
      
      * adding 1x1 conv
      
      * adding 1x1 conv
      
      * added 1x1 conv
      
      * refactor
      
      * refactor
      
      * refactor
      
      * added profiler for conv+bias+relu+add
      
      * clean up
      
      * adding conv+bias+relu
      
      * adding conv+bias+relu
      
      * added conv+bias+relu
      
      * Update README.md
      
      * update cpu verification
      
      * adding c shuffle
      
      * update static_tensor for dealing with invalid element
      
      * adding c shuffle
      
      * debugging
      
      * fix bug
      
      * convert to fp16 before shuffle
      
      * shuffle more than one M/NRepeat
      
      * clean up
      
      * remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
      
      * clean up
      
      * remove coordinate step hack from all gridwise gemm xdl
      
      * clean up coordinate step hack
      
      * clean up coordinate step hack
      
      * ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
      
      * adding output shuffle in conv+bias+relu+add
      
      * update
      
      * added conv+bias+relu+add with c shuffle
      
      * added conv+bias+relu+add with c shuffle
      
      * fix forward_sweep bugs in threadwise copy
      
      * clean up
      
      * refactor
      
      * clean up
      
      * clean up
      
      * added conv_c_shuffle+bias_relu
      
      * clean up
      
      * added conv+bias+relu+atomic_add
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * misc fixes; add 1x1 specialization
      
      * clean up
      
      * delete unused device op
      
      * clean up
      
      * add support for odd C value
      acbd7bd7
  5. 03 Dec, 2021 1 commit
    • Chao Liu's avatar
      GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380
      Chao Liu authored
      * gemm+activation
      
      * move C pointwise operation into threadwise copy
      
      * add pointwise operation to A/B matrix
      
      * update ckProfiler
      
      * adding bias add
      
      * adding bias add
      
      * adding bias add
      
      * added bias add; worked around compiler issues
      
      * clean up
      
      * clean up
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * clean up
      
      * add conv_xdl example
      
      * adding conv_xdl_bias_relu_add example
      
      * add conv+bias+relu+add, but has register spill issue
      
      * tweak
      
      * tweak
      
      * refactor
      
      * Update README.md
      
      update readme for example/2_gemm_xdl_bias_relu_add
      
      * clean up
      
      * Update README.md
      
      update readme for example/3_conv_xdl
      
      * Update README.md
      41cdd380
  6. 02 Dec, 2021 1 commit
  7. 30 Nov, 2021 1 commit
  8. 18 Nov, 2021 3 commits
    • Chao Liu's avatar
      Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53) · 64350aff
      Chao Liu authored
      * reworking vector_type
      
      * use __builtin_memcpy for bit_cast and vector access of scalar pointer
      
      * clean up
      64350aff
    • zjing14's avatar
      v5r1 fusion kernels for inference (#49) · 970fa3e9
      zjing14 authored
      
      
      * init
      
      * refactor for 1x1
      
      * rename e0_e1
      
      * add e1 with bugs
      
      * debug
      
      * fixed
      
      * fixed e1
      
      * add timer
      
      * imprve threadwise gemm with dot2
      
      * add e2
      
      * tuning
      
      * seperate c2
      
      * add nhwc
      
      * restore nchwc
      
      * clean
      
      * opt
      
      * fixed; tuning
      
      * add BGlobalMoveSliceWindowStepHacks{}
      
      * tuning
      
      * repeat running
      
      * adjust
      
      * merge v5r1 nchwc
      
      * add adaptors
      
      * split k0 k1 in c_thread_grid
      
      * split h and w
      
      * remove v5r1 nhwc
      
      * clean for pr
      
      * remove host_conv_add
      
      * clean code
      
      * clean
      
      * add dynamic support
      
      * static mode
      
      * test static
      
      * add conv+add fusion
      
      * fixed validation
      
      * naming fix
      
      * use activ_enum
      
      * make static
      
      * refactor conv_add for InMem::add
      
      * add bias
      
      * add conv_out
      
      * add configurable makeddesc
      
      * add maxpool fusion
      
      * add maxpool host for validation
      
      * enable static desc
      
      * conv-only use v5r1_add
      
      * test
      
      * test
      
      * for binary dumps
      
      * fixed incorrect results due to typo
      
      * clean
      
      * debugging maxpool
      
      * workaround with offset trick
      
      * clean code
      
      * modularize ops of fusion
      
      * add gridwise_gemm_v3
      
      * create seperate fusion fun
      
      * enable dynamic mode of conv and conv+resize_add
      
      * add dynamic mode of maxpool
      
      * add pass by point
      
      * add activ_type as arguments
      
      * merge develop
      
      * clean
      
      * reset config to old default
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      970fa3e9
    • zjing14's avatar
      Fixed bfp16 host_conv_fwd (#52) · a651ea4f
      zjing14 authored
      
      
      * fixed bfloat16 issues
      
      * refactor type_convert
      
      * fixed host_convolution_forward for ushort
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      a651ea4f
  9. 16 Nov, 2021 2 commits
  10. 15 Nov, 2021 2 commits
    • zjing14's avatar
      Add bfp16/int8 support into XDL GEMM operator (#50) · 3737bb03
      zjing14 authored
      
      
      * init StaticBufferV2
      
      * clean
      
      * adopt old output stage for staticBufferV2
      
      * clean
      
      * remove hack
      
      * clean
      
      * clean
      
      * add parameters
      
      * clean code
      
      * move c_buffer alloc into blockwise gemm
      
      * add adaptors for m/n_thread_data_on_grid
      
      * tweak gemm
      
      * adjust blockwise_gemm_xdlops
      
      * tweak
      
      * update conv
      
      * update script
      
      * adding bwd 1x1
      
      * update script
      
      * adding 1x1 bwd
      
      * debugging bwd 1x1 failure
      
      * update script
      
      * update script
      
      * test
      
      * test v100
      
      * add bf16_1k
      
      * clang-format
      
      * clean
      
      * add bfp16 for gfx908
      
      * add verification
      
      * clean up
      
      * clean code
      
      * restore bfl16
      
      * clean
      
      * add bfp16 support into gemm_driver
      
      * apply new generator to other drivers
      
      * add int8 support
      
      * cleanb
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarroot <root@hayabusa6111.amd.com>
      3737bb03
    • Chao Liu's avatar
      FP16 data in-register transpose (#41) · b491ebf3
      Chao Liu authored
      * start fixing 16bit data packing
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * add missing constexpr
      
      * adding static tensor
      
      * adding static tensor
      
      * adding transpose
      
      * add inline asm for transpose 2x2 of half_t
      
      * add general transpose_vectors(), but have unnecessary register initialization using v_mov
      
      * fix unnecessary register initialization in transpose_vector by using more pass-by-reference
      
      * add hardcoded logic for NHWC wrw
      
      * improve asm for v_pack
      
      * make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
      
      * tweak
      
      * reorganize file
      b491ebf3
  11. 14 Nov, 2021 1 commit
    • Chao Liu's avatar
      ckProfiler and device-level XDL GEMM operator (#48) · e823d518
      Chao Liu authored
      * add DeviceGemmXdl
      
      * update script
      
      * fix naming issue
      
      * fix comment
      
      * output HostTensorDescriptor
      
      * rename
      
      * padded GEMM for fwd v4r4r4 nhwc
      
      * refactor
      
      * refactor
      
      * refactor
      
      * adding ckProfiler
      
      * adding ckProfiler
      
      * refactor
      
      * fix tuning parameter bug
      
      * add more gemm instances
      
      * add more fp16 GEMM instances
      
      * fix profiler driver
      
      * fix bug in tuning parameter
      
      * add fp32 gemm instances
      
      * small fix
      
      * refactor
      
      * rename
      
      * refactor gemm profiler; adding DeviceConv and conv profiler
      
      * refactor
      
      * fix
      
      * add conv profiler
      
      * refactor
      
      * adding more GEMM and Conv instance
      
      * Create README.md
      
      Add build instruction for ckProfiler
      
      * Create README.md
      
      Add Readme for gemm_xdl example
      
      * Update README.md
      
      Remove build instruction from top most folder
      
      * Update README.md
      
      * clean up
      e823d518
  12. 27 Oct, 2021 1 commit
  13. 21 Oct, 2021 1 commit
  14. 19 Oct, 2021 1 commit
    • ltqin's avatar
      add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) · fd49ff80
      ltqin authored
      
      
      * add add new algorithm from v4r4r2
      
      * program once issue
      
      * add split k functiion
      
      * redefine code
      
      * add a matrix unmerge
      
      * add b matrix unmerge k0
      
      * trans a and b to gridegemm
      
      * nhwc init
      
      * no hacks and vector load
      
      * add hacks
      
      * modify some parameter
      
      * fix tuning prometer for fp32
      
      * fix tuning prometer for fp16
      
      * start change gridwise k split
      
      * init ok
      
      * revome a b matrix k0mk1 desc in grid
      
      * carewrite lculate gridsize
      
      * add kbatch to CalculateBottomIndex
      
      * remove some unused funtion
      
      * add clear data function before call kernel
      
      * out hacks
      
      * in hacks
      
      * rename device convolution file and function name
      
      * modify kBatch value
      
      * fix some tuning code
      
      * start from v4r4 nhwc
      
      * nhwc atomic is able to run
      
      * just for fp32
      
      * enable nchw atomic
      
      * tweak
      
      * tweak
      
      * re-arrange gridwise gemm hot loop for wrw
      
      * add wrw v4r5
      
      * v4r4r5 fp16
      
      * v4r4r4 fp16
      
      * v4r4r2 fp16
      
      * V4R4R4XDLNHWC fp16
      
      * V4R4R2XDLATOMICNCHW fp16
      
      * adjust for fp16
      
      * input gridsize
      
      * change kbatch to gridsize
      
      * testing wrw
      
      * clean up
      
      * k_batch to gridsize
      
      * fix bug
      
      * wrw v4r4r4 kbatch change to gride size
      
      * wrw v4r4r2 kbatch change to gride size
      
      * after merge , change gridwise gemm v2r4
      
      * change MakeCBlockClusterAdaptor
      
      * other method use new gridwise gemm
      
      * clean up
      
      * chapad method nge to make_right_pad_transform
      
      * kbatch out from transform function
      
      * clean up and fix bug
      
      * fix bug
      
      * using function type reduce template parameters
      
      * using auto replace define fuction type
      
      * clean up
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      fd49ff80
  15. 06 Oct, 2021 1 commit
    • Chao Liu's avatar
      Tweak GEMM kernel (#38) · b3e8d57d
      Chao Liu authored
      * add parameters
      
      * tweak gemm
      
      * tweak
      
      * update conv
      
      * update script
      
      * adding bwd 1x1
      
      * update script
      
      * adding 1x1 bwd
      
      * debugging bwd 1x1 failure
      
      * update script
      
      * update script
      
      * test
      
      * test v100
      
      * clean up
      b3e8d57d
  16. 05 Sep, 2021 1 commit
    • Chao Liu's avatar
      GEMM driver and kernel (#29) · 19613902
      Chao Liu authored
      * add gemm driver
      
      * tweak
      
      * add gemm kernel: mk_kn_mn and km_kn_mn
      
      * tweak
      
      * add GEMM km_nk_mn
      
      * fix comment
      19613902
  17. 31 Aug, 2021 1 commit
    • ltqin's avatar
      Backward weight v4r4r2 with xdlops (#18) · 627d8ef3
      ltqin authored
      
      
      * start
      
      * modify transformat
      
      * modify device convolutiion
      
      * modify host
      
      * added host conv bwd and wrw
      
      * remove bwd, seperate wrw
      
      * clean
      
      * hacall k to zero
      
      * out log
      
      * fixed
      
      * fixed
      
      * change to (out in wei)
      
      * input hack
      
      * hack to out
      
      * format
      
      * fix by comments
      
      * change wei hacks(wei transform has not merge)
      
      * fix program once issue
      
      * fix review comment
      
      * fix vector load issue
      
      * tweak
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      627d8ef3
  18. 23 Aug, 2021 1 commit
    • zjing14's avatar
      Xdlops refactor fix (#22) · 9d3f634a
      zjing14 authored
      * added constexpr ahead of adptor; clean unused driver; rename M/NPerWave to M/NPerXDL
      
      * fixed bwd
      
      * fixed comment
      9d3f634a
  19. 19 Aug, 2021 3 commits
    • Chao Liu's avatar
      Composable kernel init integration v3 (#1097) · 6fe3627a
      Chao Liu authored
      * Squashed 'src/composable_kernel/' content from commit f6edda61
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: f6edda61
      
      * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
      
      * Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5
      
      5781adf5 Update develop (#5) (#6)
      97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
      7b1ec41e refactor
      49c33aae refactor
      54b3e73d rename
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: 5781adf5
      
      
      
      * fix
      
      * refactor
      
      * remove online compilation from CK
      
      * refactor
      
      * fix
      
      * add ctest
      
      * add c-style pointer cast
      
      * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
      
      * fix clang warning suppression
      
      * tidy
      
      * suppress cppcheck
      
      * fix enum issue
      
      * revert chagnes to hip build
      
      * fix kernel filename
      
      * update CK build script
      
      * rename
      
      * rename
      
      * make innner product compatiable on gfx900
      
      * Update src/include/miopen/solver/ck_utility_common.hpp
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      
      * compiler parameter use stream
      
      * use int instead of index_t in kernel wrapper
      
      * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
      
      * refactor
      
      * refactor
      
      * change cmakelist
      
      * change ck common utility
      
      * fix
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      6fe3627a
    • zjing14's avatar
      refactor dynamic xdlops iGemm (#13) · a2ad6d35
      zjing14 authored
      * xdlops refactor
      
      * fixed commnt
      
      * clean xdlops_gemm
      
      * add make c into xldops-gemm
      
      * change mfma_info
      
      * refactor xdlops, hide c desc
      
      * clean
      
      * clean
      
      * clean
      
      * apply hacks changes to v4r4r4_nhwc
      
      * rename hacks and use single stage adapter
      
      * enable fp16 mfma
      a2ad6d35
    • zjing14's avatar
      Added host_conv_wrw for verification (#15) · ba6f79a7
      zjing14 authored
      * added host conv wrw
      ba6f79a7
  20. 13 Aug, 2021 2 commits
  21. 11 Aug, 2021 2 commits
  22. 10 Aug, 2021 5 commits
  23. 09 Aug, 2021 6 commits