1. 23 May, 2022 1 commit
  2. 20 May, 2022 10 commits
    • Shaojie WANG's avatar
      example of conv bwd weight 1d/2d/3d fp32/fp16/bf16 xdl (#244) · ac543313
      Shaojie WANG authored
      
      
      * enable example of conv 1d/3d for bwd weight
      
      * make bf16 kernel do not use atomic add
      
      * using new gridwise gemm for bwd weight on convnd bwd weight
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      ac543313
    • Chao Liu's avatar
      remove options.hpp.in (#240) · 44943e0e
      Chao Liu authored
      44943e0e
    • Anthony Chang's avatar
      Refactor block to C tile map (#235) · a054f7d6
      Anthony Chang authored
      * refactor block-to-ctile-map
      
      * gridwise gemm block2ctile generic validity check
      
      * format
      
      * amend split-k gemm block2ctile map refactor
      
      * add test
      
      * format
      
      * amend
      
      * revert to calculating batch index in kernel instead of passing as block_id_z
      
      * move file
      
      * add valid ctile index check to gridwise v2r4
      a054f7d6
    • Shaojie WANG's avatar
      [conv bwd-weight]Binding gemm k1 to conv n (#202) · 070619fb
      Shaojie WANG authored
      
      
      * add some instance to develop
      
      * avoid bank conflicts for wrw for all instance
      
      * add small K1 test
      
      * delete some unused instance
      
      * binding gemm k1 to conv n
      
      * try using half_4 to do ds_read
      
      * reset buffer load oob and ds memcpy to default option
      
      * remove useless instances
      
      * remove redandunt space
      
      * remove printf code
      
      * clang-format-10 change
      
      * use fastest config
      
      * fix clang format for the other files
      
      * remove gemmk0 pad for output
      
      * add gemmk padding macro
      
      * add bank length computation
      
      * add template to distinguish the instance that need lds padding for wrw
      
      * use rocm5.1 as docker
      
      * use integer value for GEMM test
      
      * add Right padding macro
      
      * add 2 test asm code
      
      * using 256x256x32 tile size
      
      * 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code
      
      * using small vec
      
      * 256*128 kernel size for example
      
      * remove asm files
      
      * use a new gridwise gemm header for bwd-weight
      
      * revert gridwise gemm v2r4r2
      
      * change foramt
      
      * reset gridwise gemm v2r4r2
      
      * remove unused code
      
      * revert instance file
      
      * revert example instance
      
      * format file
      
      * remove macros
      
      * resolve compile error
      
      * rename wrw kernel invoker
      
      * use gridwisegemm pipeline struct instead of implement run fucntion in the same header
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      070619fb
    • Shaojie WANG's avatar
    • myamlak's avatar
      Fix + test reenabled · 5fd5daab
      myamlak authored
      5fd5daab
    • myamlak's avatar
      Revert "Enabling bf16 test" · 18125c3b
      myamlak authored
      This reverts commit f497e2ba.
      18125c3b
    • Shaojie WANG's avatar
      [Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and... · b9b9c3b8
      Shaojie WANG authored
      
      [Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and balance ds ops with address calculations (#190)
      
      * add some instance to develop
      
      * avoid bank conflicts for wrw for all instance
      
      * add small K1 test
      
      * delete some unused instance
      
      * reset buffer load oob and ds memcpy to default option
      
      * remove useless instances
      
      * remove redandunt space
      
      * remove printf code
      
      * clang-format-10 change
      
      * fix clang format for the other files
      
      * add bank length computation
      
      * add template to distinguish the instance that need lds padding for wrw
      
      * use rocm5.1 as docker
      
      * use integer value for GEMM test
      
      * 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code
      
      * use a new gridwise gemm header for bwd-weight
      
      * revert gridwise gemm v2r4r2
      
      * change foramt
      
      * rename kernel invoker
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      b9b9c3b8
    • rocking5566's avatar
      Hotfix eltiwseop (#242) · bb4b82a9
      rocking5566 authored
      
      
      * Use vector constructor instead
      
      * Fix typo
      
      * Move blockSize to the MakeArgumentPointer
      
      * Fix naming
      
      * Fix clang format
      
      * remove blockSize from DeviceBinaryElementwise::Argument()
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      bb4b82a9
    • rocking5566's avatar
      Gemm reduce max (#209) · 0ffe956a
      rocking5566 authored
      
      
      * [What] Rename the example
      [Why] Prepare to add unary reduction
      
      * Add global oparation to the parameter
      
      * Add atomicmax
      
      * Fix compile error
      
      * Support atomicMax (hip library)
      
      * Rename the reduction example
      
      * Fix target name
      
      * use p_d1_grid as the indicator directly
      
      * Prevent performance issue. Let passthrough handle it.
      
      * Implement the function template the specialize the float2
      
      * No need to separate into two lines
      
      * Remove empty line
      
      * add comment
      
      * Fix compile error due to merge from develop
      
      * make the implementation of atomic_max / atomic_add explicit for each datatype
      
      * Refine typo
      
      * For future CI test
      
      * Fix compiler error in ckProfiler
      
      * Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
      
      * simply use remove_pointer
      
      * Rename type and var
      
      * Refine example
      
      * Modify reducemax example
      
      * Fix bug in reduction
      
      * Change initialize range
      
      * Implement F64 version of atomicMax
      
      * Move reduction  code together
      
      * Add buffer atomic_max
      
      * Fix coding style by clang-format
      
      * Integrate new api of DeviceGemmReduce_Xdl_CShuffle
      
      * Integrate Batch gemm reduction
      
      * Fix example
      
      * fix example
      
      * clean up
      
      * Fix batch gemm tensor operation
      
      * Fix coding style
      
      * Fix template augument
      
      * Fix clang format
      
      * Keep flexible of different stride for each D tensor
      
      * Fix compile error for ckProfiler
      
      * Fix typo
      
      * [What] Fix naming
      [Why] Prepare to add out elementop
      
      * Add DoutElementOp
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      0ffe956a
  3. 19 May, 2022 4 commits
    • myamlak's avatar
      Enabling bf16 test · f497e2ba
      myamlak authored
      f497e2ba
    • myamlak's avatar
      Format · f63ca8e8
      myamlak authored
      f63ca8e8
    • myamlak's avatar
    • rocking5566's avatar
      elementwise op (#238) · aafc3ac2
      rocking5566 authored
      
      
      * Add elementwise operation kernel and example
      
      * Add comment
      
      * Add template argument of dim . Prepare to support multiple dimension
      
      * Rename example
      
      * Support 1 dimension
      
      * Add static assert
      
      * Add comment
      
      * Extract pad
      
      * Remove redundant argument
      
      * Support any dimension for elementwise operation
      
      * Remove line
      
      * Let it be the multiple number of CU
      
      * Move thread per block to the parameter of constructor
      
      * rename threadPerBlock with blockSize
      
      * Support double
      
      * rename kernel function name
      
      * remove redundant include header
      
      * Refine type
      
      * Need to the final dimension
      
      * Refine variable name
      
      * Refine type
      
      * Use index_t instead of int in API
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      aafc3ac2
  4. 18 May, 2022 5 commits
  5. 17 May, 2022 13 commits
  6. 16 May, 2022 4 commits
  7. 13 May, 2022 3 commits