1. 19 May, 2022 1 commit
    • rocking5566's avatar
      elementwise op (#238) · aafc3ac2
      rocking5566 authored
      
      
      * Add elementwise operation kernel and example
      
      * Add comment
      
      * Add template argument of dim . Prepare to support multiple dimension
      
      * Rename example
      
      * Support 1 dimension
      
      * Add static assert
      
      * Add comment
      
      * Extract pad
      
      * Remove redundant argument
      
      * Support any dimension for elementwise operation
      
      * Remove line
      
      * Let it be the multiple number of CU
      
      * Move thread per block to the parameter of constructor
      
      * rename threadPerBlock with blockSize
      
      * Support double
      
      * rename kernel function name
      
      * remove redundant include header
      
      * Refine type
      
      * Need to the final dimension
      
      * Refine variable name
      
      * Refine type
      
      * Use index_t instead of int in API
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      aafc3ac2
  2. 12 May, 2022 1 commit
    • JD's avatar
      Add host API (#220) · cec69bc3
      JD authored
      
      
      * Add host API
      
      * manually rebase on develop
      
      * clean
      
      * manually rebase on develop
      
      * exclude tests from all target
      
      * address review comments
      
      * update client app name
      
      * fix missing lib name
      
      * clang-format update
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix test issue
      
      * refactor
      
      * refactor
      
      * refactor
      
      * upate cmake and readme
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cec69bc3
  3. 11 May, 2022 1 commit
    • Anthony Chang's avatar
      Manual control of MAC cluster for improved interwave performance (#184) · 76764d8c
      Anthony Chang authored
      * manual control of MAC cluster for improved 2-wave performance
      
      ensure setprio's order; ensure inner loop size >= local read size
      
      synchronize when single mac cluster
      
      * format
      
      * use value field from ck::integral_constant
      
      * roll out inter-wave loop scheduler to c-shuffle gemm variants
      
      will gradually roll out to other applicable device ops when occasional reg spill is resolved
      
      * additional comments
      
      * format
      
      * fix mismatch between inter-wave pipeline and interwave blockwise gemm
      
      * address review feedback
      
      * amend
      76764d8c
  4. 10 May, 2022 1 commit
  5. 09 May, 2022 2 commits
    • myamlak's avatar
      Resolution of issue #153: Add compiler warning on comparing int and size_t (#212) · f03a1738
      myamlak authored
      
      
      * Turning compare warnings on
      
      * Cleaning part I
      
      * Cleaning part II
      
      * Explicit static_cast to ck::type_convert
      
      * Resolving large tensor size issue.
      
      * format
      
      * revert change to tensor descriptor; promote lementSpaceSize to 64bit
      
      * use integer value for GEMM test
      
      * Review remarks
      
      * Review remarks + issues with (un)signed arithmetic
      
      * Format fix
      
      * Format
      
      * Clang-format.
      
      * fix 2gb limit issue
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      f03a1738
    • Chao Liu's avatar
      Code refactor (#175) · ec7c2e91
      Chao Liu authored
      * format
      
      * improving pipeline
      
      * fix typo
      
      * format
      
      * adding thread group
      
      * adding thread group
      
      * adding thread group
      
      * adding gemm pipeline
      
      * tweak
      
      * refactor
      
      * refactor
      
      * add missing type convert
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * fix build
      
      * refactor
      
      * format
      
      * clean up
      
      * use remove_cvref_t
      
      * clean
      
      * clean up
      
      * clean up
      
      * clean up
      ec7c2e91
  6. 29 Apr, 2022 2 commits
    • Qianfeng's avatar
      Update to gemm_reduce and batched_gemm_reduce (#213) · c77ae65d
      Qianfeng authored
      * [Experimental] Change to gemm+reduce and batched-gemm+reduce
      
      * Use threadwise-reduce function to improve the gridwise_gemm_reduce_xdl_cshuffle kernel
      
      * Tiny fix in device_batched_gemm_xdl.hpp
      
      * clang-format library/src/utility/conv_fwd_util.cpp
      c77ae65d
    • JD's avatar
      Add gfx90a CI stage for tests (#208) · 97d8c504
      JD authored
      * Add gfx90a CI stage
      
      * upgrade to ROCm 5.1 and fix formatting
      97d8c504
  7. 25 Apr, 2022 1 commit
  8. 22 Apr, 2022 1 commit
  9. 21 Apr, 2022 2 commits
  10. 15 Apr, 2022 1 commit
    • Illia Silin's avatar
      Compile CK for all targets (#188) · 4221505d
      Illia Silin authored
      
      
      * compile ck for all targets
      
      * update the target criteria
      
      * change the target condition
      
      * fixed some typos
      
      * fixed missed file
      
      * revert changes in README
      
      * revert device_conv3d_fwd_xdl_...
      
      * update device_conv3d_fwd_xdl_...
      
      * update device_batched_gemm_reduce...
      
      * test the unused arguments fix
      
      * test the warning suppression
      
      * try suppress warnings in device_batched_gemm_reduce_xdl...
      
      * fix the last warnings
      
      * replace UNUSED with std::ignore
      
      * fix a typo
      
      * replaced std::ignore with ignore
      
      * add igonre header to common_header
      
      * refactor atomicAdd
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      4221505d
  11. 05 Apr, 2022 4 commits
    • Adam Osewski's avatar
      Common forward convolution utility refactor. (#141) · abf4bdb9
      Adam Osewski authored
      
      
      * Convolution ND
      
      * Code unification across dimensions for generating tensor descriptors.
      * Example
      * Instances
      
      * Move convnd f32 instance file to comply with repo structure.
      
      * Conv 1D tensor layouts.
      
      * Formatting and use ReferenceConv
      
      * Reference ConvFwd supporting 1D and 2D convolution.
      
      * Debug printing TensorLayout name.
      
      * Conv fwd 1D instance f32
      
      * Refactor conv ND example.
      
      Needed to support various conv dimensio.
      
      Needed to support various conv dimensions
      
      * Rename conv nd example director to prevent conflicts.
      
      * Refactor some common utility to single file.
      
      Plus some tests.
      
      * Refactor GetHostTensorDescriptor + UT.
      
      * Add 1D test case.
      
      * Test reference convolution 1d/2d
      
      * Remove some leftovers.
      
      * Fix convolution example error for 1D
      
      * Refactor test check errors utility function.
      
      * Test Conv2D Fwd XDL
      
      * More UT for 1D case.
      
      * Parameterize input & weight initializers.
      
      * Rename example to prevent conflicts.
      
      * Split convnd instance into separate files for 1d/2d
      
      * Address review comments.
      
      * Fix data type for flops/gbytes calculations.
      
      * Assign example number 11.
      
      * 3D cases for convolution utility functions.
      
      * 3D reference convolution.
      
      * Add support for 3D convolution.
      
      * Check for inputs bigger than  2GB.
      
      * Formatting
      
      * Support for bf16/f16/f32/i8 - conv instances + UT.
      
      * Use check_err from test_util.hpp.
      
      * Split convnd test into separate files for each dim.
      
      * Fix data generation and use proper instances.
      
      * Formatting
      
      * Skip tensor initialization if not necessary.
      
      * Fix CMakefiles.
      
      * Remove redundant conv2d_fwd test.
      
      * Lower problem size for conv3D UT.
      
      * 3D case for convnd example.
      
      * Remove leftovers after merge.
      
      * Add Conv Specialization string to GetTypeString
      
      * Skip instance causing numerical errors.
      
      * Small fixes.
      
      * Remove redundant includes.
      
      * Fix namespace name error.
      
      * Script for automatic testing and logging convolution fwd UTs
      
      * Comment out numactl cmd.
      
      * Refine weights initalization and relax rtol for fp16
      
      * Move test_util.hpp to check_err.hpp
      
      * Refine weights initalization and relax rtol for fp16
      
      * Refactor common part of test conv utils.
      
      * Move utility function to single common place.
      
      * Add additional common functions to utility.
      
      * Refactor convnd_fwd_xdl examples.
      
      * Remove redundant files.
      * Unify structure.
      
      * Add constructor to ConvParams.
      
      * And add input parameters validation.
      
      * Modify conv examples to use single utility file.
      
      * Remove check_error from host_tensor.hpp
      
      * Get rid of check_indices function.
      
      * Remove bf16_to_f32 function overload for scalars.
      
      * Fix namespace.
      
      * Add half_float::half for check_err.
      
      * Fix conv params size in UT.
      
      * Fix weights initialization for int8.
      
      * Fix weights initialization for int8.
      
      * Add type_convert when store output in ref conv 1D.
      
      * Get back old conv2d_fwd_xdl operation.
      
      * Silence conv debug print.
      
      * format
      
      * clean
      
      * clean
      
      * Fix merge.
      
      * Fix namespace for check_err
      
      * Formatting.
      
      * Fix merge artifacts.
      
      * Remove deleted header.
      
      * Fix some includes and use ck::utils::check_err.
      
      * Remove unused check_indices restored by previous merge.
      
      * Fix namespaces after merge.
      
      * Fix compilation error.
      
      * Small fixes.
      
      * Use common functions.
      * Fix filename
      * Fix namespaces.
      
      * Fix merge artifact - retrieve removed by accident fun.
      
      * Fix ConvForwardSpecialization.
      
      * Adhere to coding style rules.
      
      * Fix merge artifacts.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      abf4bdb9
    • ltqin's avatar
      Patch for bwd data comments (#174) · 6717168c
      ltqin authored
      * change function name and way to set input zero
      
      * change enable if
      6717168c
    • ltqin's avatar
      NHWC Conv2d Bwd weight fp16 ckprofiler and test (#166) · 781cacd2
      ltqin authored
      * change backward weight name
      
      * start add bwd weight lib and profiler
      
      * change tuning paramter
      
      * change output info
      
      * add bwd weight test
      
      * change test info
      
      * using conv_util
      
      * change wgt to weight
      
      * add }
      
      * add fp32
      781cacd2
    • Qianfeng's avatar
      Improve Reduction kernel api (#152) · 82c8b9f8
      Qianfeng authored
      * Add ThreadwiseReduction functor as per-thread reduction api
      
      * Using ThreadwiseReduce api and some change in using PartitionedBlockwiseReduction api to simply the kernels
      
      * Add comments and remove useless declarations in the kernels
      
      * Tiny updates
      82c8b9f8
  12. 31 Mar, 2022 4 commits
    • Anthony Chang's avatar
      Tune & add conflict-free LDS gemm kernels (#159) · 7db48f90
      Anthony Chang authored
      * retune & add conflict-free bf16/fp16 c-shuffle gemm instances
      
      amend wrong K1 value in some fp16/bf16 kernel instances
      
      * make gemm cshuffle's timing behavior consistent with all other functions
      
      * clang-format
      
      * retune & add conflict-free fp32 c-shuffle gemm instances
      
      * retune & add conflict-free int8 c-shuffle gemm instances
      
      * update the underlying gridwise gemm of all c-shuffle gemm kernels
      
      * typo
      7db48f90
    • Chao Liu's avatar
      Compile for gfx908 and gfx90a (#130) · cd167e49
      Chao Liu authored
      * adding compilation for multiple targets
      
      * fix build
      
      * clean
      
      * update Jekinsfile
      
      * update readme
      
      * update Jenkins
      
      * use ck::half_t instead of ushort for bf16
      
      * rename enum classes
      
      * clean
      
      * rename
      
      * clean
      cd167e49
    • Jianfeng Yan's avatar
      fixed issue164 (#165) · ecf337ba
      Jianfeng Yan authored
      * fixed issue164
      
      * removed prints
      ecf337ba
    • Jianfeng Yan's avatar
      batched_gemm: use profiler in ctest (#163) · c8f3acf9
      Jianfeng Yan authored
      c8f3acf9
  13. 30 Mar, 2022 1 commit
    • Jianfeng Yan's avatar
      Batched gemm and reduction (#156) · 34c661e7
      Jianfeng Yan authored
      * adding batched_gemm_and_reduction
      
      * batched_gemm_reduce works with bactch_count=1
      
      * fix a bug in grid_size; batched_gemm_reduce works for batch_count > 1
      
      * adding profiler for batched_gemm_fp16
      
      * fixed a bug in declaration of d1 and d0; both example and profiler work
      
      * clang-format
      
      * cleanup
      
      * batched_gemm_reduce: add test
      
      * minor change
      
      * fixed some typo in function names
      34c661e7
  14. 29 Mar, 2022 1 commit
    • ltqin's avatar
      Unified implementation of 1d/2d/3d conv bwd-data. fp32/fp16/bfp16/int8 (#134) · 0536f2b3
      ltqin authored
      
      
      * start convnd bwd data
      
      * add 3d laoyout name
      
      * add conv1d reference
      
      * add con3d reference
      
      * finished example client code
      
      * conv1d kernel finished
      
      * fix input error
      
      * add conv3d
      
      * add 3d layout in conv_utils.hpp
      
      * fix sepecial check
      
      * addconvnd lib
      
      * add test for bwd data
      
      * finished test
      
      * add check slice length
      
      * convnd bwd data start
      
      * profiler can be compiled
      
      * fix some bug
      
      * set input to zero
      
      * modify readme for example
      
      * fix test_convnd_bwd_data bug
      
      * test_convnd_bwd_data parameter desc
      
      * workaround for 1d
      
      * workaroud for 2d
      
      * change init value
      
      * workaround for 3d int8
      
      * fix init value bug
      
      * remove workaround
      
      * fix acc data type
      
      * add int32
      
      * change select function to template
      
      * tilda to tilde
      
      * remove int32 instance
      
      * fix commit for device hpp
      
      * fix comments for profiler
      
      * using profile imp to test
      
      * add pass verification
      
      * fix conv2d reference
      
      * fix conflict
      
      * remove double batched_gemm
      
      * fix exampel conv2d data and test convnd
      
      * format
      
      * change conv2d_bwd_data return value
      
      * remove repeat = 1
      
      * remove conv bwd data
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0536f2b3
  15. 24 Mar, 2022 1 commit
    • Chao Liu's avatar
      Gemm+Reduce Fusion (#128) · f95267f1
      Chao Liu authored
      * add gridwise gemm v4r1
      
      * rename
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * use sfc in shuffling
      
      * remove hardcode
      
      * remove hardcode
      
      * refactor
      
      * fix build
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * format
      
      * clean
      
      * adding gemm+reduce
      
      * adding profiler for gemm+reduce
      
      * adding gemm+reduce profiler
      
      * fix build
      
      * clean up
      
      * gemm+reduce
      
      * fix build
      
      * update DeviceGemm_Xdl_CShuffle; update enum to enum class
      
      * clean up
      
      * add test for gemm+reduce
      
      * clean up
      
      * refactor
      
      * fix build
      
      * fix build
      f95267f1
  16. 23 Mar, 2022 2 commits
    • Adam Osewski's avatar
      Unified conv3D API + support for all data types. (#133) · f91579aa
      Adam Osewski authored
      
      
      * Convolution ND
      
      * Code unification across dimensions for generating tensor descriptors.
      * Example
      * Instances
      
      * Move convnd f32 instance file to comply with repo structure.
      
      * Conv 1D tensor layouts.
      
      * Formatting and use ReferenceConv
      
      * Reference ConvFwd supporting 1D and 2D convolution.
      
      * Debug printing TensorLayout name.
      
      * Conv fwd 1D instance f32
      
      * Refactor conv ND example.
      
      Needed to support various conv dimensio.
      
      Needed to support various conv dimensions
      
      * Rename conv nd example director to prevent conflicts.
      
      * Refactor some common utility to single file.
      
      Plus some tests.
      
      * Refactor GetHostTensorDescriptor + UT.
      
      * Add 1D test case.
      
      * Test reference convolution 1d/2d
      
      * Remove some leftovers.
      
      * Fix convolution example error for 1D
      
      * Refactor test check errors utility function.
      
      * Test Conv2D Fwd XDL
      
      * More UT for 1D case.
      
      * Parameterize input & weight initializers.
      
      * Rename example to prevent conflicts.
      
      * Split convnd instance into separate files for 1d/2d
      
      * Address review comments.
      
      * Fix data type for flops/gbytes calculations.
      
      * Assign example number 11.
      
      * 3D cases for convolution utility functions.
      
      * 3D reference convolution.
      
      * Add support for 3D convolution.
      
      * Check for inputs bigger than  2GB.
      
      * Formatting
      
      * Support for bf16/f16/f32/i8 - conv instances + UT.
      
      * Use check_err from test_util.hpp.
      
      * Split convnd test into separate files for each dim.
      
      * Fix data generation and use proper instances.
      
      * Formatting
      
      * Skip tensor initialization if not necessary.
      
      * Fix CMakefiles.
      
      * Remove redundant conv2d_fwd test.
      
      * Lower problem size for conv3D UT.
      
      * 3D case for convnd example.
      
      * Remove leftovers after merge.
      
      * Add Conv Specialization string to GetTypeString
      
      * Skip instance causing numerical errors.
      
      * Small fixes.
      
      * Remove redundant includes.
      
      * Fix namespace name error.
      
      * Script for automatic testing and logging convolution fwd UTs
      
      * Comment out numactl cmd.
      
      * Refine weights initalization and relax rtol for fp16
      
      * Fix weights initialization for int8.
      
      * Add type_convert when store output in ref conv 1D.
      
      * Get back old conv2d_fwd_xdl operation.
      
      * Silence conv debug print.
      
      * format
      
      * clean
      
      * clean
      
      * Fix merge.
      
      * Fix namespace for check_err
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      f91579aa
    • Chao Liu's avatar
      clean (#143) · 22061366
      Chao Liu authored
      22061366
  17. 22 Mar, 2022 2 commits
    • zjing14's avatar
      Grouped GEMM for fp16 (#126) · 716f1c7f
      zjing14 authored
      * init of grouped_gemm
      
      * 2 gemm test
      
      * perf test
      
      * clean
      
      * wrap desc into a struct
      
      * test cast static_arr to pointer
      
      * add ptr to GemmDesc
      
      * add grouped gemm profiler
      
      * fixed mem issue with unique_ptr
      
      * clean
      
      * clean
      
      * finished ckprofiler
      
      * Update README.md
      
      * readme
      
      * fixed readme
      
      * add example
      
      * improve code
      
      * fixed comments: reserve, seperate ptr and gemm_shapes
      
      * merge group and non-group
      
      * fixed comments: replace push_back with emplace_back to avoid copy constructor
      
      * fixed comments: unified blk2ctile; add test
      
      * ci fix
      
      * fixed ci
      
      * fixed ci
      
      * fixed ci
      716f1c7f
    • Qianfeng's avatar
      Reduction for int8 and bfloat16 (#125) · 9a8ee8a3
      Qianfeng authored
      
      
      * Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
      
      * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
      
      * Rename the folder name for the pool2d and reduce examples
      
      * Update to reduction test scripts
      
      * Add Readme for pool2d_fwd and reduce_blockwise examples
      
      * Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)
      
      * Tiny fix in reduce profiler and tiny update in reduce testing scripts
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Add support for bfp16 reduction (using bhalf_t = ushort)
      
      * Tiny fix in amd_buffer_addressing.hpp
      
      * Tiny change in script/profile_reduce_with_index.sh
      
      * Use AccDataType for Beta value and use element_wise::PassThrough
      
      * Use type_convert for type converting in host layer reduction
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming all NumReduceDims to NumReduceDim
      
      * Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2
      
      * Update to testing scripts to add bf16 support
      
      * added more static_assert
      
      * Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp
      
      * Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations
      
      * minor change
      
      * Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass
      
      * Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp
      
      * Tiny fix in script/profile_reduce_no_index.sh
      
      * Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims
      
      * Generic renaming in host reduction and DeviceReduce layer
      
      * Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances
      
      * Use multi-thread and simplification for host Reduction implementation
      
      * Add ctest for reduction
      
      * Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/
      
      * Update to the reduce CTest executables to enable default testing behavior when no command argument
      
      * Renaming
      Co-authored-by: default avatarJianfeng yan <jfyan008@gmail.com>
      9a8ee8a3
  18. 21 Mar, 2022 2 commits
  19. 11 Mar, 2022 1 commit
  20. 10 Mar, 2022 1 commit
    • Qianfeng's avatar
      Pr82 followup (#115) · 827301d9
      Qianfeng authored
      * Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
      
      * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
      
      * Rename the folder name for the pool2d and reduce examples
      
      * Update to reduction test scripts
      
      * Add Readme for pool2d_fwd and reduce_blockwise examples
      
      * Tiny fix in reduce profiler and tiny update in reduce testing scripts
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Tiny change in script/profile_reduce_with_index.sh
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming all NumReduceDims to NumReduceDim
      827301d9
  21. 09 Mar, 2022 1 commit
    • Chao Liu's avatar
      Reorganize files, Part 1 (#119) · 5d37d7bf
      Chao Liu authored
      * delete obselete files
      
      * move files
      
      * build
      
      * update cmake
      
      * update cmake
      
      * fix build
      
      * reorg examples
      
      * update cmake for example and test
      5d37d7bf