1. 09 Sep, 2022 1 commit
  2. 30 Aug, 2022 2 commits
    • Adam Osewski's avatar
      Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115
      Adam Osewski authored
      
      
      * GEMM + Reduce max fp16+fp32
      
      * GEmm + Max bf16 + int8
      
      * Refactor common definitions.
      
      * Refactor common func of mean meansquare example.
      
      * More examples for mean meansquare.
      
      * Update int8 examples and skip them cause of random errors.
      
      * Int4 examples.
      
      * Fix examples for max int4/8
      
      * Tensor conversion for int4 input data for mean meansquare example.
      
      * Remove int4 mean_meansquare example
      
      * Fix int8 mean_meansquare example.
      
      -All ReductionAccData and R<N>DataType have to be F32. The INT32 data
      type is giving wrong results.
      
      * Guard int4 with ifdef
      
      * Change int8 example to add_addsquare due to div rounding err.
      
      * Clang format
      
      * Change the return type of common function.
      
      * Get back int8 example with division.
      
      * Remove int8 mean meansquare.
      
      * Use proper cast for BF16 data type.
      
      * Use ck::literals.
      
      * Use proper data type for host tensors & reference.
      
      - Use ReduceAccDataType for reference gemm output data type.
      - Cast host reference output tensor to EDataType
      - Fix ifdefs for int4.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      d00e6115
    • Shaojie WANG's avatar
      Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736
      Shaojie WANG authored
      
      
      * add padding algo for bmm+scale+softmax+bmm. Version for verification
      
      * remove verification code
      
      * remove comments
      
      * add padded bmm scale softmax bmm example
      
      * format
      
      * refactor
      
      * add comments for usages of padding bmm+scale+softmax+bmm
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      45adb736
  3. 25 Aug, 2022 1 commit
  4. 24 Aug, 2022 1 commit
  5. 23 Aug, 2022 1 commit
  6. 18 Aug, 2022 1 commit
  7. 13 Aug, 2022 4 commits
    • rocking5566's avatar
      Layernorm welford (#346) · 0bd6b842
      rocking5566 authored
      
      
      * Add threadwise and blockwise welford
      
      * Rename gridwise op, prepare to add welford version
      
      * implement welford and integrate welford into layernorm
      
      * Take care of tail loop
      
      * Fix buf when ThreadSliceK > 1
      
      * Fix bug of merging of two empty set
      
      * Rename clip to clamp
      
      * 1. Fix type of count
      2. Remove useless static_assert
      
      * Do not inherit Reduction::Argument
      
      * [What] replace __syncthreads() with block_sync_lds()
      [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
      
      * Add y stride
      
      * Rename.
      DeviceLayernorm -> DeviceLayernormImpl
      DeviceNormalization2 -> DeviceLayernorm
      
      * Move literal ""_uz & ""_zu into namespace 'literals'
      
      * Move namespace 'literals' as 'ck::literals'
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0bd6b842
    • Anthony Chang's avatar
      Fused GEMM+GEMM (#351) · c20a75b0
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * add gemm_gemm instances and tests
      
      * avoid LDS data hazard
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c20a75b0
    • ltqin's avatar
      Skip lds of b matrix (#326) · 10b3278b
      ltqin authored
      * start
      
      * read for gridwise gemm
      
      * add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1
      
      * add thread  copy desc and register buffer
      
      * add K0PerBlock dim
      
      * add read global data
      
      * finish gridwise gemm
      
      * finish blockwise gemm
      
      * add print data
      
      * add smallest config
      
      * add compare code for gridwis gemm
      
      * fix NXdlPerWave
      
      * fix k0perthread and gridewis gemm main loop
      
      * remove b matrix lds alloc
      
      * fix name
      
      * add test code
      
      * create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter
      
      * add double register
      
      * modify b_thread_desc_
      
      * add float
      
      * fp16 tag
      
      * add tail for pipeline
      
      * finish main loop
      
      * optimize main loop
      
      * start clear gridwise gemm
      
      * clear code
      
      * clear redundant code
      
      * change file name
      
      * change file name
      
      * fix bug after merge develop
      
      * fix input parameters
      
      * using MultiK0 control b load data loop
      
      * fix some config
      
      * 4 buffer
      
      * fix bug
      
      * one can use
      
      * change read order
      
      * change buffer array to tuple
      
      * change to 8 buffer
      
      * interleave buffer load
      
      * change to 16
      
      * read 8 buffer
      
      * add data buffer to template
      
      * fix after merge develop(head file)
      
      * format
      
      * change to 4 buffer
      
      * remove unnecessary lambda fun
      10b3278b
    • Anthony Chang's avatar
      Fused attention (#345) · cac014f1
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * attention host validation
      
      * add blockwsie softmax v1
      
      * iteratively update softmax+gemm
      
      * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
      
      * add init method for easier debugging
      
      * do away with manual thread cluster calculation
      
      * generalize blockwise softmax interface
      
      * row-wise softmax sum & max
      
      * format
      
      * rename to DeviceBatchedGemmSoftmaxGemm
      
      * add gemm_softmax_gemm instances and tests
      
      * comment
      Co-authored-by: default avatarltqin <letao.qin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cac014f1
  8. 11 Aug, 2022 1 commit
    • Po Yen Chen's avatar
      Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340) · 68b61504
      Po Yen Chen authored
      * Add always_false<> util to delay symbol resolution
      
      * Use always_false<> to prevent trying instantiate unwanted method
      
      * Add new specializations of AddAddFastGelu::operator() method
      
      * Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32
      
      * Use floating point literal to simplify code
      
      * Remove unnecessary capture in lambda expressions
      
      * Extract fast GeLU calculation as standalone method
      
      * Mark methods as 'constexpr'
      
      * Add constraint for HostTensorDescriptor templated ctors
      
      * Simplify HostTensorDescriptor ctor calls
      
      * Add C++23 std::size_t literal suffix
      
      * Use _uz suffix to shorten example code
      
      * Remove unnecessary conversion to std::array<>
      
      * Re-order include directives
      
      * Remove C-style casting by literal suffix
      
      * Remove unnecessary statements in main()
      
      * Remove unused type parameter of always_false<>
      
      * Remove unused include directive
      
      * Exit main() by returning meaningful value
      
      * Use 'if constexpr' to switch example flow
      
      * Use std::is_same_v<> to shorten example code
      
      * Add 'inline' specifier to literal functions
      
      * Unify output methods in example
      
      * Move common codes into .inc file
      
      * Add type check in type_convert<>()
      
      * Add type_convert<float>() before computation
      
      * Merge AddAddFastGelu method specializations
      
      * Remove always_false<>
      
      * Add constraint to AddAddFastGelu::operator() parameter types
      68b61504
  9. 02 Aug, 2022 1 commit
    • Adam Osewski's avatar
      CGEMM examples bf16, fp32, int8 (#332) · fb0dc358
      Adam Osewski authored
      
      
      * Add int8 specialization for elementwise Add and Subtract.
      
      * CGEMM examples bf16, fp32, int8
      
      * Add convert reference output to CDataType.
      
      * Skip BF16 data type during testing.
      
      * Lower K value to get rid of accumulation error.
      
      * Fix merge artifact.
      
      * Fix changed function name: GetElementSpaceSize()
      
      * Fix merge artifact.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      fb0dc358
  10. 29 Jul, 2022 1 commit
    • Chao Liu's avatar
      Clean up conv example, Instances, profiler and test (#324) · 500fa995
      Chao Liu authored
      * convnd_fwd fp16 example
      
      * update example
      
      * update example
      
      * update instance
      
      * updating refernce conv
      
      * update reference conv
      
      * update conv fwd profiler
      
      * update conv 1d and 3d instance
      
      * update include path
      
      * clean
      
      * update profiler for conv bwd data and weight
      
      * update conv bwd weight
      
      * clean
      
      * update conv example
      
      * update profiler for conv bwd weight
      
      * update ckprofiler for conv bwd data
      
      * fix reference conv bwd data bug; update conv bwd data test
      
      * update examples
      
      * fix initialization issue
      
      * update test for conv fwd
      
      * clean
      
      * clean
      
      * remove test case too sensitive to error threshhold
      
      * fix test
      
      * clean
      
      * fix build
      
      * adding conv multiple d
      
      * adding conv multiple D
      
      * add matrix padder
      
      * add gemm padding to convnd
      
      * adding group conv
      
      * update gemm multi-d
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * clean
      
      * refactor
      
      * refactor
      
      * reorg
      
      * add ds
      
      * add bias
      
      * clean
      
      * add G
      
      * adding group
      
      * adding group
      
      * adding group
      
      * update Tensor
      
      * clean
      
      * update example
      
      * update DeviceGemmMultipleD_Xdl_CShuffle
      
      * update conv bwd-data and bwd-weight
      
      * upate contraction example
      
      * update gemm and batch gemm with e permute
      
      * fix example build
      
      * instance for grouped conv1d
      
      * update example
      
      * adding group conv instance
      
      * update gemm bilinear instance
      
      * update gemm+add+add+fastgelu instance
      
      * update profiler
      
      * update profiler
      
      * update test
      
      * update test and client example
      
      * clean
      
      * add grouped conv into profiler
      
      * update profiler
      
      * clean
      
      * add test grouped conv, update all conv test to gtest
      
      * update test
      500fa995
  11. 07 Jul, 2022 1 commit
    • Chao Liu's avatar
      N-D Tensor Contraction example, instance, and client example (#270) · 4fe9c393
      Chao Liu authored
      * adding contraction
      
      * add contraction example
      
      * update examle
      
      * update example
      
      * format
      
      * update readme
      
      * clean header
      
      * clean header
      
      * contraction with multiple D
      
      * rename
      
      * fix naming issue; add instances for contraction+bilinear
      
      * change assumed virtual layout of contraction; add client example
      
      * update example
      
      * update
      
      * contraction+scale
      
      * use type_convert
      
      * rename
      4fe9c393
  12. 02 Jul, 2022 1 commit
    • Chao Liu's avatar
      Gemm+Bilinear (#316) · 9e4429f9
      Chao Liu authored
      * refactor
      
      * update example
      
      * update example
      
      * gemm bilinear
      
      * clean
      
      * update
      9e4429f9
  13. 01 Jul, 2022 1 commit
    • Anthony Chang's avatar
      Single-kernel GEMM + layernorm (#263) · 63fd5da6
      Anthony Chang authored
      
      
      * dump lds content in appropriate precision type
      
      * add squared add reduction op; allows sq sum
      
      * initial stub from regular gemm impl
      
      * layernorm example code & host verification
      
      * initial layernorm implementation
      
      * tidy up
      
      * make C0 precision type consistent with C
      
      * clang-tidy and additional comments
      
      * tighten up example code
      
      * account for extra flops/bytes from normalization
      
      * clang-format
      
      * c0 bias/beta/gamma now have its own precision type
      
      * AccElemOp for gemm outputs prior to feeding to layernorm
      
      * update workgroup mapping
      
      * rename kernel template param to reflect its dual use
      
      * use LDS mem pool for reduction workspace
      
      * change cshuffle precision type to f16; clean up
      
      * clang-format
      
      * correct naming
      
      * explicit cast
      
      * fully implemented gemm + bias + activation + add + norm
      
      * activation in correct order
      
      * reflect reduction API's recent change
      
      * amend
      
      * clean up; add comment
      
      * keep up with recent changes in reduction API
      
      * format
      
      * resolve merge conflicts
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      63fd5da6
  14. 30 Jun, 2022 1 commit
    • Anthony Chang's avatar
      Standalone sweep once softmax kernel w/ ckProfiler (#295) · 93c99f3d
      Anthony Chang authored
      * use 'sweep once' softmax kernel where applicable
      
      * threadwise copy's dst buffer can specify invalid element value
      
      * add int8 in/out float compute softmax support
      
      give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
      
      * format
      
      * softmax inherits DeviceNormalization
      
      * softmax profiler stub
      
      * tighten up reference softmax interface
      
      * example prints tensor dimension
      
      * add fp32 to softmax profiler
      
      * rename header
      
      * hook with ckProfiler
      
      * format
      
      * resolve merge conflict
      
      * resolve merge conflicts
      
      * update normalization profiler help string
      
      * resolve conflict
      
      * typo
      
      * remove residual
      
      * softmax profiler: address feedback
      
      * test for mixed precision input/output
      
      * fully qualify ck::math::isnan
      
      * add comment for device normalization interface
      
      * revise wording
      
      * constness for alpha/beta scaler pointer
      93c99f3d
  15. 25 Jun, 2022 2 commits
    • Chao Liu's avatar
      add license in file (#303) · d3051d75
      Chao Liu authored
      d3051d75
    • Chao Liu's avatar
      Absolute include path (#281) · d1db6a0c
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add client app example
      
      * update readme
      
      * delete obselete files
      
      * remove old client app
      
      * delete old file
      
      * cleaning
      
      * clean
      
      * remove half
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path for all examples
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * revert client app example
      
      * clean build
      
      * fix build
      
      * temporary disable client test on Jenkins
      
      * clean
      
      * clean
      
      * clean
      d1db6a0c
  16. 21 Jun, 2022 1 commit
    • Anthony Chang's avatar
      Standalone softmax kernel (#284) · 15c89e81
      Anthony Chang authored
      * initial stub for standalone softmax
      
      * start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
      
      * host softmax validates
      
      * compiles; to implement beta scaling
      
      * use NaN trick to efficiently ignore OOB values during sum of exponentials
      
      * freeload device_reduce's utility functions
      
      * clean up interface
      
      * adding prior value (beta scaling)
      
      * remove restriction related to perf considerations
      
      * apply clang-format
      
      * clean; disable diagnostics
      
      * resolve conflicts
      
      * add exp wrapper
      
      * honor HostTensorDesc interface; allow implicit cast from different vector<T> type
      
      * test softmax for fp16/fp32
      
      * update readme
      
      * amend commit NaN trick
      
      * remove redundant param added during development
      
      * format
      
      * replace ScalarDataType with AccDataType
      
      * separate out test programs by precision type
      
      * move softmax sample code to its own folder
      
      * format
      
      * keep up with recent changes in reduction API
      
      * remove extra header
      15c89e81
  17. 19 Jun, 2022 1 commit
    • Chao Liu's avatar
      GEMM with Multiple Source, GEMM+Bias+Add+FastGeLU example and ckProfiler (#241) · 56adf7e9
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add comment
      
      * use type_convert
      
      * clean
      
      * clean element wise op
      56adf7e9
  18. 17 Jun, 2022 1 commit
    • Qianfeng's avatar
      Regulate reduction accumulator operations and Element-wise operations (#274) · 1f543bfa
      Qianfeng authored
      * Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces
      
      * Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers
      
      * Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers
      
      * Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation
      
      * Use struct-scope operator template instantiation for binary and unary element-wise operations
      
      * Change a few more elementwise operations to use template for operator()
      
      * Tiny correction in Normalize operator
      
      * Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons
      
      * Correction in some examples with regard to using ReduceAccDataType
      
      * Use static_assert for UnaryDivide
      
      * Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly
      
      * Tiny fix with regard to SetWorkSpacePointer()
      1f543bfa
  19. 02 Jun, 2022 1 commit
    • Qianfeng's avatar
      Unify the naming of the math functions used by the host and kernel (#262) · 86185bd7
      Qianfeng authored
      * Use the unified naming for math functions on host and HIP kernel
      
      * Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming
      
      * Renaming GetReductionZeroVal() to GetIdentityValue()
      
      * Tiny renaming in profile_reduce_impl.hpp
      
      * More renaming in profile_reduce_impl.hpp
      
      * Replace zeroVal by identiyVal
      
      * Remove ck_ prefix in the naming of ck::math provided functions
      86185bd7
  20. 26 May, 2022 1 commit
    • ltqin's avatar
      Add FP64 XDL GEMM built-in function (#199) · 3e6c2610
      ltqin authored
      
      
      * add intrin_mfma_f64_16x16x4f64
      
      * add example
      
      * gemm reference add double data type
      
      * chang init data
      
      * fix M N PerXdlops
      
      * fix ifdef
      
      * add comparsion config
      
      * add conv fwd example
      
      * format log out
      
      * change rc matrix egister layout
      
      * reorganize example
      
      * reorganize example 2
      
      * format,because merge develop
      
      * fix call impl adding acc data type
      
      * lost ;
      
      * add compiler warning
      
      * change example tunning parameters
      
      * add test for fp64
      
      * add instance
      
      * add test/gemm/gemm_fp64.cpp
      
      * fix get name issue
      
      * remove some tunning parameter
      
      * fix conflict
      
      * format
      
      * use integer value for GEMM test
      
      * add acc data type
      
      * remove typeid because fp16
      
      * fix streamconfig etc bug from merging develop
      
      * format
      
      * remove test_gemm_xdl_fp64
      
      * add AccDataType
      
      * AccDataType problem
      Co-authored-by: default avatarqinletao <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      3e6c2610
  21. 25 May, 2022 1 commit
  22. 24 May, 2022 2 commits
    • Jianfeng Yan's avatar
      Navi21 gemm (#197) · 40b59a63
      Jianfeng Yan authored
      
      
      * start adding navi21 GEMM
      
      * navi_gemm_km_kn_mn_fp32 compiles and passes one test.
      
      * rename variables and functions in gridwise_gemm_dlops_v1r3
      
      * add other 3 layouts; format instance
      
      * adding more tuning parameters
      
      add tuning parameters for other 3 layouts
      
      * add gemm_dlops_f16
      
      * tmp
      
      * add dependence of DeviceGemm::IsSupportedArg() on arch
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * push gemm_dlops into profiler
      
      * minor changes
      
      * if using xdl or dlops is moved into profiler_gemm_impl
      
      * minor changes
      
      * minor changes
      
      * remove is_xdl from profile_gemm_impl
      
      * make IsSupportedArg dependent on arch for other device_gemm
      
      * minor changes
      
      * minor changes
      
      * fix a bug in f_generate_tensor_value
      
      * add 64x64x64 for gemm_dlops_int8
      
      * add 64x64x64 for gemm_dlops_int8
      
      * comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1
      
      * fix
      
      * start fixing tuning parameters
      
      * monir
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * fixing
      
      * adding example
      
      * adding example
      
      * adding example
      
      * add gemm fp32 example
      
      * clean up
      
      * use 128x128x16 as MNK tile in navi21 gemm example
      
      * bug fix
      
      * fix test
      
      * use new block c tile
      
      * clean
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
      40b59a63
    • Qianfeng's avatar
      Overhaul to Reducton and its dependants (#237) · 63eee2d9
      Qianfeng authored
      * Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type
      
      * Update to host layer and host reduction
      
      * Merge and remove reduction kernels
      
      * Merge and remove reduction device interfaces and update pooling device interface
      
      * Merge and remove useless reduction device instances
      
      * Update to reduction profiler and reduction ctests
      
      * Update to reduction and pooling examples and add one reduction example
      
      * Change to reduction examples to let them testable by ctest
      
      * Add explicit pass checking for reduction and pooling examples
      
      * Explicit assignment of tensor shapes in example reduce_blockwise_two_call
      
      * Use atomic_add to repace atomicAdd and add atomic_add for double type
      
      * Add reduce ctest support for double data type
      
      * Replace to_int_vector() by using c++ std::vector::assign()
      
      * Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise
      
      * Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock
      
      * Add GetAtomicOperationZeroValue() support for AtomicMax
      
      * Tiny change to reduce example README.md
      
      * Fix some tiny issues due to branch merging
      
      * Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t
      
      * Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64
      
      * Renaming
      
      * Clean the header includings in device_reduce instances header files
      63eee2d9
  23. 20 May, 2022 2 commits
    • Anthony Chang's avatar
      Refactor block to C tile map (#235) · a054f7d6
      Anthony Chang authored
      * refactor block-to-ctile-map
      
      * gridwise gemm block2ctile generic validity check
      
      * format
      
      * amend split-k gemm block2ctile map refactor
      
      * add test
      
      * format
      
      * amend
      
      * revert to calculating batch index in kernel instead of passing as block_id_z
      
      * move file
      
      * add valid ctile index check to gridwise v2r4
      a054f7d6
    • rocking5566's avatar
      Gemm reduce max (#209) · 0ffe956a
      rocking5566 authored
      
      
      * [What] Rename the example
      [Why] Prepare to add unary reduction
      
      * Add global oparation to the parameter
      
      * Add atomicmax
      
      * Fix compile error
      
      * Support atomicMax (hip library)
      
      * Rename the reduction example
      
      * Fix target name
      
      * use p_d1_grid as the indicator directly
      
      * Prevent performance issue. Let passthrough handle it.
      
      * Implement the function template the specialize the float2
      
      * No need to separate into two lines
      
      * Remove empty line
      
      * add comment
      
      * Fix compile error due to merge from develop
      
      * make the implementation of atomic_max / atomic_add explicit for each datatype
      
      * Refine typo
      
      * For future CI test
      
      * Fix compiler error in ckProfiler
      
      * Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
      
      * simply use remove_pointer
      
      * Rename type and var
      
      * Refine example
      
      * Modify reducemax example
      
      * Fix bug in reduction
      
      * Change initialize range
      
      * Implement F64 version of atomicMax
      
      * Move reduction  code together
      
      * Add buffer atomic_max
      
      * Fix coding style by clang-format
      
      * Integrate new api of DeviceGemmReduce_Xdl_CShuffle
      
      * Integrate Batch gemm reduction
      
      * Fix example
      
      * fix example
      
      * clean up
      
      * Fix batch gemm tensor operation
      
      * Fix coding style
      
      * Fix template augument
      
      * Fix clang format
      
      * Keep flexible of different stride for each D tensor
      
      * Fix compile error for ckProfiler
      
      * Fix typo
      
      * [What] Fix naming
      [Why] Prepare to add out elementop
      
      * Add DoutElementOp
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      0ffe956a
  24. 19 May, 2022 1 commit
    • rocking5566's avatar
      elementwise op (#238) · aafc3ac2
      rocking5566 authored
      
      
      * Add elementwise operation kernel and example
      
      * Add comment
      
      * Add template argument of dim . Prepare to support multiple dimension
      
      * Rename example
      
      * Support 1 dimension
      
      * Add static assert
      
      * Add comment
      
      * Extract pad
      
      * Remove redundant argument
      
      * Support any dimension for elementwise operation
      
      * Remove line
      
      * Let it be the multiple number of CU
      
      * Move thread per block to the parameter of constructor
      
      * rename threadPerBlock with blockSize
      
      * Support double
      
      * rename kernel function name
      
      * remove redundant include header
      
      * Refine type
      
      * Need to the final dimension
      
      * Refine variable name
      
      * Refine type
      
      * Use index_t instead of int in API
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      aafc3ac2
  25. 09 May, 2022 2 commits
    • myamlak's avatar
      Resolution of issue #153: Add compiler warning on comparing int and size_t (#212) · f03a1738
      myamlak authored
      
      
      * Turning compare warnings on
      
      * Cleaning part I
      
      * Cleaning part II
      
      * Explicit static_cast to ck::type_convert
      
      * Resolving large tensor size issue.
      
      * format
      
      * revert change to tensor descriptor; promote lementSpaceSize to 64bit
      
      * use integer value for GEMM test
      
      * Review remarks
      
      * Review remarks + issues with (un)signed arithmetic
      
      * Format fix
      
      * Format
      
      * Clang-format.
      
      * fix 2gb limit issue
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      f03a1738
    • Chao Liu's avatar
      Code refactor (#175) · ec7c2e91
      Chao Liu authored
      * format
      
      * improving pipeline
      
      * fix typo
      
      * format
      
      * adding thread group
      
      * adding thread group
      
      * adding thread group
      
      * adding gemm pipeline
      
      * tweak
      
      * refactor
      
      * refactor
      
      * add missing type convert
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * fix build
      
      * refactor
      
      * format
      
      * clean up
      
      * use remove_cvref_t
      
      * clean
      
      * clean up
      
      * clean up
      
      * clean up
      ec7c2e91
  26. 22 Apr, 2022 1 commit
  27. 21 Apr, 2022 1 commit
    • Qianfeng's avatar
      Use ck::half_t for Host Reduction (#195) · c1ef7319
      Qianfeng authored
      * Add math functions for host
      
      * Change to host reduction to use ck::math:
      
      * Remove the using of half_float::half and half.hpp from reduction example/profiler/ctest
      c1ef7319
  28. 15 Apr, 2022 1 commit
    • Illia Silin's avatar
      Compile CK for all targets (#188) · 4221505d
      Illia Silin authored
      
      
      * compile ck for all targets
      
      * update the target criteria
      
      * change the target condition
      
      * fixed some typos
      
      * fixed missed file
      
      * revert changes in README
      
      * revert device_conv3d_fwd_xdl_...
      
      * update device_conv3d_fwd_xdl_...
      
      * update device_batched_gemm_reduce...
      
      * test the unused arguments fix
      
      * test the warning suppression
      
      * try suppress warnings in device_batched_gemm_reduce_xdl...
      
      * fix the last warnings
      
      * replace UNUSED with std::ignore
      
      * fix a typo
      
      * replaced std::ignore with ignore
      
      * add igonre header to common_header
      
      * refactor atomicAdd
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      4221505d
  29. 31 Mar, 2022 1 commit
    • Chao Liu's avatar
      Compile for gfx908 and gfx90a (#130) · cd167e49
      Chao Liu authored
      * adding compilation for multiple targets
      
      * fix build
      
      * clean
      
      * update Jekinsfile
      
      * update readme
      
      * update Jenkins
      
      * use ck::half_t instead of ushort for bf16
      
      * rename enum classes
      
      * clean
      
      * rename
      
      * clean
      cd167e49
  30. 24 Mar, 2022 1 commit
    • Chao Liu's avatar
      Gemm+Reduce Fusion (#128) · f95267f1
      Chao Liu authored
      * add gridwise gemm v4r1
      
      * rename
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * use sfc in shuffling
      
      * remove hardcode
      
      * remove hardcode
      
      * refactor
      
      * fix build
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * format
      
      * clean
      
      * adding gemm+reduce
      
      * adding profiler for gemm+reduce
      
      * adding gemm+reduce profiler
      
      * fix build
      
      * clean up
      
      * gemm+reduce
      
      * fix build
      
      * update DeviceGemm_Xdl_CShuffle; update enum to enum class
      
      * clean up
      
      * add test for gemm+reduce
      
      * clean up
      
      * refactor
      
      * fix build
      
      * fix build
      f95267f1
  31. 22 Mar, 2022 1 commit
    • Qianfeng's avatar
      Reduction for int8 and bfloat16 (#125) · 9a8ee8a3
      Qianfeng authored
      
      
      * Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
      
      * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
      
      * Rename the folder name for the pool2d and reduce examples
      
      * Update to reduction test scripts
      
      * Add Readme for pool2d_fwd and reduce_blockwise examples
      
      * Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)
      
      * Tiny fix in reduce profiler and tiny update in reduce testing scripts
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Add support for bfp16 reduction (using bhalf_t = ushort)
      
      * Tiny fix in amd_buffer_addressing.hpp
      
      * Tiny change in script/profile_reduce_with_index.sh
      
      * Use AccDataType for Beta value and use element_wise::PassThrough
      
      * Use type_convert for type converting in host layer reduction
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming all NumReduceDims to NumReduceDim
      
      * Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2
      
      * Update to testing scripts to add bf16 support
      
      * added more static_assert
      
      * Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp
      
      * Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations
      
      * minor change
      
      * Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass
      
      * Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp
      
      * Tiny fix in script/profile_reduce_no_index.sh
      
      * Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims
      
      * Generic renaming in host reduction and DeviceReduce layer
      
      * Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances
      
      * Use multi-thread and simplification for host Reduction implementation
      
      * Add ctest for reduction
      
      * Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/
      
      * Update to the reduce CTest executables to enable default testing behavior when no command argument
      
      * Renaming
      Co-authored-by: default avatarJianfeng yan <jfyan008@gmail.com>
      9a8ee8a3
  32. 09 Mar, 2022 1 commit
    • Chao Liu's avatar
      Reorganize files, Part 1 (#119) · 5d37d7bf
      Chao Liu authored
      * delete obselete files
      
      * move files
      
      * build
      
      * update cmake
      
      * update cmake
      
      * fix build
      
      * reorg examples
      
      * update cmake for example and test
      5d37d7bf