1. 21 Jun, 2022 1 commit
  2. 19 Jun, 2022 1 commit
    • Chao Liu's avatar
      GEMM with Multiple Source, GEMM+Bias+Add+FastGeLU example and ckProfiler (#241) · 56adf7e9
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add comment
      
      * use type_convert
      
      * clean
      
      * clean element wise op
      56adf7e9
  3. 17 Jun, 2022 5 commits
    • Illia Silin's avatar
      Don't look up the /sys/module/amdgpu/version file. (#287) · e4584d91
      Illia Silin authored
      
      
      * use pre-built docker instead of building a new one
      
      * try docker.image.pull
      
      * change syntax in docker.image()
      
      * add 30 min timeout
      
      * increase timeout to 3 hours
      
      * move performance tests to first stage for testing
      
      * set image variable to the new container name
      
      * update image name
      
      * check available images
      
      * check available images in both places
      
      * try different image name
      
      * use image ID to refer to image
      
      * run performance on gfx90a
      
      * fix the gpu_arch labeling, add parameter
      
      * move env vars out of stages
      
      * add stand-alone performance script, MI200 tests, CU numbers
      
      * dos2unix for run_perf_tests.sh
      
      * try the new git credentials
      
      * use env var for git credentials
      
      * don't look up /sys/module/amdgpu/version
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e4584d91
    • Qianfeng's avatar
      Regulate reduction accumulator operations and Element-wise operations (#274) · 1f543bfa
      Qianfeng authored
      * Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces
      
      * Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers
      
      * Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers
      
      * Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation
      
      * Use struct-scope operator template instantiation for binary and unary element-wise operations
      
      * Change a few more elementwise operations to use template for operator()
      
      * Tiny correction in Normalize operator
      
      * Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons
      
      * Correction in some examples with regard to using ReduceAccDataType
      
      * Use static_assert for UnaryDivide
      
      * Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly
      
      * Tiny fix with regard to SetWorkSpacePointer()
      1f543bfa
    • Shaojie WANG's avatar
      63cdd923
    • ltqin's avatar
      add p_workspace to baseargument (#275) · c7a96ed5
      ltqin authored
      c7a96ed5
    • rocking5566's avatar
      Gemm + bias + relu + add + layernorm (#272) · 6eb55499
      rocking5566 authored
      * Copy "gemm reduce" to "gemm bias add reduce"
      
      * Implement gemm bias add reduction
      
      * Fix compiler error due to merge from develop
      
      * Add tensor operation for gemm + bias + add + reduce
      
      * Add gemm_bais_add_reduce to ckProfiler
      
      * Add c1 functor
      
      * Refine type
      
      * Use reduceAccDataType instead of explicitly float
      
      * Change to use check_err()
      
      * Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned
      
      * Refactor relu. using type_trait instead of overloading
      
      * Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation
      
      * Fix denominator
      
      * Refine nameing
      
      * Fix denominator  in host
      
      * Remove useless include header
      
      * Use AccDataType
      
      * Fix static_cast order
      
      * Refine type
      
      * [What] Remove tuple type in the base class
      [Why] External api depend on base class. if base class has relationship with type, we will need many class for different type
      6eb55499
  4. 16 Jun, 2022 2 commits
    • Shaojie WANG's avatar
      example for convnd bwd weight bf16 splitk (#265) · 561ec12f
      Shaojie WANG authored
      * add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight
      
      * add bwd weight for bf16: init
      
      * remove redundant compute
      
      * use datatype and split k to check whether a workspace is used
      
      * remove unused computation for work space size
      
      * add some code for bfp16
      
      * add device/grid unary op
      
      * add unary type convert to bwd-weight example
      
      * support bf16 splitk kernel for convnd bwd weight
      
      * 1. remove comments. 2. add checkvalidity. 3. add gridsize computation
      
      * add workspace size check
      
      * fix format
      
      * change function name
      561ec12f
    • Illia Silin's avatar
      Use new github credentials (#278) · fb9b6b1e
      Illia Silin authored
      * use pre-built docker instead of building a new one
      
      * try docker.image.pull
      
      * change syntax in docker.image()
      
      * add 30 min timeout
      
      * increase timeout to 3 hours
      
      * move performance tests to first stage for testing
      
      * set image variable to the new container name
      
      * update image name
      
      * check available images
      
      * check available images in both places
      
      * try different image name
      
      * use image ID to refer to image
      
      * run performance on gfx90a
      
      * fix the gpu_arch labeling, add parameter
      
      * move env vars out of stages
      
      * add stand-alone performance script, MI200 tests, CU numbers
      
      * dos2unix for run_perf_tests.sh
      
      * try the new git credentials
      
      * use env var for git credentials
      fb9b6b1e
  5. 10 Jun, 2022 1 commit
    • Illia Silin's avatar
      Add performance tests on MI200 in CI, reporting number of CUs, add stand-alone perf test. (#277) · 1ced00a5
      Illia Silin authored
      * use pre-built docker instead of building a new one
      
      * try docker.image.pull
      
      * change syntax in docker.image()
      
      * add 30 min timeout
      
      * increase timeout to 3 hours
      
      * move performance tests to first stage for testing
      
      * set image variable to the new container name
      
      * update image name
      
      * check available images
      
      * check available images in both places
      
      * try different image name
      
      * use image ID to refer to image
      
      * run performance on gfx90a
      
      * fix the gpu_arch labeling, add parameter
      
      * move env vars out of stages
      
      * add stand-alone performance script, MI200 tests, CU numbers
      1ced00a5
  6. 02 Jun, 2022 3 commits
  7. 31 May, 2022 3 commits
    • zjing14's avatar
      Pass gemm_descs for grouped gemm via __constant__ buff (#232) · b6eaf3eb
      zjing14 authored
      * moved gemm_descs_args into const buff
      
      * use CK_CONSTANT_ADDRESS_SPACE instead of global constant
      
      * clean
      
      * moved hipMemAlloc outside of deviceOp
      
      * add SetWorkSpacePointer
      
      * fix ignore
      b6eaf3eb
    • myamlak's avatar
      Multi-kernel CGEMM (#230) · 7b1e2c37
      myamlak authored
      * Reference CGEMM + test stub
      
      * Format.
      
      * Incomplete simple implementation
      
      * Library instances
      
      * Sketch of tests
      
      * Test fixes.
      
      * Example added
      
      * Cosmetics
      
      * Add elementwise operation kernel and example
      
      * Add comment
      
      * Add template argument of dim . Prepare to support multiple dimension
      
      * Rename example
      
      * Support 1 dimension
      
      * Add static assert
      
      * Add comment
      
      * Second auxiliary buffer added
      
      * Extract pad
      
      * Remove redundant argument
      
      * Support any dimension for elementwise operation
      
      * Remove line
      
      * Let it be the multiple number of CU
      
      * Move thread per block to the parameter of constructor
      
      * Consuming binary ops to do A+B / A-B
      
      * Fix + cosmetics + bf16 test commented out temporarily
      
      * Format
      
      * Enabling bf16 test
      
      * Revert "Enabling bf16 test"
      
      This reverts commit f497e2ba.
      
      * Fix + test reenabled
      
      * fix build
      
      * Revert "fix build"
      
      This reverts commit d7310238
      
      .
      
      * post PR #235 merge fix
      
      * amend
      
      * Single workspace for cgemm + helper
      
      * Perf calc fix
      
      * Review remarks: static_cast
      
      * Review remarks: binary ops templated
      
      * Cleaning
      
      * Removal of instances and their tests
      
      * Review remarks from aosew addressed
      
      * Review remark: unnecessary attribute
      
      * Post-merge fixes
      
      * Restrict 4gemm to PassThrough + bug fix
      
      * Review remarks
      
      * update licence
      
      * change cgemm example to fp16
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      7b1e2c37
    • Chao Liu's avatar
      Minor fix for recent PR (#260) · 85fc91c3
      Chao Liu authored
      * fix example
      
      * update IsSupportedArgument
      
      * fix
      
      * disable fp64 conv example as test
      85fc91c3
  8. 30 May, 2022 1 commit
    • rocking5566's avatar
      gemm + layernorm (#261) · d32a67a9
      rocking5566 authored
      * Implement reduction meand and reduction square mean
      
      * Refine file name
      
      * Add reduce mean and square mean
      
      * Fix parameter name
      
      * Add normalize device op (not implement invoker::run())
      
      * Remove epislon
      
      * Refine deviceop
      
      * Add 5ary elementwise for normalization
      
      * Add layernorm example
      
      * layerNorm verication
      
      * Fix compiler error due to merge from develop
      
      * Fix typo
      
      * Fix compile error
      
      * Refine naming
      
      * [What] Suport non pointer for invoker and argument
      [Why] Snyc coding style with gemm
      
      * Refine folder name
      
      * Refine class name
      
      * Evaluate perf of the kernel
      
      * Fix compile error
      
      * [What] Refine perf evaluation in example of gemm + reduction
      [Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory
      
      * clang-format
      d32a67a9
  9. 27 May, 2022 1 commit
    • Chao Liu's avatar
      Fixing conv bug (#258) · 91d8b7d6
      Chao Liu authored
      
      
      * debugging conv
      
      * fix oversight where ctile map is constructed before initializing c desc
      
      * example program should returns error code
      
      * clean up
      
      * changed Block2CTileMap in conv2d and convnd
      
      * clean up
      
      * clean up
      
      * cleanup
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      91d8b7d6
  10. 26 May, 2022 2 commits
    • ltqin's avatar
      Add FP64 XDL GEMM built-in function (#199) · 3e6c2610
      ltqin authored
      
      
      * add intrin_mfma_f64_16x16x4f64
      
      * add example
      
      * gemm reference add double data type
      
      * chang init data
      
      * fix M N PerXdlops
      
      * fix ifdef
      
      * add comparsion config
      
      * add conv fwd example
      
      * format log out
      
      * change rc matrix egister layout
      
      * reorganize example
      
      * reorganize example 2
      
      * format,because merge develop
      
      * fix call impl adding acc data type
      
      * lost ;
      
      * add compiler warning
      
      * change example tunning parameters
      
      * add test for fp64
      
      * add instance
      
      * add test/gemm/gemm_fp64.cpp
      
      * fix get name issue
      
      * remove some tunning parameter
      
      * fix conflict
      
      * format
      
      * use integer value for GEMM test
      
      * add acc data type
      
      * remove typeid because fp16
      
      * fix streamconfig etc bug from merging develop
      
      * format
      
      * remove test_gemm_xdl_fp64
      
      * add AccDataType
      
      * AccDataType problem
      Co-authored-by: default avatarqinletao <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      3e6c2610
    • Qianfeng's avatar
      Add pooling example (#257) · 97c4d486
      Qianfeng authored
      * Add example for computing LayerNorm mean and meansquare
      
      * Refactor the pool2d_fwd example and add example for float type testing
      
      * Revert "Add example for computing LayerNorm mean and meansquare"
      
      This reverts commit df52e6f9d897b00c981baa48f291450bcd60925d.
      
      * Tiny fix in pool2d_fwd_common.hpp
      97c4d486
  11. 25 May, 2022 3 commits
    • rocking5566's avatar
      Hotfix binary elementwise (for broadcast on fastest axis) (#254) · 82d7d993
      rocking5566 authored
      
      
      * Support different length of ScalarPerVector
      
      * Add example of broadcast on fastest axis
      
      * Typo
      
      * Refine fastest example
      
      * Add dimension check
      
      * Modify fastest broadcast example to 3d
      
      * Enforce users give scalarPerVector explicitely
      
      * 1. Add CscalarPerVedctor
      2. Not only broadcast on fastest need to set scalarPerVector to 1
      
      * Rename var
      
      * Move IsScalarPerVectorValid() inside IsSupportedArgument()
      
      * Separate GridDesc_M0 into A, B and C
      
      * rename var
      
      * Rename var of length
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      82d7d993
    • Anthony Chang's avatar
      Tensile-style block to C tile map (#239) · e579c9e5
      Anthony Chang authored
      * fix build
      
      * Revert "fix build"
      
      This reverts commit d7310238
      
      .
      
      * post PR #235 merge fix
      
      * amend
      
      * adds tensile-stype c-tile map
      
      * make it dynamic version
      
      * add k-split flavor tile map
      
      * apply tensile-style tile map to all xdl gridwise gemms
      
      * remove dead code
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e579c9e5
    • Chao Liu's avatar
      minor fix for recent PR (#255) · 61851ae2
      Chao Liu authored
      * minor fix
      
      * clean
      61851ae2
  12. 24 May, 2022 4 commits
    • Jianfeng Yan's avatar
      Navi21 gemm (#197) · 40b59a63
      Jianfeng Yan authored
      
      
      * start adding navi21 GEMM
      
      * navi_gemm_km_kn_mn_fp32 compiles and passes one test.
      
      * rename variables and functions in gridwise_gemm_dlops_v1r3
      
      * add other 3 layouts; format instance
      
      * adding more tuning parameters
      
      add tuning parameters for other 3 layouts
      
      * add gemm_dlops_f16
      
      * tmp
      
      * add dependence of DeviceGemm::IsSupportedArg() on arch
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * push gemm_dlops into profiler
      
      * minor changes
      
      * if using xdl or dlops is moved into profiler_gemm_impl
      
      * minor changes
      
      * minor changes
      
      * remove is_xdl from profile_gemm_impl
      
      * make IsSupportedArg dependent on arch for other device_gemm
      
      * minor changes
      
      * minor changes
      
      * fix a bug in f_generate_tensor_value
      
      * add 64x64x64 for gemm_dlops_int8
      
      * add 64x64x64 for gemm_dlops_int8
      
      * comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1
      
      * fix
      
      * start fixing tuning parameters
      
      * monir
      
      * minor changes
      
      * minor changes
      
      * minor changes
      
      * fixing
      
      * adding example
      
      * adding example
      
      * adding example
      
      * add gemm fp32 example
      
      * clean up
      
      * use 128x128x16 as MNK tile in navi21 gemm example
      
      * bug fix
      
      * fix test
      
      * use new block c tile
      
      * clean
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
      40b59a63
    • Qianfeng's avatar
      Overhaul to Reducton and its dependants (#237) · 63eee2d9
      Qianfeng authored
      * Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type
      
      * Update to host layer and host reduction
      
      * Merge and remove reduction kernels
      
      * Merge and remove reduction device interfaces and update pooling device interface
      
      * Merge and remove useless reduction device instances
      
      * Update to reduction profiler and reduction ctests
      
      * Update to reduction and pooling examples and add one reduction example
      
      * Change to reduction examples to let them testable by ctest
      
      * Add explicit pass checking for reduction and pooling examples
      
      * Explicit assignment of tensor shapes in example reduce_blockwise_two_call
      
      * Use atomic_add to repace atomicAdd and add atomic_add for double type
      
      * Add reduce ctest support for double data type
      
      * Replace to_int_vector() by using c++ std::vector::assign()
      
      * Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise
      
      * Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock
      
      * Add GetAtomicOperationZeroValue() support for AtomicMax
      
      * Tiny change to reduce example README.md
      
      * Fix some tiny issues due to branch merging
      
      * Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t
      
      * Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64
      
      * Renaming
      
      * Clean the header includings in device_reduce instances header files
      63eee2d9
    • Illia Silin's avatar
      Add performance tests as a stage of CI. (#247) · 1085794d
      Illia Silin authored
      * modify ckProfiler_gemm output
      
      * fix syntax
      
      * change ckProfiler output and return 0
      
      * fix syntax
      
      * output datatype
      
      * fix syntax
      
      * output datatype in another way
      
      * fix syntax
      
      * fix syntax
      
      * test return values of ckProfiler
      
      * add layout info and tests, make sure ckprofiler returns 0
      
      * fix syntax
      
      * change layout output
      
      * fix syntax
      
      * fix syntax again
      
      * update script to process perf results
      
      * rearrange jenkins stages
      
      * fix typo
      
      * add python packages to Docker file
      
      * adding setuptools-rust package
      
      * modify parsing for new test parameters
      
      * test db credentials on jenkins
      
      * fix syntax
      
      * update python script to handle incomplete lines
      
      * ungrade python to 3.8 and write the gemm_params table
      
      * add sqlalchemy package to docker
      
      * move perf data processing to master node
      
      * move the master node inside a steps region
      
      * add new stage for result processing
      
      * move results processing to separate stage
      
      * reduce number of tests to speedup debugging
      
      * pass config to processPerfResults stage
      
      * run script on master in a docker container
      
      * replace show_node_info
      
      * try loading docker on master node again
      
      * use ansible node instead of master
      
      * get rid of pymysql package
      
      * try ssh connection using paramiko
      
      * put back pymysql
      
      * put the perf data processing back on the gpu node
      
      * put back artifact definition
      
      * archive the perf_log before parsing
      
      * clean up jenkinsfile, fix parsing
      
      * fix typo
      
      * enable all perf tests
      
      * put all stages in original order, finalize script
      
      * fix gpu_arch version
      
      * update parsing script
      
      * remove obsolete file causing merge conflict
      1085794d
    • Shaojie WANG's avatar
      add GetWorkSpaceSize to base arg (#253) · 0d08cf18
      Shaojie WANG authored
      * add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight
      
      * remove redundant compute
      
      * use datatype and split k to check whether a workspace is used
      
      * remove unused computation for work space size
      0d08cf18
  13. 23 May, 2022 1 commit
  14. 20 May, 2022 8 commits
    • Shaojie WANG's avatar
      example of conv bwd weight 1d/2d/3d fp32/fp16/bf16 xdl (#244) · ac543313
      Shaojie WANG authored
      
      
      * enable example of conv 1d/3d for bwd weight
      
      * make bf16 kernel do not use atomic add
      
      * using new gridwise gemm for bwd weight on convnd bwd weight
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      ac543313
    • Chao Liu's avatar
      remove options.hpp.in (#240) · 44943e0e
      Chao Liu authored
      44943e0e
    • Anthony Chang's avatar
      Refactor block to C tile map (#235) · a054f7d6
      Anthony Chang authored
      * refactor block-to-ctile-map
      
      * gridwise gemm block2ctile generic validity check
      
      * format
      
      * amend split-k gemm block2ctile map refactor
      
      * add test
      
      * format
      
      * amend
      
      * revert to calculating batch index in kernel instead of passing as block_id_z
      
      * move file
      
      * add valid ctile index check to gridwise v2r4
      a054f7d6
    • Shaojie WANG's avatar
      [conv bwd-weight]Binding gemm k1 to conv n (#202) · 070619fb
      Shaojie WANG authored
      
      
      * add some instance to develop
      
      * avoid bank conflicts for wrw for all instance
      
      * add small K1 test
      
      * delete some unused instance
      
      * binding gemm k1 to conv n
      
      * try using half_4 to do ds_read
      
      * reset buffer load oob and ds memcpy to default option
      
      * remove useless instances
      
      * remove redandunt space
      
      * remove printf code
      
      * clang-format-10 change
      
      * use fastest config
      
      * fix clang format for the other files
      
      * remove gemmk0 pad for output
      
      * add gemmk padding macro
      
      * add bank length computation
      
      * add template to distinguish the instance that need lds padding for wrw
      
      * use rocm5.1 as docker
      
      * use integer value for GEMM test
      
      * add Right padding macro
      
      * add 2 test asm code
      
      * using 256x256x32 tile size
      
      * 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code
      
      * using small vec
      
      * 256*128 kernel size for example
      
      * remove asm files
      
      * use a new gridwise gemm header for bwd-weight
      
      * revert gridwise gemm v2r4r2
      
      * change foramt
      
      * reset gridwise gemm v2r4r2
      
      * remove unused code
      
      * revert instance file
      
      * revert example instance
      
      * format file
      
      * remove macros
      
      * resolve compile error
      
      * rename wrw kernel invoker
      
      * use gridwisegemm pipeline struct instead of implement run fucntion in the same header
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      070619fb
    • Shaojie WANG's avatar
    • Shaojie WANG's avatar
      [Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and... · b9b9c3b8
      Shaojie WANG authored
      
      [Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and balance ds ops with address calculations (#190)
      
      * add some instance to develop
      
      * avoid bank conflicts for wrw for all instance
      
      * add small K1 test
      
      * delete some unused instance
      
      * reset buffer load oob and ds memcpy to default option
      
      * remove useless instances
      
      * remove redandunt space
      
      * remove printf code
      
      * clang-format-10 change
      
      * fix clang format for the other files
      
      * add bank length computation
      
      * add template to distinguish the instance that need lds padding for wrw
      
      * use rocm5.1 as docker
      
      * use integer value for GEMM test
      
      * 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code
      
      * use a new gridwise gemm header for bwd-weight
      
      * revert gridwise gemm v2r4r2
      
      * change foramt
      
      * rename kernel invoker
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      b9b9c3b8
    • rocking5566's avatar
      Hotfix eltiwseop (#242) · bb4b82a9
      rocking5566 authored
      
      
      * Use vector constructor instead
      
      * Fix typo
      
      * Move blockSize to the MakeArgumentPointer
      
      * Fix naming
      
      * Fix clang format
      
      * remove blockSize from DeviceBinaryElementwise::Argument()
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      bb4b82a9
    • rocking5566's avatar
      Gemm reduce max (#209) · 0ffe956a
      rocking5566 authored
      
      
      * [What] Rename the example
      [Why] Prepare to add unary reduction
      
      * Add global oparation to the parameter
      
      * Add atomicmax
      
      * Fix compile error
      
      * Support atomicMax (hip library)
      
      * Rename the reduction example
      
      * Fix target name
      
      * use p_d1_grid as the indicator directly
      
      * Prevent performance issue. Let passthrough handle it.
      
      * Implement the function template the specialize the float2
      
      * No need to separate into two lines
      
      * Remove empty line
      
      * add comment
      
      * Fix compile error due to merge from develop
      
      * make the implementation of atomic_max / atomic_add explicit for each datatype
      
      * Refine typo
      
      * For future CI test
      
      * Fix compiler error in ckProfiler
      
      * Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
      
      * simply use remove_pointer
      
      * Rename type and var
      
      * Refine example
      
      * Modify reducemax example
      
      * Fix bug in reduction
      
      * Change initialize range
      
      * Implement F64 version of atomicMax
      
      * Move reduction  code together
      
      * Add buffer atomic_max
      
      * Fix coding style by clang-format
      
      * Integrate new api of DeviceGemmReduce_Xdl_CShuffle
      
      * Integrate Batch gemm reduction
      
      * Fix example
      
      * fix example
      
      * clean up
      
      * Fix batch gemm tensor operation
      
      * Fix coding style
      
      * Fix template augument
      
      * Fix clang format
      
      * Keep flexible of different stride for each D tensor
      
      * Fix compile error for ckProfiler
      
      * Fix typo
      
      * [What] Fix naming
      [Why] Prepare to add out elementop
      
      * Add DoutElementOp
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      0ffe956a
  15. 19 May, 2022 1 commit
    • rocking5566's avatar
      elementwise op (#238) · aafc3ac2
      rocking5566 authored
      
      
      * Add elementwise operation kernel and example
      
      * Add comment
      
      * Add template argument of dim . Prepare to support multiple dimension
      
      * Rename example
      
      * Support 1 dimension
      
      * Add static assert
      
      * Add comment
      
      * Extract pad
      
      * Remove redundant argument
      
      * Support any dimension for elementwise operation
      
      * Remove line
      
      * Let it be the multiple number of CU
      
      * Move thread per block to the parameter of constructor
      
      * rename threadPerBlock with blockSize
      
      * Support double
      
      * rename kernel function name
      
      * remove redundant include header
      
      * Refine type
      
      * Need to the final dimension
      
      * Refine variable name
      
      * Refine type
      
      * Use index_t instead of int in API
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      aafc3ac2
  16. 13 May, 2022 1 commit
  17. 12 May, 2022 2 commits
    • JD's avatar
      Add host API (#220) · cec69bc3
      JD authored
      
      
      * Add host API
      
      * manually rebase on develop
      
      * clean
      
      * manually rebase on develop
      
      * exclude tests from all target
      
      * address review comments
      
      * update client app name
      
      * fix missing lib name
      
      * clang-format update
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix test issue
      
      * refactor
      
      * refactor
      
      * refactor
      
      * upate cmake and readme
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cec69bc3
    • ltqin's avatar
      enable convnd bwd data test (#234) · 0f912e20
      ltqin authored
      0f912e20