1. 05 Mar, 2022 1 commit
    • Qianfeng's avatar
      Reduction in Composable Kernel (#82) · e17c0d80
      Qianfeng authored
      
      
      * Initial adding of generic reduction
      
      * Initial adding of generic reduction ...
      
      * Updates to make compiling done
      
      * clang-format all files
      
      * clang-format some files again
      
      * Renaming in profiler/include/profile_reduce.hpp
      
      * Updates and make BlockWise cases passed
      
      * Updates and make ThreadWise and MultiBlockTwoCall cases passed
      
      * Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances
      
      * Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes
      
      * format
      
      * adding pooling
      
      * added max and average pooling
      
      * comment out cout and kernel timing
      
      * Tiny simplification in profiler/reduce_profiler.cpp
      
      * Add example for reduce_blockwise
      
      * Tiny updates
      
      * Change to pass the ElementWiseOp from device layer to kernel
      
      * Fix the vectorDim and vectorSize in Device layer
      
      * Enable vector load on both dim0 and dim1 for Threadwise method
      
      * Tiny updates
      
      * Change to let the user to pass the preUnaryOp and posUnaryOp
      
      * Make pooling example work
      
      * split device_reduce_instance into two libraries
      
      * Tiny update
      
      * Replace nanPropaOpt enum by boolean propagate_nan
      
      * Simplification in DeviceReduce layer codes
      
      * update build
      
      * Change to clarify the difference between ck::half_t and half_float::half
      
      * Renaming in all the reduction codes
      
      * Add VectorSize as template parameter for device layer
      
      * Add BetaIsZero as kernel template and as AccDataType for alpha
      
      * print
      
      * Small updates for pooling
      
      * Updates for host_generic_reduction for reference
      
      * Update to make AVG pooling pass
      
      * Update to make MAX pooling with indices output pass
      
      * fix
      
      * add OutDst vector store to threadwise reduction and pooling
      
      * tweak
      
      * turn off check_indices that caused build issue
      
      * refactor pooling
      
      * clean up
      
      * turn off check_indices for building issue for php-compiler
      
      * add more tile size for odd C
      
      * tweak conv for odd C
      
      * update script
      
      * clean up elementwise op
      
      * add hack in reduction_operator.hpp to avoid compile error. To fix it, need to use element_wise_op in reduction op
      
      * Add OutVectorSize as device and kernel tunable, also update to Elementwise Operations
      
      * Move reduce operator mapping to host layer file reduction_operator_mapping.hpp from reduction_operator.hpp
      
      * Change to the unary operators
      
      * Move the definitions of unary operations to element_wise_operation.hpp
      
      * re-org files
      
      * Refine in device interfaces and multiblock kernels
      
      * Split the reduction configurations into instances for specific methods
      
      * Update in getTypeString() of device pool2d
      
      * Renaming in host and kernel
      
      * Tiny update in profiler/src/profiler.cpp
      
      * Uncomment in device_operation/CMakeLists.txt to enable the building of all operations
      
      * Make check_indices a templated function to remove some linking issue
      
      * Renaming in the profiler reduce module
      
      * Add support for double Reduction (but disable MultiblockAtomicAdd for double)
      
      * Tiny correction of literal string
      
      * Rename DevicePoolFwd to DevicePool2dFwd
      
      * Split device_reduce_instance_xxx.cpp files according to the data types to speed up compiling
      
      * Add comments for lists of configurations, lists of instances and references of add_reduce_instances_xxx
      
      * Remove un-used header file gridwise_generic_reduction_wrapper_common.hpp
      
      * Renaming and refining in the Reduction codes
      
      * Tiny change in the unary operators
      
      * Renaming symbols and files
      
      * Renaming symbols in the kernels
      
      * Move kernel kernel_set_buffer_value to separate file
      
      * Add IndexDataType template parameter for kernels and use int32_t as index data type in device layer
      
      * Tiny update in the kernels
      
      * Remove definition of sqrtf()/isnan()/abs() for half_t due to some ADL issue
      
      * Simplify a helper function in device layer
      
      * Tiny adjustment in testing data initialization
      
      * Renaming in kernel/device/host
      
      * Add two testing scripts for reduction
      
      * Refine the Unary operators in element_wise_operation.hpp
      
      * Update in the reduce profiler module
      
      * Update to the reduction testing scripts
      
      * reduce compile parallelism
      
      * change CI docker to rocm5.0
      
      * remove unused variables
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      e17c0d80
  2. 04 Mar, 2022 1 commit
    • rocking5566's avatar
      [Bf16 & int8] [example & ckprofiler] (#100) · 7e9a9d32
      rocking5566 authored
      
      
      * Add int8 of mk_nk_mn to the ckProfiler
      
      * Add example of int8 gemm
      
      * Fix typo, use ushort instead of half_t for bfloat16
      
      * replace ushortXXX_t to bhalfXXX_t
      
      * rename ushort to bhalf_t
      
      * Add bf16 example
      
      * Add bf16 gemm to ckProfiler
      
      * Fix alignment
      
      * Fix typo
      
      * Add unit test for gemm_xdl int8
      
      * Add gemm_xdl fp32 unit test
      
      * Add gemm_xdl bf16 unit test
      
      * fix build
      
      * fix build issue due to merge conflict
      
      * Fix build
      
      * Fix build error
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      7e9a9d32
  3. 28 Feb, 2022 1 commit
    • Anthony Chang's avatar
      Allow distinct K0/K1 values for A/B block descriptor (#98) · 6d4450ef
      Anthony Chang authored
      
      
      * add gitignore
      
      * host tensor: allow generating sequentially increasing value in a given dimension
      
      * gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor
      
      - remove dangling header include
      - modify example gemm_xdl accordingly
      - infer KPack value from M/NPerXdl
      - device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
      (API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)
      
      * add LDS data dump utility
      
      * profiler: reflect API change for distinct K0/K1 for A/B matrices
      
      * profiler: add conflict-free LDS write FP16 kernel instances
      
      * fix accidental perf regression
      
      * address feedback; cosmetic changes
      
      * clang-format for new files
      
      * format
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      6d4450ef
  4. 23 Feb, 2022 1 commit
    • Jianfeng Yan's avatar
      Conv3d new (#94) · 6dfb92bb
      Jianfeng Yan authored
      
      
      * conv3d compiles but has memory error
      
      * conv3d works
      
      * fix performance issue by using __builtin_amdgc_readfirstlane
      
      * change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*
      
      * clang-format
      
      * remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d
      
      * format
      
      * remove useless marc
      
      * add comment
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      6dfb92bb
  5. 12 Feb, 2022 1 commit
    • ltqin's avatar
      NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9
      ltqin authored
      
      
      * add fwd bf16 conv
      
      * change tunning parametor
      
      * add int8 for conv fwd
      
      * remove comments
      
      * change tunning parametor for int8
      
      * change init int8 example
      
      * add test for conv2d fwd
      
      * change device operation file pos because merge develop
      
      * fwd int8 use reference
      
      * test_conv_fwd use reference
      
      * add braket for if statement
      
      * rename fwd example name
      
      * remove StaticBufferOfVectorTypeV2
      
      * tweak example
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      880fbee9
  6. 07 Feb, 2022 1 commit
    • Chao Liu's avatar
      GEMM+Bias+ReLU+Add (#76) · 823657ed
      Chao Liu authored
      * tweak conv for odd C
      
      * update script
      
      * clean up elementwise op
      
      * fix build
      
      * clean up
      
      * added example for gemm+bias+relu+add
      
      * added example for gemm+bias+relu
      
      * add profiler for gemm_s_shuffle; re-org files
      
      * add profiler
      
      * fix build
      
      * clean up
      
      * clean up
      
      * clean up
      
      * fix build
      823657ed
  7. 03 Dec, 2021 1 commit
    • Chao Liu's avatar
      GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380
      Chao Liu authored
      * gemm+activation
      
      * move C pointwise operation into threadwise copy
      
      * add pointwise operation to A/B matrix
      
      * update ckProfiler
      
      * adding bias add
      
      * adding bias add
      
      * adding bias add
      
      * added bias add; worked around compiler issues
      
      * clean up
      
      * clean up
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * clean up
      
      * add conv_xdl example
      
      * adding conv_xdl_bias_relu_add example
      
      * add conv+bias+relu+add, but has register spill issue
      
      * tweak
      
      * tweak
      
      * refactor
      
      * Update README.md
      
      update readme for example/2_gemm_xdl_bias_relu_add
      
      * clean up
      
      * Update README.md
      
      update readme for example/3_conv_xdl
      
      * Update README.md
      41cdd380
  8. 18 Nov, 2021 1 commit
    • zjing14's avatar
      v5r1 fusion kernels for inference (#49) · 970fa3e9
      zjing14 authored
      
      
      * init
      
      * refactor for 1x1
      
      * rename e0_e1
      
      * add e1 with bugs
      
      * debug
      
      * fixed
      
      * fixed e1
      
      * add timer
      
      * imprve threadwise gemm with dot2
      
      * add e2
      
      * tuning
      
      * seperate c2
      
      * add nhwc
      
      * restore nchwc
      
      * clean
      
      * opt
      
      * fixed; tuning
      
      * add BGlobalMoveSliceWindowStepHacks{}
      
      * tuning
      
      * repeat running
      
      * adjust
      
      * merge v5r1 nchwc
      
      * add adaptors
      
      * split k0 k1 in c_thread_grid
      
      * split h and w
      
      * remove v5r1 nhwc
      
      * clean for pr
      
      * remove host_conv_add
      
      * clean code
      
      * clean
      
      * add dynamic support
      
      * static mode
      
      * test static
      
      * add conv+add fusion
      
      * fixed validation
      
      * naming fix
      
      * use activ_enum
      
      * make static
      
      * refactor conv_add for InMem::add
      
      * add bias
      
      * add conv_out
      
      * add configurable makeddesc
      
      * add maxpool fusion
      
      * add maxpool host for validation
      
      * enable static desc
      
      * conv-only use v5r1_add
      
      * test
      
      * test
      
      * for binary dumps
      
      * fixed incorrect results due to typo
      
      * clean
      
      * debugging maxpool
      
      * workaround with offset trick
      
      * clean code
      
      * modularize ops of fusion
      
      * add gridwise_gemm_v3
      
      * create seperate fusion fun
      
      * enable dynamic mode of conv and conv+resize_add
      
      * add dynamic mode of maxpool
      
      * add pass by point
      
      * add activ_type as arguments
      
      * merge develop
      
      * clean
      
      * reset config to old default
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      970fa3e9
  9. 16 Nov, 2021 2 commits
  10. 15 Nov, 2021 2 commits
    • zjing14's avatar
      Add bfp16/int8 support into XDL GEMM operator (#50) · 3737bb03
      zjing14 authored
      
      
      * init StaticBufferV2
      
      * clean
      
      * adopt old output stage for staticBufferV2
      
      * clean
      
      * remove hack
      
      * clean
      
      * clean
      
      * add parameters
      
      * clean code
      
      * move c_buffer alloc into blockwise gemm
      
      * add adaptors for m/n_thread_data_on_grid
      
      * tweak gemm
      
      * adjust blockwise_gemm_xdlops
      
      * tweak
      
      * update conv
      
      * update script
      
      * adding bwd 1x1
      
      * update script
      
      * adding 1x1 bwd
      
      * debugging bwd 1x1 failure
      
      * update script
      
      * update script
      
      * test
      
      * test v100
      
      * add bf16_1k
      
      * clang-format
      
      * clean
      
      * add bfp16 for gfx908
      
      * add verification
      
      * clean up
      
      * clean code
      
      * restore bfl16
      
      * clean
      
      * add bfp16 support into gemm_driver
      
      * apply new generator to other drivers
      
      * add int8 support
      
      * cleanb
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarroot <root@hayabusa6111.amd.com>
      3737bb03
    • Chao Liu's avatar
      FP16 data in-register transpose (#41) · b491ebf3
      Chao Liu authored
      * start fixing 16bit data packing
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * add missing constexpr
      
      * adding static tensor
      
      * adding static tensor
      
      * adding transpose
      
      * add inline asm for transpose 2x2 of half_t
      
      * add general transpose_vectors(), but have unnecessary register initialization using v_mov
      
      * fix unnecessary register initialization in transpose_vector by using more pass-by-reference
      
      * add hardcoded logic for NHWC wrw
      
      * improve asm for v_pack
      
      * make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
      
      * tweak
      
      * reorganize file
      b491ebf3
  11. 14 Nov, 2021 1 commit
    • Chao Liu's avatar
      ckProfiler and device-level XDL GEMM operator (#48) · e823d518
      Chao Liu authored
      * add DeviceGemmXdl
      
      * update script
      
      * fix naming issue
      
      * fix comment
      
      * output HostTensorDescriptor
      
      * rename
      
      * padded GEMM for fwd v4r4r4 nhwc
      
      * refactor
      
      * refactor
      
      * refactor
      
      * adding ckProfiler
      
      * adding ckProfiler
      
      * refactor
      
      * fix tuning parameter bug
      
      * add more gemm instances
      
      * add more fp16 GEMM instances
      
      * fix profiler driver
      
      * fix bug in tuning parameter
      
      * add fp32 gemm instances
      
      * small fix
      
      * refactor
      
      * rename
      
      * refactor gemm profiler; adding DeviceConv and conv profiler
      
      * refactor
      
      * fix
      
      * add conv profiler
      
      * refactor
      
      * adding more GEMM and Conv instance
      
      * Create README.md
      
      Add build instruction for ckProfiler
      
      * Create README.md
      
      Add Readme for gemm_xdl example
      
      * Update README.md
      
      Remove build instruction from top most folder
      
      * Update README.md
      
      * clean up
      e823d518
  12. 19 Oct, 2021 1 commit
    • ltqin's avatar
      add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) · fd49ff80
      ltqin authored
      
      
      * add add new algorithm from v4r4r2
      
      * program once issue
      
      * add split k functiion
      
      * redefine code
      
      * add a matrix unmerge
      
      * add b matrix unmerge k0
      
      * trans a and b to gridegemm
      
      * nhwc init
      
      * no hacks and vector load
      
      * add hacks
      
      * modify some parameter
      
      * fix tuning prometer for fp32
      
      * fix tuning prometer for fp16
      
      * start change gridwise k split
      
      * init ok
      
      * revome a b matrix k0mk1 desc in grid
      
      * carewrite lculate gridsize
      
      * add kbatch to CalculateBottomIndex
      
      * remove some unused funtion
      
      * add clear data function before call kernel
      
      * out hacks
      
      * in hacks
      
      * rename device convolution file and function name
      
      * modify kBatch value
      
      * fix some tuning code
      
      * start from v4r4 nhwc
      
      * nhwc atomic is able to run
      
      * just for fp32
      
      * enable nchw atomic
      
      * tweak
      
      * tweak
      
      * re-arrange gridwise gemm hot loop for wrw
      
      * add wrw v4r5
      
      * v4r4r5 fp16
      
      * v4r4r4 fp16
      
      * v4r4r2 fp16
      
      * V4R4R4XDLNHWC fp16
      
      * V4R4R2XDLATOMICNCHW fp16
      
      * adjust for fp16
      
      * input gridsize
      
      * change kbatch to gridsize
      
      * testing wrw
      
      * clean up
      
      * k_batch to gridsize
      
      * fix bug
      
      * wrw v4r4r4 kbatch change to gride size
      
      * wrw v4r4r2 kbatch change to gride size
      
      * after merge , change gridwise gemm v2r4
      
      * change MakeCBlockClusterAdaptor
      
      * other method use new gridwise gemm
      
      * clean up
      
      * chapad method nge to make_right_pad_transform
      
      * kbatch out from transform function
      
      * clean up and fix bug
      
      * fix bug
      
      * using function type reduce template parameters
      
      * using auto replace define fuction type
      
      * clean up
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      fd49ff80
  13. 06 Oct, 2021 1 commit
    • Chao Liu's avatar
      Tweak GEMM kernel (#38) · b3e8d57d
      Chao Liu authored
      * add parameters
      
      * tweak gemm
      
      * tweak
      
      * update conv
      
      * update script
      
      * adding bwd 1x1
      
      * update script
      
      * adding 1x1 bwd
      
      * debugging bwd 1x1 failure
      
      * update script
      
      * update script
      
      * test
      
      * test v100
      
      * clean up
      b3e8d57d
  14. 05 Sep, 2021 1 commit
    • Chao Liu's avatar
      GEMM driver and kernel (#29) · 19613902
      Chao Liu authored
      * add gemm driver
      
      * tweak
      
      * add gemm kernel: mk_kn_mn and km_kn_mn
      
      * tweak
      
      * add GEMM km_nk_mn
      
      * fix comment
      19613902
  15. 31 Aug, 2021 1 commit
    • ltqin's avatar
      Backward weight v4r4r2 with xdlops (#18) · 627d8ef3
      ltqin authored
      
      
      * start
      
      * modify transformat
      
      * modify device convolutiion
      
      * modify host
      
      * added host conv bwd and wrw
      
      * remove bwd, seperate wrw
      
      * clean
      
      * hacall k to zero
      
      * out log
      
      * fixed
      
      * fixed
      
      * change to (out in wei)
      
      * input hack
      
      * hack to out
      
      * format
      
      * fix by comments
      
      * change wei hacks(wei transform has not merge)
      
      * fix program once issue
      
      * fix review comment
      
      * fix vector load issue
      
      * tweak
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      627d8ef3
  16. 19 Aug, 2021 2 commits
    • Chao Liu's avatar
      Composable kernel init integration v3 (#1097) · 6fe3627a
      Chao Liu authored
      * Squashed 'src/composable_kernel/' content from commit f6edda61
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: f6edda61
      
      * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
      
      * Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5
      
      5781adf5 Update develop (#5) (#6)
      97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
      7b1ec41e refactor
      49c33aae refactor
      54b3e73d rename
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: 5781adf5
      
      
      
      * fix
      
      * refactor
      
      * remove online compilation from CK
      
      * refactor
      
      * fix
      
      * add ctest
      
      * add c-style pointer cast
      
      * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
      
      * fix clang warning suppression
      
      * tidy
      
      * suppress cppcheck
      
      * fix enum issue
      
      * revert chagnes to hip build
      
      * fix kernel filename
      
      * update CK build script
      
      * rename
      
      * rename
      
      * make innner product compatiable on gfx900
      
      * Update src/include/miopen/solver/ck_utility_common.hpp
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      
      * compiler parameter use stream
      
      * use int instead of index_t in kernel wrapper
      
      * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
      
      * refactor
      
      * refactor
      
      * change cmakelist
      
      * change ck common utility
      
      * fix
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      6fe3627a
    • zjing14's avatar
      Added host_conv_wrw for verification (#15) · ba6f79a7
      zjing14 authored
      * added host conv wrw
      ba6f79a7
  17. 10 Aug, 2021 2 commits
  18. 09 Aug, 2021 3 commits
  19. 30 Jul, 2021 1 commit
  20. 18 Jul, 2021 1 commit