1. 15 Apr, 2022 1 commit
    • Illia Silin's avatar
      Compile CK for all targets (#188) · 4221505d
      Illia Silin authored
      
      
      * compile ck for all targets
      
      * update the target criteria
      
      * change the target condition
      
      * fixed some typos
      
      * fixed missed file
      
      * revert changes in README
      
      * revert device_conv3d_fwd_xdl_...
      
      * update device_conv3d_fwd_xdl_...
      
      * update device_batched_gemm_reduce...
      
      * test the unused arguments fix
      
      * test the warning suppression
      
      * try suppress warnings in device_batched_gemm_reduce_xdl...
      
      * fix the last warnings
      
      * replace UNUSED with std::ignore
      
      * fix a typo
      
      * replaced std::ignore with ignore
      
      * add igonre header to common_header
      
      * refactor atomicAdd
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      4221505d
  2. 31 Mar, 2022 1 commit
    • Chao Liu's avatar
      Compile for gfx908 and gfx90a (#130) · cd167e49
      Chao Liu authored
      * adding compilation for multiple targets
      
      * fix build
      
      * clean
      
      * update Jekinsfile
      
      * update readme
      
      * update Jenkins
      
      * use ck::half_t instead of ushort for bf16
      
      * rename enum classes
      
      * clean
      
      * rename
      
      * clean
      cd167e49
  3. 22 Mar, 2022 1 commit
    • Qianfeng's avatar
      Reduction for int8 and bfloat16 (#125) · 9a8ee8a3
      Qianfeng authored
      
      
      * Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
      
      * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
      
      * Rename the folder name for the pool2d and reduce examples
      
      * Update to reduction test scripts
      
      * Add Readme for pool2d_fwd and reduce_blockwise examples
      
      * Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)
      
      * Tiny fix in reduce profiler and tiny update in reduce testing scripts
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Tiny fix in testing script profile_reduce_no_index.sh
      
      * Add support for bfp16 reduction (using bhalf_t = ushort)
      
      * Tiny fix in amd_buffer_addressing.hpp
      
      * Tiny change in script/profile_reduce_with_index.sh
      
      * Use AccDataType for Beta value and use element_wise::PassThrough
      
      * Use type_convert for type converting in host layer reduction
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming and refining in Reduction profiler/device layer/examples
      
      * Renaming all NumReduceDims to NumReduceDim
      
      * Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2
      
      * Update to testing scripts to add bf16 support
      
      * added more static_assert
      
      * Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp
      
      * Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations
      
      * minor change
      
      * Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass
      
      * Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp
      
      * Tiny fix in script/profile_reduce_no_index.sh
      
      * Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims
      
      * Generic renaming in host reduction and DeviceReduce layer
      
      * Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances
      
      * Use multi-thread and simplification for host Reduction implementation
      
      * Add ctest for reduction
      
      * Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/
      
      * Update to the reduce CTest executables to enable default testing behavior when no command argument
      
      * Renaming
      Co-authored-by: default avatarJianfeng yan <jfyan008@gmail.com>
      9a8ee8a3
  4. 14 Nov, 2021 1 commit
    • Chao Liu's avatar
      ckProfiler and device-level XDL GEMM operator (#48) · e823d518
      Chao Liu authored
      * add DeviceGemmXdl
      
      * update script
      
      * fix naming issue
      
      * fix comment
      
      * output HostTensorDescriptor
      
      * rename
      
      * padded GEMM for fwd v4r4r4 nhwc
      
      * refactor
      
      * refactor
      
      * refactor
      
      * adding ckProfiler
      
      * adding ckProfiler
      
      * refactor
      
      * fix tuning parameter bug
      
      * add more gemm instances
      
      * add more fp16 GEMM instances
      
      * fix profiler driver
      
      * fix bug in tuning parameter
      
      * add fp32 gemm instances
      
      * small fix
      
      * refactor
      
      * rename
      
      * refactor gemm profiler; adding DeviceConv and conv profiler
      
      * refactor
      
      * fix
      
      * add conv profiler
      
      * refactor
      
      * adding more GEMM and Conv instance
      
      * Create README.md
      
      Add build instruction for ckProfiler
      
      * Create README.md
      
      Add Readme for gemm_xdl example
      
      * Update README.md
      
      Remove build instruction from top most folder
      
      * Update README.md
      
      * clean up
      e823d518
  5. 19 Aug, 2021 1 commit
    • Chao Liu's avatar
      Composable kernel init integration v3 (#1097) · 6fe3627a
      Chao Liu authored
      * Squashed 'src/composable_kernel/' content from commit f6edda61
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: f6edda61
      
      * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
      
      * Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5
      
      5781adf5 Update develop (#5) (#6)
      97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
      7b1ec41e refactor
      49c33aae refactor
      54b3e73d rename
      
      git-subtree-dir: src/composable_kernel
      git-subtree-split: 5781adf5
      
      
      
      * fix
      
      * refactor
      
      * remove online compilation from CK
      
      * refactor
      
      * fix
      
      * add ctest
      
      * add c-style pointer cast
      
      * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
      
      * fix clang warning suppression
      
      * tidy
      
      * suppress cppcheck
      
      * fix enum issue
      
      * revert chagnes to hip build
      
      * fix kernel filename
      
      * update CK build script
      
      * rename
      
      * rename
      
      * make innner product compatiable on gfx900
      
      * Update src/include/miopen/solver/ck_utility_common.hpp
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      
      * compiler parameter use stream
      
      * use int instead of index_t in kernel wrapper
      
      * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
      
      * refactor
      
      * refactor
      
      * change cmakelist
      
      * change ck common utility
      
      * fix
      Co-authored-by: default avatarJD <Jehandad.Khan@amd.com>
      6fe3627a
  6. 10 Aug, 2021 1 commit
  7. 27 Jul, 2021 1 commit
    • Chao Liu's avatar
      [MIOpen Downstream] Initial MIOpen integration (#52) · f63a23ac
      Chao Liu authored
      * update online kernel wrapper bundle all descriptors in a tuple
      
      * change __CONSTANT__ to CONSTANT
      
      * rename
      
      * adding tuning
      
      * added IsValidCompileParameter
      
      * reorginze
      
      * adding tunable for fp16 and int8
      
      * fix kernel compile warning and bug fixes
      
      * suppress warning about cast CONSTANT (address space 4) pointer
      
      * fix building issue
      f63a23ac
  8. 18 Jul, 2021 1 commit
  9. 01 Jul, 2021 1 commit
    • zjing14's avatar
      xdlops_v4r4_fwd fp32/fp16 (#34) · 3835318c
      zjing14 authored
      
      
      * create files for xdlops
      
      * working on blockwise_gemm_xdlops
      
      * add KReduction
      
      * add m/n repeats
      
      * add 2x2 pipeline
      
      * added 128x128 wavegemm
      
      * use StaticBuffer of vector_type
      
      * break vector type to blk_size
      
      * add kpack into xldops_gemm and blockwise_gemm
      
      * abroadcast only
      
      * add fp32 mfma instructions
      
      * adding fp16 mfma
      
      * pack half4_t
      
      * rename kperwave to kpack
      
      * add 32x32x8fp16
      
      * add fp16 mfma
      
      * clean code
      
      * clean code
      
      * V4r4 xdlops kpack (#35)
      
      * add kpack with incorrect results
      
      * bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2
      
      * add 1x1 kernel
      
      * add gridwise_gemm_v2 - single_buffer
      
      * enabled dwordx4 for fp16
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      
      * refactor fwd-v4r4-xdlops
      
      * add v4r4-nhwc-xdlop
      
      * improve some perf of nhwc and nchw by tuning parameters, and change scheuduling in gridwise-gemm loop
      
      * tweak scheduling in gridwise gemm
      
      * add v4r3 with a single output copy
      
      * init commit: output with slice win
      
      * adding sliceWin
      
      * add multiple repeats pattern
      
      * starting adding bwd-v4r1-xdlops
      
      * use tuple as SrcBuffer
      
      * adding bwd-data v4r1 nhwc xdlops
      
      * fix bug in make_dynamic_naive_tensor_descriptor_aligned_v2()
      
      * fix bug in host bwd-data conv
      
      * initial implementation of bwd-data v4r1 nhwc xdlops
      
      * add launch bound flags
      
      * enable launch bound
      
      * add m/nrepeat=4
      
      * tweak bwd-data v4r1 nhwc xdlops
      
      * added bwd-data v4r1 nhwc xlops with output A and weight B
      
      * add fwd-v4r4 nhwc xdlops, A input, B weight, C output
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      3835318c
  10. 12 May, 2021 1 commit
  11. 25 Mar, 2021 1 commit
  12. 24 Jun, 2020 1 commit