1. 21 May, 2024 1 commit
  2. 10 May, 2024 1 commit
  3. 08 May, 2024 1 commit
  4. 26 Apr, 2024 2 commits
    • zjing14's avatar
      ggemm tile_loop multD bf16 int8 (#1258) · 5ae893c0
      zjing14 authored
      
      
      * Overload output stream operator for LoopScheduler and PiplineVersion
      
      * Add Run overload accepting grid descriptors MK.
      
      * Add __device__ keyword for CalculateGridSize
      
      * Create device op GroupedGemmMultipleD
      
      * Add GroupedGemm MultipleD Tile Loop implementation.
      
      * Add an example for GroupedGemm MultipleD tile loop.
      
      * Device Op GroupedGEMMTileLoop.
      
      * Bunch of small changes in exmaple.
      
      * CkProfiler
      
      * Remove unused tparam.
      
      * changed the copy function to v7r2
      
      * adding multi_abd
      
      * in-progress
      
      * add post-load oob check
      
      * Fix include statement.
      
      * Fix output stream overloads.
      
      * Do not make descriptors and check validity untill we find group.
      
      * Fix gemm desc initialization.
      
      * debugging
      
      * adjust instances
      
      * add run_lds
      
      * add elemntwise_op
      
      * replace multi_abd_device with v3
      
      * clean up
      
      * clean
      
      * clean
      
      * Revert device op
      
      * Fix compilation for DTYPES=FP16
      
      * Validate tensor transfers paramters.
      
      * Added LDSType
      
      * profiling
      
      * adjust oobcheck
      
      * add missing file
      
      * Validate on host only NK dims if M is not known.
      
      * add
      
      * clean
      
      * refactor
      
      * clean
      
      * add examples
      
      * add fuse
      
      * add fusion and client example
      
      * Fix bug.
      
      * A convenient debug func for selecting threads.
      
      * Fix has main k block loop bug.
      
      * Make sure that b2c has up to date tile offset.
      
      * Output stream operator for Sequence type.
      
      * Cmake file formatting.
      
      * clean
      
      ---------
      Co-authored-by: default avatarAdam Osewski <Adam.Osewski@amd.com>
      5ae893c0
    • zjing14's avatar
      bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db
      zjing14 authored
      * changed the copy function to v7r2
      
      * adding multi_abd
      
      * in-progress
      
      * add post-load oob check
      
      * debugging
      
      * adjust instances
      
      * add run_lds
      
      * add elemntwise_op
      
      * replace multi_abd_device with v3
      
      * clean up
      
      * clean
      
      * clean
      
      * Added LDSType
      
      * profiling
      
      * adjust oobcheck
      
      * add missing file
      
      * refactor
      
      * clean
      
      * add examples
      0d0150db
  5. 19 Apr, 2024 1 commit
    • Bartłomiej Kocot's avatar
      Refactor elementwise kernels (#1222) · ad1597c4
      Bartłomiej Kocot authored
      * Refactor elementwise kernels
      
      * Instances fixes
      
      * Fix cmake
      
      * Fix max pool bwd test
      
      * Update two stage gemm split k
      
      * Restore elementwise scale for hiptensor backward compatiblity
      
      * Fix Acc data type check in conv fwd multiple abd
      
      * Disable conv fp64 fwd example
      
      * Update grouped conv weight multi d
      ad1597c4
  6. 18 Apr, 2024 1 commit
  7. 16 Apr, 2024 1 commit
    • zjing14's avatar
      Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978) · 12865fbf
      zjing14 authored
      
      
      * added an example grouped_gemm_multi_abd
      
      * fixed ci
      
      * add setElementwiseOp
      
      * changed API
      
      * clean code: add multiA into example
      
      * fixed v7r2 copy
      
      * add transpose
      
      * clean
      
      * fixed vector_load check
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * add reduce
      
      * testing
      
      * add example_b16_i8
      
      * refactor example
      
      * clean
      
      * add mpading
      
      * disable reduce for kbatch = 1
      
      * seperate reduce device op
      
      * add reduce op
      
      * add guard for workspace_size
      
      * add instances
      
      * format
      
      * fixed
      
      * add client example
      
      * add a colmajor
      
      * add instances
      
      * Update cmake-ck-dev.sh
      
      * Update profile_gemm_splitk.cpp
      
      * Update gridwise_gemm_xdlops_v2r4r2.hpp
      
      * format
      
      * Update profile_gemm_splitk.cpp
      
      * fixed
      
      * fixed
      
      * adjust test
      
      * adjust precision loss
      
      * adjust test
      
      * fixed
      
      * add bf16_i8 scale bias
      
      * fixed scale
      
      * fixed scale elementwise_op
      
      * revert contraction deviceop changes
      
      * fixed
      
      * Add AddFastGelu
      
      * Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"
      
      This reverts commit 3b5d001efd74335b38dcb7d8c8877580b49d23a4, reversing
      changes made to 943199a99191661c5597c51ca8371a90bf57837e.
      
      * add Scales into elementwise
      
      * add gemm_multi_abd client example
      
      * add client examples
      
      * add rcr and crr
      
      * add grouped gemm client example
      
      * add grouped gemm client example
      
      * add instance for rcr crr
      
      * format
      
      * fixed
      
      * fixed cmake
      
      * fixed
      
      * fixed client_example
      
      * format
      
      * fixed contraction isSupport
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update device_reduce_threadwise.hpp
      
      * clean
      
      * Fixes
      
      * Fix example
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      12865fbf
  8. 15 Apr, 2024 1 commit
  9. 11 Apr, 2024 1 commit
  10. 03 Apr, 2024 1 commit
  11. 02 Apr, 2024 1 commit
    • Illia Silin's avatar
      Split the instances by architecture. (#1223) · ae57e593
      Illia Silin authored
      * parse examples inside the add_example_executable function
      
      * fix the example 64 cmake file
      
      * add xdl flag to the gemm_bias_softmax_gemm_permute example
      
      * add filtering of tests based on architecture type
      
      * enable test_grouped_gemm for gfx9 only
      
      * enable test_transpose only for gfx9
      
      * only linnk test_transpose if it gets built
      
      * split the gemm instances by architectures
      
      * split gemm_bilinear,grouped_conv_bwd_weight instances by targets
      
      * split instances by architecture
      
      * split grouped_conv instances by architecture
      
      * fix clang format
      
      * fix the if-else logic in group_conv headers
      
      * small fix for grouped convolution instances
      
      * fix the grouped conv bwd weight dl instances
      
      * fix client examples
      
      * only enable client examples 3 and 4 on gfx9
      
      * set the gfx9 macro
      
      * make sure the architecture macros are set by cmake
      
      * use separate set of xdl/wmma flags for host code
      
      * sinmplify the main cmake file
      
      * add conv_fwd_bf8 instance declaration
      ae57e593
  12. 21 Mar, 2024 1 commit
  13. 15 Mar, 2024 1 commit
  14. 13 Mar, 2024 1 commit
  15. 29 Feb, 2024 1 commit
  16. 26 Feb, 2024 1 commit
  17. 21 Feb, 2024 1 commit
  18. 13 Feb, 2024 2 commits
  19. 25 Jan, 2024 1 commit
    • rocking's avatar
      layernorm & groupnorm bwd gamma beta (#1133) · 28f68a5a
      rocking authored
      * Add layernorm bwd gamma beta external api
      
      * Add groupnorm external api
      
      * Add layernorm bwd gamma beta profiler
      
      * Add groupnorm bwd gamma beta ckProfiler
      
      * Add layernorm & groupnorm bwd gamma beta test
      
      * Fix groupnorm bwd gamma beta profiler bug
      
      * Layernorm bwd weight client example
      
      * Groupnorm bwd weight client example
      
      * clang format
      
      * Remove useless header
      
      * Let inv_std be positive
      
      * Rename to num_bytes and move this calculation outside the loop
      28f68a5a
  20. 19 Jan, 2024 1 commit
  21. 19 Dec, 2023 1 commit
  22. 18 Dec, 2023 1 commit
    • rocking's avatar
      layernorm and groupnorm backward data (#1083) · a69aa2a1
      rocking authored
      * rename folder
      
      * Add type string
      
      * Remove typo
      
      * Add deviceOp to backward x
      
      * Add comment to describe the behavior of backward normalization
      
      * Add kernel function, prepare to implement
      
      * implement generic kernel
      
      * Check vector size
      
      * Add sweep once pipeline for small reduce size
      
      * Fix bug of KRaw_ error
      
      * Fix bug of dx stride
      
      * sanity check for mean and rstd
      
      * backward x for groupnorm
      
      * Add bwd x instance
      
      * add layernorm 2d bwd gamma beta instances
      
      * Change save mean var type from f32 to f16 in f16 mode
      
      * Change the example to f16
      
      * Add groupnorm bwd gamma beta instance
      
      * Add groupnorm bwd x instance
      
      * Fix naming
      
      * Add layernorm bwd x ckprofiler
      
      * Add groupnorm bwd x profiler
      
      * clang format
      
      * Rename bwd x to bwd data
      
      * Fix bug of verification in profiler
      
      * Add test of layernorm and groupnorm bwd data
      
      * Add missing cmake
      
      * Add layernorm2d bwd data
      
      * rename fwd example
      
      * Add groupnorm client example
      
      * Fix typo. replace Invarient with Invariant
      
      * Add checking before running the best instance
      a69aa2a1
  23. 08 Dec, 2023 1 commit
  24. 06 Dec, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Introduce wrapper library (#1071) · 836b7e55
      Bartłomiej Kocot authored
      * Introduce wrapper library
      
      * Update cmake files
      
      * Revert "Update cmake files"
      
      This reverts commit c27f88b56590c11a88e26d5d0df7aca51a08133d.
      
      * Fix comments
      836b7e55
  25. 28 Nov, 2023 1 commit
    • Illia Silin's avatar
      Split the static library into several files. (#1044) · 7965d66a
      Illia Silin authored
      * spolit the static library into several
      
      * update lib paths and fix client example
      
      * do not use device_mha_operarions for client examples
      
      * use appropriate libs to link to client examples
      
      * remove the gpu/transpose path from the list
      
      * try fixing clinet examples 3,4,9
      
      * add necessary libs for client examples
      
      * fix the layernorm client example
      
      * fix the client examples 23 and 24
      
      * fix typo
      
      * add interface library and refresh clang format
      7965d66a
  26. 14 Nov, 2023 1 commit
  27. 13 Nov, 2023 1 commit
  28. 10 Nov, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Support multi AB for grouped conv fwd xdl (#1027) · 49e52bb3
      Bartłomiej Kocot authored
      * Support multi AB for grouped conv fwd xdl
      
      * Add instances
      
      * Add client example
      
      * Add example
      
      * Add interface test
      
      * Minor fixes
      
      Minor fixes
      
      Minor fixes
      
      * Comment fixes
      
      * Fixes
      
      * Reference fix
      
      * Test xdl fixes
      
      * Improve multi_ab interface test
      49e52bb3
  29. 09 Nov, 2023 2 commits
    • arai713's avatar
      Transpose 3d (#984) · 3af8c81a
      arai713 authored
      
      
      * added working example for 5D input using 1D kernel
      
      * example with 5D input tensor and 2d kernel - not working: issues with arguments
      
      * added updated version of 3d device op - changed descriptors/dims
      
      * added example file to check kernel
      
      * fixed descriptor and isSupportedArgument stride problem
      
      * added and modified kernel for 3d - updated tids/loop
      
      * adding some more 5d example files
      
      * fixed some issues
      
      * changes made for testing
      
      * working version: fixed error in stride for A, still a bit inefficient
      
      * cleaned up formatting/comments
      
      * updating formatting
      
      * more formatting fixes
      
      * fixing cmake, adding back gpu targets in cmake script
      
      * adding client example
      
      * added instances for client example
      
      * fixed errors in client example
      
      * implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp
      
      * removed extra files
      
      * minor formatting and naming fixes
      
      * adding test files and profiler
      
      * fixing minor error
      
      * minor fix
      
      * removed unneccesary comments, renamed files
      
      * updated instance list for client example, added different layout example
      
      * removing instances
      
      * fixed error in instance generation
      
      * remove comments
      
      * update profiler and client example tensor layouts
      
      * fixed errors in test/profiler
      
      * updated vector dim access to enable vector load
      
      * updated test/profiler files
      
      * updated example with 1d kernel
      
      * updating profiler
      
      * renamed files
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      3af8c81a
    • rocking's avatar
      Layernorm4d (#1022) · a3d9a2cd
      rocking authored
      
      
      * Rename folder
      
      * Add layernorm 4d fwd example
      
      * Rename original layernorm example
      
      * Add layernorm 4d f16  test
      
      * Add layernorm4d_fwd client example
      
      * Support layernorm4D in ckProfiler
      
      * Rename groupnorm to groupnorm fwd in example
      
      * Rename layernorm and group fwd in test
      
      * Rename normalization to normalization_fwd (instances)
      
      * Add fwd to DeviceNormalization
      
      * Rename external api header
      
      * Rename folder, because we can also add bwd in this folder
      
      * Add fwd in layernorm and groupnorm (profiler
      
      * Fix compile error
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      a3d9a2cd
  30. 01 Nov, 2023 1 commit
  31. 31 Oct, 2023 1 commit
  32. 18 Oct, 2023 1 commit
    • rocking's avatar
      Layernorm and groupnorm support to save mean and inverse std in forward (#929) · 3696fe1c
      rocking authored
      * save mean and inverse std in normalization
      
      * Save mean and inverse std in splitK
      
      * Vector save mean and inv std
      
      * Modify instance for save mean and std
      
      * simplify the layernorm example
      
      * Save mean and std in groupnorm example
      
      * Save mean and inv std in ckProfiler and test
      
      * Remove compute data type from base class
      
      * Save mean and inv std in client example
      
      * Add changelog
      
      * clang format
      
      * Fix compile error
      
      * Refine naming
      
      * Avoid error in bf16
      
      * revert changelog
      3696fe1c
  33. 11 Oct, 2023 2 commits
    • zjing14's avatar
      Revert "Grouped Gemm with looping over the tiles. (#788)" (#982) · c99323be
      zjing14 authored
      This reverts commit a4f72a31.
      c99323be
    • Adam Osewski's avatar
      Grouped Gemm with looping over the tiles. (#788) · a4f72a31
      Adam Osewski authored
      
      
      * Introduce LocalBlockToCTileMap.
      
      * Change the signature of CalculateBottomIndex() function which now does
      not accept any argument. The B2C map which is already passed as an
      argument to the kernel Run function is calculating block's local id
      already outside at kernel entry point __global__ function.
      The LocalB2C map stores as members local block ID.
      
      * Use LocalBlockToCTile map in device ops.
      
      * First draft of tile loop work distribution.
      
      * Fix typo.
      
      * Simplify kernel arguments.
      
      Calculate descriptors & B2C maps on the device.
      
      * Use looping kernel.
      
      * Fix B2C constructor.
      
      * Fix Navi21 errors.
      
      * Calculate tile start/end in device kernel.
      
      * Change Run API to accept user provided workspace buffer.
      
      * Add new line at EOF.
      
      * Move Gemm KernelArguments to device op interface.
      
      * Remove unused code.
      
      * Update API.
      
      * Launch grid size which is min of occupancy vs tile count
      
      * Get back to use constant memory for gemm descriptors.
      
      * Remove unused code.
      
      * Add default virtual method implementation.
      
      * Update comments to conform with doxygen style.
      
      * Fix doc style and unused parameters.
      
      * Add thread cluster lengths to kernel name.
      
      * Remove old splitk impl and replace it with tile looping one.
      
      * Modify instances.
      
      * set KPerBlock to 64
      * maximize wherever possible vector load size.
      
      * Fix instances cluster lengths.
      
      * Change comment style.
      
      * Use 128b store where possible in instances.
      
      * Update test cases, since KPerBlock has doubled.
      
      * Update output stream operator for Sequence.
      
      * Add pipeline version to GroupedGEMM device op type string.
      
      * Fix pipeline version type logging.
      
      * Fix input tensors type after merge.
      
      * Fix compiler error.
      
      * Fix output stream operator for Pipeline version.
      
      * Store using 128b.
      
      * Set of instances with kpb 32/64
      
      * Limit number of instances
      
      * Remove commented out instances.
      
      * Fix function name.
      
      * Limit the number of instances.
      
      Add pipline version to the regular instances
      
      * Change thr cluster layout for reading B tensor.
      
      * disabled failed instances
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      a4f72a31
  34. 04 Oct, 2023 2 commits
    • zjing14's avatar
      Grouped conv bwd data with fp16 input and bf8fp8 comp (#962) · 04f93aad
      zjing14 authored
      
      
      * Add f8 bf8 gemm example
      
      * Add element-wise ops
      
      * Add intrinsics
      
      * Update reference calculation
      
      * Add an additional type option for xdlops gemm
      
      * Fix build process
      
      * Add bf8 to buffer addressing
      
      * Update blockwise op, split typeA and typeB
      
      * Update for compatibility
      
      * Uppdate naming to f8->fp8
      
      * Update naming
      
      * Format
      
      * Update naming (#937)
      
      * Add a client example
      
      * Add computetypes to device and gridwise ops
      
      * Add instances, update instance factory
      
      * Format
      
      * Fix a flag
      
      * Add ckProfiler mode
      
      * Fix typos
      
      * Add an example
      
      * Add bf8 generator
      
      * add bf8 mfma; fixed type_convert for bf8
      
      * move verfication ahead of timing
      
      * Update reference calculation
      
      * Fix reference
      
      * Narrow down float init range
      
      * Fix bf8 bf8 mfma
      
      * Add bf8 @ fp8 mfma
      
      * Update example
      
      * Update instances
      
      * Update profiler api
      
      * Update for compatibility
      
      * Format
      
      * Remove extra example
      
      * Clean up
      
      * workaround convert
      
      * added instance of f16_bf8f8, and client example
      
      * fixed mfma selector
      
      * format
      
      ---------
      Co-authored-by: default avatarRostyslav Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarRostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      04f93aad
    • zjing14's avatar
      3d grouped conv fwd with input/output fp16 and comp fp8 (#931) · e921e1f0
      zjing14 authored
      
      
      * add f8 comp instance
      
      * fixed
      
      * fixed comments
      
      * rename
      
      * fixed dtype
      
      * format
      
      * fixed CI
      
      * fixed ci
      
      * add missing ComputeType
      
      * fixed cit
      
      * fixed
      
      * Update cmake-ck-dev.sh
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      e921e1f0
  35. 03 Oct, 2023 1 commit