1. 20 Oct, 2023 5 commits
  2. 19 Oct, 2023 5 commits
  3. 18 Oct, 2023 4 commits
    • rocking's avatar
      Layernorm and groupnorm support to save mean and inverse std in forward (#929) · 3696fe1c
      rocking authored
      * save mean and inverse std in normalization
      
      * Save mean and inverse std in splitK
      
      * Vector save mean and inv std
      
      * Modify instance for save mean and std
      
      * simplify the layernorm example
      
      * Save mean and std in groupnorm example
      
      * Save mean and inv std in ckProfiler and test
      
      * Remove compute data type from base class
      
      * Save mean and inv std in client example
      
      * Add changelog
      
      * clang format
      
      * Fix compile error
      
      * Refine naming
      
      * Avoid error in bf16
      
      * revert changelog
      3696fe1c
    • zjing14's avatar
      fixed math-ci error; suspend a warning (#996) · 58338bb2
      zjing14 authored
      
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      58338bb2
    • zjing14's avatar
      Clean DTYPES conditions in CMake (#974) · bf435140
      zjing14 authored
      
      
      * Add a condition to build fp8 instances
      
      * simplified buffer_load/store
      
      * add bfp8/fp8
      
      * fixed
      
      * remove all f8/bf8 condition include folder
      
      * fixed cmake conditions
      
      * fixed DTYPES=fp16/bfp16
      
      * fix
      
      * fixed buffer_load
      
      * fixed buffer_store
      
      * fix
      
      * clean example cmake files
      
      * fixed ci
      
      * fixed cit
      
      ---------
      Co-authored-by: default avatarRostyslav Geyyer <rosty.geyyer@amd.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      bf435140
    • zjing14's avatar
      Add contraction_multi_abd (#972) · 1cc36ba5
      zjing14 authored
      
      
      * add gridwise_multi_abd
      
      * move element_op into RunRead
      
      * merge element_wise op with data read
      
      * add multiABD example
      
      * allow packed elementwise_op
      
      * changed example
      
      * clean
      
      * clean
      
      * add is_detected
      
      * fix
      
      * minor fix
      
      * add scaleAdd_vec4 example
      
      * init commit for contraction_multi_ABD
      
      * add examples
      
      * add examples of multiA and broadcast
      
      * update example
      
      * fixed comments
      
      * Update cmake-ck-dev.sh
      
      * Update cmake-ck-dev.sh
      
      * Add comments into the example
      
      * Update CMakeLists.txt
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      1cc36ba5
  4. 17 Oct, 2023 9 commits
  5. 16 Oct, 2023 8 commits
  6. 13 Oct, 2023 2 commits
  7. 12 Oct, 2023 2 commits
  8. 11 Oct, 2023 2 commits
    • zjing14's avatar
      Revert "Grouped Gemm with looping over the tiles. (#788)" (#982) · c99323be
      zjing14 authored
      This reverts commit a4f72a31.
      c99323be
    • Adam Osewski's avatar
      Grouped Gemm with looping over the tiles. (#788) · a4f72a31
      Adam Osewski authored
      
      
      * Introduce LocalBlockToCTileMap.
      
      * Change the signature of CalculateBottomIndex() function which now does
      not accept any argument. The B2C map which is already passed as an
      argument to the kernel Run function is calculating block's local id
      already outside at kernel entry point __global__ function.
      The LocalB2C map stores as members local block ID.
      
      * Use LocalBlockToCTile map in device ops.
      
      * First draft of tile loop work distribution.
      
      * Fix typo.
      
      * Simplify kernel arguments.
      
      Calculate descriptors & B2C maps on the device.
      
      * Use looping kernel.
      
      * Fix B2C constructor.
      
      * Fix Navi21 errors.
      
      * Calculate tile start/end in device kernel.
      
      * Change Run API to accept user provided workspace buffer.
      
      * Add new line at EOF.
      
      * Move Gemm KernelArguments to device op interface.
      
      * Remove unused code.
      
      * Update API.
      
      * Launch grid size which is min of occupancy vs tile count
      
      * Get back to use constant memory for gemm descriptors.
      
      * Remove unused code.
      
      * Add default virtual method implementation.
      
      * Update comments to conform with doxygen style.
      
      * Fix doc style and unused parameters.
      
      * Add thread cluster lengths to kernel name.
      
      * Remove old splitk impl and replace it with tile looping one.
      
      * Modify instances.
      
      * set KPerBlock to 64
      * maximize wherever possible vector load size.
      
      * Fix instances cluster lengths.
      
      * Change comment style.
      
      * Use 128b store where possible in instances.
      
      * Update test cases, since KPerBlock has doubled.
      
      * Update output stream operator for Sequence.
      
      * Add pipeline version to GroupedGEMM device op type string.
      
      * Fix pipeline version type logging.
      
      * Fix input tensors type after merge.
      
      * Fix compiler error.
      
      * Fix output stream operator for Pipeline version.
      
      * Store using 128b.
      
      * Set of instances with kpb 32/64
      
      * Limit number of instances
      
      * Remove commented out instances.
      
      * Fix function name.
      
      * Limit the number of instances.
      
      Add pipline version to the regular instances
      
      * Change thr cluster layout for reading B tensor.
      
      * disabled failed instances
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      a4f72a31
  9. 10 Oct, 2023 2 commits
  10. 05 Oct, 2023 1 commit