1. 21 Aug, 2024 1 commit
    • Andriy Roshchenko's avatar
      Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473) · c3515f27
      Andriy Roshchenko authored
      * Enable CMakePresets build
      
      * Verify Convolution, Scaling and ReLU algorithms.
      
      * Add tensor element-wise scale and type cast operation.
      
      * Reduction implemented but does not work.
      
      * Exploration of Reduction functionality.
      
      * Completed example for Convolution scaled with ReLu activation and AMAX reduction.
      
      * WIP: Add required instances for convolution.
      
      * WIP: Create client example. Implement convolution stage.
      
      * Add elementwise instances.
      
      * Add elementwise scale + convert example.
      
      * Add reduction instances.
      
      * WIP: Client example for AMAX reduction.
      
      * WIP: Add instances for multistage reduction.
      
      * WIP: Implementation of multistage reduction.
      
      * Refactoring.
      
      * Clean up.
      
      * Add CMakePresets.json
      
      * Guard off FP8 instances when the data type is not available.
      
      * Add example for Scaled FP8 Convolution with AMAX reduction.
      
      * Refactor CombConvScaleRelu instances.
      
      * Add CombConvScale instances.
      
      * Add client example for Scaled FP8 Convolution with AMAX reduction.
      
      * Cleanup.
      c3515f27
  2. 20 Aug, 2024 1 commit
    • Andriy Roshchenko's avatar
      Adding Instances and Examples for FP8-based Scaled Convolution with ReLU... · a94113a9
      Andriy Roshchenko authored
      Adding Instances and Examples for FP8-based Scaled Convolution with ReLU Activation and AMAX Reduction. (#1469)
      
      * Enable CMakePresets build
      
      * Verify Convolution, Scaling and ReLU algorithms.
      
      * Add tensor element-wise scale and type cast operation.
      
      * Reduction implemented but does not work.
      
      * Exploration of Reduction functionality.
      
      * Completed example for Convolution scaled with ReLu activation and AMAX reduction.
      
      * WIP: Add required instances for convolution.
      
      * WIP: Create client example. Implement convolution stage.
      
      * Add elementwise instances.
      
      * Add elementwise scale + convert example.
      
      * Add reduction instances.
      
      * WIP: Client example for AMAX reduction.
      
      * WIP: Add instances for multistage reduction.
      
      * WIP: Implementation of multistage reduction.
      
      * Refactoring.
      
      * Clean up.
      
      * Guard off FP8 instances when the data type is not available.
      
      * Improve output readability.
      
      * Addressing reviewer's comments.
      a94113a9
  3. 16 Aug, 2024 1 commit
    • Illia Silin's avatar
      Re-enable fp8 types for all architectures. (#1470) · c8b6b642
      Illia Silin authored
      * re-enable fp8 and bf8 for all targets
      
      * restore the fp8 gemm instances
      
      * re-enable conv_3d fp8 on all architectures
      
      * diasble several fp8 gemm instances on all architectures except gfx94
      
      * clang format fix
      c8b6b642
  4. 14 Aug, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] gemm_universal related optimization (#1453) · 3049b546
      Haocong WANG authored
      
      
      * replace buffer_atomic with global_atomic
      
      * fixed global_atomic_add
      
      * added bf16 atomic_add
      
      * format
      
      * clang-format-12
      
      * clean
      
      * clean
      
      * add guards
      
      * Update gtest.cmake
      
      * enabled splitk_gemm_multi_d
      
      * format
      
      * add ckProfiler
      
      * format
      
      * fixed naming
      
      * format
      
      * clean
      
      * clean
      
      * add guards
      
      * fix clang format
      
      * format
      
      * add kbatch printout
      
      * clean
      
      * Add rocm6.2 related gemm optimization
      
      * Limit bf16 atomic usage
      
      * remove redundant RCR gemm_universal instance
      
      * Add RRR fp8 gemm universal instance
      
      * Bug fix
      
      * Add GPU_TARGET guard to FP8/BF8 target
      
      * bug fix
      
      * update cmake
      
      * remove all fp8/bf8 example if arch not support
      
      * Enable fp8 RRR support in ckProfiler
      
      * limit greedy-reverse flag to gemm_universal in ckProfiler
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@fb.com>
      Co-authored-by: default avatarJing Zhang <jizhan@meta.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      3049b546
  5. 12 Aug, 2024 1 commit
  6. 09 Aug, 2024 1 commit
  7. 06 Aug, 2024 2 commits
    • bibek's avatar
      adding mha as static lib (#1366) · 840c5397
      bibek authored
      
      
      * adding mha as static lib
      
      * add fmha fwd compile options
      
      * typo
      
      * fix python version
      
      * python version to 3
      
      * increase path length
      
      * add max path flag in mha cmake
      
      * fix long path issue
      
      * mha currently only runs in gfx94x
      
      * only buld mha in mi300
      
      * populate gpu_list
      
      * add mha compile flags
      
      * avoid building mha in gpu other then gfx94x
      
      * some comments and  include ck_tile in rocm
      
      * use rocm_install
      
      * place ck_tile in include
      
      * correct ck_tile path
      
      ---------
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      840c5397
    • Bartłomiej Kocot's avatar
      Add Grouped Conv Fwd Large Tensor kernel (#1432) · 4ec5c52a
      Bartłomiej Kocot authored
      * Support 64 bit indexing
      
      * Add new grouped conv fwd kernel for large tensors
      
      * Add instances large tensor
      
      * Fixes for transform conv to gemm
      
      * Fixes
      
      * fixes
      
      * Remove not needed instances
      
      * examples fixes
      
      * Remove not need ds arrays
      
      * Fix tests
      
      * Add 2GB check in gridwise dl
      
      * Fixes
      4ec5c52a
  8. 05 Aug, 2024 1 commit
  9. 30 Jul, 2024 1 commit
  10. 24 Jul, 2024 1 commit
    • Andriy Roshchenko's avatar
      Adding more instances of grouped convolution 3d forward for FP8 with... · 4a8a1bef
      Andriy Roshchenko authored
      Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412)
      
      * Add CMakePresets configurations.
      
      * Add binary elementwise ConvScaleAdd and an example.
      
      * Numerical verification of results.
      
      Observed significant irregularities in F8 to F32 type conversions:
      ```log
      ConvScaleAdd: float=145.000000   f8_t=160.000000    e=144.000000
      ConvScaleAdd: float=97.000000   f8_t=96.000000    e=104.000000
      ConvScaleAdd: float=65.000000   f8_t=64.000000    e=72.000000
      ```
      
      * Implemented ConvScaleAdd + Example.
      
      * Add ConvScale+Bias Instances
      
      * Add Client Example for ConvScale+Bias
      
      * Fix number of bytes in an example..
      
      * Cleanup.
      4a8a1bef
  11. 23 Jul, 2024 1 commit
  12. 22 Jul, 2024 1 commit
  13. 19 Jul, 2024 2 commits
    • Haocong WANG's avatar
      [GEMM] F8 GEMM, performance optimized. (#1384) · 8c90f25b
      Haocong WANG authored
      
      
      * add ab_scale init support
      
      * enabled interwave
      
      * add scale type; update isSupport
      
      * adjust example
      
      * clean
      
      * enable f8 pure gemm rcr ckprofiler
      
      * Add gemm_multiply_multiply instances
      
      * clang format
      
      * Optimize for ScaleBlockMNK=128
      
      * enable abscale f8 gemm ck profiler
      
      * Add pure f8 gemm test suite
      
      * Reverting to the state of project at f60fd77
      
      * update copyright
      
      * clang format
      
      * update copyright
      
      ---------
      Co-authored-by: default avatarroot <jizhan@amd.com>
      8c90f25b
    • ltqin's avatar
      Universal gemm splitk using reduce (with multi-d) (#1341) · c544eb4d
      ltqin authored
      
      
      * init for reduce_threadwise multi_d
      
      * add reduce_threadwise_multi_d
      
      * add reduce_multi_d
      
      * clean
      
      * start add an other splitk device op
      
      * add reduce template parameter to SplitKBatchOffset
      
      * add reduce c matrix
      
      * clean up code
      
      * change example data type to bf16
      
      * add bf16Ai8B example
      
      * remove reduce template parameter
      
      * add splitk atomic status to v4
      
      * example add multi d parameters
      
      * device op add multi-d parameters
      
      * add multi-d to reduce
      
      * fix kbach=1 bug
      
      * change B layout to col in  bf16Ai8B example
      
      * remove float adding struct
      
      * change  multi-d interface
      
      * change file and class name
      
      * remove multi-d of bf16Ai8B example
      
      * change IsReduce function to IsReduceAdd
      
      * change example layout to RRR from RCR
      
      * according layout to set ds stride
      
      * reset parameter layout
      
      * add gemm universal reduce instance
      
      * add reduce factory
      
      * add profile_gemm_universal_reduce
      
      * add reduce to profiler
      
      * fix reduce instance
      
      * fix profiler reduce compiling bug
      
      * format
      
      * format library instance code
      
      * add mem instance for reduce library
      
      * fix call instance names
      
      * add workspace for reduce in ckProfiler
      
      * format
      
      * add mnpading to reduce library instance
      
      * add fp16 instance to reduce of profiler
      
      * change copyright time
      
      * restore profiler cmake file
      
      * add reduce text to instances
      
      * add DsLayout and DsDataType to instances template parameter
      
      * fixed gemm_reduce_multi_d
      
      * add an example without multi_d
      
      * Update common.hpp
      
      * Update gtest.cmake
      
      * Update gemm_xdl_splitk_reduce_bf16.cpp
      
      * clean
      
      * Update gtest.cmake
      
      * format
      
      * fixe api
      
      * format
      
      * default parameter change to RRR
      
      * add vector_len for multi_d
      
      * format
      
      * Update gtest.cmake
      
      * fix bf16A iBB elementwiseop
      
      * add ReduceDataType
      
      * move ReduceDataType to end position
      
      * format
      
      * remove googletest git method  address
      
      * fix copyright time
      
      * update init data
      
      ---------
      Co-authored-by: default avatarroot <jizhan@amd.com>
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@meta.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      c544eb4d
  14. 16 Jul, 2024 2 commits
  15. 12 Jul, 2024 1 commit
  16. 11 Jul, 2024 1 commit
  17. 09 Jul, 2024 1 commit
  18. 06 Jul, 2024 1 commit
    • Harisankar Sadasivan's avatar
      Universal streamk with atomics (#1360) · 75e622f0
      Harisankar Sadasivan authored
      * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 
      
      * Update README.md
      
      * fixing clang-format issues
      
      * removed conflicts in struct members between streamk and universal streamk
      
      * corrected arg parsing for streamk and universal streamk
      
      * added stream-k policies for 3 tile and 4 tile
      
      * fixed argument type issue with parsing cmd args
      
      * changes suggested in PR review are made- removing comments and correcting copyright
      
      * file permissions updated
      
      * added default value support for grid_size and streamk-policy selection set to -1
      
      * print messages for arguments
      
      * print messages for arguments
      
      * print messages for arguments1
      75e622f0
  19. 27 Jun, 2024 1 commit
  20. 22 Jun, 2024 1 commit
  21. 18 Jun, 2024 1 commit
  22. 12 Jun, 2024 1 commit
  23. 10 Jun, 2024 1 commit
  24. 05 Jun, 2024 2 commits
    • Bartłomiej Kocot's avatar
      Integrate universal gemm with conv forward (#1320) · ac58cc5d
      Bartłomiej Kocot authored
      * Integrate universal gemm with conv fwd
      
      * Fix conv fwd wmma test
      
      * Fix instances
      
      * Remove direct load check
      ac58cc5d
    • Rostyslav Geyyer's avatar
      Add a scale op, related instances and examples (#1242) · cb0645be
      Rostyslav Geyyer authored
      
      
      * Add a scale op
      
      * Update the element op
      
      * Add instances
      
      * Add an example
      
      * Add a client example
      
      * Add a flag check
      
      * Revert flag check addition
      
      * Fix flag check
      
      * Update d strides in example
      
      * Update d strides in client example
      
      * Apply suggestions from code review
      
      Update copyright header
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Move the example
      
      * Move the client example
      
      * Update element op
      
      * Update example with the new element op
      
      * Add scalar layout
      
      * Update example
      
      * Update kernel for scalar Ds
      
      * Revert kernel changes
      
      * Update element op
      
      * Update example to use scales' pointers
      
      * Format
      
      * Update instances
      
      * Update client example
      
      * Move element op to unary elements
      
      * Update element op to work with values instead of pointers
      
      * Update instances to take element op as an argument
      
      * Update examples to use random scale values
      
      ---------
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      cb0645be
  25. 23 May, 2024 1 commit
  26. 22 May, 2024 2 commits
  27. 08 May, 2024 1 commit
  28. 01 May, 2024 1 commit
  29. 29 Apr, 2024 1 commit
  30. 26 Apr, 2024 3 commits
    • Haocong WANG's avatar
      [GEMM] UniversalGemm update (#1262) · 764164b4
      Haocong WANG authored
      
      
      * Add bf16 instances
      
      * Add bf16 gemm universal example
      
      * tempsave
      
      * Add guard to navi compilation
      
      * workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
      
      * fix formatting condition statement issue
      
      * solve conflict
      
      ---------
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      764164b4
    • zjing14's avatar
      ggemm tile_loop multD bf16 int8 (#1258) · 5ae893c0
      zjing14 authored
      
      
      * Overload output stream operator for LoopScheduler and PiplineVersion
      
      * Add Run overload accepting grid descriptors MK.
      
      * Add __device__ keyword for CalculateGridSize
      
      * Create device op GroupedGemmMultipleD
      
      * Add GroupedGemm MultipleD Tile Loop implementation.
      
      * Add an example for GroupedGemm MultipleD tile loop.
      
      * Device Op GroupedGEMMTileLoop.
      
      * Bunch of small changes in exmaple.
      
      * CkProfiler
      
      * Remove unused tparam.
      
      * changed the copy function to v7r2
      
      * adding multi_abd
      
      * in-progress
      
      * add post-load oob check
      
      * Fix include statement.
      
      * Fix output stream overloads.
      
      * Do not make descriptors and check validity untill we find group.
      
      * Fix gemm desc initialization.
      
      * debugging
      
      * adjust instances
      
      * add run_lds
      
      * add elemntwise_op
      
      * replace multi_abd_device with v3
      
      * clean up
      
      * clean
      
      * clean
      
      * Revert device op
      
      * Fix compilation for DTYPES=FP16
      
      * Validate tensor transfers paramters.
      
      * Added LDSType
      
      * profiling
      
      * adjust oobcheck
      
      * add missing file
      
      * Validate on host only NK dims if M is not known.
      
      * add
      
      * clean
      
      * refactor
      
      * clean
      
      * add examples
      
      * add fuse
      
      * add fusion and client example
      
      * Fix bug.
      
      * A convenient debug func for selecting threads.
      
      * Fix has main k block loop bug.
      
      * Make sure that b2c has up to date tile offset.
      
      * Output stream operator for Sequence type.
      
      * Cmake file formatting.
      
      * clean
      
      ---------
      Co-authored-by: default avatarAdam Osewski <Adam.Osewski@amd.com>
      5ae893c0
    • zjing14's avatar
      bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db
      zjing14 authored
      * changed the copy function to v7r2
      
      * adding multi_abd
      
      * in-progress
      
      * add post-load oob check
      
      * debugging
      
      * adjust instances
      
      * add run_lds
      
      * add elemntwise_op
      
      * replace multi_abd_device with v3
      
      * clean up
      
      * clean
      
      * clean
      
      * Added LDSType
      
      * profiling
      
      * adjust oobcheck
      
      * add missing file
      
      * refactor
      
      * clean
      
      * add examples
      0d0150db
  31. 25 Apr, 2024 1 commit
    • Adam Osewski's avatar
      Grouped GEMM Multiple D tile loop. (#1247) · b4032629
      Adam Osewski authored
      * Overload output stream operator for LoopScheduler and PiplineVersion
      
      * Add Run overload accepting grid descriptors MK.
      
      * Add __device__ keyword for CalculateGridSize
      
      * Create device op GroupedGemmMultipleD
      
      * Add GroupedGemm MultipleD Tile Loop implementation.
      
      * Add an example for GroupedGemm MultipleD tile loop.
      
      * Device Op GroupedGEMMTileLoop.
      
      * Bunch of small changes in exmaple.
      
      * CkProfiler
      
      * Remove unused tparam.
      
      * Fix include statement.
      
      * Fix output stream overloads.
      
      * Do not make descriptors and check validity untill we find group.
      
      * Fix gemm desc initialization.
      
      * Revert device op
      
      * Fix compilation for DTYPES=FP16
      
      * Validate tensor transfers paramters.
      
      * Validate on host only NK dims if M is not known.
      
      * Fix bug.
      
      * A convenient debug func for selecting threads.
      
      * Fix has main k block loop bug.
      
      * Make sure that b2c has up to date tile offset.
      
      * Output stream operator for Sequence type.
      
      * Cmake file formatting.
      b4032629
  32. 19 Apr, 2024 2 commits