1. 08 Oct, 2024 1 commit
    • Rostyslav Geyyer's avatar
      Add a gpu gemm reference kernel (#1528) · aa932445
      Rostyslav Geyyer authored
      
      
      * Add a gpu gemm reference kernel
      
      * Switch to gpu reference in gemm examples
      
      * Remove redundant arguments
      
      * Update all related examples
      
      * Update more examples
      
      * Try less threads per block
      
      * Try even less threads per block
      
      * Add support for all matrix layouts
      
      * Increase block size
      
      * Clean up
      
      * Remove hardcoded strides
      
      * Clean up
      
      * Try a column-major case
      
      * Revert back to row-major
      
      * Run both CPU and GPU veriffication
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      aa932445
  2. 06 Jul, 2024 1 commit
    • Harisankar Sadasivan's avatar
      Universal streamk with atomics (#1360) · 75e622f0
      Harisankar Sadasivan authored
      * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 
      
      * Update README.md
      
      * fixing clang-format issues
      
      * removed conflicts in struct members between streamk and universal streamk
      
      * corrected arg parsing for streamk and universal streamk
      
      * added stream-k policies for 3 tile and 4 tile
      
      * fixed argument type issue with parsing cmd args
      
      * changes suggested in PR review are made- removing comments and correcting copyright
      
      * file permissions updated
      
      * added default value support for grid_size and streamk-policy selection set to -1
      
      * print messages for arguments
      
      * print messages for arguments
      
      * print messages for arguments1
      75e622f0
  3. 14 Apr, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] Gemm universal device operation (#1154) · f83e9701
      Haocong WANG authored
      
      
      * Optimize GEMM on MI200/300:
      1. Add new blockwise gemm pipeline
      2. Add irregular splitk intances
      
      * clang format + typo fix
      
      * Fix a bug
      
      * initial commit
      
      * Add more instances to irregular splitk
      
      * blkgemm pipeline v1~4 prototype
      
      * Sanity Checked. Known issue:
      1. Poor performance of splitk
      2. Register spill on blkgemmpipeline v3
      
      * Sanity and Performance fix:
      1. fix a bug related to sanity in grouped b2c mapping
      2. fix a bug related to sanity and performance in splitk offset
      
      * Sanity and API update:
      1. Remove prefetch stage
      2. Fix valid check bug
      3, Add first gemm_universal instance into ckProfiler
      
      * Add NN instances for gemm universal
      
      * 1. Add NT instances for gemm_universal
      2. Fix a bug about Kpadding in gemm_universal
      
      * Fix a bug regarding padding Odd K number
      
      * remove kernel print
      
      * Fix KPadding bug...
      
      * Update safety check
      
      * another try to fix kpadding..
      
      * Sanity checked
      
      * new instances..
      
      * clang format+typo fix
      
      * remove clang format script's change
      
      * Add non-hotloop compile option
      
      * 1. Add fp16xfp8 example
      2. pull packed convert f8 from pr1150
      
      * Some miscs.. opt and fix
      
      * Add pipeline description docs
      
      * Split universal gemm instance library to cut profiler compiling time
      
      * uncomment cmakefile
      
      * Fix a bug caused by blockwise_gemm_pipe_v2
      
      * reduce default splitk to 1
      
      * Add 224x256x64 tile size
      
      * update, including:
      1. Experiment pipeline 5~7
      2. Optimization for pipeline 4
      3. Organized instance library
      
      * temp save
      
      * temp save
      
      * Permuted lds layout, sanity and function checked
      
      * clang format
      
      * Move OOB check from RunRead to RunWrite, for better software pipeline.
      TODO: agpr spill when NN layout
      
      * clangformat
      
      * A/B splitpipe scheduler for v3
      
      * Fix two bugs
      
      * bug fix
      
      * fix a bug in oob check
      
      * Example for mixed fp16_fp8 gemm
      
      * Clean experimental code blocks
      
      * Add mixed precision gemm into profiler
      
      * tempsave
      
      * optimize m/n major lds layout
      
      * Add RRR GEMM  mixed precision instances
      
      * Optimize f8 matrix transpose
      
      * Add test_gemm_universal
      
      * A/B spilt schedule for blkpip v5
      
      * Take ds_read2 into iglp scheduling scheme
      
      * format
      
      * fixed cmake
      
      * Add llvm-option into CI cmake flag
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      f83e9701
  4. 27 Feb, 2024 1 commit
    • Illia Silin's avatar
      Clip fp8 to +/-240 on all targets. (#1172) · d0c7b451
      Illia Silin authored
      * clip fp8 to +/-240 on all targets
      
      * if inputs to fp8 conversion are +/-inf, they remain unaltered
      
      * increase tolerance for test_elementwise_layernorm to prevent false errors
      
      * change the input values for gemm examples to floats
      
      * reduce gemm example float input values to prevent errors
      
      * increase the tolerance for gemm examples
      d0c7b451
  5. 26 Jul, 2023 1 commit
    • carlushuang's avatar
      initial stream-k implementation with example (#699) · e7dca79d
      carlushuang authored
      
      
      * initial stream-k implementation with example
      
      * fix unexpected change in err
      
      * improve a little bit performance by reorganize pipeline.
      
      * improve perf a little bit by swizzle block idx
      
      * add profiler
      
      * update example
      
      * fix spelling
      
      * shrink karg for streamk
      
      * support dynamic buffer using memory coherence glc_slc bit from template
      
      * control memory coherence while construct dynamic buffer
      
      * update reduction for streamk(not ready yet)
      
      * Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
      
      * fix build issue
      
      * fix several bug
      
      * now result is correct, everything works (but has scratch)
      
      * remove scratch by manually reset coordinate
      
      * update device code
      
      * fix a bug in final reduce
      
      * fix something in example
      
      * update async memset
      
      * fix enum as camel case
      
      * modify coherence enum name
      
      * clean code and use atomic streamk by default
      
      * remove unused var
      
      * throw exception if have empty pointer
      
      * fix format
      
      * fix CI warning
      
      * fix type in init
      
      * modify CI error
      
      * filter out on gfx10+
      
      * restore changed example code
      
      ---------
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      e7dca79d
  6. 31 May, 2023 1 commit
  7. 23 Aug, 2022 2 commits
    • Po Yen Chen's avatar
      Add examples of Gemm (data type: int4) (#367) · fa2d894b
      Po Yen Chen authored
      * Add GEMM examples for int4
      
      Currently the source files are just copied from int8 examples
      
      * Re-use pre-defined alias in int4 exmples
      
      * Distinguish user-side type from kernel-side type
      
      * Add int4_t support for check_err()
      
      * Allow conversion between Tensor<> specializations
      
      * Re-format source files
      
      * Use different type for host tensors
      
      * Re-use CopyAsType<>() to implement copy ctor
      
      * Re-use element-wise operation type alias
      
      * Fix typo in alias names
      
      * Complete the int4 examples
      
      * Add constraint to Tensor<> templated methods
      
      * Add type traits 'is_signed_integral<>'
      
      * Add type constraints for integer version check_err<>()
      
      * Allow comparing different-sized integral types in check_err()
      
      * Check converted Tensor<int4_t> with golden Tensor<int8_t>
      
      * Remove constraint of Tensor<>::CopyAsType()
      
      * Avoid compilation error while disabling ck::int4_t support
      
      * Remove debug messages
      
      * Add #error directive to prevent compile sources with wrong setting
      
      * Simplify tensor usages in examples
      
      * Add constraint to check_err() input reference type
      
      * Align design with other PR
      
      * Use ""_uz to simplify example code
      
      * Avoid too much generalizing check_err()
      
      * Re-format GEMM instance template arguments
      
      * Extract int4 example common codes
      
      * Sort include directives
      
      * Move #include directives into new header
      
      * Move common codes together
      
      * Re-format template argument in example code
      
      * Reuse same implementation code for most of GEMM examples
      
      * Re-format common.hpp
      
      * Unify structured comment in examples
      
      * Use reinterpret_cast<>() for cross-type pointer conversion
      
      * Revert "Add type traits 'is_signed_integral<>'"
      
      This reverts commit f2c148efaedf42c8ee66032dac6d13a1003b0f3a.
      
      * Allow unsigned integer arguments for check_err()
      
      * Fix compilation error in check_err()
      
      * Remove unnecessary copy ctor for Tensor<>
      
      * Mark Tensor<> special member functions as 'default'
      
      * Use more strict condition to add code in examples
      
      * Fix wrong program return value of GEMM examples
      
      * Handle the case while user specify all the strides
      
      * Fix never-ran examples
      
      * Exit successfully if GEMM instance does not support given problem
      
      * Add missing 'else' keyword
      
      * Re-format CMakeLists.txt
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit 3f190b0779ceedf7aaf0b380712fda0518de72c1.
      
      * Add conversion ctor for Tensor<>
      
      * Write Tensor<> conversion logics explicitly in example code
      
      * Convert Tensor<> values after transfer data to host
      fa2d894b
    • Po Yen Chen's avatar
      Add example of Gemm + AddAddFastGelu (data type: int4) (#369) · 2327f1a6
      Po Yen Chen authored
      * Add custom target to bundle examples together
      
      * Add int4 example conditionally (just copy from int8 example)
      
      * Extract common code into common.hpp
      
      * Move ref gemm type alias into data-type-specific sources
      
      * Add #error directive to prevent compile with wrong setting
      
      * Let AddAddFastGelu support int4 parameter type
      
      * Let check_err() support int4 parameter type
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Finish int4 example for GEMM + AddAddFastGelu
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit e26e7af71e1f982a4ca7406401e2fc9b1f086b32.
      
      * Add conversion ctor for Tensor<>
      
      * Add 'const' specifier to Tensor<>::CopyAsType()
      
      * Convert Tensor<> values before/after transfer between host & device
      2327f1a6