1. 10 Mar, 2024 1 commit
  2. 01 Mar, 2024 1 commit
  3. 27 Feb, 2024 1 commit
    • Illia Silin's avatar
      Clip fp8 to +/-240 on all targets. (#1172) · d0c7b451
      Illia Silin authored
      * clip fp8 to +/-240 on all targets
      
      * if inputs to fp8 conversion are +/-inf, they remain unaltered
      
      * increase tolerance for test_elementwise_layernorm to prevent false errors
      
      * change the input values for gemm examples to floats
      
      * reduce gemm example float input values to prevent errors
      
      * increase the tolerance for gemm examples
      d0c7b451
  4. 26 Feb, 2024 1 commit
  5. 24 Feb, 2024 1 commit
  6. 17 Feb, 2024 1 commit
  7. 15 Feb, 2024 1 commit
  8. 14 Feb, 2024 1 commit
  9. 07 Feb, 2024 1 commit
  10. 02 Feb, 2024 1 commit
  11. 24 Jan, 2024 1 commit
    • Illia Silin's avatar
      Fixing most of the cppcheck errors. (#1142) · 180e5720
      Illia Silin authored
      * fix cppcheck errors, first pass
      
      * fix format
      
      * fix returned value in examples
      
      * add macro definitions for cppcheck
      
      * fix the profile_gemm logic
      
      * update the gemm profiler logic
      
      * add more difinitions to cppcheck, fix couple more errors
      
      * replace runtime error with message in device function
      
      * fix a couple of int4 issues
      
      * no return for fill function
      
      * fix errors in data_types.hpp
      
      * fix format
      
      * fix few remaining errors
      
      * fix errors in data_types.hpp
      
      * fix last couple of errors in datat_types.hpp
      180e5720
  12. 19 Jan, 2024 1 commit
  13. 03 Jan, 2024 1 commit
  14. 13 Dec, 2023 1 commit
  15. 06 Dec, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Introduce wrapper library (#1071) · 836b7e55
      Bartłomiej Kocot authored
      * Introduce wrapper library
      
      * Update cmake files
      
      * Revert "Update cmake files"
      
      This reverts commit c27f88b56590c11a88e26d5d0df7aca51a08133d.
      
      * Fix comments
      836b7e55
  16. 03 Dec, 2023 1 commit
    • Bartlomiej Wroblewski's avatar
      Add support for double buffering in direct load GEMM kernel (#1052) · bc4bf9bd
      Bartlomiej Wroblewski authored
      This PR introduces support for double buffering in LDS into GEMM kernels that use direct load instructions.
      
      Direct loads now use inline asm instead of intrinsics. Usage of intrinsics results in compiler adding additional waitcnt instructions what breaks possible load/compute overlap in case of double buffering.
      
      Usage of inline asm results in the need to use sched_barrier in order to make sure that compiler cannot incorrectly reschedule instructions since it does not know the data dependencies between global->LDS and LDS->registers.
      bc4bf9bd
  17. 30 Nov, 2023 1 commit
  18. 28 Nov, 2023 1 commit
  19. 25 Nov, 2023 1 commit
    • Bartlomiej Wroblewski's avatar
      Add basic support for direct loads from global to LDS (#999) · 627054b9
      Bartlomiej Wroblewski authored
      * Add basic support for direct loads from global to LDS
      
      * Clean the code and comments
      
      * Add support for fp16
      
      * Add comments
      
      * Add check for thread cluster lengths
      
      * Align non-direct-load fp16 example
      
      * Small fixes
      
      * Extend IsSupported to check for supported GPU gens
      
      * Build examples only on the supported HW
      
      * Do not throw when instance not supported in 04 example
      
      * Review: Apply review suggestions
      
      * Review: small fix
      
      * Review: small fix
      627054b9
  20. 07 Nov, 2023 1 commit
  21. 28 Oct, 2023 1 commit
  22. 20 Oct, 2023 1 commit
  23. 19 Oct, 2023 2 commits
  24. 18 Oct, 2023 1 commit
  25. 16 Oct, 2023 1 commit
  26. 13 Oct, 2023 1 commit
  27. 12 Oct, 2023 1 commit
  28. 11 Oct, 2023 2 commits
    • zjing14's avatar
      Revert "Grouped Gemm with looping over the tiles. (#788)" (#982) · c99323be
      zjing14 authored
      This reverts commit a4f72a31.
      c99323be
    • Adam Osewski's avatar
      Grouped Gemm with looping over the tiles. (#788) · a4f72a31
      Adam Osewski authored
      
      
      * Introduce LocalBlockToCTileMap.
      
      * Change the signature of CalculateBottomIndex() function which now does
      not accept any argument. The B2C map which is already passed as an
      argument to the kernel Run function is calculating block's local id
      already outside at kernel entry point __global__ function.
      The LocalB2C map stores as members local block ID.
      
      * Use LocalBlockToCTile map in device ops.
      
      * First draft of tile loop work distribution.
      
      * Fix typo.
      
      * Simplify kernel arguments.
      
      Calculate descriptors & B2C maps on the device.
      
      * Use looping kernel.
      
      * Fix B2C constructor.
      
      * Fix Navi21 errors.
      
      * Calculate tile start/end in device kernel.
      
      * Change Run API to accept user provided workspace buffer.
      
      * Add new line at EOF.
      
      * Move Gemm KernelArguments to device op interface.
      
      * Remove unused code.
      
      * Update API.
      
      * Launch grid size which is min of occupancy vs tile count
      
      * Get back to use constant memory for gemm descriptors.
      
      * Remove unused code.
      
      * Add default virtual method implementation.
      
      * Update comments to conform with doxygen style.
      
      * Fix doc style and unused parameters.
      
      * Add thread cluster lengths to kernel name.
      
      * Remove old splitk impl and replace it with tile looping one.
      
      * Modify instances.
      
      * set KPerBlock to 64
      * maximize wherever possible vector load size.
      
      * Fix instances cluster lengths.
      
      * Change comment style.
      
      * Use 128b store where possible in instances.
      
      * Update test cases, since KPerBlock has doubled.
      
      * Update output stream operator for Sequence.
      
      * Add pipeline version to GroupedGEMM device op type string.
      
      * Fix pipeline version type logging.
      
      * Fix input tensors type after merge.
      
      * Fix compiler error.
      
      * Fix output stream operator for Pipeline version.
      
      * Store using 128b.
      
      * Set of instances with kpb 32/64
      
      * Limit number of instances
      
      * Remove commented out instances.
      
      * Fix function name.
      
      * Limit the number of instances.
      
      Add pipline version to the regular instances
      
      * Change thr cluster layout for reading B tensor.
      
      * disabled failed instances
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      a4f72a31
  29. 10 Oct, 2023 1 commit
  30. 04 Oct, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Add conv bwd weight fp16 comp bf8 fp8 op, instances and example (#945) · 42facfc6
      Rostyslav Geyyer authored
      
      
      * Add f8 bf8 gemm example
      
      * Add element-wise ops
      
      * Add intrinsics
      
      * Update reference calculation
      
      * Add an additional type option for xdlops gemm
      
      * Fix build process
      
      * Add bf8 to buffer addressing
      
      * Update blockwise op, split typeA and typeB
      
      * Update for compatibility
      
      * Uppdate naming to f8->fp8
      
      * Update naming
      
      * Format
      
      * Update naming (#937)
      
      * Add a client example
      
      * Add computetypes to device and gridwise ops
      
      * Add instances, update instance factory
      
      * Format
      
      * Fix a flag
      
      * Add ckProfiler mode
      
      * Fix typos
      
      * Add an example
      
      * Add bf8 generator
      
      * add bf8 mfma; fixed type_convert for bf8
      
      * move verfication ahead of timing
      
      * Update reference calculation
      
      * Fix reference
      
      * Narrow down float init range
      
      * Fix bf8 bf8 mfma
      
      * Add bf8 @ fp8 mfma
      
      * Update example
      
      * Update instances
      
      * Update profiler api
      
      * Update for compatibility
      
      * Format
      
      * Remove extra example
      
      * Clean up
      
      * workaround convert
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      42facfc6
  31. 02 Oct, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Add fp8 @ bf8 gemm support and example (#933) · bd09b5c5
      Rostyslav Geyyer authored
      * Add f8 bf8 gemm example
      
      * Add element-wise ops
      
      * Add intrinsics
      
      * Update reference calculation
      
      * Add an additional type option for xdlops gemm
      
      * Fix build process
      
      * Add bf8 to buffer addressing
      
      * Update blockwise op, split typeA and typeB
      
      * Update for compatibility
      
      * Uppdate naming to f8->fp8
      
      * Update naming
      
      * Format
      bd09b5c5
  32. 27 Sep, 2023 3 commits
    • Bartlomiej Wroblewski's avatar
      Handle type conversions to a const datatype (#944) · f4af5aed
      Bartlomiej Wroblewski authored
      * Handle type conversions to a const datatype
      
      * Review: Handle X being const data type as well
      
      * Review: Remove typo
      f4af5aed
    • Bartłomiej Kocot's avatar
      Add column to image kernel (#930) · e2243a4d
      Bartłomiej Kocot authored
      * Add column to image kernel
      
      * Minor fixes for dtypes and client examples
      
      * Disable tests for disabled dtypes
      
      * Disable add instances functions for disabled data types
      
      * Minor stylistic fixes
      
      * Revert "Disable add instances functions for disabled data types"
      
      This reverts commit 728b86956378dcd9415fd0f2557833a068fe1c10.
      
      * Instances reduction
      
      * Add comments in device_column_to_image_impl
      
      * Update changelog and Copyrights
      
      * Improve changelog
      e2243a4d
    • zjing14's avatar
      Add multiple A/B support (#906) · 11676c7e
      zjing14 authored
      
      
      * add gridwise_multi_abd
      
      * move element_op into RunRead
      
      * merge element_wise op with data read
      
      * add multiABD example
      
      * allow packed elementwise_op
      
      * changed example
      
      * clean
      
      * clean
      
      * add is_detected
      
      * fix
      
      * minor fix
      
      * add scaleAdd_vec4 example
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      11676c7e
  33. 18 Sep, 2023 1 commit
  34. 13 Sep, 2023 1 commit
  35. 12 Sep, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Refactor f8_t, add bf8_t (#792) · 62d4af74
      Rostyslav Geyyer authored
      * Refactor f8_t to add bf8_t
      
      * Add check_err impl for f8_t
      
      * Update fp8 test
      
      * Format
      
      * Revert the fix
      
      * Update vector_type implementation
      
      * Add bf8 test
      
      * Add bf8, use BitInt types
      
      * Add bf8 conversion methods
      
      * Update type_convert for fp8/bf8
      
      * Add check_err fp8/bf8 support
      
      * Add subnorm fp8 tests
      
      * Add subnorm bf8 tests
      
      * Fix conversion
      
      * Add bf8 cmake bindings
      
      * Add macros to enable build with disabled fp8/bf8
      
      * Remove is_native method
      
      * Update flag combination for mixed precision instances
      
      * Add more flag checks
      
      * Add another flag to a client example
      
      * Add type traits, decouple f8/bf8 casting
      
      * Clean up
      
      * Decouple fp8 and bf8 flags
      
      * Remove more redundant flags
      
      * Remove leftover comments
      62d4af74
  36. 06 Sep, 2023 1 commit
    • Bartlomiej Wroblewski's avatar
      Redesign the DPP8 GEMM kernel to use warp-wise component (#863) · 37a8c1f7
      Bartlomiej Wroblewski authored
      * Redesign the DPP8 GEMM kernel to use warp-wise component
      
      * Review: Improve error messages
      
      * Review: Remove unnecessary empty lines
      
      * Review: Fix M, N per thread names
      
      * Review: Rename mfma_input_type to dpp_input_type
      
      * Review: Fix tensor adaptor; remove unnecessary element
      
      * Review: Remove calls to dpp_gemm's MakeCDescriptor
      
      * Review: Add blockwise doc, change function names to include dimension names
      
      * Review: Remove duplicated code; Move Block2CtileMap alias to the top of the file
      
      * Review: Add __restrict__ keywords
      
      * Review: Use MatrixPadder for padding A, B, C matrices
      
      * Review: Remove hardcoded datatypes
      
      * Review: Change names from FloatX to XDataType
      
      * Review: Introduce AK0 and BK0 instead of a single K0
      
      * Review: Remove construction of dpp_datatypes object
      
      * Review: Rename DppInstrRunner to DppLanegroupGemm
      37a8c1f7