1. 11 Oct, 2023 2 commits
    • zjing14's avatar
      Revert "Grouped Gemm with looping over the tiles. (#788)" (#982) · c99323be
      zjing14 authored
      This reverts commit a4f72a31.
      c99323be
    • Adam Osewski's avatar
      Grouped Gemm with looping over the tiles. (#788) · a4f72a31
      Adam Osewski authored
      
      
      * Introduce LocalBlockToCTileMap.
      
      * Change the signature of CalculateBottomIndex() function which now does
      not accept any argument. The B2C map which is already passed as an
      argument to the kernel Run function is calculating block's local id
      already outside at kernel entry point __global__ function.
      The LocalB2C map stores as members local block ID.
      
      * Use LocalBlockToCTile map in device ops.
      
      * First draft of tile loop work distribution.
      
      * Fix typo.
      
      * Simplify kernel arguments.
      
      Calculate descriptors & B2C maps on the device.
      
      * Use looping kernel.
      
      * Fix B2C constructor.
      
      * Fix Navi21 errors.
      
      * Calculate tile start/end in device kernel.
      
      * Change Run API to accept user provided workspace buffer.
      
      * Add new line at EOF.
      
      * Move Gemm KernelArguments to device op interface.
      
      * Remove unused code.
      
      * Update API.
      
      * Launch grid size which is min of occupancy vs tile count
      
      * Get back to use constant memory for gemm descriptors.
      
      * Remove unused code.
      
      * Add default virtual method implementation.
      
      * Update comments to conform with doxygen style.
      
      * Fix doc style and unused parameters.
      
      * Add thread cluster lengths to kernel name.
      
      * Remove old splitk impl and replace it with tile looping one.
      
      * Modify instances.
      
      * set KPerBlock to 64
      * maximize wherever possible vector load size.
      
      * Fix instances cluster lengths.
      
      * Change comment style.
      
      * Use 128b store where possible in instances.
      
      * Update test cases, since KPerBlock has doubled.
      
      * Update output stream operator for Sequence.
      
      * Add pipeline version to GroupedGEMM device op type string.
      
      * Fix pipeline version type logging.
      
      * Fix input tensors type after merge.
      
      * Fix compiler error.
      
      * Fix output stream operator for Pipeline version.
      
      * Store using 128b.
      
      * Set of instances with kpb 32/64
      
      * Limit number of instances
      
      * Remove commented out instances.
      
      * Fix function name.
      
      * Limit the number of instances.
      
      Add pipline version to the regular instances
      
      * Change thr cluster layout for reading B tensor.
      
      * disabled failed instances
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      a4f72a31
  2. 10 Oct, 2023 1 commit
  3. 05 Oct, 2023 2 commits
  4. 29 Sep, 2023 1 commit
    • Bartlomiej Wroblewski's avatar
      Add support for mixed precision in contraction scale and bilinear (#936) · f0748506
      Bartlomiej Wroblewski authored
      * Extract common functionality to separate files
      
      * Reference contraction: Remove incorrect consts from type_converts
      
      * Reference contraction: Add missing type_convert for dst value
      
      * Reference contraction: Fix incorrect order of B matrix dimensions
      
      * Add support for mixed precision in contraction scale and bilinear
      
      * Move using statements from instances to a common file
      
      * Move using statements from examples to a common file
      
      * Fix the order of B matrix dimensions across examples and profiler
      
      * Fix the computation of error threshold
      
      * Make ComputeDataType an optional argument
      
      * Include possible DataType -> ComputeDataType casting error in the threshold
      
      * Remove commented code
      f0748506
  5. 28 Sep, 2023 1 commit
  6. 27 Sep, 2023 4 commits
  7. 26 Sep, 2023 1 commit
  8. 23 Sep, 2023 1 commit
  9. 21 Sep, 2023 1 commit
    • Illia Silin's avatar
      Refactoring cmake files to build data types separately. (#932) · bba085d2
      Illia Silin authored
      * refactor cmake files for the tests
      
      * refactor cmake files for examples
      
      * fix cmake for gemm example
      
      * fix the cmake file for all examples
      
      * add splitting by data types in gemm_splitk instance header
      
      * rename test to reflect only dl instances are used
      
      * clean up CI workspace, update cmake for instances
      
      * change the jenkinsfile syntax
      
      * build all instances except DL on gfx11
      
      * move workspace cleanup after stages
      
      * clean up workspace after every stage
      
      * isolate data types in grouped_conv_fwd header
      
      * isolate dl instances for grouped_conv2d_fwd
      
      * fix syntax
      
      * fix cmake and batchnorm instances
      
      * fix typo
      
      * fix reduction instances
      
      * fix grouped_conv headers
      
      * fix syntax
      
      * replace parsing logic for instances, replace bfp16 with bf16
      
      * fix the client examples build
      
      * clean up DTYPES from instances cmake files
      
      * update the parsing logic in cmake files
      
      * make an exception for reduction kernels
      
      * update few remaining cmake files to handle DTYPES
      
      * fix syntax
      
      * fix cmake conflicts
      
      * replace f8 with fp8 test name
      
      * resolve conflicts for dpp instances
      bba085d2
  10. 15 Sep, 2023 1 commit
  11. 13 Sep, 2023 1 commit
  12. 12 Sep, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Refactor f8_t, add bf8_t (#792) · 62d4af74
      Rostyslav Geyyer authored
      * Refactor f8_t to add bf8_t
      
      * Add check_err impl for f8_t
      
      * Update fp8 test
      
      * Format
      
      * Revert the fix
      
      * Update vector_type implementation
      
      * Add bf8 test
      
      * Add bf8, use BitInt types
      
      * Add bf8 conversion methods
      
      * Update type_convert for fp8/bf8
      
      * Add check_err fp8/bf8 support
      
      * Add subnorm fp8 tests
      
      * Add subnorm bf8 tests
      
      * Fix conversion
      
      * Add bf8 cmake bindings
      
      * Add macros to enable build with disabled fp8/bf8
      
      * Remove is_native method
      
      * Update flag combination for mixed precision instances
      
      * Add more flag checks
      
      * Add another flag to a client example
      
      * Add type traits, decouple f8/bf8 casting
      
      * Clean up
      
      * Decouple fp8 and bf8 flags
      
      * Remove more redundant flags
      
      * Remove leftover comments
      62d4af74
  13. 05 Sep, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Add image to column kernel (#867) · 0077eeb3
      Bartłomiej Kocot authored
      * Add image to column kernel
      
      * Add instances, tests, profiler, example
      
      * Add client example
      
      * Several fixes of image to column
      
      * Fix variable name in device_image_to_column_impl
      
      * Several fixes of image to column profiler
      
      * Fix num_btype calculation
      
      * Make new mesaurements for correct bytes calculation
      0077eeb3
  14. 31 Aug, 2023 1 commit
    • rocking's avatar
      MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861) · 866377de
      rocking authored
      * Add maxpool instances
      
      * Rename index pool to max pool.
      
      * Add maxpool bwd bf16 instances
      
      * Add avg pool bwd instances
      
      * Rename avgpool and maxpool to avg_pool3d and max_pool
      
      * Add bf16 pool fwd instances
      
      * Add max pool bwd to ckProfiler
      
      * Add avg pool3d bwd to ckProfiler
      
      * Add avg pool bwd test
      
      * Fix bug of reference pool fwd (dilation)
      
      * Fix bug of max pool bwd  (dilation and initZero)
      
      * Support bf16 compute data type
      
      * Force compute type be f32. Because atomicAdd only support f32
      
      * Add max pool bwd test
      
      * Rename folder
      
      * Rename pool
      
      * Add max pool bwd client example
      
      * Add avg pool bwd client example
      
      * Add missing workspace
      
      * clang format
      
      * Rename macro
      
      * remove useless header
      
      * remove useless layout
      866377de
  15. 23 Aug, 2023 1 commit
    • Jun Liu's avatar
      [HotFix] add config and version files to pass on build info (#856) · c8a8385f
      Jun Liu authored
      * experiment with config file
      
      * experiment with version.h config
      
      * add more info to version.h
      
      * minor updates
      
      * minor updates
      
      * fix case where DTYPE is not used
      
      * large amount of files but minor changes
      
      * remove white space
      
      * minor changes to add more MACROs
      
      * fix cmakedefine01
      
      * fix issue with CK internal conflict
      
      * fix define and define value
      
      * fix clang-format
      
      * fix formatting issue
      
      * experiment with cmake
      
      * clang format v12 to be consistent with miopen
      
      * avoid clang-format for config file
      c8a8385f
  16. 22 Aug, 2023 3 commits
  17. 14 Aug, 2023 1 commit
    • rocking's avatar
      Refactor pool fwd (#815) · f60f0a5e
      rocking authored
      * Do not hardcode stride
      
      * devicePool2DFwd Inherit devicePool3DFwd
      
      * Move instance declaration out of common
      
      * Add dilation
      
      * use the pool3d rank, because pool2d inherit pooo3d
      
      * calculate Do Ho Wo for the dilation
      
      * Fix header name
      
      * Modify ckProfiler
      
      * Remove pool2d instance
      
      * Remove pool2d in profiler
      
      * Remove pool2d and add dilation
      
      * In to client example, this commit revise following:
      1. Add dilation.
      2. Use pool3d to implement pool2d
      
      * Refine naming and IsSupportedArgument()
      
      * Add dilation to maxpool bwd example
      
      * clang format
      
      * 1. Remove useless header
      2. Fix copyright
      3. Refine naming
      
      * Add layout parameter to pool fwd
      
      * clang format
      
      * Fix merge error
      
      * Fix compile error
      
      * Remove layout parameter in derived class
      
      * Refine changlog
      
      * Fix compile error
      
      * Fix compiler error
      
      * Add layout to external api and profiler
      f60f0a5e
  18. 09 Aug, 2023 1 commit
  19. 07 Aug, 2023 2 commits
    • Illia Silin's avatar
      Allow building CK for specific data types and split off last remaining DL instances. (#830) · 08eb1769
      Illia Silin authored
      * properly split conv_nd_bwd_data instances
      
      * split conv2d_fwd instance data types
      
      * split the gemm, conv2d_fwd and batched_gemm_softamx_gemm
      
      * split the tests by data types where possible
      
      * filter examples by DTYPES
      
      * split few remaining examples by DTYPES
      
      * filter most instances by DTYPES
      
      * add new lines at end of headers, fix grouped_gemm profiler
      
      * fix syntax
      
      * split the ckprofiler instances by DTYPES
      
      * split the conv2d and quantization DL and XDL instances
      
      * fix the splitting of conv2d DL instances
      
      * split softmax and pool_fwd tests for fp16 and fp32 types
      
      * fix syntax
      
      * fix the dl_int8 quantization instances isolation
      08eb1769
    • Bartłomiej Kocot's avatar
      Add wei_strides to grouped conv3d wei to keep consistency (#817) · 22443f7a
      Bartłomiej Kocot authored
      
      
      * Add wei_strides to grouped conv3d wei to keep consistency
      
      * Fix strides in client examples
      
      * Unify backward weight api with forward
      
      * Fix for example
      
      * Fixes for examples
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      22443f7a
  20. 27 Jul, 2023 1 commit
  21. 26 Jul, 2023 2 commits
    • carlushuang's avatar
      initial stream-k implementation with example (#699) · e7dca79d
      carlushuang authored
      
      
      * initial stream-k implementation with example
      
      * fix unexpected change in err
      
      * improve a little bit performance by reorganize pipeline.
      
      * improve perf a little bit by swizzle block idx
      
      * add profiler
      
      * update example
      
      * fix spelling
      
      * shrink karg for streamk
      
      * support dynamic buffer using memory coherence glc_slc bit from template
      
      * control memory coherence while construct dynamic buffer
      
      * update reduction for streamk(not ready yet)
      
      * Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
      
      * fix build issue
      
      * fix several bug
      
      * now result is correct, everything works (but has scratch)
      
      * remove scratch by manually reset coordinate
      
      * update device code
      
      * fix a bug in final reduce
      
      * fix something in example
      
      * update async memset
      
      * fix enum as camel case
      
      * modify coherence enum name
      
      * clean code and use atomic streamk by default
      
      * remove unused var
      
      * throw exception if have empty pointer
      
      * fix format
      
      * fix CI warning
      
      * fix type in init
      
      * modify CI error
      
      * filter out on gfx10+
      
      * restore changed example code
      
      ---------
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      e7dca79d
    • Illia Silin's avatar
      Disable DL kernels by default. (#816) · 9195435c
      Illia Silin authored
      9195435c
  22. 21 Jul, 2023 1 commit
  23. 18 Jul, 2023 2 commits
    • Bartłomiej Kocot's avatar
      Grouped 3d conv backward data support (#799) · 49180fd6
      Bartłomiej Kocot authored
      * Grouped 3d conv backward data support
      
      * Fix comments
      49180fd6
    • Illia Silin's avatar
      Add mechanism to build CK for select data types, add Navi3x CI. (#790) · 189ea3b9
      Illia Silin authored
      * allow building CK for specific data types
      
      * add CI build and test stage on Naiv3x without some int8 instances
      
      * add missing gemm fp16 instances
      
      * add the changes to the missed cmake file
      
      * add empty lines at end of source files
      
      * Do not build quantization client example on navi3 in CI
      
      * disable batched_gemm_multi_d_int8 instances with DTYPES
      
      * disable device_conv2d_bwd_data_instance with DTYPES
      
      * fix ckprofiler for conv_bwd_data for int8
      
      * properly isolate the conv_bwd_data int8 instances
      
      * remove empty line
      189ea3b9
  24. 12 Jul, 2023 1 commit
  25. 06 Jul, 2023 2 commits
  26. 21 Jun, 2023 1 commit
  27. 19 Jun, 2023 1 commit
    • Rostyslav Geyyer's avatar
      FP8 enablement - add a pseudorandom number generator, add conversion methods (#708) · f0c620c4
      Rostyslav Geyyer authored
      * Add basic fp8 definitions and prn-generator
      
      * Format
      
      * Add fp8<->fp32 type_convert
      
      * Format
      
      * Split type_convert and cast_to/from_f8
      
      * Format
      
      * Minor fix
      
      * Minor fix
      
      * Move fp8 utils to a separate header
      
      * Add elementwise ops
      
      * Add fp8_convert_sr
      
      * Format
      
      * Add element op
      
      * Eliminate magic numbers
      
      * Split f8_convert_sr in host and device
      
      * Format
      
      * Add some constexpr
      
      * Add a datatype test
      
      * Format
      
      * Another format
      
      * Add fp8<->fp16 tests
      
      * Update type_converts
      
      * Format
      
      * Add fp16 casting functions
      
      * Format
      
      * Use seed as a runtime arg
      
      * Use element location for PRNG
      
      * Format
      
      * Add fp8<->fp16 to PassThrough element op
      
      * Clean up
      
      * Merge host and device implementations
      
      * Add comments on rounding modes
      
      * Remove leftover code
      
      * Put type_converts into a separate header
      
      * Put random number gen to a separate header
      
      * Rearrange f8_utils' namespaces
      
      * Refactor type_convert.hpp
      
      * Move f8_t definition
      f0c620c4
  28. 17 Jun, 2023 1 commit
    • Qianfeng's avatar
      Padded Generic Kernel Instance (#730) · 0d911822
      Qianfeng authored
      
      
      * Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting
      
      * Move the generic kernel instance to be the first of the instance list for elementwise op of normalization
      
      * Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax
      
      * Add testing of GetGenericInstance() in client_example of Softmax
      
      * Revert "Add testing of GetGenericInstance() in client_example of Softmax"
      
      This reverts commit f629cd9a93ce38dfed4886d849f3c38d2e5379c8.
      
      * Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"
      
      This reverts commit a9f0d000eb9fd240404112a526ef125429a351df.
      
      * Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm
      
      * Move generic kernel instance to separate tuple for elementwise op of normalization
      
      * Remove un-used files for softmax instance
      
      * Store generic kernel instance to separate tuple for softmax
      
      * Add IsSupported checking for generic instance to client example of softmax
      
      * Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization
      
      * clang-format fix
      
      * Remove int8 from softmax instances
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      0d911822
  29. 15 Jun, 2023 1 commit
    • Illia Silin's avatar
      Enable gfx941 and gfx942 architectures. (#752) · 027e46ee
      Illia Silin authored
      * enable gfx941/942 targets
      
      * fix clang format
      
      * fix the cmake logic for multiple targets
      
      * fix cmake syntax for looping over targets
      
      * add gfx941/942 support for gemm_xdl instances
      027e46ee