1. 21 Jun, 2023 1 commit
  2. 19 Jun, 2023 2 commits
    • Rostyslav Geyyer's avatar
      FP8 enablement - add a pseudorandom number generator, add conversion methods (#708) · f0c620c4
      Rostyslav Geyyer authored
      * Add basic fp8 definitions and prn-generator
      
      * Format
      
      * Add fp8<->fp32 type_convert
      
      * Format
      
      * Split type_convert and cast_to/from_f8
      
      * Format
      
      * Minor fix
      
      * Minor fix
      
      * Move fp8 utils to a separate header
      
      * Add elementwise ops
      
      * Add fp8_convert_sr
      
      * Format
      
      * Add element op
      
      * Eliminate magic numbers
      
      * Split f8_convert_sr in host and device
      
      * Format
      
      * Add some constexpr
      
      * Add a datatype test
      
      * Format
      
      * Another format
      
      * Add fp8<->fp16 tests
      
      * Update type_converts
      
      * Format
      
      * Add fp16 casting functions
      
      * Format
      
      * Use seed as a runtime arg
      
      * Use element location for PRNG
      
      * Format
      
      * Add fp8<->fp16 to PassThrough element op
      
      * Clean up
      
      * Merge host and device implementations
      
      * Add comments on rounding modes
      
      * Remove leftover code
      
      * Put type_converts into a separate header
      
      * Put random number gen to a separate header
      
      * Rearrange f8_utils' namespaces
      
      * Refactor type_convert.hpp
      
      * Move f8_t definition
      f0c620c4
    • rocking's avatar
      Maxpool bwd (#750) · 341ad956
      rocking authored
      * Add maxpool f32 kernel and example
      
      * Revise copyright
      
      * Add device pool bwd device op
      
      * Support f16 and bf16
      
      * Add compute datatype for reference code.
      Prevent error in bf16
      
      * Fix type error
      
      * Remove layout
      
      * Fix bf16 error
      
      * Add f16 and bf16 example
      
      * Add more operations
      
      * Implement IsSupportedArgument
      
      * Add changelog
      
      * Add comment
      
      * Add comment
      
      * Remove useless header
      
      * Move initialize of workspace to the run
      
      * Move set din zero to the device operator
      
      * Save din_length_raw
      
      * Remove useless header
      
      * Calculate gridsize according to the number of CU
      
      * Calculate gridSize according to the number of CU.
      Remove useless header
      
      * Add put example
      
      * Remove useless header
      
      * Fix CI fail
      341ad956
  3. 17 Jun, 2023 1 commit
    • Qianfeng's avatar
      Padded Generic Kernel Instance (#730) · 0d911822
      Qianfeng authored
      
      
      * Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting
      
      * Move the generic kernel instance to be the first of the instance list for elementwise op of normalization
      
      * Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax
      
      * Add testing of GetGenericInstance() in client_example of Softmax
      
      * Revert "Add testing of GetGenericInstance() in client_example of Softmax"
      
      This reverts commit f629cd9a93ce38dfed4886d849f3c38d2e5379c8.
      
      * Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"
      
      This reverts commit a9f0d000eb9fd240404112a526ef125429a351df.
      
      * Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm
      
      * Move generic kernel instance to separate tuple for elementwise op of normalization
      
      * Remove un-used files for softmax instance
      
      * Store generic kernel instance to separate tuple for softmax
      
      * Add IsSupported checking for generic instance to client example of softmax
      
      * Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization
      
      * clang-format fix
      
      * Remove int8 from softmax instances
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      0d911822
  4. 15 Jun, 2023 1 commit
  5. 14 Jun, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Add generic kernel instances for ck::tensor_operation::device::DeviceGemmMultipleD (#741) · 54b68eb3
      Rostyslav Geyyer authored
      * Add generic instance gemm_add_add_fastgelu
      
      * Add a client example for generic gemm_add_add_fastgelu
      
      * Update CMakeLists
      
      * Format
      
      * Format
      
      * Add generic instance gemm_add_fastgelu
      
      * Format
      
      * Add a gemm_add_fastgelu client example
      
      * Format
      
      * Add generic instance gemm_fastgelu
      
      * Format
      
      * Fix argument order
      
      * Add gemm_fastgelu client example
      
      * Add exceptions if argument is not supported
      54b68eb3
  6. 12 Jun, 2023 2 commits
  7. 31 May, 2023 1 commit
  8. 30 May, 2023 2 commits
    • Adam Osewski's avatar
      Multiple fixes to GroupedGemm+SplitK (#707) · 70e4eb56
      Adam Osewski authored
      
      
      * Add license header.
      
      * Reduce number of logged output. Add constant initialization.
      
      * Add functional tests for grouped_gemm with different kbatch value.
      
      * Add debug log informations + remove unused code.
      
      * Don't pass kbatch to CalculateKPadded.
      
      * Turn on logging in grouped gemm and gemm splitk profiler
      
      * Debug: limit number of test cases to run;
      
      * Log more information and initialize with constant value.
      
      * Turn on DEBUG_LOG
      
      * Add more debug log informations.
      
      * Limit the number of instances to compile.
      
      * Use GridwiseGemmPipeline
      
      * Use KBatch to calculate K0
      
      * Multiple DebugLog messages.
      
      * Unit tests for multiple KBatch values.
      
      * Refactoring
      
      * Disable logging
      * extract out of if statement KBatch update.
      
      * Uncomment instances.
      
      * Disable DebugLog.
      
      * Use Kbatch when calculate KPadded.
      
      * Fix CGridDesc padding.
      
      * Use available helper functions.
      
      * Uncomment code commented for debuggin.
      
      * Remove unnecessary debug log messages.
      
      * Uncomment previously commented code for debug purposes.
      
      * Add KBatch info to profiler output summary log.
      
      * Add gtests for gemm splitk using ckProfiler API.
      
      * Add more test-cases for different data layout.
      
      * Add more test cases for gemm splitk
      
      * Remove old test.
      
      * Unit tests for MKNK ggemm interface.
      
      * Fix and add more unit-tests.
      
      * Constepxr everything!
      
      * Increase error threshold for fp16 and splitk.
      
      Since we're using fp16 atomic add for splitk there's a
      known precision loss.
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      70e4eb56
    • Bartłomiej Kocot's avatar
      Add instances for fp16/int8 Gemm kernels (Navi21) (#717) · c2d7a29d
      Bartłomiej Kocot authored
      * Add instances for fp16/int8 Gemm kernels (Navi21)
      
      * Extend instances with smaller tiles
      
      * Fix SrcVectorTensor for km_kn_mn int8
      c2d7a29d
  9. 24 May, 2023 2 commits
    • Illia Silin's avatar
      Clean-up the headers (#713) · ac9e01e2
      Illia Silin authored
      
      
      * fix headers for gpu instances
      
      * remove unused headers
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      ac9e01e2
    • rocking's avatar
      Pool3d fwd (#697) · 76ec0089
      rocking authored
      * Expand the base class of pool2d, prepare to share base class with pool3d
      
      * Add pool3d device op
      
      * Add pool3d f16 example
      
      * Refactor the base class. implement generic pooling in the future
      
      * clang format
      
      * get original index in max pooling
      
      * Add outputindex to base class
      
      * Fix dimension
      
      * Add pooling instance
      
      * Use indexType instead
      
      * Remove useless header
      
      * Extract IndexDataType to template
      
      * Extract pooling reference code
      
      * clang format
      
      * clang format
      
      * Fix typo
      
      * Add tensor stride
      
      * Add missing header
      
      * Add index stride and output stride
      
      * Refine naming
      
      * Add type to base class
      
      * Rename file
      
      * Use proper size
      
      * Fix typo
      
      * Refine naming
      
      * Modify the argument into vector.
      
      * Add max pool profiler
      
      * Refine naming
      
      * Support f32 pool
      
      * Fix typo
      
      * Add avg pool2d fwd in profiler
      
      * clang format
      
      * Rename AccDatatype to ComputeDatatype
      
      * Fix init
      
      * test pool
      
      * Extract variable
      
      * Add client example
      
      * Check the pooling dim
      
      * clang format
      
      * Connect argv and arg_parser
      
      * Add found check
      
      * Remove useless header
      
      * Refine naming
      
      * Adjust the order of device_pool_fwd
      76ec0089
  10. 15 May, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Add contraction profiler and tests (#701) · 642d5e91
      Bartłomiej Kocot authored
      * Add contraction profiler and tests
      
      * Build and style fixes
      
      * Allow to use any elementwise operator for ref_contraction
      
      * Introduce profile_contraction_scale and profile_contraction_bilinear
      
      * Make ref_contraction generic and extend interface tests
      
      * Stylistic minor fixes
      
      * Extend test_contraction_interface
      642d5e91
  11. 12 May, 2023 1 commit
  12. 11 May, 2023 1 commit
  13. 10 May, 2023 1 commit
  14. 04 May, 2023 2 commits
    • Rostyslav Geyyer's avatar
      Optimize bf16 conversion (#664) · b076a02a
      Rostyslav Geyyer authored
      * Add TypeConvert class and start refactoring
      
      * Refactor TypeConvert as a struct
      
      * Get back to template functions type_convert
      
      * Add a type_convert_bf16_rtn, set rtz as default
      
      * Clean up
      
      * Add UnaryConvertPrecision struct for high-precision workloads
      
      * Format
      
      * Update type_convert to UnaryConvert on threadwise level
      
      * Update UnaryConvertPrecision
      
      * Format
      
      * Fix chmod
      
      * Add a flag to pick converion method
      
      * Format
      
      * Remove the added flag
      
      * Merge elementwise op with type conversion
      
      * Move type_convert to elemwise op, update the op
      
      * Update type_convert_precision -> bf16_convert_rtn
      
      * Clean up
      
      * Update comments
      
      * Update the CK_WORKAROUND_DENORM_FIX flag handling
      
      * Update the unneeded op to work but warn user
      
      * Remove the message
      
      * Use a PassThrough instead of ConvertBF16RTN to calcaulate reference
      
      * Format
      
      * Add missing include
      b076a02a
    • Adam Osewski's avatar
      Limit the number of instances to compile. · 3948d09b
      Adam Osewski authored
      3948d09b
  15. 24 Apr, 2023 2 commits
    • Adam Osewski's avatar
      Grouped Gemm + SplitK + simplified Kernel Args (#669) · 8bb2bb4a
      Adam Osewski authored
      
      
      * simplify karg in device/grid split-k op
      
      * fix mk_kn_mn instances
      
      * add more instances
      
      * B2C with 3D grid for KSplit
      
      * Remove unused code.
      
      * Use default B2C (3D grid) in grid gemm v2r4r2.
      
      * Device gemm splitk use B2C map.
      
      * Device GroupedGemmXdlSplitKCShuffle
      
      * Example for GroupedGemm Xdl SplitK
      
      * Introduce Device GroupedGemmSplitK
      
      * Fix updating kbatch size.
      
      * Add instance mk-nk-mn
      
      * Enable set kbatch in profiler.
      
      * Add GGemmSplitK mk-kn-mn instances
      
      * Add more instances & split into multiple files.
      
      * minor fix
      
      * tuning
      
      * clean
      
      * disabled failed instances
      
      * use pipe v2
      
      * Ignore arg on not supported arch.
      
      * fix warning
      
      ---------
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      Co-authored-by: default avatarroot <root@ctr-ubbsmc15.amd.com>
      8bb2bb4a
    • rocking's avatar
      Revise layout of group convolution (#675) · 3eecbfb6
      rocking authored
      * [What] Remove pure conv int8 instance
      [Why] We will never use pure int8 conv in AI, use int8 quantization instead
      
      * Change layout
      
      * Share the kernel parameter
      
      * Support more type of NHWGC for group conv
      
      * Revise client example of conv 2d, use NHWGC layout
      
      * Add instance to cmake
      
      * Revise layout of group conv quantization instance
      
      * Revise layout of external api of group conv quantization
      
      * Revise layout of group conv quantization client example
      
      * Fix clang format
      
      * Add comment to describe meaning of each parameter
      3eecbfb6
  16. 22 Apr, 2023 1 commit
  17. 17 Apr, 2023 1 commit
  18. 10 Apr, 2023 1 commit
    • rocking5566's avatar
      Groupnorm + swish external api (#668) · ed3a2e52
      rocking5566 authored
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      ed3a2e52
  19. 07 Apr, 2023 1 commit
  20. 30 Mar, 2023 2 commits
  21. 29 Mar, 2023 1 commit
    • rocking5566's avatar
      Conv + quantization + tanh (#645) · 389e84a8
      rocking5566 authored
      
      
      * Rename file. Prepare to support another activation
      
      * Add comment for quantization
      
      * Extract out_elementop
      
      * Add tanh example
      
      * Add conv + bias + tanh quantization instance
      
      * Add missing parameter
      
      * Refine cmake
      
      * Add external api and client example
      
      * Extract variable in example
      
      * Fix the comment
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      389e84a8
  22. 20 Mar, 2023 1 commit
  23. 15 Mar, 2023 1 commit
    • rocking5566's avatar
      gemm/Conv xdlops + dlops quantization (#625) · 16dc18e0
      rocking5566 authored
      * Add conv perlayer quantization
      
      * Add gemm_dlops quantization
      
      * Support int8 for innerproduct
      
      * Refine gemm dlops int8 kernel parameter
      
      * Support gfx908(MI100) and gfx90a(MI200)
      
      * clang-format
      
      * Rename example number
      
      * Support different layout for d tensor
      
      * Add conv dlops perchannel quantization example
      
      * Move to example 40
      
      * Extract the common code for different platform (dlops and xdlops)
      
      * Move ot subfolder. Prepare to add other op of quantization
      
      * Refine the quantization instance library
      
      * Add conv dl instances and client example
      
      * Remove unnecessary type
      
      * Add gemm quantization instance
      
      * Add external api and client example
      
      * Refine num_bytes
      
      * Separete different layout to different cpp
      
      * Add more xdl instances
      
      * Revert "Remove unnecessary type"
      
      This reverts commit 82086918
      
      .
      
      * Remove CShuffleDataType in dlops
      Let acc and CShuffleDataType be the same in xdlops
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      16dc18e0
  24. 08 Mar, 2023 1 commit
    • Adam Osewski's avatar
      GroupedGEMM + Gelu client example/instances/profiler (#614) · 9096b1c7
      Adam Osewski authored
      
      
      * Grouped gemm + Gelu instances.
      
      * Device Instance Factory for GroupedGemm+Gelu
      
      * Client example
      
      * Rangify fill helper functions.
      
      * Fix name clash.
      
      * Profiler for grouped_gemm+gelu
      
      * No need to use full namespace name.
      
      * Add check for MRaw divisible by vector load.
      
      * Ugly fix for big errors.
      
      * Add grouped_gemm+gelu to profiler CMakelists.
      
      * Store in argument additional info.
      
      * Information about Mraw, Nraw, Kraw values.
      
      * Use FastGelu instead of Gelu.
      
      * Change client ex to use FastGelu
      
      * Remove relaxed error precision.
      
      * Remove duplicate output elementwise-op
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      9096b1c7
  25. 06 Mar, 2023 1 commit
  26. 15 Feb, 2023 2 commits
    • rocking5566's avatar
      Improve normalization (#580) · 6a6163a3
      rocking5566 authored
      * Sync the order of type string with template parameter
      
      * Add more instances
      
      * Check the vector size and remove redundant var
      
      * Extract var to static, prepare to separate sweep once kernel
      
      * Separate sweeponce flow and optimize the flow
      
      * 1. Rename AccDatatype in normalization to computeData
      2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
      
      * Remove useless code
      
      * Update naive variance kernel
      
      * Refine string
      
      * Fix typo
      
      * Support naive variance for device_normalization
      
      * Check the blocksize
      
      * Share the VGPR of x and y
      
      * Share the VGPR of gamma and beta
      
      * Add more instances
      
      * Support fp16 sqrt for experiment
      
      * Add CHANGELOG
      
      * Fix typo
      
      * clang-format
      6a6163a3
    • Adam Osewski's avatar
      Conv3D FWD BWD WRW fp16 fp32 client examples (#559) · e9fd1228
      Adam Osewski authored
      
      
      * Conv3d bwd weight client example.
      
      * Update year in license
      
      * Convolution bwd data 3D fp16/fp32 client example.
      
      * Client example for convnd fwd fp16 fp32
      
      * clang-format
      
      * Review remarks.
      
      * Fix compiler err.
      
      * Update data layout to standard one.
      
      * Add conv 3d fwd NDHWGC instances
      
      * clang-format
      
      * Conv3d fwd NDHWGC instances.
      
      ---------
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      e9fd1228
  27. 13 Feb, 2023 1 commit
  28. 09 Feb, 2023 2 commits
    • rocking5566's avatar
      Gemm+layernorm instance, ckProfiler, client example (#568) · f7d28f3e
      rocking5566 authored
      * Add gemm + layernorm instance
      
      * Add ckProfiler
      
      * Add test
      
      * Add client example
      
      * Detect if user forger to set the workrspace
      
      * Use literal in the example
      
      * [What] use builtin function for sqrt
      [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
      
      * check gemm vaildity in IsSupportedArgument
      
      * Add more testcases
      
      * Merge duplicated folder in client example
      
      * Print more infomation
      
      * Use better kernel parameter for MS problem size
      
      * clang format
      
      * Add constexpr for if condition and remove redundant include
      
      * Remove cstdlib and add constexpr
      f7d28f3e
    • guangzlu's avatar
      Add instance for elementwise normlization (#573) · 76d144fa
      guangzlu authored
      * added instances for large N
      
      * add instance for elementwise normlization
      
      * added supported restrict in device_elementwise_normalization_impl.hpp
      76d144fa
  29. 08 Feb, 2023 1 commit
    • ltqin's avatar
      Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576) · 332ccc33
      ltqin authored
      * add instance for gemm bias softmax gemm
      
      * add client example
      
      * change CGridDesc_G_M_N to CGridDesc_G_M_O
      
      * add gridwise
      
      * change c grid name
      
      * device add d0s data
      
      * fix 08 client_example
      
      * add example 47_fused_attention
      
      * example output correct
      
      * add d0 to example
      
      * add d0 element op
      
      * rechange instance code
      
      * change Acc0ElementwiseOperation to C0DEElementwiseOperation
      
      * change example name
      
      * update instance for cdeelementwiseop
      
      * add bhalf_t ScaleAdd
      
      * add test
      
      * not surport geem1 bias
      
      * remove some ignore
      
      * fix test bug
      332ccc33
  30. 26 Jan, 2023 1 commit
  31. 25 Jan, 2023 1 commit
    • Qianfeng's avatar
      Batchnorm inference instances, external API, client examples and gtests (#531) · a1b2441f
      Qianfeng authored
      * File renaming and class renaming for device element-wise operation
      
      * Add batchnorm-infer instances, external API and client example
      
      * Add batchnorm-infer profiler module and gtests
      
      * Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp
      
      * Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer
      
      * Rename class and file due to conflict from device_elementwise_2d.hpp
      
      * Fix namespace in batcnnorm_infer_nhwc client example
      a1b2441f