1. 08 Jun, 2023 1 commit
  2. 31 May, 2023 2 commits
    • Illia Silin's avatar
      update copyright headers (#726) · b94fd0b2
      Illia Silin authored
      b94fd0b2
    • Po Yen Chen's avatar
      Add class type support for __builtin_amdgcn_readfirstlane() (#711) · 582e31e8
      Po Yen Chen authored
      * Add overloaded version of __builtin_amdgcn_readfirstlane()
      
      * Remove 'static' specifiers
      
      * Remove more 'static' specifier
      
      * Replace unsigne char by std::byte
      
      * Add 'const' specifier to never changing variable
      
      * Add 'inline' specifier to funcion definition
      
      * Fix wrong boundar calculation logic
      
      * Rename type trait
      
      * Remove std:: qualifier from standard types
      
      * Replace 'size_t' by 'unsigned'
      
      * Use type alias to hint usage
      
      * Replace static_for<> by ordinary 'for' loop
      
      * Rename readfirstlane() to amd_wave_read_first_lane()
      
      * Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
      
      * Reorder statements
      582e31e8
  3. 24 May, 2023 1 commit
  4. 04 May, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Optimize bf16 conversion (#664) · b076a02a
      Rostyslav Geyyer authored
      * Add TypeConvert class and start refactoring
      
      * Refactor TypeConvert as a struct
      
      * Get back to template functions type_convert
      
      * Add a type_convert_bf16_rtn, set rtz as default
      
      * Clean up
      
      * Add UnaryConvertPrecision struct for high-precision workloads
      
      * Format
      
      * Update type_convert to UnaryConvert on threadwise level
      
      * Update UnaryConvertPrecision
      
      * Format
      
      * Fix chmod
      
      * Add a flag to pick converion method
      
      * Format
      
      * Remove the added flag
      
      * Merge elementwise op with type conversion
      
      * Move type_convert to elemwise op, update the op
      
      * Update type_convert_precision -> bf16_convert_rtn
      
      * Clean up
      
      * Update comments
      
      * Update the CK_WORKAROUND_DENORM_FIX flag handling
      
      * Update the unneeded op to work but warn user
      
      * Remove the message
      
      * Use a PassThrough instead of ConvertBF16RTN to calcaulate reference
      
      * Format
      
      * Add missing include
      b076a02a
  5. 28 Apr, 2023 1 commit
  6. 10 Apr, 2023 1 commit
    • rocking5566's avatar
      Groupnorm + swish external api (#668) · ed3a2e52
      rocking5566 authored
      * Rename to proper naming
      
      * Add example of groupnorm + swish
      
      * Extract duplicate code in example
      
      * Add groupnorm + swish instances
      
      * Ractor instance generation, split into multiple cpp file
      
      * Add external api and client example
      
      * Refine profiler message
      
      * Use ck math version of exp
      
      * Refine problem size in example
      
      * Add host version of exp
      ed3a2e52
  7. 29 Mar, 2023 2 commits
  8. 20 Mar, 2023 1 commit
  9. 15 Mar, 2023 2 commits
    • rocking5566's avatar
      gemm/Conv xdlops + dlops quantization (#625) · 16dc18e0
      rocking5566 authored
      
      
      * Add conv perlayer quantization
      
      * Add gemm_dlops quantization
      
      * Support int8 for innerproduct
      
      * Refine gemm dlops int8 kernel parameter
      
      * Support gfx908(MI100) and gfx90a(MI200)
      
      * clang-format
      
      * Rename example number
      
      * Support different layout for d tensor
      
      * Add conv dlops perchannel quantization example
      
      * Move to example 40
      
      * Extract the common code for different platform (dlops and xdlops)
      
      * Move ot subfolder. Prepare to add other op of quantization
      
      * Refine the quantization instance library
      
      * Add conv dl instances and client example
      
      * Remove unnecessary type
      
      * Add gemm quantization instance
      
      * Add external api and client example
      
      * Refine num_bytes
      
      * Separete different layout to different cpp
      
      * Add more xdl instances
      
      * Revert "Remove unnecessary type"
      
      This reverts commit 820869182f6a8f62b2c9004101ba6bf76b96be14.
      
      * Remove CShuffleDataType in dlops
      Let acc and CShuffleDataType be the same in xdlops
      
      ---------
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      16dc18e0
    • Haocong WANG's avatar
      Fix arch limitation bug (#639) · ea028ac6
      Haocong WANG authored
      ea028ac6
  10. 10 Mar, 2023 1 commit
  11. 09 Mar, 2023 1 commit
  12. 15 Feb, 2023 1 commit
    • rocking5566's avatar
      Improve normalization (#580) · 6a6163a3
      rocking5566 authored
      * Sync the order of type string with template parameter
      
      * Add more instances
      
      * Check the vector size and remove redundant var
      
      * Extract var to static, prepare to separate sweep once kernel
      
      * Separate sweeponce flow and optimize the flow
      
      * 1. Rename AccDatatype in normalization to computeData
      2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
      
      * Remove useless code
      
      * Update naive variance kernel
      
      * Refine string
      
      * Fix typo
      
      * Support naive variance for device_normalization
      
      * Check the blocksize
      
      * Share the VGPR of x and y
      
      * Share the VGPR of gamma and beta
      
      * Add more instances
      
      * Support fp16 sqrt for experiment
      
      * Add CHANGELOG
      
      * Fix typo
      
      * clang-format
      6a6163a3
  13. 09 Feb, 2023 1 commit
    • rocking5566's avatar
      Gemm+layernorm instance, ckProfiler, client example (#568) · f7d28f3e
      rocking5566 authored
      * Add gemm + layernorm instance
      
      * Add ckProfiler
      
      * Add test
      
      * Add client example
      
      * Detect if user forger to set the workrspace
      
      * Use literal in the example
      
      * [What] use builtin function for sqrt
      [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
      
      * check gemm vaildity in IsSupportedArgument
      
      * Add more testcases
      
      * Merge duplicated folder in client example
      
      * Print more infomation
      
      * Use better kernel parameter for MS problem size
      
      * clang format
      
      * Add constexpr for if condition and remove redundant include
      
      * Remove cstdlib and add constexpr
      f7d28f3e
  14. 18 Jan, 2023 1 commit
    • Raman R jana's avatar
      Wavelet (inter-wave consumer-producer) GEMM (#310) · 1cfa8760
      Raman R jana authored
      
      
      * wavelet gemm programming model support for CK
      
      * GEMM pipeline update for wavelet progrmmaing model
      
      * Updated wavelet programming pipeline
      
      * fixes for global-write for math-wave
      
      * fixed bug in global writes
      
      * Updated comments for better readability
      
      * fixed clang format errors
      
      * added block_lds without barrier sync
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * refactor
      
      * prototype
      
      4 layouts
      
      fix default stride
      
      all problem sizes
      
      tidy
      
      move file
      
      update build script
      
      restore old file
      
      fix build
      
      * refactor standalone test to use gemm test harness
      
      * simplify gemm test
      
      * update build script
      
      * remove redundant
      
      * early return when cmd arg doesn't match
      
      * tidy
      
      * report failure when result not validated
      
      * tidy
      
      * Add comment depicting B2C mapping pattern.
      
      * Formatting & comments.
      
      * Comparison with custom B2C mapping pattern.
      
      * Example for wavelet gemm.
      
      * Add wavelet to Gemm standalone test.
      
      * Remove debug code.
      
      * Remove dangling #endif directive.
      
      Co-authored-by: root <Raman Jana>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      Co-authored-by: default avatarAdam Osewski <19374865+aosewski@users.noreply.github.com>
      1cfa8760
  15. 17 Jan, 2023 2 commits
    • Qianfeng's avatar
      Reduction external API and client examples (#493) · 80e05267
      Qianfeng authored
      
      
      * Change to the DeviceReduce base class template to include all problem description information
      
      * Add external api for reduction
      
      * Add client example to test the reduction external api
      
      * Spelling correction
      
      * Re-implement the host_reduction to follow the DeviceReduce base API format
      
      * Change the reduce profiler to call the external API for collecting device instances
      
      * Rename reduce client example directory from 08_reduce to 12_reduce
      
      * Remove (void) before the functional call
      
      * Tiny update in reduce client example
      
      * Tiny update in profile_reduce_impl.hpp
      
      * Rename the reduce client example directory
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      80e05267
    • Haocong WANG's avatar
      [Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541) · 919aeb1f
      Haocong WANG authored
      * wmma_op + unit test
      
      * add arch limitation to wmma test
      
      * change arch limitation
      
      * Refactor + Add all type unit test(int4 compile failed)
      
      * Add f32_16x16x16_bf16 unit test
      
      * tempsave
      
      * tempsave
      
      * tempsave
      
      * runtime bug, cannot find symbol
      
      * workaround for incorrect HIP warpSize return value
      
      * debugging
      
      * tempsave
      
      * Correctness OK, waiting for optimization
      
      * Tidy up + format
      
      * temp save
      
      * temp save, reproduce the v_bfi_b32 issue
      
      * add inline asm for wmmaop test
      
      * tidy up
      
      * clean some debug purpose code
      
      * discard some codes
      
      * clang format
      
      * clang format
      
      * compiler issue fixed + increase tile size
      919aeb1f
  16. 12 Jan, 2023 1 commit
    • Qianfeng's avatar
      Remove including of cmath (#551) · a17b0414
      Qianfeng authored
      * Let cmath included when compiling host codes in math_v2.hpp
      
      * Remove including of cmath in device_base.hpp and device_permute.hpp
      a17b0414
  17. 07 Dec, 2022 1 commit
  18. 02 Dec, 2022 1 commit
  19. 15 Nov, 2022 1 commit
    • guangzlu's avatar
      Add BF16 tests for batched_gemm_softmax_gemm_permute (#504) · 4c4c7328
      guangzlu authored
      
      
      * fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
      
      * added bf16 tests for batched_gemm_softmax_gemm_permute
      
      * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * aligned annotations
      
      * modified CMakeLists for examples
      
      * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
      
      * use macro to control the instances
      
      * added macro control into instances
      
      * clang-format some files
      
      * changed error tolerance for bf16
      
      * changed index for 10_elementwise_normalization
      
      * fixed xdlops code bug in amd_xdlops.hpp
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      4c4c7328
  20. 20 Sep, 2022 1 commit
    • Po Yen Chen's avatar
      Add 'Permute' device op & example (#408) · f584ab0c
      Po Yen Chen authored
      * Add example folder for 'DeviceElementwise'
      
      * Re-structure example files
      
      * Move common parts into common.hpp
      
      * Use more strict input
      
      * Add more helper methods in 'DeviceElementwise'
      
      * Use more specific method to write example
      
      * Allow specify problem through command line argument
      
      * Allow specify problem 'axes' through command line argument
      
      * Add check to template type argument
      
      * Add transpose_shape() to generalize shape permute
      
      * Generalize transpose utility functions
      
      * Use better name for tensor indices
      
      * Add checks in helper functions
      
      * Remove debug messages
      
      * Refine error message for check_err()
      
      * Generalize variable naming in example code
      
      * Add device op 'DevicePermute'
      
      This device op is clone of 'DeviceElementwise'
      
      * Use 'DevicePermute' device op in example
      
      * Remove 'elementwise' from identifiers
      
      * Remove 'elementwise' from file paths
      
      * Remove base class of 'DevicePermute'
      
      * Let 'DevicePermute' inherit from 'BaseOperator'
      
      * Add simple type traits to validate device op type
      
      * Add static_assert() to check type constraints
      
      * Create 'DevicePermuteBase' to generate methods
      
      * Use indirect base type to generate methods
      
      * Remove 'is_device_op<>' type traits
      
      * Only accept single-input-single-output for 'DervicePermute'
      
      * Simplify 'DevicePermute' interface
      
      * Re-format 'DeviceElementwise'
      
      * Use CRTP to generate overridden virtual method
      
      * Remove unnecessary include directives
      
      * Distinguish input & output shape in 'DevicePermute'
      
      * Passing 'axes' to 'DevicePermute'
      
      * Use more reasonable return value for Invoker::Run()
      
      * Add 'GridwisePermute' kernel
      
      This kernel is a clone of 'GridwiseElementwise_1D'
      
      * Remove no-longer used type argument
      
      * Check if input/output shape meet the requirement
      
      * Remove no-longer used method
      
      * Remove never-entered-if-clause
      
      * Change problem description for 'DevicePermute'
      
      * Transform descriptor into 3 dimensions
      
      * Add debug code the verify result
      
      * Add comment to indicate template argument location
      
      * Add N/H/WPerBlock template parameter to 'DevicePermute'
      
      * Rename 'GridwisePermute' to 'GridwiseCopy'
      
      * Check tensor descriptor dimensions in 'GridwiseElementwise_1D'
      
      * Add missing include directive
      
      * Add 'BlockSize' parameter to 'DevicePermute'
      
      * Remove no-longer used method
      
      * Add 'BlockToTileMap' for 'GridwiseCopy'
      
      * Use the normal Block2TileMap convention
      
      * Rename 'BlockToTileMap' as 'Block2TileMap'
      
      * Fix most of compilation errors
      
      * Let 'Block2TileMap' map block to 2d coordinate
      
      * Allow data transfer in 'GridwiseCopy'
      
      * Fix wrong output descriptor for 2nd blockwise copy
      
      * Rename 'GridwiseCopy' as 'GridwisePermute'
      
      * Remove '1d' in identifiers
      
      * Remove commented-out codes
      
      * Remove 'MPerThread' template parameter
      
      * Seperate template parameters
      
      * Unify variable namming convention
      
      * Use more verbose way to create expressions
      
      * Add template parameter 'InBlockLdsExtraW'
      
      * Release the constraint on In/OutGridDesc
      
      * Use date type directly as template argument
      
      * Re-arrange template arguments for blockwise copy
      
      * Remove no-longer used template parameters
      
      * Embed layout in the variable names
      
      * Add GridwisePermute::CheckValidity()
      
      * Extract local types as template parameters
      
      * Rename local type alias
      
      * Add more template parameters (vector width related)
      
      * Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions
      
      * Fill tensor values start from 1
      
      * Re-formate example code
      
      * Avoid too-large block id
      
      * Add comment
      
      * Make sure 'SrcVectorDim' is not same as 'DstVectorDim'
      
      * Add check for the 'VectorDim' & 'ScalarPerVector' template params
      
      * Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc
      
      * Remove no-longer used template parameter 'NPerBlock'
      
      * Fix wrong descriptor creation logics
      
      * Specify problem in each examples
      
      * Use better example name
      
      * Add new example 'example_permute_NxHxW_fp32'
      
      * Add example for demonstrating bundle multiple elems in tensor
      
      * Add support to permute multiple elements together
      
      * Change the default problem size
      
      * Add span<> class template
      
      * Use span<> to generalize check_err() interface
      
      * Fix ambiguous ctor call
      
      * Avoid create necessary objects
      
      * Use helper functions to simplify example code
      
      * Add example for 4xfp16 permute
      
      * Disable failed-to-compile example
      
      * Add check for the NUM_ELEMS_IN_BUNDLE
      
      * Remove redundant parameter in helper lambda function
      
      * Add check for the input tensor type's byte-size
      
      * Check scalar-per-vector with padded length
      
      * Use more verbose name to avoid name collision
      
      * Use fixed 'VectorDim' & 'ScalarPerVector' for LDS
      
      * Embed shape info in name of descriptor constructor
      
      * Rename example folder '36_permute' into '37_permute'
      
      * Avoid using too-large LDS in kernel code
      
      * Remove redundant example
      
      * Usw switch() to group similar codes
      
      * Add const to the span<> type arguement
      
      * Simply initialize tensor with floating point values
      
      * Use fp16 as data type in all examples
      
      * Enlarge tensor size in example
      
      * Enalrge N-dim in example
      
      * Add check for the bundled type in example
      
      * Use more stricter error threshold
      
      * Remove global load/store loop in kernel code
      
      * Measure execution time by default
      
      * Use faster device op config for example 'NxHxW_fp16'
      
      * Use faster device op config for example '1xHxW_fp16'
      
      * Use faster device op config for example 'HxWx4_fp16'
      
      * Remove cmd arg parsing logics
      
      * Rename functions
      
      * Extract bundle permutation logic out
      
      * Simplify permute bundle example
      
      * Add Tensor<>::GetElementSpaceSizeInBytes()
      
      * Add Tensor<>::data()
      
      * Use new methods to simplify code
      
      * Use type alias to replace duplicated code
      
      * Use existing method to shorten code
      
      * Allow FillUniformDistribution accept range arugment
      
      * Intialize random values in range
      
      * Add Tensor<>::size()
      
      * Use more meaningful names in permute bundle example
      
      * Use more meaningful names in permute element examples
      
      * Use rangified copy() to copy elements
      
      * Use function return value directly to eliminate variables
      
      * Add to_array() conversion tool to eliminate more variables
      
      * Add Tensor<>::AsSpan<>() to create view of tensor values
      
      * Use AsSpan() to shorten check_err() calls
      
      * Remove no-longer-used 'using' directives
      
      * Move 'using' directive to proper code position
      
      * Remove redudant variables
      
      * Remove useless static_assert()
      
      * Add check for range types
      
      * Declare variable right before first use
      
      * Move long return type as tailing return type
      
      * Add BaseInvokerCRTP<> class template to generate method
      
      * Create new base type for 'DervicePermute' implementations
      
      * Move 'NumDim' template param to the first
      
      * Rename 'DevicePermute' to 'DevicePermuteImpl'
      
      * Add 'noexcept' specifier to CRTP generated method
      
      * Move 'Block2TileMap' definition into 'GridwisePermute'
      
      * Use type alias to reduce code
      
      * Unify naming style in 'DevicePermute'
      
      * Add comments in 'GridwisePermute'
      
      * Rename permute example folder
      
      * Use std::cerr to report error
      
      * Use larger shape in examples
      
      * Rename '38_permute' to '39_permute'
      
      * Make sure we use unsigned type for shape & indices
      
      * Remove opt-ed out assertion
      
      * Remove template BaseInvokerCRTP<>
      f584ab0c
  21. 19 Sep, 2022 2 commits
    • Anthony Chang's avatar
    • Shaojie WANG's avatar
      Conv bwd data multiple d (#404) · 27858374
      Shaojie WANG authored
      
      
      * init commit of convnd bwd data
      
      * begin compiling example
      
      * have a first version that produce a right result
      
      * refine device level launch kernel code
      
      * add more instances in example and get right results
      
      * clang-format
      
      * format example file
      
      * add more instances
      
      * fix instances
      
      * adding conv_bwd_data multile_d
      
      * adding conv_bwd_data multile_d
      
      * adding conv_bwd multiple d
      
      * adding conv_bwd multiple d
      
      * adding conv_bwd multiple d
      
      * refactor
      
      * refactor
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * adding conv bwd data multiple d
      
      * refactor
      
      * update conv fwd's bias impl
      
      * refactor
      
      * reorg file
      
      * clean up cmake
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      27858374
  22. 09 Sep, 2022 1 commit
  23. 30 Aug, 2022 2 commits
    • Adam Osewski's avatar
      Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115
      Adam Osewski authored
      
      
      * GEMM + Reduce max fp16+fp32
      
      * GEmm + Max bf16 + int8
      
      * Refactor common definitions.
      
      * Refactor common func of mean meansquare example.
      
      * More examples for mean meansquare.
      
      * Update int8 examples and skip them cause of random errors.
      
      * Int4 examples.
      
      * Fix examples for max int4/8
      
      * Tensor conversion for int4 input data for mean meansquare example.
      
      * Remove int4 mean_meansquare example
      
      * Fix int8 mean_meansquare example.
      
      -All ReductionAccData and R<N>DataType have to be F32. The INT32 data
      type is giving wrong results.
      
      * Guard int4 with ifdef
      
      * Change int8 example to add_addsquare due to div rounding err.
      
      * Clang format
      
      * Change the return type of common function.
      
      * Get back int8 example with division.
      
      * Remove int8 mean meansquare.
      
      * Use proper cast for BF16 data type.
      
      * Use ck::literals.
      
      * Use proper data type for host tensors & reference.
      
      - Use ReduceAccDataType for reference gemm output data type.
      - Cast host reference output tensor to EDataType
      - Fix ifdefs for int4.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      d00e6115
    • Shaojie WANG's avatar
      Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736
      Shaojie WANG authored
      
      
      * add padding algo for bmm+scale+softmax+bmm. Version for verification
      
      * remove verification code
      
      * remove comments
      
      * add padded bmm scale softmax bmm example
      
      * format
      
      * refactor
      
      * add comments for usages of padding bmm+scale+softmax+bmm
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      45adb736
  24. 25 Aug, 2022 1 commit
  25. 24 Aug, 2022 1 commit
  26. 23 Aug, 2022 1 commit
  27. 18 Aug, 2022 1 commit
  28. 13 Aug, 2022 4 commits
    • rocking5566's avatar
      Layernorm welford (#346) · 0bd6b842
      rocking5566 authored
      
      
      * Add threadwise and blockwise welford
      
      * Rename gridwise op, prepare to add welford version
      
      * implement welford and integrate welford into layernorm
      
      * Take care of tail loop
      
      * Fix buf when ThreadSliceK > 1
      
      * Fix bug of merging of two empty set
      
      * Rename clip to clamp
      
      * 1. Fix type of count
      2. Remove useless static_assert
      
      * Do not inherit Reduction::Argument
      
      * [What] replace __syncthreads() with block_sync_lds()
      [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
      
      * Add y stride
      
      * Rename.
      DeviceLayernorm -> DeviceLayernormImpl
      DeviceNormalization2 -> DeviceLayernorm
      
      * Move literal ""_uz & ""_zu into namespace 'literals'
      
      * Move namespace 'literals' as 'ck::literals'
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0bd6b842
    • Anthony Chang's avatar
      Fused GEMM+GEMM (#351) · c20a75b0
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * add gemm_gemm instances and tests
      
      * avoid LDS data hazard
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c20a75b0
    • ltqin's avatar
      Skip lds of b matrix (#326) · 10b3278b
      ltqin authored
      * start
      
      * read for gridwise gemm
      
      * add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1
      
      * add thread  copy desc and register buffer
      
      * add K0PerBlock dim
      
      * add read global data
      
      * finish gridwise gemm
      
      * finish blockwise gemm
      
      * add print data
      
      * add smallest config
      
      * add compare code for gridwis gemm
      
      * fix NXdlPerWave
      
      * fix k0perthread and gridewis gemm main loop
      
      * remove b matrix lds alloc
      
      * fix name
      
      * add test code
      
      * create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter
      
      * add double register
      
      * modify b_thread_desc_
      
      * add float
      
      * fp16 tag
      
      * add tail for pipeline
      
      * finish main loop
      
      * optimize main loop
      
      * start clear gridwise gemm
      
      * clear code
      
      * clear redundant code
      
      * change file name
      
      * change file name
      
      * fix bug after merge develop
      
      * fix input parameters
      
      * using MultiK0 control b load data loop
      
      * fix some config
      
      * 4 buffer
      
      * fix bug
      
      * one can use
      
      * change read order
      
      * change buffer array to tuple
      
      * change to 8 buffer
      
      * interleave buffer load
      
      * change to 16
      
      * read 8 buffer
      
      * add data buffer to template
      
      * fix after merge develop(head file)
      
      * format
      
      * change to 4 buffer
      
      * remove unnecessary lambda fun
      10b3278b
    • Anthony Chang's avatar
      Fused attention (#345) · cac014f1
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * attention host validation
      
      * add blockwsie softmax v1
      
      * iteratively update softmax+gemm
      
      * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
      
      * add init method for easier debugging
      
      * do away with manual thread cluster calculation
      
      * generalize blockwise softmax interface
      
      * row-wise softmax sum & max
      
      * format
      
      * rename to DeviceBatchedGemmSoftmaxGemm
      
      * add gemm_softmax_gemm instances and tests
      
      * comment
      Co-authored-by: default avatarltqin <letao.qin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cac014f1
  29. 11 Aug, 2022 1 commit
    • Po Yen Chen's avatar
      Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340) · 68b61504
      Po Yen Chen authored
      * Add always_false<> util to delay symbol resolution
      
      * Use always_false<> to prevent trying instantiate unwanted method
      
      * Add new specializations of AddAddFastGelu::operator() method
      
      * Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32
      
      * Use floating point literal to simplify code
      
      * Remove unnecessary capture in lambda expressions
      
      * Extract fast GeLU calculation as standalone method
      
      * Mark methods as 'constexpr'
      
      * Add constraint for HostTensorDescriptor templated ctors
      
      * Simplify HostTensorDescriptor ctor calls
      
      * Add C++23 std::size_t literal suffix
      
      * Use _uz suffix to shorten example code
      
      * Remove unnecessary conversion to std::array<>
      
      * Re-order include directives
      
      * Remove C-style casting by literal suffix
      
      * Remove unnecessary statements in main()
      
      * Remove unused type parameter of always_false<>
      
      * Remove unused include directive
      
      * Exit main() by returning meaningful value
      
      * Use 'if constexpr' to switch example flow
      
      * Use std::is_same_v<> to shorten example code
      
      * Add 'inline' specifier to literal functions
      
      * Unify output methods in example
      
      * Move common codes into .inc file
      
      * Add type check in type_convert<>()
      
      * Add type_convert<float>() before computation
      
      * Merge AddAddFastGelu method specializations
      
      * Remove always_false<>
      
      * Add constraint to AddAddFastGelu::operator() parameter types
      68b61504
  30. 02 Aug, 2022 1 commit
    • Adam Osewski's avatar
      CGEMM examples bf16, fp32, int8 (#332) · fb0dc358
      Adam Osewski authored
      
      
      * Add int8 specialization for elementwise Add and Subtract.
      
      * CGEMM examples bf16, fp32, int8
      
      * Add convert reference output to CDataType.
      
      * Skip BF16 data type during testing.
      
      * Lower K value to get rid of accumulation error.
      
      * Fix merge artifact.
      
      * Fix changed function name: GetElementSpaceSize()
      
      * Fix merge artifact.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      fb0dc358
  31. 29 Jul, 2022 1 commit
    • Chao Liu's avatar
      Clean up conv example, Instances, profiler and test (#324) · 500fa995
      Chao Liu authored
      * convnd_fwd fp16 example
      
      * update example
      
      * update example
      
      * update instance
      
      * updating refernce conv
      
      * update reference conv
      
      * update conv fwd profiler
      
      * update conv 1d and 3d instance
      
      * update include path
      
      * clean
      
      * update profiler for conv bwd data and weight
      
      * update conv bwd weight
      
      * clean
      
      * update conv example
      
      * update profiler for conv bwd weight
      
      * update ckprofiler for conv bwd data
      
      * fix reference conv bwd data bug; update conv bwd data test
      
      * update examples
      
      * fix initialization issue
      
      * update test for conv fwd
      
      * clean
      
      * clean
      
      * remove test case too sensitive to error threshhold
      
      * fix test
      
      * clean
      
      * fix build
      
      * adding conv multiple d
      
      * adding conv multiple D
      
      * add matrix padder
      
      * add gemm padding to convnd
      
      * adding group conv
      
      * update gemm multi-d
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * clean
      
      * refactor
      
      * refactor
      
      * reorg
      
      * add ds
      
      * add bias
      
      * clean
      
      * add G
      
      * adding group
      
      * adding group
      
      * adding group
      
      * update Tensor
      
      * clean
      
      * update example
      
      * update DeviceGemmMultipleD_Xdl_CShuffle
      
      * update conv bwd-data and bwd-weight
      
      * upate contraction example
      
      * update gemm and batch gemm with e permute
      
      * fix example build
      
      * instance for grouped conv1d
      
      * update example
      
      * adding group conv instance
      
      * update gemm bilinear instance
      
      * update gemm+add+add+fastgelu instance
      
      * update profiler
      
      * update profiler
      
      * update test
      
      * update test and client example
      
      * clean
      
      * add grouped conv into profiler
      
      * update profiler
      
      * clean
      
      * add test grouped conv, update all conv test to gtest
      
      * update test
      500fa995