1. 20 Sep, 2022 1 commit
    • Po Yen Chen's avatar
      Add 'Permute' device op & example (#408) · f584ab0c
      Po Yen Chen authored
      * Add example folder for 'DeviceElementwise'
      
      * Re-structure example files
      
      * Move common parts into common.hpp
      
      * Use more strict input
      
      * Add more helper methods in 'DeviceElementwise'
      
      * Use more specific method to write example
      
      * Allow specify problem through command line argument
      
      * Allow specify problem 'axes' through command line argument
      
      * Add check to template type argument
      
      * Add transpose_shape() to generalize shape permute
      
      * Generalize transpose utility functions
      
      * Use better name for tensor indices
      
      * Add checks in helper functions
      
      * Remove debug messages
      
      * Refine error message for check_err()
      
      * Generalize variable naming in example code
      
      * Add device op 'DevicePermute'
      
      This device op is clone of 'DeviceElementwise'
      
      * Use 'DevicePermute' device op in example
      
      * Remove 'elementwise' from identifiers
      
      * Remove 'elementwise' from file paths
      
      * Remove base class of 'DevicePermute'
      
      * Let 'DevicePermute' inherit from 'BaseOperator'
      
      * Add simple type traits to validate device op type
      
      * Add static_assert() to check type constraints
      
      * Create 'DevicePermuteBase' to generate methods
      
      * Use indirect base type to generate methods
      
      * Remove 'is_device_op<>' type traits
      
      * Only accept single-input-single-output for 'DervicePermute'
      
      * Simplify 'DevicePermute' interface
      
      * Re-format 'DeviceElementwise'
      
      * Use CRTP to generate overridden virtual method
      
      * Remove unnecessary include directives
      
      * Distinguish input & output shape in 'DevicePermute'
      
      * Passing 'axes' to 'DevicePermute'
      
      * Use more reasonable return value for Invoker::Run()
      
      * Add 'GridwisePermute' kernel
      
      This kernel is a clone of 'GridwiseElementwise_1D'
      
      * Remove no-longer used type argument
      
      * Check if input/output shape meet the requirement
      
      * Remove no-longer used method
      
      * Remove never-entered-if-clause
      
      * Change problem description for 'DevicePermute'
      
      * Transform descriptor into 3 dimensions
      
      * Add debug code the verify result
      
      * Add comment to indicate template argument location
      
      * Add N/H/WPerBlock template parameter to 'DevicePermute'
      
      * Rename 'GridwisePermute' to 'GridwiseCopy'
      
      * Check tensor descriptor dimensions in 'GridwiseElementwise_1D'
      
      * Add missing include directive
      
      * Add 'BlockSize' parameter to 'DevicePermute'
      
      * Remove no-longer used method
      
      * Add 'BlockToTileMap' for 'GridwiseCopy'
      
      * Use the normal Block2TileMap convention
      
      * Rename 'BlockToTileMap' as 'Block2TileMap'
      
      * Fix most of compilation errors
      
      * Let 'Block2TileMap' map block to 2d coordinate
      
      * Allow data transfer in 'GridwiseCopy'
      
      * Fix wrong output descriptor for 2nd blockwise copy
      
      * Rename 'GridwiseCopy' as 'GridwisePermute'
      
      * Remove '1d' in identifiers
      
      * Remove commented-out codes
      
      * Remove 'MPerThread' template parameter
      
      * Seperate template parameters
      
      * Unify variable namming convention
      
      * Use more verbose way to create expressions
      
      * Add template parameter 'InBlockLdsExtraW'
      
      * Release the constraint on In/OutGridDesc
      
      * Use date type directly as template argument
      
      * Re-arrange template arguments for blockwise copy
      
      * Remove no-longer used template parameters
      
      * Embed layout in the variable names
      
      * Add GridwisePermute::CheckValidity()
      
      * Extract local types as template parameters
      
      * Rename local type alias
      
      * Add more template parameters (vector width related)
      
      * Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions
      
      * Fill tensor values start from 1
      
      * Re-formate example code
      
      * Avoid too-large block id
      
      * Add comment
      
      * Make sure 'SrcVectorDim' is not same as 'DstVectorDim'
      
      * Add check for the 'VectorDim' & 'ScalarPerVector' template params
      
      * Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc
      
      * Remove no-longer used template parameter 'NPerBlock'
      
      * Fix wrong descriptor creation logics
      
      * Specify problem in each examples
      
      * Use better example name
      
      * Add new example 'example_permute_NxHxW_fp32'
      
      * Add example for demonstrating bundle multiple elems in tensor
      
      * Add support to permute multiple elements together
      
      * Change the default problem size
      
      * Add span<> class template
      
      * Use span<> to generalize check_err() interface
      
      * Fix ambiguous ctor call
      
      * Avoid create necessary objects
      
      * Use helper functions to simplify example code
      
      * Add example for 4xfp16 permute
      
      * Disable failed-to-compile example
      
      * Add check for the NUM_ELEMS_IN_BUNDLE
      
      * Remove redundant parameter in helper lambda function
      
      * Add check for the input tensor type's byte-size
      
      * Check scalar-per-vector with padded length
      
      * Use more verbose name to avoid name collision
      
      * Use fixed 'VectorDim' & 'ScalarPerVector' for LDS
      
      * Embed shape info in name of descriptor constructor
      
      * Rename example folder '36_permute' into '37_permute'
      
      * Avoid using too-large LDS in kernel code
      
      * Remove redundant example
      
      * Usw switch() to group similar codes
      
      * Add const to the span<> type arguement
      
      * Simply initialize tensor with floating point values
      
      * Use fp16 as data type in all examples
      
      * Enlarge tensor size in example
      
      * Enalrge N-dim in example
      
      * Add check for the bundled type in example
      
      * Use more stricter error threshold
      
      * Remove global load/store loop in kernel code
      
      * Measure execution time by default
      
      * Use faster device op config for example 'NxHxW_fp16'
      
      * Use faster device op config for example '1xHxW_fp16'
      
      * Use faster device op config for example 'HxWx4_fp16'
      
      * Remove cmd arg parsing logics
      
      * Rename functions
      
      * Extract bundle permutation logic out
      
      * Simplify permute bundle example
      
      * Add Tensor<>::GetElementSpaceSizeInBytes()
      
      * Add Tensor<>::data()
      
      * Use new methods to simplify code
      
      * Use type alias to replace duplicated code
      
      * Use existing method to shorten code
      
      * Allow FillUniformDistribution accept range arugment
      
      * Intialize random values in range
      
      * Add Tensor<>::size()
      
      * Use more meaningful names in permute bundle example
      
      * Use more meaningful names in permute element examples
      
      * Use rangified copy() to copy elements
      
      * Use function return value directly to eliminate variables
      
      * Add to_array() conversion tool to eliminate more variables
      
      * Add Tensor<>::AsSpan<>() to create view of tensor values
      
      * Use AsSpan() to shorten check_err() calls
      
      * Remove no-longer-used 'using' directives
      
      * Move 'using' directive to proper code position
      
      * Remove redudant variables
      
      * Remove useless static_assert()
      
      * Add check for range types
      
      * Declare variable right before first use
      
      * Move long return type as tailing return type
      
      * Add BaseInvokerCRTP<> class template to generate method
      
      * Create new base type for 'DervicePermute' implementations
      
      * Move 'NumDim' template param to the first
      
      * Rename 'DevicePermute' to 'DevicePermuteImpl'
      
      * Add 'noexcept' specifier to CRTP generated method
      
      * Move 'Block2TileMap' definition into 'GridwisePermute'
      
      * Use type alias to reduce code
      
      * Unify naming style in 'DevicePermute'
      
      * Add comments in 'GridwisePermute'
      
      * Rename permute example folder
      
      * Use std::cerr to report error
      
      * Use larger shape in examples
      
      * Rename '38_permute' to '39_permute'
      
      * Make sure we use unsigned type for shape & indices
      
      * Remove opt-ed out assertion
      
      * Remove template BaseInvokerCRTP<>
      f584ab0c
  2. 19 Sep, 2022 1 commit
  3. 15 Sep, 2022 2 commits
  4. 14 Sep, 2022 1 commit
    • ltqin's avatar
      batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c
      ltqin authored
      
      
      * refactor
      
      * start
      
      * add device gemm file
      
      * add BatchStrideD0
      
      * add stridd0
      
      * add gridwise file
      
      * add d0 parameters to gridwise gemm
      
      * add c layout transformer
      
      * add d0 threadwise copy
      
      * init kernel
      
      * init kernel
      
      * regular code
      
      * nm desc put to out
      
      * kernel parameter can not use reference
      
      * host add bias+gelu
      
      * run right for bias+gelu
      
      * change AddFastGelu into another file
      
      * interface add d1 bias parameters
      
      * add d1 parameter to argument
      
      * add d1 parameter to gridwise
      
      * first all code,not verify
      
      * gelu change to relu and GetElementSpaceSize bug
      
      * add instance
      
      * start add to ckprofiler
      
      * ckprofiler finish code
      
      * change input parameter for ckProfiler
      
      * fix host bias+gelu bug
      
      * show help for ckProfiler
      
      * fix bug for lunch kernel ignore parametes
      
      * add pad and fix about bug
      
      * mutiple d0
      
      * add dynamic d0_element_op
      
      * change profiler and  instance to mutiple d0
      
      * example have 2 d0
      
      * remove some comments not using
      
      * change 2 d0 have self  parameters
      
      * change d element_op name
      
      * change class name(multiple_d)
      
      * fix bug
      
      * fix bug that don't find file
      
      * update profiler
      
      * refactor
      
      * update profiler
      
      * clean
      
      * revert example change
      
      * add gon layout
      
      * optimize parameter for gno
      
      * add gon to gemm+gemm
      
      * change helping input parameters
      
      * change to GemmPadder_v2
      
      * using ForEach
      
      * fix gb_per_sec
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      370efa6c
  5. 09 Sep, 2022 1 commit
  6. 06 Sep, 2022 2 commits
  7. 31 Aug, 2022 2 commits
    • Po Yen Chen's avatar
      Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380) · 46a675aa
      Po Yen Chen authored
      
      
      * Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle
      
      * Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface
      
      * Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle
      
      * Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'
      
      * Add 'TransformConvFwdToGemm<>' utility class (from Chao)
      
      * Use 'TransformConvFwdToGemm<>' to shorten code
      
      * Fix ill-formed method declaration
      
      * Re-implement MakeRGridDescriptor_M() function
      
      * Change problem description
      
      * Use macro to define layout types
      
      * Define K-reduced output tensor layout types
      
      * Let user to decide R output tensor layout
      
      * Rename variables
      
      * Add padding to the reduced output tensor if necessary
      
      * Extract common code as helper method
      
      * Remove debug message
      
      * Add missing include directive
      
      * Add partial fp16 Conv + Reduction example
      
      * Add example verification code for 2D Conv problem
      
      * Use type alias to simplify code
      
      * Share code across different-dimension Conv problems
      
      * Rename file/functions from run_conv_fwd* to run_convnd_fwd*
      
      * Make example code more verbose
      
      * Add code to support 1D & 3D Conv + Reduction on host
      
      * Add more examples for data type: bf16, fp32
      
      * Add example for int8
      
      * Add custom target to group examples
      
      * Use more general custom target name
      
      * Change the description in error message
      
      * Disable testing for example other than fp32
      
      * Add examplel for int4 (just copy from int8)
      
      * Fix wrong data type
      
      * Use larger data type for intermediate tensors
      
      * Finish int4 example
      
      * Undefine macro PP_DEFINE_LAYOUT_TYPE() after use
      
      * Use named variables to replace magic numbers
      
      * Remove debug messages
      
      * Use same A/B data type for host Conv in int4 example
      
      * Add check for the 'RLayout' type argument
      
      * Group same-dim-layouts together in 'LayoutSetting<>'
      
      * Add 'final' specifier to utility classes
      
      * Use different initialization method for examples
      
      * Remove macro PP_DEFINE_LAYOUT_TYPE()
      
      * Fix code-comment mismatch
      
      * Use more reasonable initialization value for all data types
      
      * Default use init_method=1 for all examples
      
      * Remove never-used code
      
      * Remove confusing out-of-date comments
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      46a675aa
    • Chao Liu's avatar
      conv+conv (1x1 only) example using gemm+gemm (#393) · 4df6d93f
      Chao Liu authored
      * refactor conv
      
      * add conv+conv example, 1x1 only
      4df6d93f
  8. 30 Aug, 2022 1 commit
    • Adam Osewski's avatar
      Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115
      Adam Osewski authored
      
      
      * GEMM + Reduce max fp16+fp32
      
      * GEmm + Max bf16 + int8
      
      * Refactor common definitions.
      
      * Refactor common func of mean meansquare example.
      
      * More examples for mean meansquare.
      
      * Update int8 examples and skip them cause of random errors.
      
      * Int4 examples.
      
      * Fix examples for max int4/8
      
      * Tensor conversion for int4 input data for mean meansquare example.
      
      * Remove int4 mean_meansquare example
      
      * Fix int8 mean_meansquare example.
      
      -All ReductionAccData and R<N>DataType have to be F32. The INT32 data
      type is giving wrong results.
      
      * Guard int4 with ifdef
      
      * Change int8 example to add_addsquare due to div rounding err.
      
      * Clang format
      
      * Change the return type of common function.
      
      * Get back int8 example with division.
      
      * Remove int8 mean meansquare.
      
      * Use proper cast for BF16 data type.
      
      * Use ck::literals.
      
      * Use proper data type for host tensors & reference.
      
      - Use ReduceAccDataType for reference gemm output data type.
      - Cast host reference output tensor to EDataType
      - Fix ifdefs for int4.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      d00e6115
  9. 29 Aug, 2022 1 commit
  10. 24 Aug, 2022 2 commits
  11. 23 Aug, 2022 2 commits
    • Po Yen Chen's avatar
      Add examples of Gemm (data type: int4) (#367) · fa2d894b
      Po Yen Chen authored
      * Add GEMM examples for int4
      
      Currently the source files are just copied from int8 examples
      
      * Re-use pre-defined alias in int4 exmples
      
      * Distinguish user-side type from kernel-side type
      
      * Add int4_t support for check_err()
      
      * Allow conversion between Tensor<> specializations
      
      * Re-format source files
      
      * Use different type for host tensors
      
      * Re-use CopyAsType<>() to implement copy ctor
      
      * Re-use element-wise operation type alias
      
      * Fix typo in alias names
      
      * Complete the int4 examples
      
      * Add constraint to Tensor<> templated methods
      
      * Add type traits 'is_signed_integral<>'
      
      * Add type constraints for integer version check_err<>()
      
      * Allow comparing different-sized integral types in check_err()
      
      * Check converted Tensor<int4_t> with golden Tensor<int8_t>
      
      * Remove constraint of Tensor<>::CopyAsType()
      
      * Avoid compilation error while disabling ck::int4_t support
      
      * Remove debug messages
      
      * Add #error directive to prevent compile sources with wrong setting
      
      * Simplify tensor usages in examples
      
      * Add constraint to check_err() input reference type
      
      * Align design with other PR
      
      * Use ""_uz to simplify example code
      
      * Avoid too much generalizing check_err()
      
      * Re-format GEMM instance template arguments
      
      * Extract int4 example common codes
      
      * Sort include directives
      
      * Move #include directives into new header
      
      * Move common codes together
      
      * Re-format template argument in example code
      
      * Reuse same implementation code for most of GEMM examples
      
      * Re-format common.hpp
      
      * Unify structured comment in examples
      
      * Use reinterpret_cast<>() for cross-type pointer conversion
      
      * Revert "Add type traits 'is_signed_integral<>'"
      
      This reverts commit f2c148efaedf42c8ee66032dac6d13a1003b0f3a.
      
      * Allow unsigned integer arguments for check_err()
      
      * Fix compilation error in check_err()
      
      * Remove unnecessary copy ctor for Tensor<>
      
      * Mark Tensor<> special member functions as 'default'
      
      * Use more strict condition to add code in examples
      
      * Fix wrong program return value of GEMM examples
      
      * Handle the case while user specify all the strides
      
      * Fix never-ran examples
      
      * Exit successfully if GEMM instance does not support given problem
      
      * Add missing 'else' keyword
      
      * Re-format CMakeLists.txt
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit 3f190b0779ceedf7aaf0b380712fda0518de72c1.
      
      * Add conversion ctor for Tensor<>
      
      * Write Tensor<> conversion logics explicitly in example code
      
      * Convert Tensor<> values after transfer data to host
      fa2d894b
    • Po Yen Chen's avatar
      Add example of Gemm + AddAddFastGelu (data type: int4) (#369) · 2327f1a6
      Po Yen Chen authored
      * Add custom target to bundle examples together
      
      * Add int4 example conditionally (just copy from int8 example)
      
      * Extract common code into common.hpp
      
      * Move ref gemm type alias into data-type-specific sources
      
      * Add #error directive to prevent compile with wrong setting
      
      * Let AddAddFastGelu support int4 parameter type
      
      * Let check_err() support int4 parameter type
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Finish int4 example for GEMM + AddAddFastGelu
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit e26e7af71e1f982a4ca7406401e2fc9b1f086b32.
      
      * Add conversion ctor for Tensor<>
      
      * Add 'const' specifier to Tensor<>::CopyAsType()
      
      * Convert Tensor<> values before/after transfer between host & device
      2327f1a6
  12. 15 Aug, 2022 1 commit
    • Qianfeng's avatar
      Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320) · 53ea4713
      Qianfeng authored
      * Implement multiple-reduction in one kernel (kernels, device ops, examples)
      
      * Add generic elementwise kernel and device interface
      
      * Add generator for normal-distributed data initialization
      
      * Add host refer implementation of batchnorm-forward and batchnorm-infer
      
      * Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
      
      * Remove un-needed including in batchnorm example
      
      * Renaming generic_elementwise to elementiwise in kernel and device classes/functions
      
      * Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
      
      * Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
      
      * Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
      
      * Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
      
      * Removing and renaming files
      
      * Update to synchronize gemm_layernorm client example to the generic element-wise device op API
      
      * Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
      
      * Merge two static member functions in device_elementwise.hpp
      
      * Remove unary_elementwise_1d kernel and device
      53ea4713
  13. 13 Aug, 2022 3 commits
    • cloudhan's avatar
      Change all device operations to use add_instance_library (#338) · fb1cbf02
      cloudhan authored
      
      
      * Change all device operations to use add_instance_library to avoid duplicated cmake configuration.
      
      * update DeviceMem
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      fb1cbf02
    • Anthony Chang's avatar
      Fused GEMM+GEMM (#351) · c20a75b0
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * add gemm_gemm instances and tests
      
      * avoid LDS data hazard
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c20a75b0
    • Anthony Chang's avatar
      Fused attention (#345) · cac014f1
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * attention host validation
      
      * add blockwsie softmax v1
      
      * iteratively update softmax+gemm
      
      * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
      
      * add init method for easier debugging
      
      * do away with manual thread cluster calculation
      
      * generalize blockwise softmax interface
      
      * row-wise softmax sum & max
      
      * format
      
      * rename to DeviceBatchedGemmSoftmaxGemm
      
      * add gemm_softmax_gemm instances and tests
      
      * comment
      Co-authored-by: default avatarltqin <letao.qin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cac014f1
  14. 12 Aug, 2022 1 commit
  15. 11 Aug, 2022 1 commit
    • Po Yen Chen's avatar
      Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340) · 68b61504
      Po Yen Chen authored
      * Add always_false<> util to delay symbol resolution
      
      * Use always_false<> to prevent trying instantiate unwanted method
      
      * Add new specializations of AddAddFastGelu::operator() method
      
      * Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32
      
      * Use floating point literal to simplify code
      
      * Remove unnecessary capture in lambda expressions
      
      * Extract fast GeLU calculation as standalone method
      
      * Mark methods as 'constexpr'
      
      * Add constraint for HostTensorDescriptor templated ctors
      
      * Simplify HostTensorDescriptor ctor calls
      
      * Add C++23 std::size_t literal suffix
      
      * Use _uz suffix to shorten example code
      
      * Remove unnecessary conversion to std::array<>
      
      * Re-order include directives
      
      * Remove C-style casting by literal suffix
      
      * Remove unnecessary statements in main()
      
      * Remove unused type parameter of always_false<>
      
      * Remove unused include directive
      
      * Exit main() by returning meaningful value
      
      * Use 'if constexpr' to switch example flow
      
      * Use std::is_same_v<> to shorten example code
      
      * Add 'inline' specifier to literal functions
      
      * Unify output methods in example
      
      * Move common codes into .inc file
      
      * Add type check in type_convert<>()
      
      * Add type_convert<float>() before computation
      
      * Merge AddAddFastGelu method specializations
      
      * Remove always_false<>
      
      * Add constraint to AddAddFastGelu::operator() parameter types
      68b61504
  16. 07 Aug, 2022 1 commit
  17. 02 Aug, 2022 1 commit
    • Adam Osewski's avatar
      CGEMM examples bf16, fp32, int8 (#332) · fb0dc358
      Adam Osewski authored
      
      
      * Add int8 specialization for elementwise Add and Subtract.
      
      * CGEMM examples bf16, fp32, int8
      
      * Add convert reference output to CDataType.
      
      * Skip BF16 data type during testing.
      
      * Lower K value to get rid of accumulation error.
      
      * Fix merge artifact.
      
      * Fix changed function name: GetElementSpaceSize()
      
      * Fix merge artifact.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      fb0dc358
  18. 29 Jul, 2022 1 commit
    • Chao Liu's avatar
      Clean up conv example, Instances, profiler and test (#324) · 500fa995
      Chao Liu authored
      * convnd_fwd fp16 example
      
      * update example
      
      * update example
      
      * update instance
      
      * updating refernce conv
      
      * update reference conv
      
      * update conv fwd profiler
      
      * update conv 1d and 3d instance
      
      * update include path
      
      * clean
      
      * update profiler for conv bwd data and weight
      
      * update conv bwd weight
      
      * clean
      
      * update conv example
      
      * update profiler for conv bwd weight
      
      * update ckprofiler for conv bwd data
      
      * fix reference conv bwd data bug; update conv bwd data test
      
      * update examples
      
      * fix initialization issue
      
      * update test for conv fwd
      
      * clean
      
      * clean
      
      * remove test case too sensitive to error threshhold
      
      * fix test
      
      * clean
      
      * fix build
      
      * adding conv multiple d
      
      * adding conv multiple D
      
      * add matrix padder
      
      * add gemm padding to convnd
      
      * adding group conv
      
      * update gemm multi-d
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * clean
      
      * refactor
      
      * refactor
      
      * reorg
      
      * add ds
      
      * add bias
      
      * clean
      
      * add G
      
      * adding group
      
      * adding group
      
      * adding group
      
      * update Tensor
      
      * clean
      
      * update example
      
      * update DeviceGemmMultipleD_Xdl_CShuffle
      
      * update conv bwd-data and bwd-weight
      
      * upate contraction example
      
      * update gemm and batch gemm with e permute
      
      * fix example build
      
      * instance for grouped conv1d
      
      * update example
      
      * adding group conv instance
      
      * update gemm bilinear instance
      
      * update gemm+add+add+fastgelu instance
      
      * update profiler
      
      * update profiler
      
      * update test
      
      * update test and client example
      
      * clean
      
      * add grouped conv into profiler
      
      * update profiler
      
      * clean
      
      * add test grouped conv, update all conv test to gtest
      
      * update test
      500fa995
  19. 21 Jul, 2022 2 commits
    • Illia Silin's avatar
      Add full QA with verification option, few other changes. (#331) · d8415a96
      Illia Silin authored
      * add verify flag and update scripts
      
      * replace old check_error function with the new check_err
      
      * fix syntax
      
      * remove blank spaces
      
      * remove empty line
      
      * add check_err for tensors
      
      * fix syntax
      
      * replace tensors with vectors in check_err calls
      
      * fix syntax
      
      * remove blank spaces
      
      * fix syntax
      
      * add new line at end of file
      
      * disable conv2d_bwd_weight test, add gpu check
      
      * set check_gpu using export
      
      * check GPU using runShell
      
      * add definition of runShell
      
      * fix script syntax
      
      * reduce the number of threads, add full qa option
      
      * run processing scripts in bash
      
      * fix the branch and host names in performance scripts, add chronos
      
      * replace parameterizedCron with cron
      
      * archive the perf log files
      
      * try to fix git call
      
      * pass branch and host names as arguments into scripts
      
      * fix script arguments
      
      * fix script arguments
      
      * process results on master
      
      * fix pipeline
      
      * add definition of gpu_arch
      
      * run processing scripts in docker
      
      * fix the brackets
      
      * add agent master for the processing stage
      
      * get rid of show_node_info call on master
      
      * try using mici label instead of master, disable MI100 tests for now
      
      * fix syntax
      
      * simplify container for results processing
      
      * remove node(master) from the process_results stage
      
      * put all stages in original order
      
      * change the agent label from master to mici for gfx908
      d8415a96
    • zjing14's avatar
      Grouped Gemm device with multiD grid (#319) · 7959dad5
      zjing14 authored
      
      
      * replace gridwise_v2r3 with multiD
      
      * adjust parameters
      
      * add instances
      
      * fixed test_grouped_gemm
      
      * fix standalone softmax race condition around blockwise reduction
      
      * fixed ci
      
      * fixed comment: remove redundant workspace
      
      * use instanceFactory
      
      * add test layout
      
      * add empty Ds
      
      * add bias example
      
      * use array
      
      * sperate examples
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      7959dad5
  20. 13 Jul, 2022 1 commit
    • rocking5566's avatar
      Standalone layernorm (#315) · 7f216620
      rocking5566 authored
      
      
      * Implement layernorm kernel and deviceOp
      
      * verify gpu kernel with host code
      
      * 1. Separate gamma aand beta from affine
      2. Check if argument is valid
      
      * clean
      
      * Sync the naming
      
      * Support sweep once mode if we can put k dimension data inside one block
      
      * [What] Get length from upper length.
      [Why] if we get length directly, we may get length after padding.
      
      * We only use one block in K dimension.
      Hence, we can simplify the indexing of global R/W.
      
      * Use 1d descriptor for gamma and beta
      
      * Add accElementwiseOp
      
      * Extract layernorm host code
      
      * Support different YVectorDim in GridwiseLayernorm
      
      * Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
      
      * Gamma and beta can share the VGPR.
      
      * Add test for fp32 and fp16
      
      * Fix bug of concurrency and add test case which may fail orignally
      
      * Propagate NaN for layernorm
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      7f216620
  21. 07 Jul, 2022 1 commit
    • Chao Liu's avatar
      N-D Tensor Contraction example, instance, and client example (#270) · 4fe9c393
      Chao Liu authored
      * adding contraction
      
      * add contraction example
      
      * update examle
      
      * update example
      
      * format
      
      * update readme
      
      * clean header
      
      * clean header
      
      * contraction with multiple D
      
      * rename
      
      * fix naming issue; add instances for contraction+bilinear
      
      * change assumed virtual layout of contraction; add client example
      
      * update example
      
      * update
      
      * contraction+scale
      
      * use type_convert
      
      * rename
      4fe9c393
  22. 02 Jul, 2022 1 commit
    • Chao Liu's avatar
      Gemm+Bilinear (#316) · 9e4429f9
      Chao Liu authored
      * refactor
      
      * update example
      
      * update example
      
      * gemm bilinear
      
      * clean
      
      * update
      9e4429f9
  23. 01 Jul, 2022 2 commits
    • Anthony Chang's avatar
      Single-kernel GEMM + layernorm (#263) · 63fd5da6
      Anthony Chang authored
      
      
      * dump lds content in appropriate precision type
      
      * add squared add reduction op; allows sq sum
      
      * initial stub from regular gemm impl
      
      * layernorm example code & host verification
      
      * initial layernorm implementation
      
      * tidy up
      
      * make C0 precision type consistent with C
      
      * clang-tidy and additional comments
      
      * tighten up example code
      
      * account for extra flops/bytes from normalization
      
      * clang-format
      
      * c0 bias/beta/gamma now have its own precision type
      
      * AccElemOp for gemm outputs prior to feeding to layernorm
      
      * update workgroup mapping
      
      * rename kernel template param to reflect its dual use
      
      * use LDS mem pool for reduction workspace
      
      * change cshuffle precision type to f16; clean up
      
      * clang-format
      
      * correct naming
      
      * explicit cast
      
      * fully implemented gemm + bias + activation + add + norm
      
      * activation in correct order
      
      * reflect reduction API's recent change
      
      * amend
      
      * clean up; add comment
      
      * keep up with recent changes in reduction API
      
      * format
      
      * resolve merge conflicts
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      63fd5da6
    • Chao Liu's avatar
      Improve external interface for GEMM and GEMM+add+add+fastgelu (#311) · 0dcb3496
      Chao Liu authored
      * interface for GEMM and GEMM+add+add+fastgelu
      
      * rename namespace
      
      * instance factory
      
      * fix build
      
      * fix build; add GEMM client example
      
      * clean
      0dcb3496
  24. 30 Jun, 2022 1 commit
    • Anthony Chang's avatar
      Standalone sweep once softmax kernel w/ ckProfiler (#295) · 93c99f3d
      Anthony Chang authored
      * use 'sweep once' softmax kernel where applicable
      
      * threadwise copy's dst buffer can specify invalid element value
      
      * add int8 in/out float compute softmax support
      
      give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
      
      * format
      
      * softmax inherits DeviceNormalization
      
      * softmax profiler stub
      
      * tighten up reference softmax interface
      
      * example prints tensor dimension
      
      * add fp32 to softmax profiler
      
      * rename header
      
      * hook with ckProfiler
      
      * format
      
      * resolve merge conflict
      
      * resolve merge conflicts
      
      * update normalization profiler help string
      
      * resolve conflict
      
      * typo
      
      * remove residual
      
      * softmax profiler: address feedback
      
      * test for mixed precision input/output
      
      * fully qualify ck::math::isnan
      
      * add comment for device normalization interface
      
      * revise wording
      
      * constness for alpha/beta scaler pointer
      93c99f3d
  25. 27 Jun, 2022 2 commits
    • rocking5566's avatar
      external api for gemm + layernorm (#285) · 12235112
      rocking5566 authored
      * Extract base class for elementwise
      
      * Refactor interface of DeviceGemmReduce. Do not use tuple in interface
      
      * [What] Rename d into reduce in gemm + reduction related code
      [Why] Prepare to add d term for add
      
      * Unify base class of gemm + reduce and gemm + bias + add + reduce
      
      * 1. Rename gemm_bias_add_reduce for external api
       2. Refine cmake
      
      * Add normalize device operation
      
      * [What] Reorder the argument
      [Why] Because d0 is also the input of c.
      
      * Add type string
      
      * Add example of gemm_bias_add_layernorm  via external api
      
      * Refactor example code
      
      * clang-format
      
      * Fix compile error
      
      * clang-format
      
      * Add external api for gemm_add_add_layernorm and normalize
      
      * Add client example
      
      * clang-format
      12235112
    • Chao Liu's avatar
      External Interface (#304) · aebd211c
      Chao Liu authored
      * add client example
      
      * clean
      
      * clean
      
      * reorg
      
      * clean up profiler
      
      * reorg
      
      * clea
      
      * fix profiler
      
      * function for getinstances
      
      * update client example
      
      * update client example
      
      * update client example
      
      * update
      
      * update example
      
      * update Jenkins file
      
      * update cmake
      
      * update Jenkins
      aebd211c
  26. 25 Jun, 2022 2 commits
    • Chao Liu's avatar
      add license in file (#303) · d3051d75
      Chao Liu authored
      d3051d75
    • Chao Liu's avatar
      Absolute include path (#281) · d1db6a0c
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add client app example
      
      * update readme
      
      * delete obselete files
      
      * remove old client app
      
      * delete old file
      
      * cleaning
      
      * clean
      
      * remove half
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path for all examples
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * revert client app example
      
      * clean build
      
      * fix build
      
      * temporary disable client test on Jenkins
      
      * clean
      
      * clean
      
      * clean
      d1db6a0c
  27. 23 Jun, 2022 1 commit
    • Adam Osewski's avatar
      Testing all fwd convolution specializations. (#259) · a2edd7d8
      Adam Osewski authored
      
      
      * UniforFill with integer values.
      
      * Log tested instance type string.
      
      * Add UT for all convolution specializations.
      
      * debugging conv
      
      * Fix dangling reference bug.
      
      * Small refinements.
      
      * Fix call to error checking function.
      
      * Small refinements to tests.
      
      * Configure error tolerance
      * Change problem size.
      * Remove OddC case from types that do not support it.
      
      * Add helper traits for AccumulatorDataType.
      
      * Print first 5 errs in check_err for integral types.
      
      * Rename FillUniform to FillUniformDistribution
      
      * Refactor
      
      * Do not use typed tests.
      * Instead use plain fixture class with templatized member functions.
      * Initialize tensors with integer values.
      
      * Refine test instances.
      
      * Properly set accumulator data type.
      * Add another "big" instance.
      
      * Refactor convolution tests.
      
      * Revert "debugging conv"
      
      This reverts commit b109516455631ff8fd6dce99cf7c14bf8e323ebb.
      
      * Add pragma once + format + small refinement.
      
      * Fix some unwanted changes.
      
      * Clang-format
      
      * Fix profile_convnd to use renamed tensor initializer.
      
      * Add instances for ConvFWDND kernel case 2D
      
      * Helpers to get ConvNDFwd 2D instances.
      
      * Refactoring.
      
      * Remove "small block" instance as it was generating compiler errors.
      * Remove default template parameters values.
      
      * Refine and fix test.
      
      * Fix problem with default template parameter types.
      * Adjust error thresholds for floating point values test.
      * Use integer values initialization for instances test.
      * Add tests for ConvNDFwd 2D case.
      
      * Remove AccumulatorDataType type trait.
      
      * Update unit-tests.
      
      * Remove operator<< overload.
      
      * Unlock conv1d/3d nd fwd instances.
      
      * Enable skipping calculating reference using flag.
      
      * Fix number of channels for first ResNet50 layer.
      
      * Clang-format.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      a2edd7d8
  28. 21 Jun, 2022 1 commit
    • Anthony Chang's avatar
      Standalone softmax kernel (#284) · 15c89e81
      Anthony Chang authored
      * initial stub for standalone softmax
      
      * start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
      
      * host softmax validates
      
      * compiles; to implement beta scaling
      
      * use NaN trick to efficiently ignore OOB values during sum of exponentials
      
      * freeload device_reduce's utility functions
      
      * clean up interface
      
      * adding prior value (beta scaling)
      
      * remove restriction related to perf considerations
      
      * apply clang-format
      
      * clean; disable diagnostics
      
      * resolve conflicts
      
      * add exp wrapper
      
      * honor HostTensorDesc interface; allow implicit cast from different vector<T> type
      
      * test softmax for fp16/fp32
      
      * update readme
      
      * amend commit NaN trick
      
      * remove redundant param added during development
      
      * format
      
      * replace ScalarDataType with AccDataType
      
      * separate out test programs by precision type
      
      * move softmax sample code to its own folder
      
      * format
      
      * keep up with recent changes in reduction API
      
      * remove extra header
      15c89e81
  29. 19 Jun, 2022 1 commit
    • Chao Liu's avatar
      GEMM with Multiple Source, GEMM+Bias+Add+FastGeLU example and ckProfiler (#241) · 56adf7e9
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add comment
      
      * use type_convert
      
      * clean
      
      * clean element wise op
      56adf7e9