1. 08 Sep, 2022 1 commit
  2. 07 Sep, 2022 1 commit
  3. 06 Sep, 2022 3 commits
    • Anthony Chang's avatar
      Fused attention instances & padding tests (#395) · 868e5c55
      Anthony Chang authored
      * modify comment
      
      * trim unnecessary check
      
      * add gemm spec in kernel name
      
      * add TNTT gemm_gemm + atten kernel instances
      
      * refactor attention padding to better fit in unit tests
      
      This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
      Also added compile-time conditionals that load OOB value as NaN only after padding is enabled
      
      * add adhoc padding test for atten
      
      * shrink input value range for attention kernel validation to avoid occasional error by 1e-3
      
      Still unsure whether this kind of deterministic floating point accurary issue is expected
      or not. May want to try exact same approach as the GPU kernel in the host reference
      GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
      shrink the input value range as it is less likely to produce errors of around ~1e-3.
      
      * attention kernel proper granular padding for all 4 dims
      
      * IsSupportedArgument checks
      
      * test more padded cases
      
      * block PadK specialization in attention kernels
      
      * workaround clang crash for gfx908
      
      (gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
      error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
      VGPR_32: Cannot scavenge register without an emergency spill slot!"
      this fall back to less ideal way of handle NPadding in fused attention kernel
      
      * comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
      868e5c55
    • Anthony Chang's avatar
      GemmGemm TNNT instances (#399) · fe52c94c
      Anthony Chang authored
      * add gemm_gemm TNNT instance
      
      * sanitize Gemm1KPack
      
      * disable instances that failed validation on mi100
      fe52c94c
    • Adam Osewski's avatar
      Softmax client example (#396) · 3da5c19e
      Adam Osewski authored
      
      
      * Update Softmax device operation interface.
      
      * Update ckProfiler.
      
      * Update Softmax UT.
      
      * Update example.
      
      * Client example.
      
      * Clang format
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      3da5c19e
  4. 02 Sep, 2022 1 commit
    • zjing14's avatar
      [Hotfix] SplitK Gemm fp32 (#401) · 75891161
      zjing14 authored
      * add scripts
      
      * fixed splitK_gemm_fp32
      
      * clean
      
      * clean
      
      * use gemm_xdl_splitK_c_shuffle into profiler
      
      * remove device_gemm_xdl_splitk.hpp
      75891161
  5. 01 Sep, 2022 1 commit
  6. 31 Aug, 2022 2 commits
    • Po Yen Chen's avatar
      Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380) · 46a675aa
      Po Yen Chen authored
      
      
      * Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle
      
      * Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface
      
      * Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle
      
      * Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'
      
      * Add 'TransformConvFwdToGemm<>' utility class (from Chao)
      
      * Use 'TransformConvFwdToGemm<>' to shorten code
      
      * Fix ill-formed method declaration
      
      * Re-implement MakeRGridDescriptor_M() function
      
      * Change problem description
      
      * Use macro to define layout types
      
      * Define K-reduced output tensor layout types
      
      * Let user to decide R output tensor layout
      
      * Rename variables
      
      * Add padding to the reduced output tensor if necessary
      
      * Extract common code as helper method
      
      * Remove debug message
      
      * Add missing include directive
      
      * Add partial fp16 Conv + Reduction example
      
      * Add example verification code for 2D Conv problem
      
      * Use type alias to simplify code
      
      * Share code across different-dimension Conv problems
      
      * Rename file/functions from run_conv_fwd* to run_convnd_fwd*
      
      * Make example code more verbose
      
      * Add code to support 1D & 3D Conv + Reduction on host
      
      * Add more examples for data type: bf16, fp32
      
      * Add example for int8
      
      * Add custom target to group examples
      
      * Use more general custom target name
      
      * Change the description in error message
      
      * Disable testing for example other than fp32
      
      * Add examplel for int4 (just copy from int8)
      
      * Fix wrong data type
      
      * Use larger data type for intermediate tensors
      
      * Finish int4 example
      
      * Undefine macro PP_DEFINE_LAYOUT_TYPE() after use
      
      * Use named variables to replace magic numbers
      
      * Remove debug messages
      
      * Use same A/B data type for host Conv in int4 example
      
      * Add check for the 'RLayout' type argument
      
      * Group same-dim-layouts together in 'LayoutSetting<>'
      
      * Add 'final' specifier to utility classes
      
      * Use different initialization method for examples
      
      * Remove macro PP_DEFINE_LAYOUT_TYPE()
      
      * Fix code-comment mismatch
      
      * Use more reasonable initialization value for all data types
      
      * Default use init_method=1 for all examples
      
      * Remove never-used code
      
      * Remove confusing out-of-date comments
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      46a675aa
    • Chao Liu's avatar
      conv+conv (1x1 only) example using gemm+gemm (#393) · 4df6d93f
      Chao Liu authored
      * refactor conv
      
      * add conv+conv example, 1x1 only
      4df6d93f
  7. 30 Aug, 2022 2 commits
    • Adam Osewski's avatar
      Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115
      Adam Osewski authored
      
      
      * GEMM + Reduce max fp16+fp32
      
      * GEmm + Max bf16 + int8
      
      * Refactor common definitions.
      
      * Refactor common func of mean meansquare example.
      
      * More examples for mean meansquare.
      
      * Update int8 examples and skip them cause of random errors.
      
      * Int4 examples.
      
      * Fix examples for max int4/8
      
      * Tensor conversion for int4 input data for mean meansquare example.
      
      * Remove int4 mean_meansquare example
      
      * Fix int8 mean_meansquare example.
      
      -All ReductionAccData and R<N>DataType have to be F32. The INT32 data
      type is giving wrong results.
      
      * Guard int4 with ifdef
      
      * Change int8 example to add_addsquare due to div rounding err.
      
      * Clang format
      
      * Change the return type of common function.
      
      * Get back int8 example with division.
      
      * Remove int8 mean meansquare.
      
      * Use proper cast for BF16 data type.
      
      * Use ck::literals.
      
      * Use proper data type for host tensors & reference.
      
      - Use ReduceAccDataType for reference gemm output data type.
      - Cast host reference output tensor to EDataType
      - Fix ifdefs for int4.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      d00e6115
    • Shaojie WANG's avatar
      Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736
      Shaojie WANG authored
      
      
      * add padding algo for bmm+scale+softmax+bmm. Version for verification
      
      * remove verification code
      
      * remove comments
      
      * add padded bmm scale softmax bmm example
      
      * format
      
      * refactor
      
      * add comments for usages of padding bmm+scale+softmax+bmm
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      45adb736
  8. 29 Aug, 2022 2 commits
  9. 26 Aug, 2022 2 commits
    • Illia Silin's avatar
      Add an option to build CK with clang directly (#387) · 1e5b59df
      Illia Silin authored
      * replace hipcc compiler with clang++
      
      * build client app with hipcc
      
      * build client app with clang
      
      * add an option to build with hipcc ro clang
      
      * fix the environment for client app
      
      * fix setting up compiler in cmake_build
      
      * change the way the compiler is set
      1e5b59df
    • zjing14's avatar
      Fixed splitk gemm fp32 (#384) · 9881625b
      zjing14 authored
      * add scripts
      
      * fixed splitK_gemm_fp32
      
      * clean
      
      * clean
      9881625b
  10. 25 Aug, 2022 5 commits
  11. 24 Aug, 2022 2 commits
  12. 23 Aug, 2022 5 commits
    • Po Yen Chen's avatar
      Add examples of Gemm (data type: int4) (#367) · fa2d894b
      Po Yen Chen authored
      * Add GEMM examples for int4
      
      Currently the source files are just copied from int8 examples
      
      * Re-use pre-defined alias in int4 exmples
      
      * Distinguish user-side type from kernel-side type
      
      * Add int4_t support for check_err()
      
      * Allow conversion between Tensor<> specializations
      
      * Re-format source files
      
      * Use different type for host tensors
      
      * Re-use CopyAsType<>() to implement copy ctor
      
      * Re-use element-wise operation type alias
      
      * Fix typo in alias names
      
      * Complete the int4 examples
      
      * Add constraint to Tensor<> templated methods
      
      * Add type traits 'is_signed_integral<>'
      
      * Add type constraints for integer version check_err<>()
      
      * Allow comparing different-sized integral types in check_err()
      
      * Check converted Tensor<int4_t> with golden Tensor<int8_t>
      
      * Remove constraint of Tensor<>::CopyAsType()
      
      * Avoid compilation error while disabling ck::int4_t support
      
      * Remove debug messages
      
      * Add #error directive to prevent compile sources with wrong setting
      
      * Simplify tensor usages in examples
      
      * Add constraint to check_err() input reference type
      
      * Align design with other PR
      
      * Use ""_uz to simplify example code
      
      * Avoid too much generalizing check_err()
      
      * Re-format GEMM instance template arguments
      
      * Extract int4 example common codes
      
      * Sort include directives
      
      * Move #include directives into new header
      
      * Move common codes together
      
      * Re-format template argument in example code
      
      * Reuse same implementation code for most of GEMM examples
      
      * Re-format common.hpp
      
      * Unify structured comment in examples
      
      * Use reinterpret_cast<>() for cross-type pointer conversion
      
      * Revert "Add type traits 'is_signed_integral<>'"
      
      This reverts commit f2c148efaedf42c8ee66032dac6d13a1003b0f3a.
      
      * Allow unsigned integer arguments for check_err()
      
      * Fix compilation error in check_err()
      
      * Remove unnecessary copy ctor for Tensor<>
      
      * Mark Tensor<> special member functions as 'default'
      
      * Use more strict condition to add code in examples
      
      * Fix wrong program return value of GEMM examples
      
      * Handle the case while user specify all the strides
      
      * Fix never-ran examples
      
      * Exit successfully if GEMM instance does not support given problem
      
      * Add missing 'else' keyword
      
      * Re-format CMakeLists.txt
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit 3f190b0779ceedf7aaf0b380712fda0518de72c1.
      
      * Add conversion ctor for Tensor<>
      
      * Write Tensor<> conversion logics explicitly in example code
      
      * Convert Tensor<> values after transfer data to host
      fa2d894b
    • Anthony Chang's avatar
      Attention with output permutation (#370) · e0d8806c
      Anthony Chang authored
      * comment on specialization for TensorSpecialization::Packed
      
      * gemm_softmax_gemm with output permutation
      
      * scaling
      
      * refactor MatrixPadder; rename to GemmPadder
      
      * remove old sanity check
      
      * restore original gemm_softmax_gemm
      
      * revise comment in gemm_softmax_gemm example
      
      * use GetElementSpaceSize()
      
      * remove extra header
      
      * typo
      
      * remove archaic DeviceOpPtr
      e0d8806c
    • zjing14's avatar
      Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361) · 60914583
      zjing14 authored
      
      
      * add examples into grouped/batched_gemm
      
      * adding splitK examples
      
      * fixed splitK
      
      * add bfp16 int8 example into splitK
      
      * formatting
      
      * use static_cast
      
      * added common for batched_gemm
      
      * add commons for examples of splitK/batched/grouped_gemm
      
      * return true
      
      * adjust splitK check tol
      
      * update example
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      60914583
    • Po Yen Chen's avatar
      Add example of Gemm + AddAddFastGelu (data type: int4) (#369) · 2327f1a6
      Po Yen Chen authored
      * Add custom target to bundle examples together
      
      * Add int4 example conditionally (just copy from int8 example)
      
      * Extract common code into common.hpp
      
      * Move ref gemm type alias into data-type-specific sources
      
      * Add #error directive to prevent compile with wrong setting
      
      * Let AddAddFastGelu support int4 parameter type
      
      * Let check_err() support int4 parameter type
      
      * Add wrapper function to hide value conversion while copying memory
      
      * Finish int4 example for GEMM + AddAddFastGelu
      
      * Add new DeviceMem API to copy memory
      
      * Use new DeviceMem API to implement examples
      
      * Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'
      
      * Revert "Add new DeviceMem API to copy memory"
      
      This reverts commit e26e7af71e1f982a4ca7406401e2fc9b1f086b32.
      
      * Add conversion ctor for Tensor<>
      
      * Add 'const' specifier to Tensor<>::CopyAsType()
      
      * Convert Tensor<> values before/after transfer between host & device
      2327f1a6
    • Anthony Chang's avatar
      Implement padding and sanity checks for fused GEMM+GEMM (#376) · f4047c94
      Anthony Chang authored
      * GemmPadder and GemmGemmPadder
      
      * proper padding using GemmGemmPadder
      
      * test gemm_gemm padding
      
      * properly check size K in IsSupportedArgument()
      
      * properly check size requirement given SrcScalarPerVector in IsSupportedArgument()
      
      * comment
      
      * format
      f4047c94
  13. 22 Aug, 2022 1 commit
  14. 18 Aug, 2022 2 commits
  15. 17 Aug, 2022 1 commit
  16. 15 Aug, 2022 2 commits
    • Anthony Chang's avatar
      Hotfix LDS data hazard in fused attention (#360) · c961ce92
      Anthony Chang authored
      * avoid LDS data hazard in gemm_softmax_gemm pipeline
      
      * trivial refactors
      
      * comments
      
      * shrink blockwise gemm v2 thread buffer size
      
      * reclaim A block lds space when during 2nd gemm
      
      * amend
      
      * amend
      c961ce92
    • Qianfeng's avatar
      Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320) · 53ea4713
      Qianfeng authored
      * Implement multiple-reduction in one kernel (kernels, device ops, examples)
      
      * Add generic elementwise kernel and device interface
      
      * Add generator for normal-distributed data initialization
      
      * Add host refer implementation of batchnorm-forward and batchnorm-infer
      
      * Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
      
      * Remove un-needed including in batchnorm example
      
      * Renaming generic_elementwise to elementiwise in kernel and device classes/functions
      
      * Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
      
      * Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
      
      * Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
      
      * Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
      
      * Removing and renaming files
      
      * Update to synchronize gemm_layernorm client example to the generic element-wise device op API
      
      * Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
      
      * Merge two static member functions in device_elementwise.hpp
      
      * Remove unary_elementwise_1d kernel and device
      53ea4713
  17. 13 Aug, 2022 7 commits
    • Chao Liu's avatar
      fix build issue (#357) · 5ee30459
      Chao Liu authored
      * fix build
      
      * excludeexample_gemm_max_xdl_fp16 from testing due to random failure on gfx908
      5ee30459
    • cloudhan's avatar
      Change all device operations to use add_instance_library (#338) · fb1cbf02
      cloudhan authored
      
      
      * Change all device operations to use add_instance_library to avoid duplicated cmake configuration.
      
      * update DeviceMem
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      fb1cbf02
    • rocking5566's avatar
      Layernorm welford (#346) · 0bd6b842
      rocking5566 authored
      
      
      * Add threadwise and blockwise welford
      
      * Rename gridwise op, prepare to add welford version
      
      * implement welford and integrate welford into layernorm
      
      * Take care of tail loop
      
      * Fix buf when ThreadSliceK > 1
      
      * Fix bug of merging of two empty set
      
      * Rename clip to clamp
      
      * 1. Fix type of count
      2. Remove useless static_assert
      
      * Do not inherit Reduction::Argument
      
      * [What] replace __syncthreads() with block_sync_lds()
      [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
      
      * Add y stride
      
      * Rename.
      DeviceLayernorm -> DeviceLayernormImpl
      DeviceNormalization2 -> DeviceLayernorm
      
      * Move literal ""_uz & ""_zu into namespace 'literals'
      
      * Move namespace 'literals' as 'ck::literals'
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0bd6b842
    • Anthony Chang's avatar
      Fused GEMM+GEMM (#351) · c20a75b0
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * add gemm_gemm instances and tests
      
      * avoid LDS data hazard
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c20a75b0
    • ltqin's avatar
      Skip lds of b matrix (#326) · 10b3278b
      ltqin authored
      * start
      
      * read for gridwise gemm
      
      * add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1
      
      * add thread  copy desc and register buffer
      
      * add K0PerBlock dim
      
      * add read global data
      
      * finish gridwise gemm
      
      * finish blockwise gemm
      
      * add print data
      
      * add smallest config
      
      * add compare code for gridwis gemm
      
      * fix NXdlPerWave
      
      * fix k0perthread and gridewis gemm main loop
      
      * remove b matrix lds alloc
      
      * fix name
      
      * add test code
      
      * create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter
      
      * add double register
      
      * modify b_thread_desc_
      
      * add float
      
      * fp16 tag
      
      * add tail for pipeline
      
      * finish main loop
      
      * optimize main loop
      
      * start clear gridwise gemm
      
      * clear code
      
      * clear redundant code
      
      * change file name
      
      * change file name
      
      * fix bug after merge develop
      
      * fix input parameters
      
      * using MultiK0 control b load data loop
      
      * fix some config
      
      * 4 buffer
      
      * fix bug
      
      * one can use
      
      * change read order
      
      * change buffer array to tuple
      
      * change to 8 buffer
      
      * interleave buffer load
      
      * change to 16
      
      * read 8 buffer
      
      * add data buffer to template
      
      * fix after merge develop(head file)
      
      * format
      
      * change to 4 buffer
      
      * remove unnecessary lambda fun
      10b3278b
    • Qianfeng's avatar
      Add examples for reduction fp16/fp32/bp16/int8/fp64 for 3d/4d/5d (#342) · 14932e8d
      Qianfeng authored
      * Update the reduce_blockwise example to support user specified data type and input+reducing dimensions
      
      * Add examples for using reduce_multiblock_atomic_add
      
      * Add more running examples to the default command-line
      
      * Remove un-necessary header including
      
      * Update to the example README.md
      14932e8d
    • rocking5566's avatar
      Gemm multiple d multiple r (#335) · 6c3c06bf
      rocking5566 authored
      * Imitate XXX_gemm_multiple_d, add XXX_gemm_multiple_d_multiple_r for gemm + reduction
      
      * Implement run of kernel
      
      * Add example
      
      * Fix parameter of typo
      
      * Rewrite the reduceMax example
      
      * Rewrite the reduceMean + reduceMeanSquare example
      
      * Refine naming
      
      * Refine folder name
      
      * refine naming
      
      * Rewrite the gemm + bias + relu + add + layernorm example
      
      * Rewrite the gemm + layernorm example
      
      * clang-format
      
      * Fix bug if sync lds
      
      * Fix compile error
      6c3c06bf