1. 19 Sep, 2022 4 commits
  2. 16 Sep, 2022 4 commits
  3. 15 Sep, 2022 5 commits
  4. 14 Sep, 2022 3 commits
    • ltqin's avatar
      batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c
      ltqin authored
      
      
      * refactor
      
      * start
      
      * add device gemm file
      
      * add BatchStrideD0
      
      * add stridd0
      
      * add gridwise file
      
      * add d0 parameters to gridwise gemm
      
      * add c layout transformer
      
      * add d0 threadwise copy
      
      * init kernel
      
      * init kernel
      
      * regular code
      
      * nm desc put to out
      
      * kernel parameter can not use reference
      
      * host add bias+gelu
      
      * run right for bias+gelu
      
      * change AddFastGelu into another file
      
      * interface add d1 bias parameters
      
      * add d1 parameter to argument
      
      * add d1 parameter to gridwise
      
      * first all code,not verify
      
      * gelu change to relu and GetElementSpaceSize bug
      
      * add instance
      
      * start add to ckprofiler
      
      * ckprofiler finish code
      
      * change input parameter for ckProfiler
      
      * fix host bias+gelu bug
      
      * show help for ckProfiler
      
      * fix bug for lunch kernel ignore parametes
      
      * add pad and fix about bug
      
      * mutiple d0
      
      * add dynamic d0_element_op
      
      * change profiler and  instance to mutiple d0
      
      * example have 2 d0
      
      * remove some comments not using
      
      * change 2 d0 have self  parameters
      
      * change d element_op name
      
      * change class name(multiple_d)
      
      * fix bug
      
      * fix bug that don't find file
      
      * update profiler
      
      * refactor
      
      * update profiler
      
      * clean
      
      * revert example change
      
      * add gon layout
      
      * optimize parameter for gno
      
      * add gon to gemm+gemm
      
      * change helping input parameters
      
      * change to GemmPadder_v2
      
      * using ForEach
      
      * fix gb_per_sec
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      370efa6c
    • rocking's avatar
      Let shape of gamma and beta can be same as x · 12673f3f
      rocking authored
      12673f3f
    • rocking's avatar
      Add groupnorm example by layernorm · 45220e05
      rocking authored
      1.  Reference is not ready
      2. shape of gamma and beta need to be fix
      45220e05
  5. 13 Sep, 2022 1 commit
    • Illia Silin's avatar
      Upgrade the OS and ROCM versions. (#411) · b22ebd44
      Illia Silin authored
      * upgrade the OS and ROCM versions in CK docker
      
      * add cxx flags to link code with rocm5.2 and ck-9110 compiler
      
      * rename the docker image
      
      * run ONNX gemms using init=1
      b22ebd44
  6. 09 Sep, 2022 1 commit
  7. 08 Sep, 2022 1 commit
  8. 07 Sep, 2022 1 commit
  9. 06 Sep, 2022 3 commits
    • Anthony Chang's avatar
      Fused attention instances & padding tests (#395) · 868e5c55
      Anthony Chang authored
      * modify comment
      
      * trim unnecessary check
      
      * add gemm spec in kernel name
      
      * add TNTT gemm_gemm + atten kernel instances
      
      * refactor attention padding to better fit in unit tests
      
      This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
      Also added compile-time conditionals that load OOB value as NaN only after padding is enabled
      
      * add adhoc padding test for atten
      
      * shrink input value range for attention kernel validation to avoid occasional error by 1e-3
      
      Still unsure whether this kind of deterministic floating point accurary issue is expected
      or not. May want to try exact same approach as the GPU kernel in the host reference
      GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
      shrink the input value range as it is less likely to produce errors of around ~1e-3.
      
      * attention kernel proper granular padding for all 4 dims
      
      * IsSupportedArgument checks
      
      * test more padded cases
      
      * block PadK specialization in attention kernels
      
      * workaround clang crash for gfx908
      
      (gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
      error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
      VGPR_32: Cannot scavenge register without an emergency spill slot!"
      this fall back to less ideal way of handle NPadding in fused attention kernel
      
      * comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
      868e5c55
    • Anthony Chang's avatar
      GemmGemm TNNT instances (#399) · fe52c94c
      Anthony Chang authored
      * add gemm_gemm TNNT instance
      
      * sanitize Gemm1KPack
      
      * disable instances that failed validation on mi100
      fe52c94c
    • Adam Osewski's avatar
      Softmax client example (#396) · 3da5c19e
      Adam Osewski authored
      
      
      * Update Softmax device operation interface.
      
      * Update ckProfiler.
      
      * Update Softmax UT.
      
      * Update example.
      
      * Client example.
      
      * Clang format
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      3da5c19e
  10. 02 Sep, 2022 1 commit
    • zjing14's avatar
      [Hotfix] SplitK Gemm fp32 (#401) · 75891161
      zjing14 authored
      * add scripts
      
      * fixed splitK_gemm_fp32
      
      * clean
      
      * clean
      
      * use gemm_xdl_splitK_c_shuffle into profiler
      
      * remove device_gemm_xdl_splitk.hpp
      75891161
  11. 01 Sep, 2022 1 commit
  12. 31 Aug, 2022 2 commits
    • Po Yen Chen's avatar
      Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380) · 46a675aa
      Po Yen Chen authored
      
      
      * Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle
      
      * Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface
      
      * Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle
      
      * Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'
      
      * Add 'TransformConvFwdToGemm<>' utility class (from Chao)
      
      * Use 'TransformConvFwdToGemm<>' to shorten code
      
      * Fix ill-formed method declaration
      
      * Re-implement MakeRGridDescriptor_M() function
      
      * Change problem description
      
      * Use macro to define layout types
      
      * Define K-reduced output tensor layout types
      
      * Let user to decide R output tensor layout
      
      * Rename variables
      
      * Add padding to the reduced output tensor if necessary
      
      * Extract common code as helper method
      
      * Remove debug message
      
      * Add missing include directive
      
      * Add partial fp16 Conv + Reduction example
      
      * Add example verification code for 2D Conv problem
      
      * Use type alias to simplify code
      
      * Share code across different-dimension Conv problems
      
      * Rename file/functions from run_conv_fwd* to run_convnd_fwd*
      
      * Make example code more verbose
      
      * Add code to support 1D & 3D Conv + Reduction on host
      
      * Add more examples for data type: bf16, fp32
      
      * Add example for int8
      
      * Add custom target to group examples
      
      * Use more general custom target name
      
      * Change the description in error message
      
      * Disable testing for example other than fp32
      
      * Add examplel for int4 (just copy from int8)
      
      * Fix wrong data type
      
      * Use larger data type for intermediate tensors
      
      * Finish int4 example
      
      * Undefine macro PP_DEFINE_LAYOUT_TYPE() after use
      
      * Use named variables to replace magic numbers
      
      * Remove debug messages
      
      * Use same A/B data type for host Conv in int4 example
      
      * Add check for the 'RLayout' type argument
      
      * Group same-dim-layouts together in 'LayoutSetting<>'
      
      * Add 'final' specifier to utility classes
      
      * Use different initialization method for examples
      
      * Remove macro PP_DEFINE_LAYOUT_TYPE()
      
      * Fix code-comment mismatch
      
      * Use more reasonable initialization value for all data types
      
      * Default use init_method=1 for all examples
      
      * Remove never-used code
      
      * Remove confusing out-of-date comments
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      46a675aa
    • Chao Liu's avatar
      conv+conv (1x1 only) example using gemm+gemm (#393) · 4df6d93f
      Chao Liu authored
      * refactor conv
      
      * add conv+conv example, 1x1 only
      4df6d93f
  13. 30 Aug, 2022 2 commits
    • Adam Osewski's avatar
      Gemm reduce examples int4/int8/fp32/bf16 (#368) · d00e6115
      Adam Osewski authored
      
      
      * GEMM + Reduce max fp16+fp32
      
      * GEmm + Max bf16 + int8
      
      * Refactor common definitions.
      
      * Refactor common func of mean meansquare example.
      
      * More examples for mean meansquare.
      
      * Update int8 examples and skip them cause of random errors.
      
      * Int4 examples.
      
      * Fix examples for max int4/8
      
      * Tensor conversion for int4 input data for mean meansquare example.
      
      * Remove int4 mean_meansquare example
      
      * Fix int8 mean_meansquare example.
      
      -All ReductionAccData and R<N>DataType have to be F32. The INT32 data
      type is giving wrong results.
      
      * Guard int4 with ifdef
      
      * Change int8 example to add_addsquare due to div rounding err.
      
      * Clang format
      
      * Change the return type of common function.
      
      * Get back int8 example with division.
      
      * Remove int8 mean meansquare.
      
      * Use proper cast for BF16 data type.
      
      * Use ck::literals.
      
      * Use proper data type for host tensors & reference.
      
      - Use ReduceAccDataType for reference gemm output data type.
      - Cast host reference output tensor to EDataType
      - Fix ifdefs for int4.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      d00e6115
    • Shaojie WANG's avatar
      Padding for attention: bmm+scale+softmax+bmm kernel (#385) · 45adb736
      Shaojie WANG authored
      
      
      * add padding algo for bmm+scale+softmax+bmm. Version for verification
      
      * remove verification code
      
      * remove comments
      
      * add padded bmm scale softmax bmm example
      
      * format
      
      * refactor
      
      * add comments for usages of padding bmm+scale+softmax+bmm
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      45adb736
  14. 29 Aug, 2022 2 commits
  15. 26 Aug, 2022 2 commits
    • Illia Silin's avatar
      Add an option to build CK with clang directly (#387) · 1e5b59df
      Illia Silin authored
      * replace hipcc compiler with clang++
      
      * build client app with hipcc
      
      * build client app with clang
      
      * add an option to build with hipcc ro clang
      
      * fix the environment for client app
      
      * fix setting up compiler in cmake_build
      
      * change the way the compiler is set
      1e5b59df
    • zjing14's avatar
      Fixed splitk gemm fp32 (#384) · 9881625b
      zjing14 authored
      * add scripts
      
      * fixed splitK_gemm_fp32
      
      * clean
      
      * clean
      9881625b
  16. 25 Aug, 2022 5 commits
  17. 24 Aug, 2022 2 commits