1. 04 May, 2023 1 commit
    • Rostyslav Geyyer's avatar
      Optimize bf16 conversion (#664) · b076a02a
      Rostyslav Geyyer authored
      * Add TypeConvert class and start refactoring
      
      * Refactor TypeConvert as a struct
      
      * Get back to template functions type_convert
      
      * Add a type_convert_bf16_rtn, set rtz as default
      
      * Clean up
      
      * Add UnaryConvertPrecision struct for high-precision workloads
      
      * Format
      
      * Update type_convert to UnaryConvert on threadwise level
      
      * Update UnaryConvertPrecision
      
      * Format
      
      * Fix chmod
      
      * Add a flag to pick converion method
      
      * Format
      
      * Remove the added flag
      
      * Merge elementwise op with type conversion
      
      * Move type_convert to elemwise op, update the op
      
      * Update type_convert_precision -> bf16_convert_rtn
      
      * Clean up
      
      * Update comments
      
      * Update the CK_WORKAROUND_DENORM_FIX flag handling
      
      * Update the unneeded op to work but warn user
      
      * Remove the message
      
      * Use a PassThrough instead of ConvertBF16RTN to calcaulate reference
      
      * Format
      
      * Add missing include
      b076a02a
  2. 28 Apr, 2023 1 commit
  3. 11 Apr, 2023 1 commit
  4. 30 Mar, 2023 1 commit
  5. 20 Mar, 2023 1 commit
  6. 09 Mar, 2023 1 commit
  7. 27 Feb, 2023 1 commit
  8. 15 Feb, 2023 1 commit
  9. 18 Jan, 2023 1 commit
    • Raman R jana's avatar
      Wavelet (inter-wave consumer-producer) GEMM (#310) · 1cfa8760
      Raman R jana authored
      
      
      * wavelet gemm programming model support for CK
      
      * GEMM pipeline update for wavelet progrmmaing model
      
      * Updated wavelet programming pipeline
      
      * fixes for global-write for math-wave
      
      * fixed bug in global writes
      
      * Updated comments for better readability
      
      * fixed clang format errors
      
      * added block_lds without barrier sync
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * refactor
      
      * prototype
      
      4 layouts
      
      fix default stride
      
      all problem sizes
      
      tidy
      
      move file
      
      update build script
      
      restore old file
      
      fix build
      
      * refactor standalone test to use gemm test harness
      
      * simplify gemm test
      
      * update build script
      
      * remove redundant
      
      * early return when cmd arg doesn't match
      
      * tidy
      
      * report failure when result not validated
      
      * tidy
      
      * Add comment depicting B2C mapping pattern.
      
      * Formatting & comments.
      
      * Comparison with custom B2C mapping pattern.
      
      * Example for wavelet gemm.
      
      * Add wavelet to Gemm standalone test.
      
      * Remove debug code.
      
      * Remove dangling #endif directive.
      
      Co-authored-by: root <Raman Jana>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      Co-authored-by: default avatarAdam Osewski <19374865+aosewski@users.noreply.github.com>
      1cfa8760
  10. 12 Jan, 2023 1 commit
  11. 02 Dec, 2022 1 commit
  12. 17 Nov, 2022 1 commit
  13. 02 Nov, 2022 2 commits
    • Anthony Chang's avatar
    • Rostyslav Geyyer's avatar
      Add pipeline v1/v2 selector, add more instances (#381) · 1a0b0e7b
      Rostyslav Geyyer authored
      
      
      * Add gridwise gemm pipeline v1/v2 selector
      
      * Pipeline selector working, test-wise add pipeline options to one instance
      
      * Add gemm instances
      
      * Add debug info to DeviceGemmXdl
      
      * Add debug info to DeviceGemmXdl_CShuffle
      
      * Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu
      
      * Minor fix
      
      * Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm
      
      * set up inter-wave configuration
      
      * use defualt loop scheduling for supported gemm ops
      
      for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI
      
      * Add enum PipelineVersion
      
      * Update instances
      
      * Format
      
      * Fix the merge conflict
      
      * Add flags to disable added instances
      
      * Test disable flag check
      
      * Disable flag check
      
      * Enable the instances
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      1a0b0e7b
  14. 27 Oct, 2022 1 commit
    • Anthony Chang's avatar
      Input/output permutation for fused attention (#460) · de37550f
      Anthony Chang authored
      
      
      * reopen masking att instance due to CI is upgraded
      
      * re-enable instances previously failed on 9110
      
      * enable ksize-kpadding pair validity test
      
      * add non-masked attention+permute test; expose masking boolean to attention kernel handles
      
      * disable bench
      
      * fix test
      
      * move files
      
      * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
      
      * format
      
      * amend rename
      
      * disable bench in test
      
      * add mask/no-mask test for non-permute attention kernels
      
      * disable broken kernel instance
      
      * example working
      
      add non-permuted problem statement
      
      evaluating whether overhead comes from permutation or the extra kernel arg
      
      * interface for bias addition without implementing it
      
      * test and profiler running
      
      * tidy
      
      * mask type determined by enum class
      
      * unify example code
      
      * move masking specialization to its own header
      
      * align formats
      
      * extract helper functions
      
      * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
      
      * add tensor specialization to template args
      
      since tensor spec packed shows perf parity when permutation isn't needed
      
      remove redundant template args
      
      comment on 'packed' tensor specialization
      
      * grouped attention with input/output permute example
      
      * format
      
      * clean up
      
      * refactor acc0 tile visitor
      Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      de37550f
  15. 06 Sep, 2022 1 commit
    • Anthony Chang's avatar
      Fused attention instances & padding tests (#395) · 868e5c55
      Anthony Chang authored
      * modify comment
      
      * trim unnecessary check
      
      * add gemm spec in kernel name
      
      * add TNTT gemm_gemm + atten kernel instances
      
      * refactor attention padding to better fit in unit tests
      
      This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
      Also added compile-time conditionals that load OOB value as NaN only after padding is enabled
      
      * add adhoc padding test for atten
      
      * shrink input value range for attention kernel validation to avoid occasional error by 1e-3
      
      Still unsure whether this kind of deterministic floating point accurary issue is expected
      or not. May want to try exact same approach as the GPU kernel in the host reference
      GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
      shrink the input value range as it is less likely to produce errors of around ~1e-3.
      
      * attention kernel proper granular padding for all 4 dims
      
      * IsSupportedArgument checks
      
      * test more padded cases
      
      * block PadK specialization in attention kernels
      
      * workaround clang crash for gfx908
      
      (gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
      error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
      VGPR_32: Cannot scavenge register without an emergency spill slot!"
      this fall back to less ideal way of handle NPadding in fused attention kernel
      
      * comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
      868e5c55
  16. 29 Jul, 2022 1 commit
    • Chao Liu's avatar
      Clean up conv example, Instances, profiler and test (#324) · 500fa995
      Chao Liu authored
      * convnd_fwd fp16 example
      
      * update example
      
      * update example
      
      * update instance
      
      * updating refernce conv
      
      * update reference conv
      
      * update conv fwd profiler
      
      * update conv 1d and 3d instance
      
      * update include path
      
      * clean
      
      * update profiler for conv bwd data and weight
      
      * update conv bwd weight
      
      * clean
      
      * update conv example
      
      * update profiler for conv bwd weight
      
      * update ckprofiler for conv bwd data
      
      * fix reference conv bwd data bug; update conv bwd data test
      
      * update examples
      
      * fix initialization issue
      
      * update test for conv fwd
      
      * clean
      
      * clean
      
      * remove test case too sensitive to error threshhold
      
      * fix test
      
      * clean
      
      * fix build
      
      * adding conv multiple d
      
      * adding conv multiple D
      
      * add matrix padder
      
      * add gemm padding to convnd
      
      * adding group conv
      
      * update gemm multi-d
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * clean
      
      * refactor
      
      * refactor
      
      * reorg
      
      * add ds
      
      * add bias
      
      * clean
      
      * add G
      
      * adding group
      
      * adding group
      
      * adding group
      
      * update Tensor
      
      * clean
      
      * update example
      
      * update DeviceGemmMultipleD_Xdl_CShuffle
      
      * update conv bwd-data and bwd-weight
      
      * upate contraction example
      
      * update gemm and batch gemm with e permute
      
      * fix example build
      
      * instance for grouped conv1d
      
      * update example
      
      * adding group conv instance
      
      * update gemm bilinear instance
      
      * update gemm+add+add+fastgelu instance
      
      * update profiler
      
      * update profiler
      
      * update test
      
      * update test and client example
      
      * clean
      
      * add grouped conv into profiler
      
      * update profiler
      
      * clean
      
      * add test grouped conv, update all conv test to gtest
      
      * update test
      500fa995
  17. 08 Jul, 2022 1 commit
    • Po Yen Chen's avatar
      GEMM pipeline v2 (#317) · 63914743
      Po Yen Chen authored
      
      
      * format
      
      * improving pipeline
      
      * fix typo
      
      * format
      
      * adding thread group
      
      * adding thread group
      
      * adding thread group
      
      * adding gemm pipeline
      
      * tweak
      
      * refactor
      
      * refactor
      
      * add missing type convert
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * fix build
      
      * refactor
      
      * format
      
      * clean up
      
      * use remove_cvref_t
      
      * clean
      
      * use pipeline_v2 for gemm kernel
      
      * Remove inconsistent indent
      
      * Fix compilation errors due to incomplete merge process
      
      * Add missing include directives
      
      * Fix compilation errors in currently unused files
      
      * Add license in newly added files
      
      * Re-format touched files by clang-format-10
      
      * Fix wrong template argument count of DeviceGemm<>
      
      * Use language construct to choose between types
      
      * Use language construct to choose GEMM example instance
      
      * Fix compilation error due to interface change
      
      * Re-use type alias to avoid duplication
      
      * Unify type alias usage in source file
      
      * Only use v2 pipeline in one gridwise GEMM type
      
      * Remove no-longer used include directives
      
      * Add static_assert() to check pipeline type requirements
      
      * Revert "Add static_assert() to check pipeline type requirements"
      
      This reverts commit f0985f0a132671a1caaea92810c9f30dcf062bde.
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
      63914743
  18. 07 Jul, 2022 1 commit
    • Chao Liu's avatar
      N-D Tensor Contraction example, instance, and client example (#270) · 4fe9c393
      Chao Liu authored
      * adding contraction
      
      * add contraction example
      
      * update examle
      
      * update example
      
      * format
      
      * update readme
      
      * clean header
      
      * clean header
      
      * contraction with multiple D
      
      * rename
      
      * fix naming issue; add instances for contraction+bilinear
      
      * change assumed virtual layout of contraction; add client example
      
      * update example
      
      * update
      
      * contraction+scale
      
      * use type_convert
      
      * rename
      4fe9c393
  19. 25 Jun, 2022 1 commit
    • Chao Liu's avatar
      Absolute include path (#281) · d1db6a0c
      Chao Liu authored
      * ad gelu and fast_gelu
      
      * added GeLU and fast GeLU
      
      * clean up
      
      * add gemm+fastgelu example
      
      * add gemm+gelu instances
      
      * update profiler
      
      * clean up
      
      * clean up
      
      * adding gemm+bias+activation
      
      * clean
      
      * adding bias
      
      * clean
      
      * adding gemm multiple d
      
      * debugging
      
      * add gemm bias add fastgelu
      
      * rename, clean
      
      * refactoring; add readme
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * refactor
      
      * fix
      
      * fix
      
      * update example
      
      * update example
      
      * rename
      
      * update example
      
      * add ckProfiler
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * add client app example
      
      * update readme
      
      * delete obselete files
      
      * remove old client app
      
      * delete old file
      
      * cleaning
      
      * clean
      
      * remove half
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path for all examples
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * fix header path
      
      * revert client app example
      
      * clean build
      
      * fix build
      
      * temporary disable client test on Jenkins
      
      * clean
      
      * clean
      
      * clean
      d1db6a0c
  20. 23 Jun, 2022 2 commits
    • Chao Liu's avatar
      update license (#297) · a49115b9
      Chao Liu authored
      * update license
      
      * update license
      
      * update license
      
      * update license
      a49115b9
    • Adam Osewski's avatar
      Testing all fwd convolution specializations. (#259) · a2edd7d8
      Adam Osewski authored
      
      
      * UniforFill with integer values.
      
      * Log tested instance type string.
      
      * Add UT for all convolution specializations.
      
      * debugging conv
      
      * Fix dangling reference bug.
      
      * Small refinements.
      
      * Fix call to error checking function.
      
      * Small refinements to tests.
      
      * Configure error tolerance
      * Change problem size.
      * Remove OddC case from types that do not support it.
      
      * Add helper traits for AccumulatorDataType.
      
      * Print first 5 errs in check_err for integral types.
      
      * Rename FillUniform to FillUniformDistribution
      
      * Refactor
      
      * Do not use typed tests.
      * Instead use plain fixture class with templatized member functions.
      * Initialize tensors with integer values.
      
      * Refine test instances.
      
      * Properly set accumulator data type.
      * Add another "big" instance.
      
      * Refactor convolution tests.
      
      * Revert "debugging conv"
      
      This reverts commit b109516455631ff8fd6dce99cf7c14bf8e323ebb.
      
      * Add pragma once + format + small refinement.
      
      * Fix some unwanted changes.
      
      * Clang-format
      
      * Fix profile_convnd to use renamed tensor initializer.
      
      * Add instances for ConvFWDND kernel case 2D
      
      * Helpers to get ConvNDFwd 2D instances.
      
      * Refactoring.
      
      * Remove "small block" instance as it was generating compiler errors.
      * Remove default template parameters values.
      
      * Refine and fix test.
      
      * Fix problem with default template parameter types.
      * Adjust error thresholds for floating point values test.
      * Use integer values initialization for instances test.
      * Add tests for ConvNDFwd 2D case.
      
      * Remove AccumulatorDataType type trait.
      
      * Update unit-tests.
      
      * Remove operator<< overload.
      
      * Unlock conv1d/3d nd fwd instances.
      
      * Enable skipping calculating reference using flag.
      
      * Fix number of channels for first ResNet50 layer.
      
      * Clang-format.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      a2edd7d8
  21. 21 Jun, 2022 1 commit
  22. 20 May, 2022 1 commit
    • rocking5566's avatar
      Gemm reduce max (#209) · 0ffe956a
      rocking5566 authored
      
      
      * [What] Rename the example
      [Why] Prepare to add unary reduction
      
      * Add global oparation to the parameter
      
      * Add atomicmax
      
      * Fix compile error
      
      * Support atomicMax (hip library)
      
      * Rename the reduction example
      
      * Fix target name
      
      * use p_d1_grid as the indicator directly
      
      * Prevent performance issue. Let passthrough handle it.
      
      * Implement the function template the specialize the float2
      
      * No need to separate into two lines
      
      * Remove empty line
      
      * add comment
      
      * Fix compile error due to merge from develop
      
      * make the implementation of atomic_max / atomic_add explicit for each datatype
      
      * Refine typo
      
      * For future CI test
      
      * Fix compiler error in ckProfiler
      
      * Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
      
      * simply use remove_pointer
      
      * Rename type and var
      
      * Refine example
      
      * Modify reducemax example
      
      * Fix bug in reduction
      
      * Change initialize range
      
      * Implement F64 version of atomicMax
      
      * Move reduction  code together
      
      * Add buffer atomic_max
      
      * Fix coding style by clang-format
      
      * Integrate new api of DeviceGemmReduce_Xdl_CShuffle
      
      * Integrate Batch gemm reduction
      
      * Fix example
      
      * fix example
      
      * clean up
      
      * Fix batch gemm tensor operation
      
      * Fix coding style
      
      * Fix template augument
      
      * Fix clang format
      
      * Keep flexible of different stride for each D tensor
      
      * Fix compile error for ckProfiler
      
      * Fix typo
      
      * [What] Fix naming
      [Why] Prepare to add out elementop
      
      * Add DoutElementOp
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarrocking <chunylai@amd.com>
      0ffe956a
  23. 11 May, 2022 1 commit
    • Anthony Chang's avatar
      Manual control of MAC cluster for improved interwave performance (#184) · 76764d8c
      Anthony Chang authored
      * manual control of MAC cluster for improved 2-wave performance
      
      ensure setprio's order; ensure inner loop size >= local read size
      
      synchronize when single mac cluster
      
      * format
      
      * use value field from ck::integral_constant
      
      * roll out inter-wave loop scheduler to c-shuffle gemm variants
      
      will gradually roll out to other applicable device ops when occasional reg spill is resolved
      
      * additional comments
      
      * format
      
      * fix mismatch between inter-wave pipeline and interwave blockwise gemm
      
      * address review feedback
      
      * amend
      76764d8c
  24. 09 May, 2022 1 commit
    • Chao Liu's avatar
      Code refactor (#175) · ec7c2e91
      Chao Liu authored
      * format
      
      * improving pipeline
      
      * fix typo
      
      * format
      
      * adding thread group
      
      * adding thread group
      
      * adding thread group
      
      * adding gemm pipeline
      
      * tweak
      
      * refactor
      
      * refactor
      
      * add missing type convert
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * fix build
      
      * refactor
      
      * format
      
      * clean up
      
      * use remove_cvref_t
      
      * clean
      
      * clean up
      
      * clean up
      
      * clean up
      ec7c2e91
  25. 31 Mar, 2022 1 commit
    • Chao Liu's avatar
      Compile for gfx908 and gfx90a (#130) · cd167e49
      Chao Liu authored
      * adding compilation for multiple targets
      
      * fix build
      
      * clean
      
      * update Jekinsfile
      
      * update readme
      
      * update Jenkins
      
      * use ck::half_t instead of ushort for bf16
      
      * rename enum classes
      
      * clean
      
      * rename
      
      * clean
      cd167e49
  26. 24 Mar, 2022 1 commit
    • Chao Liu's avatar
      Gemm+Reduce Fusion (#128) · f95267f1
      Chao Liu authored
      * add gridwise gemm v4r1
      
      * rename
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * use sfc in shuffling
      
      * remove hardcode
      
      * remove hardcode
      
      * refactor
      
      * fix build
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * adding gemm+reduce
      
      * format
      
      * clean
      
      * adding gemm+reduce
      
      * adding profiler for gemm+reduce
      
      * adding gemm+reduce profiler
      
      * fix build
      
      * clean up
      
      * gemm+reduce
      
      * fix build
      
      * update DeviceGemm_Xdl_CShuffle; update enum to enum class
      
      * clean up
      
      * add test for gemm+reduce
      
      * clean up
      
      * refactor
      
      * fix build
      
      * fix build
      f95267f1
  27. 23 Mar, 2022 1 commit
    • Adam Osewski's avatar
      Unified conv3D API + support for all data types. (#133) · f91579aa
      Adam Osewski authored
      
      
      * Convolution ND
      
      * Code unification across dimensions for generating tensor descriptors.
      * Example
      * Instances
      
      * Move convnd f32 instance file to comply with repo structure.
      
      * Conv 1D tensor layouts.
      
      * Formatting and use ReferenceConv
      
      * Reference ConvFwd supporting 1D and 2D convolution.
      
      * Debug printing TensorLayout name.
      
      * Conv fwd 1D instance f32
      
      * Refactor conv ND example.
      
      Needed to support various conv dimensio.
      
      Needed to support various conv dimensions
      
      * Rename conv nd example director to prevent conflicts.
      
      * Refactor some common utility to single file.
      
      Plus some tests.
      
      * Refactor GetHostTensorDescriptor + UT.
      
      * Add 1D test case.
      
      * Test reference convolution 1d/2d
      
      * Remove some leftovers.
      
      * Fix convolution example error for 1D
      
      * Refactor test check errors utility function.
      
      * Test Conv2D Fwd XDL
      
      * More UT for 1D case.
      
      * Parameterize input & weight initializers.
      
      * Rename example to prevent conflicts.
      
      * Split convnd instance into separate files for 1d/2d
      
      * Address review comments.
      
      * Fix data type for flops/gbytes calculations.
      
      * Assign example number 11.
      
      * 3D cases for convolution utility functions.
      
      * 3D reference convolution.
      
      * Add support for 3D convolution.
      
      * Check for inputs bigger than  2GB.
      
      * Formatting
      
      * Support for bf16/f16/f32/i8 - conv instances + UT.
      
      * Use check_err from test_util.hpp.
      
      * Split convnd test into separate files for each dim.
      
      * Fix data generation and use proper instances.
      
      * Formatting
      
      * Skip tensor initialization if not necessary.
      
      * Fix CMakefiles.
      
      * Remove redundant conv2d_fwd test.
      
      * Lower problem size for conv3D UT.
      
      * 3D case for convnd example.
      
      * Remove leftovers after merge.
      
      * Add Conv Specialization string to GetTypeString
      
      * Skip instance causing numerical errors.
      
      * Small fixes.
      
      * Remove redundant includes.
      
      * Fix namespace name error.
      
      * Script for automatic testing and logging convolution fwd UTs
      
      * Comment out numactl cmd.
      
      * Refine weights initalization and relax rtol for fp16
      
      * Fix weights initialization for int8.
      
      * Add type_convert when store output in ref conv 1D.
      
      * Get back old conv2d_fwd_xdl operation.
      
      * Silence conv debug print.
      
      * format
      
      * clean
      
      * clean
      
      * Fix merge.
      
      * Fix namespace for check_err
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      f91579aa
  28. 09 Mar, 2022 1 commit
    • Chao Liu's avatar
      Reorganize files, Part 1 (#119) · 5d37d7bf
      Chao Liu authored
      * delete obselete files
      
      * move files
      
      * build
      
      * update cmake
      
      * update cmake
      
      * fix build
      
      * reorg examples
      
      * update cmake for example and test
      5d37d7bf
  29. 04 Mar, 2022 1 commit
    • ltqin's avatar
      NHWC conv 2d: bwd fp32/fp16/bfp16/int8, Device level tuning and host API (#92) · c254e5ab
      ltqin authored
      
      
      * start conv2d bwd api
      
      * kernel running
      
      * add bwd reference
      
      * change to no shuffle
      
      * fix bwd reference
      
      * pass verification
      
      * add Filter1x1Stride1Pad0 and start testing
      
      * change some tuning parameter
      
      * fix test error
      
      * add fp16 tuning parameter
      
      * add bf16 tuning parameter
      
      * add int8 tuning parameters
      
      * change fp32 tuning parameter
      
      * add bwd to profiler
      
      * fix bug for bwd profiler
      
      * fix ckProfiler bug
      
      * change conv2d_bwd_xdl to fp16
      
      * fix bug in comments
      
      * fix precompile id
      
      * fix enum conv name
      
      * chage _bwd_ to _bwd_data_
      
      * change conv2d_bwd example id
      
      * bwd to bwd data
      
      * fix prehead
      
      * fix MakeDefaultBlock2CTileMap ,import form merge develop
      
      * format bwd instance
      
      * bwd to bwd data
      
      * change name bwd to bwd data
      
      * change name bwd to bwd data in example
      
      * formate code
      
      * change conv2d bwd data id in example
      
      * rewrite readme for example
      
      * fix CalculateMagicNumbers about div zero
      
      * add workaround CK_WORKAROUND_SWDEV_325164
      
      * change test_conf2d_bwd_data show info
      
      * format
      
      * fix bug for workaround:CK_WORKAROUND_SWDEV_325164
      
      * formate tuning parameters
      
      * formate tuning parameters again
      
      * formate tuning parameters 3
      
      * formate tuning parameters 4
      
      * remove add function template
      
      * format
      
      * update comment
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c254e5ab
  30. 23 Feb, 2022 1 commit
    • Jianfeng Yan's avatar
      Conv3d new (#94) · 6dfb92bb
      Jianfeng Yan authored
      
      
      * conv3d compiles but has memory error
      
      * conv3d works
      
      * fix performance issue by using __builtin_amdgc_readfirstlane
      
      * change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*
      
      * clang-format
      
      * remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d
      
      * format
      
      * remove useless marc
      
      * add comment
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      6dfb92bb
  31. 19 Feb, 2022 1 commit
    • JD's avatar
      Initial Setup for CI (#86) · 2778e997
      JD authored
      
      
      * add docker file and make default target buildable
      
      * add Jenkinsfile
      
      * remove empty env block
      
      * fix package stage
      
      * remove render group from docker run
      
      * clean up Jenkins file
      
      * add cppcheck as dev dependency
      
      * update cmake file
      
      * Add profiler build stage
      
      * add hip_version config file for reduction operator
      
      * correct jenkins var name
      
      * Build release instead of debug
      
      * clean up
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      2778e997
  32. 26 Dec, 2021 1 commit
    • Chao Liu's avatar
      Fusion Conv+Bias+ReLU(+Add) (#62) · acbd7bd7
      Chao Liu authored
      * fix relu
      
      * clean up
      
      * clean up
      
      * adding 1x1 conv
      
      * adding 1x1 conv
      
      * added 1x1 conv
      
      * refactor
      
      * refactor
      
      * refactor
      
      * added profiler for conv+bias+relu+add
      
      * clean up
      
      * adding conv+bias+relu
      
      * adding conv+bias+relu
      
      * added conv+bias+relu
      
      * Update README.md
      
      * update cpu verification
      
      * adding c shuffle
      
      * update static_tensor for dealing with invalid element
      
      * adding c shuffle
      
      * debugging
      
      * fix bug
      
      * convert to fp16 before shuffle
      
      * shuffle more than one M/NRepeat
      
      * clean up
      
      * remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
      
      * clean up
      
      * remove coordinate step hack from all gridwise gemm xdl
      
      * clean up coordinate step hack
      
      * clean up coordinate step hack
      
      * ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
      
      * adding output shuffle in conv+bias+relu+add
      
      * update
      
      * added conv+bias+relu+add with c shuffle
      
      * added conv+bias+relu+add with c shuffle
      
      * fix forward_sweep bugs in threadwise copy
      
      * clean up
      
      * refactor
      
      * clean up
      
      * clean up
      
      * added conv_c_shuffle+bias_relu
      
      * clean up
      
      * added conv+bias+relu+atomic_add
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * clean up
      
      * misc fixes; add 1x1 specialization
      
      * clean up
      
      * delete unused device op
      
      * clean up
      
      * add support for odd C value
      acbd7bd7
  33. 03 Dec, 2021 1 commit
    • Chao Liu's avatar
      GEMM/Conv+BiasAdd+ReLU+Add (#55) · 41cdd380
      Chao Liu authored
      * gemm+activation
      
      * move C pointwise operation into threadwise copy
      
      * add pointwise operation to A/B matrix
      
      * update ckProfiler
      
      * adding bias add
      
      * adding bias add
      
      * adding bias add
      
      * added bias add; worked around compiler issues
      
      * clean up
      
      * clean up
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * clean up
      
      * add conv_xdl example
      
      * adding conv_xdl_bias_relu_add example
      
      * add conv+bias+relu+add, but has register spill issue
      
      * tweak
      
      * tweak
      
      * refactor
      
      * Update README.md
      
      update readme for example/2_gemm_xdl_bias_relu_add
      
      * clean up
      
      * Update README.md
      
      update readme for example/3_conv_xdl
      
      * Update README.md
      41cdd380
  34. 18 Nov, 2021 2 commits
    • Chao Liu's avatar
      Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53) · 64350aff
      Chao Liu authored
      * reworking vector_type
      
      * use __builtin_memcpy for bit_cast and vector access of scalar pointer
      
      * clean up
      64350aff
    • zjing14's avatar
      v5r1 fusion kernels for inference (#49) · 970fa3e9
      zjing14 authored
      
      
      * init
      
      * refactor for 1x1
      
      * rename e0_e1
      
      * add e1 with bugs
      
      * debug
      
      * fixed
      
      * fixed e1
      
      * add timer
      
      * imprve threadwise gemm with dot2
      
      * add e2
      
      * tuning
      
      * seperate c2
      
      * add nhwc
      
      * restore nchwc
      
      * clean
      
      * opt
      
      * fixed; tuning
      
      * add BGlobalMoveSliceWindowStepHacks{}
      
      * tuning
      
      * repeat running
      
      * adjust
      
      * merge v5r1 nchwc
      
      * add adaptors
      
      * split k0 k1 in c_thread_grid
      
      * split h and w
      
      * remove v5r1 nhwc
      
      * clean for pr
      
      * remove host_conv_add
      
      * clean code
      
      * clean
      
      * add dynamic support
      
      * static mode
      
      * test static
      
      * add conv+add fusion
      
      * fixed validation
      
      * naming fix
      
      * use activ_enum
      
      * make static
      
      * refactor conv_add for InMem::add
      
      * add bias
      
      * add conv_out
      
      * add configurable makeddesc
      
      * add maxpool fusion
      
      * add maxpool host for validation
      
      * enable static desc
      
      * conv-only use v5r1_add
      
      * test
      
      * test
      
      * for binary dumps
      
      * fixed incorrect results due to typo
      
      * clean
      
      * debugging maxpool
      
      * workaround with offset trick
      
      * clean code
      
      * modularize ops of fusion
      
      * add gridwise_gemm_v3
      
      * create seperate fusion fun
      
      * enable dynamic mode of conv and conv+resize_add
      
      * add dynamic mode of maxpool
      
      * add pass by point
      
      * add activ_type as arguments
      
      * merge develop
      
      * clean
      
      * reset config to old default
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      970fa3e9
  35. 16 Nov, 2021 1 commit
  36. 15 Nov, 2021 1 commit
    • Chao Liu's avatar
      FP16 data in-register transpose (#41) · b491ebf3
      Chao Liu authored
      * start fixing 16bit data packing
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * adding StaticTensor
      
      * add missing constexpr
      
      * adding static tensor
      
      * adding static tensor
      
      * adding transpose
      
      * add inline asm for transpose 2x2 of half_t
      
      * add general transpose_vectors(), but have unnecessary register initialization using v_mov
      
      * fix unnecessary register initialization in transpose_vector by using more pass-by-reference
      
      * add hardcoded logic for NHWC wrw
      
      * improve asm for v_pack
      
      * make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
      
      * tweak
      
      * reorganize file
      b491ebf3
  37. 14 Nov, 2021 1 commit
    • Chao Liu's avatar
      ckProfiler and device-level XDL GEMM operator (#48) · e823d518
      Chao Liu authored
      * add DeviceGemmXdl
      
      * update script
      
      * fix naming issue
      
      * fix comment
      
      * output HostTensorDescriptor
      
      * rename
      
      * padded GEMM for fwd v4r4r4 nhwc
      
      * refactor
      
      * refactor
      
      * refactor
      
      * adding ckProfiler
      
      * adding ckProfiler
      
      * refactor
      
      * fix tuning parameter bug
      
      * add more gemm instances
      
      * add more fp16 GEMM instances
      
      * fix profiler driver
      
      * fix bug in tuning parameter
      
      * add fp32 gemm instances
      
      * small fix
      
      * refactor
      
      * rename
      
      * refactor gemm profiler; adding DeviceConv and conv profiler
      
      * refactor
      
      * fix
      
      * add conv profiler
      
      * refactor
      
      * adding more GEMM and Conv instance
      
      * Create README.md
      
      Add build instruction for ckProfiler
      
      * Create README.md
      
      Add Readme for gemm_xdl example
      
      * Update README.md
      
      Remove build instruction from top most folder
      
      * Update README.md
      
      * clean up
      e823d518