1. 11 Nov, 2022 1 commit
    • Po Yen Chen's avatar
      Rangify constructor of HostTensorDescriptor & Tensor<> (#445) · 4a2a56c2
      Po Yen Chen authored
      * Rangify STL algorithms
      
      This commit adapts rangified std::copy(), std::fill() & std::transform()
      
      * Rangify check_err()
      
      By rangifying check_err(), we can not only compare values between
      std::vector<>s, but also compare any ranges which have same value
      type.
      
      * Allow constructing Tensor<> like a HostTensorDescriptor
      
      * Simplify Tensor<> object construction logics
      
      * Remove more unnecessary 'HostTensorDescriptor' objects
      
      * Re-format example code
      
      * Re-write more HostTensorDescriptor ctor call
      4a2a56c2
  2. 03 Nov, 2022 1 commit
    • guangzlu's avatar
      Fused elementwise normalization (#492) · 8a4253ba
      guangzlu authored
      * add fused addition lyernorm
      
      * add fused addition lyernorm
      
      * changed CMakelist
      
      * removed annotates
      
      * modified descriptor of C
      
      * fixed bug in gridwise add layernorm
      
      * format the files
      
      * modified name from add&layernorm into elementwise&layernorm
      
      * created fused elementwise layernorm branch
      
      * change input into tuple type
      
      * add sweep once to reduce load & read of C from global memory
      
      * modified Argument api
      
      * modified way to malloc c in global memory
      
      * changed gamma and beta to m_k_desc
      
      * fixed bug when sweep once and move CDataType when define device level struct
      
      * add src dim for gamma and beta
      
      * implement optimization for coalesced
      
      * delete a annotation line
      
      * fixed some bug to meet the requirements of ck
      
      * add bandwidth computing in example, and fixed the time unit
      
      * move device_elementwise_layernorm_impl.hpp into device/impl
      
      * fixed bug in device_elementwise_layernorm_impl.hpp
      
      * changed name from layernorm into normalization
      
      * clang-format the changed files
      
      * changed the names
      
      * moved immidiate results into lds, it become faster in non-sweeponce cases
      
      * changed naming of C into X to make the defination more clear
      
      * changed naming in example
      
      * add tests for elementwise normalization
      
      * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
      
      * move test_elementwise_layernorm_fp16 into new folder
      
      * move elementwise_normalization_instances into a new folder
      
      * add more tests in test_elementwise_layernorm_fp16.cpp
      
      * added some corner cases in test
      
      * fixed method to compute lds size for matrix X
      
      * changed name of 44_elementwise_normalization into 45_elementwise_normalization
      
      * modified some comments
      
      * modified some other confused comments
      
      * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
      8a4253ba
  3. 02 Nov, 2022 1 commit
    • rocking5566's avatar
      Refine layernorm naming and test code (#497) · d4d1147f
      rocking5566 authored
      * Sync the naming
      
      * Sync the test of layernorm with groupnorm
      
      * Sync the naming
      
      * Minor change for comment and log
      
      * [What] Add saveMean and SaveInvVariance in the interface.
      [Why] These can optimize the backward
      d4d1147f
  4. 25 Oct, 2022 2 commits
    • guangzlu's avatar
      Revert "Fused elementwise layernorm (#468)" (#491) · 6ea9257e
      guangzlu authored
      This reverts commit efbcc6ed.
      6ea9257e
    • guangzlu's avatar
      Fused elementwise layernorm (#468) · efbcc6ed
      guangzlu authored
      * add fused addition lyernorm
      
      * add fused addition lyernorm
      
      * changed CMakelist
      
      * removed annotates
      
      * modified descriptor of C
      
      * fixed bug in gridwise add layernorm
      
      * format the files
      
      * modified name from add&layernorm into elementwise&layernorm
      
      * created fused elementwise layernorm branch
      
      * change input into tuple type
      
      * add sweep once to reduce load & read of C from global memory
      
      * modified Argument api
      
      * modified way to malloc c in global memory
      
      * changed gamma and beta to m_k_desc
      
      * fixed bug when sweep once and move CDataType when define device level struct
      
      * add src dim for gamma and beta
      
      * implement optimization for coalesced
      
      * delete a annotation line
      
      * fixed some bug to meet the requirements of ck
      
      * add bandwidth computing in example, and fixed the time unit
      
      * move device_elementwise_layernorm_impl.hpp into device/impl
      
      * fixed bug in device_elementwise_layernorm_impl.hpp
      
      * changed name from layernorm into normalization
      
      * clang-format the changed files
      
      * changed the names
      
      * moved immidiate results into lds, it become faster in non-sweeponce cases
      
      * changed naming of C into X to make the defination more clear
      
      * changed naming in example
      
      * add tests for elementwise normalization
      
      * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
      
      * move test_elementwise_layernorm_fp16 into new folder
      
      * move elementwise_normalization_instances into a new folder
      
      * add more tests in test_elementwise_layernorm_fp16.cpp
      
      * added some corner cases in test
      
      * fixed method to compute lds size for matrix X
      
      * changed name of 44_elementwise_normalization into 45_elementwise_normalization
      
      * modified some comments
      
      * modified some other confused comments
      
      * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
      efbcc6ed
  5. 13 Oct, 2022 2 commits
    • Adam Osewski's avatar
      Refactor device op implementations into `impl` subdirectory. (#420) · 30480288
      Adam Osewski authored
      
      
      * Move kernel implementation files under impl directory.
      
      * Update examples paths.
      
      * Update device kernel impl include paths.
      
      * Update tensor operation instances include paths.
      
      * Update profiler and tests include paths.
      
      * Clang-format
      
      * Update include paths for batched gemm reduce
      
      * Refactor UnitTest ConvNDBwdWeight.
      
      * Refactor fwd and bwd data convND UT.
      
      * Fix used test macro.
      
      * Fix include path.
      
      * Fix include paths.
      
      * Fix include paths in profiler and tests.
      
      * Fix include paths.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      30480288
    • rocking5566's avatar
      Fix bug of layernorm ckProfiler and refine code (#448) · 1b62bfaa
      rocking5566 authored
      * Fix bug of profiler for layernorm
      
      * 1. Rename layernorm into normalization
      2. Decouple softmax from normalization
      
      * clang-format
      1b62bfaa
  6. 20 Sep, 2022 1 commit
    • rocking5566's avatar
      Group norm (#417) · 4eba345f
      rocking5566 authored
      
      
      * Add groupnorm example by layernorm
      1.  Reference is not ready
      2. shape of gamma and beta need to be fix
      
      * Let shape of gamma and beta can be same as x
      
      * Modify test, instance and client example
      
      * [What] Fix bug of layernorm for greater than 2 dimension.
      [Why] We need to get upper length from merge transform instead of embed transform.
      
      * Add reference for groupnorm
      
      * Fuse sigmoid after groupnorm
      
      * [What] Rename original layernorm into layernorm2d
      [Why] Prepare to add groupnorm using layernorm5d
      
      * clang-format
      
      * Add groupnorm test
      
      * Refine error message
      
      * Add groupnorm ckProfiler
      
      * Test groupnorm kernel from device_instance
      
      * update example
      
      * upadte profiler
      
      * Fix test naming
      
      * Fix argc number
      
      * Move descriptor and sweeponce to argument for quick debugging
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      4eba345f
  7. 13 Aug, 2022 1 commit
    • rocking5566's avatar
      Layernorm welford (#346) · 0bd6b842
      rocking5566 authored
      
      
      * Add threadwise and blockwise welford
      
      * Rename gridwise op, prepare to add welford version
      
      * implement welford and integrate welford into layernorm
      
      * Take care of tail loop
      
      * Fix buf when ThreadSliceK > 1
      
      * Fix bug of merging of two empty set
      
      * Rename clip to clamp
      
      * 1. Fix type of count
      2. Remove useless static_assert
      
      * Do not inherit Reduction::Argument
      
      * [What] replace __syncthreads() with block_sync_lds()
      [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
      
      * Add y stride
      
      * Rename.
      DeviceLayernorm -> DeviceLayernormImpl
      DeviceNormalization2 -> DeviceLayernorm
      
      * Move literal ""_uz & ""_zu into namespace 'literals'
      
      * Move namespace 'literals' as 'ck::literals'
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0bd6b842
  8. 11 Aug, 2022 1 commit
    • rocking5566's avatar
      ckProfiler for layernorm (#330) · fdfd7eb5
      rocking5566 authored
      * Refine parameter
      
      * Add base class for layernorm
      
      * Add layernorm instance
      
      * Add layernorm to ckProfiler
      
      * Remove redundant
      
      * Add verification
      
      * Fix compile error due to merge
      fdfd7eb5
  9. 29 Jul, 2022 1 commit
    • Chao Liu's avatar
      Clean up conv example, Instances, profiler and test (#324) · 500fa995
      Chao Liu authored
      * convnd_fwd fp16 example
      
      * update example
      
      * update example
      
      * update instance
      
      * updating refernce conv
      
      * update reference conv
      
      * update conv fwd profiler
      
      * update conv 1d and 3d instance
      
      * update include path
      
      * clean
      
      * update profiler for conv bwd data and weight
      
      * update conv bwd weight
      
      * clean
      
      * update conv example
      
      * update profiler for conv bwd weight
      
      * update ckprofiler for conv bwd data
      
      * fix reference conv bwd data bug; update conv bwd data test
      
      * update examples
      
      * fix initialization issue
      
      * update test for conv fwd
      
      * clean
      
      * clean
      
      * remove test case too sensitive to error threshhold
      
      * fix test
      
      * clean
      
      * fix build
      
      * adding conv multiple d
      
      * adding conv multiple D
      
      * add matrix padder
      
      * add gemm padding to convnd
      
      * adding group conv
      
      * update gemm multi-d
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * clean
      
      * refactor
      
      * refactor
      
      * reorg
      
      * add ds
      
      * add bias
      
      * clean
      
      * add G
      
      * adding group
      
      * adding group
      
      * adding group
      
      * update Tensor
      
      * clean
      
      * update example
      
      * update DeviceGemmMultipleD_Xdl_CShuffle
      
      * update conv bwd-data and bwd-weight
      
      * upate contraction example
      
      * update gemm and batch gemm with e permute
      
      * fix example build
      
      * instance for grouped conv1d
      
      * update example
      
      * adding group conv instance
      
      * update gemm bilinear instance
      
      * update gemm+add+add+fastgelu instance
      
      * update profiler
      
      * update profiler
      
      * update test
      
      * update test and client example
      
      * clean
      
      * add grouped conv into profiler
      
      * update profiler
      
      * clean
      
      * add test grouped conv, update all conv test to gtest
      
      * update test
      500fa995
  10. 13 Jul, 2022 1 commit
    • rocking5566's avatar
      Standalone layernorm (#315) · 7f216620
      rocking5566 authored
      
      
      * Implement layernorm kernel and deviceOp
      
      * verify gpu kernel with host code
      
      * 1. Separate gamma aand beta from affine
      2. Check if argument is valid
      
      * clean
      
      * Sync the naming
      
      * Support sweep once mode if we can put k dimension data inside one block
      
      * [What] Get length from upper length.
      [Why] if we get length directly, we may get length after padding.
      
      * We only use one block in K dimension.
      Hence, we can simplify the indexing of global R/W.
      
      * Use 1d descriptor for gamma and beta
      
      * Add accElementwiseOp
      
      * Extract layernorm host code
      
      * Support different YVectorDim in GridwiseLayernorm
      
      * Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
      
      * Gamma and beta can share the VGPR.
      
      * Add test for fp32 and fp16
      
      * Fix bug of concurrency and add test case which may fail orignally
      
      * Propagate NaN for layernorm
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      7f216620