1. 12 Jan, 2023 2 commits
    • Illia Silin's avatar
      Add a flag to enable/disable debug output in many kernels. (#549) · 715e8dd2
      Illia Silin authored
      * add DEBUG_LOG macro to enable/disable debug output
      
      * fix syntax
      
      * fix syntax again
      
      * fix syntax one more time
      
      * remove balnk spaces
      
      * use ifdefs
      
      * add the Print argument
      
      * move the definition of DEBUG_LOG to ck.hpp
      
      * add the missign argument to Print()
      715e8dd2
    • Qianfeng's avatar
      Remove including of cmath (#551) · a17b0414
      Qianfeng authored
      * Let cmath included when compiling host codes in math_v2.hpp
      
      * Remove including of cmath in device_base.hpp and device_permute.hpp
      a17b0414
  2. 15 Dec, 2022 2 commits
  3. 12 Dec, 2022 1 commit
    • arai713's avatar
      Gridwise elementwise 2d (#466) · 0e5c264c
      arai713 authored
      
      
      * added 2d gridwise elementwise
      
      * added 2d version of device elementwise
      
      * added example file with updated device elementwise call
      
      * added Cmake file
      
      * changed NumDim into 2D
      
      * fixed compiler issues
      
      * fixed indexing for loop step
      
      * fixed NumDim dimension error
      
      * changed blockID to 2D
      
      * updated Grid Desc
      
      * updated kernel call
      
      * fixed 2d thread indexing
      
      * added dimensions for example file
      
      * commented out unused code
      
      * changed vector load
      
      * removed extra code
      
      * temporarily removing vector load on 2nd dim
      
      * changed vector load back, still causing errors
      
      * altered indexing
      
      * changed isSupportedArgument for 2D
      
      * changed indexing + do/while
      
      * fixed isSupportedArgument
      
      * changed dimension for debugging
      
      * fixed
      
      * added testing printouts
      
      * testing change
      
      * added variables to distribute threads through both dimensions
      
      * testing changes
      
      * integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise
      
      * removed most of the extraneous code, testing with different dimensions
      
      * testing
      
      * removed debugging print statements
      
      * moved 2d elementwise permute into elementwise permute directory
      
      * fixed formatting
      
      * removed debugging comments from threadwise transfer
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      0e5c264c
  4. 08 Dec, 2022 1 commit
  5. 07 Dec, 2022 1 commit
  6. 02 Dec, 2022 3 commits
  7. 30 Nov, 2022 2 commits
    • rocking5566's avatar
      gemm, conv perchannel quantization (#503) · ad541ad6
      rocking5566 authored
      * Use gemm_multiple_D instead
      
      * Add gemm bias relu quantization example
      
      * Add pure gemm quantization example
      
      * Add quantization of perchannel conv + bias + relu example
      
      * Refine the code
      
      * Rename multiplier to requant_scale
      
      * Rename the folder
      
      * Remove redundant comment
      
      * Rename the file. Prepare to add perchannel
      
      * Add conv perchannel instance
      
      * Move to quantization folder
      
      * Add conv perchannel client example
      
      * Apply Rangify constructor of HostTensorDescriptor & Tensor<>
      
      * Fix merge error
      ad541ad6
    • Qianfeng's avatar
      BatchNorm backward instance/external API/profiler/tests (#519) · 63af525c
      Qianfeng authored
      * Refine the device batchnorm-backward base API templates and data type assignments
      
      * Remove duplicated kernel file
      
      * Add batchnorm backward instances and external API
      
      * Add batchnorm-backward profiler and tests
      
      * Add client example which uses batchnorm backward external API
      
      * Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory
      
      * Loose the threshold for batchnorm-backward check_err()
      63af525c
  8. 29 Nov, 2022 2 commits
    • fsx950223's avatar
      fix GetTypeString · 0e9c88ce
      fsx950223 authored
      0e9c88ce
    • Qianfeng's avatar
      BatchNorm backward implementation (#461) · 44789d99
      Qianfeng authored
      * Implemented batchnorm-backward Blockwise and Multiblock kernels
      
      * Add batchnorm-backward device op
      
      * Add batchnorm-backward host-reference op
      
      * Add batchnorm-backward example
      
      * Parameters renaming in batchnorm backward kernels and device op
      
      * Change in the example to loose the threshold for ScaleDiff checking
      
      * Add comments to explain the implementation of batchnorm-backward
      
      * Parameters renaming again in batchnorm backward kernels
      
      * Improve the expression calculation for performance
      
      * Add batchnorm backward to README
      
      * Add comments to explain inv-variance in batchnorm forward and backward
      
      * Renaming the batchnorm forward training and inferring examples
      
      * Add/update the comments for batchnorm-backward kernels
      
      * Renaming again
      
      * Add block_sync_lds between two consecutive blockwise reductions
      
      * Move common expression 1/N out of the static_for loops
      
      * Add dy_elementwise_op
      
      * Renaming in backward example again
      
      * Add checking for reduceDims in reference_batchnorm_backward
      
      * Update to comments and codes format
      
      * Rename in the comments
      
      * Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c
      
      * Add block_sync_lds() between blockwise reduction again
      
      * Fix comments again
      
      * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
      44789d99
  9. 25 Nov, 2022 1 commit
    • Qianfeng's avatar
      BatchNorm forward instance/external api/profiler/tests/client example (#511) · 4e6a5575
      Qianfeng authored
      
      
      * Update to device_batchnorm_forward base class to include all template parameters for problem description
      
      * Add batchnorm forward instances and external api
      
      * Add batchnorm forward profiler module which uses the external api
      
      * Add some comments in batchnorm_forward example to explain the dimensions in lengths[]
      
      * Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward
      
      * Improvement to the batchnorm infer base API
      
      * Add batchnorm forward client example which shows using the batchnorm forward external API
      
      * Add test for batchnorm forward
      
      * Tuning the batchnorm profiler initialized values and error threshold
      
      * Add support for bhalf_t in instances/external api/tests
      
      * Add support for int8_t in instances/external api/tests
      
      * Add support for double in instances/external api/tests
      
      * Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances
      
      * Checking before running best instance in batchnorm_fwd_nhwc client example
      
      * Add checking for YElementwiseOp in batchnorm_forward external API
      
      * Add more types in batchnorm forward profiler
      
      * Add more test lengths
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      4e6a5575
  10. 20 Nov, 2022 1 commit
  11. 17 Nov, 2022 1 commit
  12. 15 Nov, 2022 4 commits
    • guangzlu's avatar
      Add BF16 tests for batched_gemm_softmax_gemm_permute (#504) · 4c4c7328
      guangzlu authored
      
      
      * fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
      
      * added bf16 tests for batched_gemm_softmax_gemm_permute
      
      * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * aligned annotations
      
      * modified CMakeLists for examples
      
      * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
      
      * use macro to control the instances
      
      * added macro control into instances
      
      * clang-format some files
      
      * changed error tolerance for bf16
      
      * changed index for 10_elementwise_normalization
      
      * fixed xdlops code bug in amd_xdlops.hpp
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      4c4c7328
    • ltqin's avatar
      Add Conv Backward Data on Navi21 for ResNet50 (#499) · db0eb1ea
      ltqin authored
      
      
      * start add example
      
      * add device dl
      
      * change launch kernel
      
      * change init data method
      
      * change example config
      
      * add config valid check
      
      * add instance for dl bwd
      
      * add instance to ckProfiler
      
      * reserver to profiler and cmakelist
      
      * add instance to ckProfiler2
      
      * change instance f32 config
      
      * fix example return value
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      db0eb1ea
    • Po Yen Chen's avatar
      7038723a
    • Po Yen Chen's avatar
      Introduce ck::accumulate_n() (#439) · 730204ee
      Po Yen Chen authored
      We can use this template to eliminate duplicated iterator computing
      logics. By providing return type to ck::accumulate_n(), we can avoid
      type conversion operations.
      730204ee
  13. 11 Nov, 2022 1 commit
  14. 10 Nov, 2022 4 commits
    • guangzlu's avatar
      add client example for elementwise_normalization (#501) · 70456328
      guangzlu authored
      * add client example for elementwise_normalization
      
      * clang format elementwise_layernorm2d.cpp
      
      * changed some naming to make it more understandable
      
      * changed naming of input into ab_input
      
      * fixed bug for threadwise_x_store
      
      * add elementwise operation to reference
      70456328
    • Po Yen Chen's avatar
      Add client example of grouped conv2d forward (data type: fp16) (#488) · f4980310
      Po Yen Chen authored
      * Rename example folder for GroupedConvFwdMultipleD
      
      * Unify example codes
      
      * Change target names
      
      * Add fp16 example for multiple d instance
      
      * Re-format common.hpp
      
      * Add interface 'DeviceGroupedConvFwd'
      
      * Use simpler interface
      
      * Move common conv params out
      
      * Rename conv fwd client example folder
      
      * Add missing include directive
      
      * Update grouped conv instance implementations
      
      * Simplify ckProfiler (grouped conv forward)
      
      * Use GroupedConvFwd to implement client example
      
      * Use greater groupe count in example
      
      * Add custom target to group examples
      
      * Add extra tag param to instance factory function
      
      * Use tag to differentiate factory functions
      
      * Add missing tag argument for factory function
      
      * Remove inheritance relationship
      
      * Remove no-longer used include directive
      
      * Add license in front of file
      f4980310
    • Po Yen Chen's avatar
      Add client example of grouped conv2d backward weight (data type: fp16) (#498) · 38470e04
      Po Yen Chen authored
      * Remove redundant CMake setting
      
      * Extract common code from files
      
      * Rename folder 'convnd' to 'conv'
      
      * Use std::array<> to accept compile-time kwnown # of arguments
      
      * Fix compilation error of tuning parameter
      
      * In example, use same setting as unit-test
      
      * Remove no-longer used include directive
      
      * Add interface for grouped conv bwd weight
      
      * Add group support for conv bwd weight
      
      * Add grouped conv bwd weight example
      
      * Use group parameter in example
      
      * Rename example folder
      
      * Remove non-grouped version example source files
      
      * Rename device op template
      
      * Add group support to convolution backward weight
      
      * Remove debug messages
      
      * Use smaller group size in example
      
      * Use named variable as loop terminate condition
      
      * Prettify example output message
      
      * Enlarge used grid size
      
      * Allow real grid size exceeds expected grid size
      
      * Rename interface file
      
      * Add client example for group...
      38470e04
    • Po Yen Chen's avatar
      Remove interface 'DeviceGroupedConvBwdData' (#500) · 67423a22
      Po Yen Chen authored
      * Remove interface 'DeviceGroupedConvBwdData'
      
      * Remove no-longer needed include directive
      
      * Rename client example folder
      67423a22
  15. 03 Nov, 2022 1 commit
    • guangzlu's avatar
      Fused elementwise normalization (#492) · 8a4253ba
      guangzlu authored
      * add fused addition lyernorm
      
      * add fused addition lyernorm
      
      * changed CMakelist
      
      * removed annotates
      
      * modified descriptor of C
      
      * fixed bug in gridwise add layernorm
      
      * format the files
      
      * modified name from add&layernorm into elementwise&layernorm
      
      * created fused elementwise layernorm branch
      
      * change input into tuple type
      
      * add sweep once to reduce load & read of C from global memory
      
      * modified Argument api
      
      * modified way to malloc c in global memory
      
      * changed gamma and beta to m_k_desc
      
      * fixed bug when sweep once and move CDataType when define device level struct
      
      * add src dim for gamma and beta
      
      * implement optimization for coalesced
      
      * delete a annotation line
      
      * fixed some bug to meet the requirements of ck
      
      * add bandwidth computing in example, and fixed the time unit
      
      * move device_elementwise_layernorm_impl.hpp into device/impl
      
      * fixed bug in device_elementwise_layernorm_impl.hpp
      
      * changed name from layernorm into normalization
      
      * clang-format the changed files
      
      * changed the names
      
      * moved immidiate results into lds, it become faster in non-sweeponce cases
      
      * changed naming of C into X to make the defination more clear
      
      * changed naming in example
      
      * add tests for elementwise normalization
      
      * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
      
      * move test_elementwise_layernorm_fp16 into new folder
      
      * move elementwise_normalization_instances into a new folder
      
      * add more tests in test_elementwise_layernorm_fp16.cpp
      
      * added some corner cases in test
      
      * fixed method to compute lds size for matrix X
      
      * changed name of 44_elementwise_normalization into 45_elementwise_normalization
      
      * modified some comments
      
      * modified some other confused comments
      
      * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
      8a4253ba
  16. 02 Nov, 2022 6 commits
    • rocking5566's avatar
      Refine layernorm naming and test code (#497) · d4d1147f
      rocking5566 authored
      * Sync the naming
      
      * Sync the test of layernorm with groupnorm
      
      * Sync the naming
      
      * Minor change for comment and log
      
      * [What] Add saveMean and SaveInvVariance in the interface.
      [Why] These can optimize the backward
      d4d1147f
    • Anthony Chang's avatar
    • Po Yen Chen's avatar
      Add client example of grouped conv2d backward data (data type: fp16) (#481) · 9e57a290
      Po Yen Chen authored
      * Improve example reusability
      
      * Remove no-longer used file
      
      * Rename folder of grouped_conv_bwd_data example
      
      * Add normal grouped conv bwd example
      
      * Add interface 'DeviceGroupedConvBwdData'
      
      * Prettify comment of device op type arguments
      
      * Add grouped conv2d/conv3d backward data fp16 instances
      
      * Fix wrong template argument
      
      * Add grouped_conv2d_bwd_data client example
      
      * Use simpler expression to calculate memory size
      
      * Fix formating
      
      * Remove grouped_conv3d_bw_data instances
      
      Underlying device operator is not ready to handle 3D input
      
      * Remove no-longer necessary include directive
      
      * Add missing include directive
      
      * Use more realistic conv param in example
      9e57a290
    • Rostyslav Geyyer's avatar
      Add pipeline v1/v2 selector, add more instances (#381) · 1a0b0e7b
      Rostyslav Geyyer authored
      
      
      * Add gridwise gemm pipeline v1/v2 selector
      
      * Pipeline selector working, test-wise add pipeline options to one instance
      
      * Add gemm instances
      
      * Add debug info to DeviceGemmXdl
      
      * Add debug info to DeviceGemmXdl_CShuffle
      
      * Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu
      
      * Minor fix
      
      * Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm
      
      * set up inter-wave configuration
      
      * use defualt loop scheduling for supported gemm ops
      
      for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI
      
      * Add enum PipelineVersion
      
      * Update instances
      
      * Format
      
      * Fix the merge conflict
      
      * Add flags to disable added instances
      
      * Test disable flag check
      
      * Disable flag check
      
      * Enable the instances
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      1a0b0e7b
    • Adam Osewski's avatar
      Softmax unit-test reduction across all and non innermost dims cases. (#406) · 6d8614ee
      Adam Osewski authored
      
      
      * Add reduction across all dims cases.
      
      * host softmax: handle all reduce
      
      * Test cases when reduced dim is not innermost axis.
      
      * Fix syntax.
      
      * Test non innermost dim for fp32 and int8
      
      * Group test suites wrt NumReduceDim.
      
      * Additionally test failing cases.
      
      * Throw error when Rank or NumReduceDims doesn't match arguments.
      
      * Check reducedDims has correct values
      
      * Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
      Instead implement own. (in fact just get rid of one check to enable
      reduction across inner dimensions).
      
      * Reorganize unit tests to better cover use scenarios.
      
      * Test input validation
      * Test reduction of inner dimensions with custom op instances.
      
      * Refactor fp32 and int8 unit tests.
      
      * Fix FP32 instance template parameters.
      
      * Add more instances.
      
      * Instances with InSrcVectorDim=0.
      
      * Do not initialize and copy data when arg not supported.
      
      * ckProfiler Softmax use instance factory.
      
      * Refactor device softmax IsSupported.
      
      * Additionally add non-polymorphic api functions
      
      * Split softmax instances into multiple files.
      
      * Fix profiler.
      
      * Reorganize tests to reuse profiler and cover edge cases.
      
      * Clang-format
      
      * I8 Softmax instances along with UT.
      
      * Reuse type alias definitions from instance factory header.
      
      * Clean included headers
      
      * Fix variable names.
      
      * Add missing checks in Argument constructor.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      6d8614ee
    • rocking5566's avatar
      Conv perlayer int8 quantization (#471) · 226bc02b
      rocking5566 authored
      * Add conv2d requant example
      
      * Fix bash error
      
      * Rename example
      
      * 1. Rename gemm quantization
      2. shares the requantization lambda function with conv
      
      * Refine declare type
      
      * Add conv bias relu quantization exmaple
      
      * clang format
      
      * Fix compile error due to merge develop
      
      * Fix CI error
      
      * Extract quantization post operation into another file
      
      * Support quantization for non piecewise linear function
      
      * Add instance for conv quantization
      
      * Add convolution quantization factory
      
      * Add convolution quantization client example
      
      * Add more instances with different template parameters
      
      * clang format
      
      * Sync the naming with the develop
      226bc02b
  17. 31 Oct, 2022 1 commit
    • ltqin's avatar
      Add Conv Forward on Navi21 for ResNet50 (#490) · 8ee36118
      ltqin authored
      
      
      * add device of dl
      
      * fix k1 of GridwiseGemmDl_km_kn_mn_v1r3
      
      * init version for dl conv
      
      * add example(init)
      
      * result right
      
      * disable elementwise operation
      
      * check parameters
      
      * add fp32,int8 example and change check code
      
      * change deive file and class name
      
      * add check vector access of C
      
      * add instance
      
      * add to ckProfiler
      
      * add Filter1x1Pad0 instances
      
      * fix ignore error
      
      * fix for CI
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      8ee36118
  18. 28 Oct, 2022 1 commit
    • Qianfeng's avatar
      Batchnorm-forward implemented using welford method to calculate variance (#403) · 7fa892e6
      Qianfeng authored
      
      
      * Update to the batchnorm-forward API and base class
      
      * Fix leeked header including in gridwise_set_buffer_value.hpp
      
      * Add kernels and device file for batchnorm-forward welford supporting both blockwise and multi-block reduction
      
      * Update to the batchnorm-forward example to use the new batchnorm-forward device interface
      
      * Change the batchnorm-forward reference to use sequential welford method
      
      * Change to assign the workspace into four buffers in the host layer
      
      * Use GetReduceCountPerThread functor to replace the initial count for Blockwise and Multiblock welford
      
      * Tiny correction and remove un-used file under example/34_batchnorm
      
      * Renaming in the kernel arguments
      
      * Explicitly use ck::math::sqrt in batchnorm-forward kernels
      
      * Add some comments to some kernels
      
      * Tiny fix
      
      * Generalize the data types in reference_batchnorm_forward_nhwc_c
      
      * Use ck::ignore to mark un-used parameters
      
      * Move GetReduceCountPerThread functor codes from kernel to device
      
      * Remove some un-used codes in device_batchnorm_forward_impl.hpp
      
      * Tiny fix in batchnorm_forward example
      
      * Move GetReduceCountPerThread() to welford_helper.hpp
      
      * Use seperate data type for Scale and Bias
      
      * Renaming in device Op
      
      * Tiny fix in forward example
      
      * Updata to batchnorm-infer (type spliting, renaming)
      
      * Add time and bandwidth measurement to the batchnorm-forward example
      
      * Add support of elementwise operation for batchnorm forward output
      
      * Reduce object copying by passing object as reference type
      
      * Tiny change for performance
      
      * Updates for performance again
      
      * Some Renamings
      
      * Add GetActualVariance template parameter for ThreadwiseWelfordMerge
      
      * Tiny update in reference batchnorm forward nhwc/c
      
      * Move batchnorm multiblock kernel files to grid/batchnorm_multiblock sub-directory
      
      * Fuse mean and bias in the normalization calculation
      Co-authored-by: default avatarroot <root@dc-smc-18.amd.com>
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      7fa892e6
  19. 27 Oct, 2022 1 commit
    • Anthony Chang's avatar
      Input/output permutation for fused attention (#460) · de37550f
      Anthony Chang authored
      
      
      * reopen masking att instance due to CI is upgraded
      
      * re-enable instances previously failed on 9110
      
      * enable ksize-kpadding pair validity test
      
      * add non-masked attention+permute test; expose masking boolean to attention kernel handles
      
      * disable bench
      
      * fix test
      
      * move files
      
      * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
      
      * format
      
      * amend rename
      
      * disable bench in test
      
      * add mask/no-mask test for non-permute attention kernels
      
      * disable broken kernel instance
      
      * example working
      
      add non-permuted problem statement
      
      evaluating whether overhead comes from permutation or the extra kernel arg
      
      * interface for bias addition without implementing it
      
      * test and profiler running
      
      * tidy
      
      * mask type determined by enum class
      
      * unify example code
      
      * move masking specialization to its own header
      
      * align formats
      
      * extract helper functions
      
      * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
      
      * add tensor specialization to template args
      
      since tensor spec packed shows perf parity when permutation isn't needed
      
      remove redundant template args
      
      comment on 'packed' tensor specialization
      
      * grouped attention with input/output permute example
      
      * format
      
      * clean up
      
      * refactor acc0 tile visitor
      Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      de37550f
  20. 25 Oct, 2022 3 commits
    • Qianfeng's avatar
      Update to the Reduction API and instances (#476) · dda3a0a1
      Qianfeng authored
      * Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances
      
      * Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces
      
      * Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock
      
      * Split the cpp and hpp files for reduction instances to enable more parallel compiling
      
      * Remove the using of macros for declaring reduction instances and instance references
      
      * Update to add_device_reduce_instance_xxxx templated functions
      
      * Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates
      
      * Change return format
      dda3a0a1
    • guangzlu's avatar
      Revert "Fused elementwise layernorm (#468)" (#491) · 6ea9257e
      guangzlu authored
      This reverts commit efbcc6ed.
      6ea9257e
    • guangzlu's avatar
      Fused elementwise layernorm (#468) · efbcc6ed
      guangzlu authored
      * add fused addition lyernorm
      
      * add fused addition lyernorm
      
      * changed CMakelist
      
      * removed annotates
      
      * modified descriptor of C
      
      * fixed bug in gridwise add layernorm
      
      * format the files
      
      * modified name from add&layernorm into elementwise&layernorm
      
      * created fused elementwise layernorm branch
      
      * change input into tuple type
      
      * add sweep once to reduce load & read of C from global memory
      
      * modified Argument api
      
      * modified way to malloc c in global memory
      
      * changed gamma and beta to m_k_desc
      
      * fixed bug when sweep once and move CDataType when define device level struct
      
      * add src dim for gamma and beta
      
      * implement optimization for coalesced
      
      * delete a annotation line
      
      * fixed some bug to meet the requirements of ck
      
      * add bandwidth computing in example, and fixed the time unit
      
      * move device_elementwise_layernorm_impl.hpp into device/impl
      
      * fixed bug in device_elementwise_layernorm_impl.hpp
      
      * changed name from layernorm into normalization
      
      * clang-format the changed files
      
      * changed the names
      
      * moved immidiate results into lds, it become faster in non-sweeponce cases
      
      * changed naming of C into X to make the defination more clear
      
      * changed naming in example
      
      * add tests for elementwise normalization
      
      * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
      
      * move test_elementwise_layernorm_fp16 into new folder
      
      * move elementwise_normalization_instances into a new folder
      
      * add more tests in test_elementwise_layernorm_fp16.cpp
      
      * added some corner cases in test
      
      * fixed method to compute lds size for matrix X
      
      * changed name of 44_elementwise_normalization into 45_elementwise_normalization
      
      * modified some comments
      
      * modified some other confused comments
      
      * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
      efbcc6ed
  21. 13 Oct, 2022 1 commit
    • Adam Osewski's avatar
      Refactor device op implementations into `impl` subdirectory. (#420) · 30480288
      Adam Osewski authored
      
      
      * Move kernel implementation files under impl directory.
      
      * Update examples paths.
      
      * Update device kernel impl include paths.
      
      * Update tensor operation instances include paths.
      
      * Update profiler and tests include paths.
      
      * Clang-format
      
      * Update include paths for batched gemm reduce
      
      * Refactor UnitTest ConvNDBwdWeight.
      
      * Refactor fwd and bwd data convND UT.
      
      * Fix used test macro.
      
      * Fix include path.
      
      * Fix include paths.
      
      * Fix include paths in profiler and tests.
      
      * Fix include paths.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      30480288