"...resnet50_tensorflow.git" did not exist on "28e531fc0dc3d73f36627cf894bda3b17381168a"
  1. 15 May, 2023 1 commit
    • Bartłomiej Kocot's avatar
      Add contraction profiler and tests (#701) · 642d5e91
      Bartłomiej Kocot authored
      * Add contraction profiler and tests
      
      * Build and style fixes
      
      * Allow to use any elementwise operator for ref_contraction
      
      * Introduce profile_contraction_scale and profile_contraction_bilinear
      
      * Make ref_contraction generic and extend interface tests
      
      * Stylistic minor fixes
      
      * Extend test_contraction_interface
      642d5e91
  2. 22 Mar, 2023 1 commit
  3. 15 Feb, 2023 1 commit
    • rocking5566's avatar
      Improve normalization (#580) · 6a6163a3
      rocking5566 authored
      * Sync the order of type string with template parameter
      
      * Add more instances
      
      * Check the vector size and remove redundant var
      
      * Extract var to static, prepare to separate sweep once kernel
      
      * Separate sweeponce flow and optimize the flow
      
      * 1. Rename AccDatatype in normalization to computeData
      2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
      
      * Remove useless code
      
      * Update naive variance kernel
      
      * Refine string
      
      * Fix typo
      
      * Support naive variance for device_normalization
      
      * Check the blocksize
      
      * Share the VGPR of x and y
      
      * Share the VGPR of gamma and beta
      
      * Add more instances
      
      * Support fp16 sqrt for experiment
      
      * Add CHANGELOG
      
      * Fix typo
      
      * clang-format
      6a6163a3
  4. 10 Feb, 2023 1 commit
  5. 09 Feb, 2023 2 commits
    • rocking5566's avatar
      Gemm+layernorm instance, ckProfiler, client example (#568) · f7d28f3e
      rocking5566 authored
      * Add gemm + layernorm instance
      
      * Add ckProfiler
      
      * Add test
      
      * Add client example
      
      * Detect if user forger to set the workrspace
      
      * Use literal in the example
      
      * [What] use builtin function for sqrt
      [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
      
      * check gemm vaildity in IsSupportedArgument
      
      * Add more testcases
      
      * Merge duplicated folder in client example
      
      * Print more infomation
      
      * Use better kernel parameter for MS problem size
      
      * clang format
      
      * Add constexpr for if condition and remove redundant include
      
      * Remove cstdlib and add constexpr
      f7d28f3e
    • guangzlu's avatar
      Add instance for elementwise normlization (#573) · 76d144fa
      guangzlu authored
      * added instances for large N
      
      * add instance for elementwise normlization
      
      * added supported restrict in device_elementwise_normalization_impl.hpp
      76d144fa
  6. 08 Feb, 2023 1 commit
    • ltqin's avatar
      Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576) · 332ccc33
      ltqin authored
      * add instance for gemm bias softmax gemm
      
      * add client example
      
      * change CGridDesc_G_M_N to CGridDesc_G_M_O
      
      * add gridwise
      
      * change c grid name
      
      * device add d0s data
      
      * fix 08 client_example
      
      * add example 47_fused_attention
      
      * example output correct
      
      * add d0 to example
      
      * add d0 element op
      
      * rechange instance code
      
      * change Acc0ElementwiseOperation to C0DEElementwiseOperation
      
      * change example name
      
      * update instance for cdeelementwiseop
      
      * add bhalf_t ScaleAdd
      
      * add test
      
      * not surport geem1 bias
      
      * remove some ignore
      
      * fix test bug
      332ccc33
  7. 30 Jan, 2023 1 commit
  8. 25 Jan, 2023 1 commit
    • Qianfeng's avatar
      Batchnorm inference instances, external API, client examples and gtests (#531) · a1b2441f
      Qianfeng authored
      * File renaming and class renaming for device element-wise operation
      
      * Add batchnorm-infer instances, external API and client example
      
      * Add batchnorm-infer profiler module and gtests
      
      * Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp
      
      * Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer
      
      * Rename class and file due to conflict from device_elementwise_2d.hpp
      
      * Fix namespace in batcnnorm_infer_nhwc client example
      a1b2441f
  9. 18 Jan, 2023 1 commit
    • Raman R jana's avatar
      Wavelet (inter-wave consumer-producer) GEMM (#310) · 1cfa8760
      Raman R jana authored
      
      
      * wavelet gemm programming model support for CK
      
      * GEMM pipeline update for wavelet progrmmaing model
      
      * Updated wavelet programming pipeline
      
      * fixes for global-write for math-wave
      
      * fixed bug in global writes
      
      * Updated comments for better readability
      
      * fixed clang format errors
      
      * added block_lds without barrier sync
      
      * clean
      
      * clean
      
      * clean
      
      * clean
      
      * refactor
      
      * prototype
      
      4 layouts
      
      fix default stride
      
      all problem sizes
      
      tidy
      
      move file
      
      update build script
      
      restore old file
      
      fix build
      
      * refactor standalone test to use gemm test harness
      
      * simplify gemm test
      
      * update build script
      
      * remove redundant
      
      * early return when cmd arg doesn't match
      
      * tidy
      
      * report failure when result not validated
      
      * tidy
      
      * Add comment depicting B2C mapping pattern.
      
      * Formatting & comments.
      
      * Comparison with custom B2C mapping pattern.
      
      * Example for wavelet gemm.
      
      * Add wavelet to Gemm standalone test.
      
      * Remove debug code.
      
      * Remove dangling #endif directive.
      
      Co-authored-by: root <Raman Jana>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      Co-authored-by: default avatarAdam Osewski <19374865+aosewski@users.noreply.github.com>
      1cfa8760
  10. 17 Jan, 2023 1 commit
    • Haocong WANG's avatar
      [Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541) · 919aeb1f
      Haocong WANG authored
      * wmma_op + unit test
      
      * add arch limitation to wmma test
      
      * change arch limitation
      
      * Refactor + Add all type unit test(int4 compile failed)
      
      * Add f32_16x16x16_bf16 unit test
      
      * tempsave
      
      * tempsave
      
      * tempsave
      
      * runtime bug, cannot find symbol
      
      * workaround for incorrect HIP warpSize return value
      
      * debugging
      
      * tempsave
      
      * Correctness OK, waiting for optimization
      
      * Tidy up + format
      
      * temp save
      
      * temp save, reproduce the v_bfi_b32 issue
      
      * add inline asm for wmmaop test
      
      * tidy up
      
      * clean some debug purpose code
      
      * discard some codes
      
      * clang format
      
      * clang format
      
      * compiler issue fixed + increase tile size
      919aeb1f
  11. 15 Dec, 2022 1 commit
  12. 07 Dec, 2022 1 commit
  13. 02 Dec, 2022 1 commit
  14. 01 Dec, 2022 1 commit
    • Po Yen Chen's avatar
      Modularize ckProfiler operations (#514) · 8784a72e
      Po Yen Chen authored
      
      
      * Re-structure ckProfiler source files
      
      * Rename profiler.cpp to main.cpp
      
      * Modularize ckProfiler operations
      
      * Add description for profiler operations
      
      * Use longer name to avoid name collision
      
      * Use macro to delay expansion
      
      * Use std::move() to avoid object copying
      
      * Prohibit users from calling dtor
      
      * Use macro to eliminate redundant code
      
      * Make friend function hidden
      
      * Add missing include directive <iostream>
      
      * Fix wrong include directives
      
      * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      8784a72e
  15. 30 Nov, 2022 1 commit
    • Qianfeng's avatar
      BatchNorm backward instance/external API/profiler/tests (#519) · 63af525c
      Qianfeng authored
      * Refine the device batchnorm-backward base API templates and data type assignments
      
      * Remove duplicated kernel file
      
      * Add batchnorm backward instances and external API
      
      * Add batchnorm-backward profiler and tests
      
      * Add client example which uses batchnorm backward external API
      
      * Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory
      
      * Loose the threshold for batchnorm-backward check_err()
      63af525c
  16. 29 Nov, 2022 1 commit
  17. 28 Nov, 2022 1 commit
  18. 25 Nov, 2022 1 commit
    • Qianfeng's avatar
      BatchNorm forward instance/external api/profiler/tests/client example (#511) · 4e6a5575
      Qianfeng authored
      
      
      * Update to device_batchnorm_forward base class to include all template parameters for problem description
      
      * Add batchnorm forward instances and external api
      
      * Add batchnorm forward profiler module which uses the external api
      
      * Add some comments in batchnorm_forward example to explain the dimensions in lengths[]
      
      * Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward
      
      * Improvement to the batchnorm infer base API
      
      * Add batchnorm forward client example which shows using the batchnorm forward external API
      
      * Add test for batchnorm forward
      
      * Tuning the batchnorm profiler initialized values and error threshold
      
      * Add support for bhalf_t in instances/external api/tests
      
      * Add support for int8_t in instances/external api/tests
      
      * Add support for double in instances/external api/tests
      
      * Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances
      
      * Checking before running best instance in batchnorm_fwd_nhwc client example
      
      * Add checking for YElementwiseOp in batchnorm_forward external API
      
      * Add more types in batchnorm forward profiler
      
      * Add more test lengths
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      4e6a5575
  19. 17 Nov, 2022 1 commit
  20. 15 Nov, 2022 1 commit
    • guangzlu's avatar
      Add BF16 tests for batched_gemm_softmax_gemm_permute (#504) · 4c4c7328
      guangzlu authored
      
      
      * fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
      
      * added bf16 tests for batched_gemm_softmax_gemm_permute
      
      * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * aligned annotations
      
      * modified CMakeLists for examples
      
      * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
      
      * use macro to control the instances
      
      * added macro control into instances
      
      * clang-format some files
      
      * changed error tolerance for bf16
      
      * changed index for 10_elementwise_normalization
      
      * fixed xdlops code bug in amd_xdlops.hpp
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      4c4c7328
  21. 11 Nov, 2022 1 commit
    • Po Yen Chen's avatar
      Rangify constructor of HostTensorDescriptor & Tensor<> (#445) · 4a2a56c2
      Po Yen Chen authored
      * Rangify STL algorithms
      
      This commit adapts rangified std::copy(), std::fill() & std::transform()
      
      * Rangify check_err()
      
      By rangifying check_err(), we can not only compare values between
      std::vector<>s, but also compare any ranges which have same value
      type.
      
      * Allow constructing Tensor<> like a HostTensorDescriptor
      
      * Simplify Tensor<> object construction logics
      
      * Remove more unnecessary 'HostTensorDescriptor' objects
      
      * Re-format example code
      
      * Re-write more HostTensorDescriptor ctor call
      4a2a56c2
  22. 10 Nov, 2022 1 commit
    • Po Yen Chen's avatar
      Add client example of grouped conv2d backward weight (data type: fp16) (#498) · 38470e04
      Po Yen Chen authored
      * Remove redundant CMake setting
      
      * Extract common code from files
      
      * Rename folder 'convnd' to 'conv'
      
      * Use std::array<> to accept compile-time kwnown # of arguments
      
      * Fix compilation error of tuning parameter
      
      * In example, use same setting as unit-test
      
      * Remove no-longer used include directive
      
      * Add interface for grouped conv bwd weight
      
      * Add group support for conv bwd weight
      
      * Add grouped conv bwd weight example
      
      * Use group parameter in example
      
      * Rename example folder
      
      * Remove non-grouped version example source files
      
      * Rename device op template
      
      * Add group support to convolution backward weight
      
      * Remove debug messages
      
      * Use smaller group size in example
      
      * Use named variable as loop terminate condition
      
      * Prettify example output message
      
      * Enlarge used grid size
      
      * Allow real grid size exceeds expected grid size
      
      * Rename interface file
      
      * Add client example for group...
      38470e04
  23. 03 Nov, 2022 1 commit
    • guangzlu's avatar
      Fused elementwise normalization (#492) · 8a4253ba
      guangzlu authored
      * add fused addition lyernorm
      
      * add fused addition lyernorm
      
      * changed CMakelist
      
      * removed annotates
      
      * modified descriptor of C
      
      * fixed bug in gridwise add layernorm
      
      * format the files
      
      * modified name from add&layernorm into elementwise&layernorm
      
      * created fused elementwise layernorm branch
      
      * change input into tuple type
      
      * add sweep once to reduce load & read of C from global memory
      
      * modified Argument api
      
      * modified way to malloc c in global memory
      
      * changed gamma and beta to m_k_desc
      
      * fixed bug when sweep once and move CDataType when define device level struct
      
      * add src dim for gamma and beta
      
      * implement optimization for coalesced
      
      * delete a annotation line
      
      * fixed some bug to meet the requirements of ck
      
      * add bandwidth computing in example, and fixed the time unit
      
      * move device_elementwise_layernorm_impl.hpp into device/impl
      
      * fixed bug in device_elementwise_layernorm_impl.hpp
      
      * changed name from layernorm into normalization
      
      * clang-format the changed files
      
      * changed the names
      
      * moved immidiate results into lds, it become faster in non-sweeponce cases
      
      * changed naming of C into X to make the defination more clear
      
      * changed naming in example
      
      * add tests for elementwise normalization
      
      * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
      
      * move test_elementwise_layernorm_fp16 into new folder
      
      * move elementwise_normalization_instances into a new folder
      
      * add more tests in test_elementwise_layernorm_fp16.cpp
      
      * added some corner cases in test
      
      * fixed method to compute lds size for matrix X
      
      * changed name of 44_elementwise_normalization into 45_elementwise_normalization
      
      * modified some comments
      
      * modified some other confused comments
      
      * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
      8a4253ba
  24. 02 Nov, 2022 3 commits
    • Anthony Chang's avatar
      Disable gtest discovery to run tests per-program not per-case (#432) · 79aa3fb1
      Anthony Chang authored
      * disable gtest discovery to run tests per-program not per-case
      
      * register cmake target to ctest
      79aa3fb1
    • rocking5566's avatar
      Refine layernorm naming and test code (#497) · d4d1147f
      rocking5566 authored
      * Sync the naming
      
      * Sync the test of layernorm with groupnorm
      
      * Sync the naming
      
      * Minor change for comment and log
      
      * [What] Add saveMean and SaveInvVariance in the interface.
      [Why] These can optimize the backward
      d4d1147f
    • Adam Osewski's avatar
      Softmax unit-test reduction across all and non innermost dims cases. (#406) · 6d8614ee
      Adam Osewski authored
      
      
      * Add reduction across all dims cases.
      
      * host softmax: handle all reduce
      
      * Test cases when reduced dim is not innermost axis.
      
      * Fix syntax.
      
      * Test non innermost dim for fp32 and int8
      
      * Group test suites wrt NumReduceDim.
      
      * Additionally test failing cases.
      
      * Throw error when Rank or NumReduceDims doesn't match arguments.
      
      * Check reducedDims has correct values
      
      * Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
      Instead implement own. (in fact just get rid of one check to enable
      reduction across inner dimensions).
      
      * Reorganize unit tests to better cover use scenarios.
      
      * Test input validation
      * Test reduction of inner dimensions with custom op instances.
      
      * Refactor fp32 and int8 unit tests.
      
      * Fix FP32 instance template parameters.
      
      * Add more instances.
      
      * Instances with InSrcVectorDim=0.
      
      * Do not initialize and copy data when arg not supported.
      
      * ckProfiler Softmax use instance factory.
      
      * Refactor device softmax IsSupported.
      
      * Additionally add non-polymorphic api functions
      
      * Split softmax instances into multiple files.
      
      * Fix profiler.
      
      * Reorganize tests to reuse profiler and cover edge cases.
      
      * Clang-format
      
      * I8 Softmax instances along with UT.
      
      * Reuse type alias definitions from instance factory header.
      
      * Clean included headers
      
      * Fix variable names.
      
      * Add missing checks in Argument constructor.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      6d8614ee
  25. 28 Oct, 2022 2 commits
  26. 27 Oct, 2022 3 commits
    • Anthony Chang's avatar
      Input/output permutation for fused attention (#460) · de37550f
      Anthony Chang authored
      
      
      * reopen masking att instance due to CI is upgraded
      
      * re-enable instances previously failed on 9110
      
      * enable ksize-kpadding pair validity test
      
      * add non-masked attention+permute test; expose masking boolean to attention kernel handles
      
      * disable bench
      
      * fix test
      
      * move files
      
      * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
      
      * format
      
      * amend rename
      
      * disable bench in test
      
      * add mask/no-mask test for non-permute attention kernels
      
      * disable broken kernel instance
      
      * example working
      
      add non-permuted problem statement
      
      evaluating whether overhead comes from permutation or the extra kernel arg
      
      * interface for bias addition without implementing it
      
      * test and profiler running
      
      * tidy
      
      * mask type determined by enum class
      
      * unify example code
      
      * move masking specialization to its own header
      
      * align formats
      
      * extract helper functions
      
      * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
      
      * add tensor specialization to template args
      
      since tensor spec packed shows perf parity when permutation isn't needed
      
      remove redundant template args
      
      comment on 'packed' tensor specialization
      
      * grouped attention with input/output permute example
      
      * format
      
      * clean up
      
      * refactor acc0 tile visitor
      Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      de37550f
    • Rostyslav Geyyer's avatar
      Fix Batched Gemm op for int8 data (#482) · cd517326
      Rostyslav Geyyer authored
      * Fix for lwpck-425, update BlockTransferSrcVectorDim
      
      * Revert "Fix for lwpck-425, update BlockTransferSrcVectorDim"
      
      This reverts commit fd24e280e28ff238b452cfdde58a988affd46461.
      
      * Add Batched Gemm int8 test, expect it to fail
      
      * Format
      
      * Re-add the fix
      cd517326
    • Anthony Chang's avatar
      Gemm standalone bench executable (#480) · 57106048
      Anthony Chang authored
      
      
      * prototype
      
      4 layouts
      
      fix default stride
      
      all problem sizes
      
      tidy
      
      move file
      
      update build script
      
      restore old file
      
      fix build
      
      * refactor standalone test to use gemm test harness
      
      * simplify gemm test
      
      * update build script
      
      * remove redundant
      
      * early return when cmd arg doesn't match
      
      * tidy
      
      * report failure when result not validated
      
      * tidy
      
      * Apply suggestions from code review
      Co-authored-by: default avatarAdam Osewski <19374865+aosewski@users.noreply.github.com>
      Co-authored-by: default avatarAdam Osewski <19374865+aosewski@users.noreply.github.com>
      57106048
  27. 25 Oct, 2022 2 commits
    • guangzlu's avatar
      Revert "Fused elementwise layernorm (#468)" (#491) · 6ea9257e
      guangzlu authored
      This reverts commit efbcc6ed.
      6ea9257e
    • guangzlu's avatar
      Fused elementwise layernorm (#468) · efbcc6ed
      guangzlu authored
      * add fused addition lyernorm
      
      * add fused addition lyernorm
      
      * changed CMakelist
      
      * removed annotates
      
      * modified descriptor of C
      
      * fixed bug in gridwise add layernorm
      
      * format the files
      
      * modified name from add&layernorm into elementwise&layernorm
      
      * created fused elementwise layernorm branch
      
      * change input into tuple type
      
      * add sweep once to reduce load & read of C from global memory
      
      * modified Argument api
      
      * modified way to malloc c in global memory
      
      * changed gamma and beta to m_k_desc
      
      * fixed bug when sweep once and move CDataType when define device level struct
      
      * add src dim for gamma and beta
      
      * implement optimization for coalesced
      
      * delete a annotation line
      
      * fixed some bug to meet the requirements of ck
      
      * add bandwidth computing in example, and fixed the time unit
      
      * move device_elementwise_layernorm_impl.hpp into device/impl
      
      * fixed bug in device_elementwise_layernorm_impl.hpp
      
      * changed name from layernorm into normalization
      
      * clang-format the changed files
      
      * changed the names
      
      * moved immidiate results into lds, it become faster in non-sweeponce cases
      
      * changed naming of C into X to make the defination more clear
      
      * changed naming in example
      
      * add tests for elementwise normalization
      
      * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
      
      * move test_elementwise_layernorm_fp16 into new folder
      
      * move elementwise_normalization_instances into a new folder
      
      * add more tests in test_elementwise_layernorm_fp16.cpp
      
      * added some corner cases in test
      
      * fixed method to compute lds size for matrix X
      
      * changed name of 44_elementwise_normalization into 45_elementwise_normalization
      
      * modified some comments
      
      * modified some other confused comments
      
      * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
      efbcc6ed
  28. 13 Oct, 2022 2 commits
    • Adam Osewski's avatar
      Refactor device op implementations into `impl` subdirectory. (#420) · 30480288
      Adam Osewski authored
      
      
      * Move kernel implementation files under impl directory.
      
      * Update examples paths.
      
      * Update device kernel impl include paths.
      
      * Update tensor operation instances include paths.
      
      * Update profiler and tests include paths.
      
      * Clang-format
      
      * Update include paths for batched gemm reduce
      
      * Refactor UnitTest ConvNDBwdWeight.
      
      * Refactor fwd and bwd data convND UT.
      
      * Fix used test macro.
      
      * Fix include path.
      
      * Fix include paths.
      
      * Fix include paths in profiler and tests.
      
      * Fix include paths.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      30480288
    • rocking5566's avatar
      Fix bug of layernorm ckProfiler and refine code (#448) · 1b62bfaa
      rocking5566 authored
      * Fix bug of profiler for layernorm
      
      * 1. Rename layernorm into normalization
      2. Decouple softmax from normalization
      
      * clang-format
      1b62bfaa
  29. 07 Oct, 2022 1 commit
    • Shaojie WANG's avatar
      Optimization for gridwise group norm (#453) · 40942b90
      Shaojie WANG authored
      
      
      * use another instance to check the efficiency
      
      * optimize group layer norm
      
      * 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop
      
      * add more instances to layernorm
      
      * add 2 more test cases
      
      * remove ignore in generating tuple of vector
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      40942b90
  30. 20 Sep, 2022 3 commits
    • Shaojie WANG's avatar
      MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425) · ebab84b6
      Shaojie WANG authored
      
      
      * add lower triangle bmm
      
      * init code for tile skipping
      
      * functionality right with lower triangle mask
      
      * add decoder lower triangular mask calculation
      
      * use 7*13 group
      
      * fix n2 compute error
      
      * attention with lower triangle mask with tile skipping
      
      * add template to distinguish masking kernel
      
      * rename template and remove default template value
      
      * remove lower triangle gemm reference struct
      
      * add some comments on example
      
      * add 10 instance for masking bmm + scale + softmax + bmm + permute kernels
      
      * add test
      
      * add test file
      
      * add gtest for bmm masking scale softmax bmm permute
      
      * clang-format
      
      * fix compile error
      
      * check lef bottom corner for tile skipping
      
      * fix error: check left bottom corner for tile skipping
      
      * add k padding
      
      * add test and instance for MNK padding
      
      * passing a mask struct
      
      * fix instances
      
      * delete used comments
      
      * format
      Co-authored-by: default avatardanyao12 <yaodan@dc-smc-13.amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      ebab84b6
    • rocking5566's avatar
      Group norm (#417) · 4eba345f
      rocking5566 authored
      
      
      * Add groupnorm example by layernorm
      1.  Reference is not ready
      2. shape of gamma and beta need to be fix
      
      * Let shape of gamma and beta can be same as x
      
      * Modify test, instance and client example
      
      * [What] Fix bug of layernorm for greater than 2 dimension.
      [Why] We need to get upper length from merge transform instead of embed transform.
      
      * Add reference for groupnorm
      
      * Fuse sigmoid after groupnorm
      
      * [What] Rename original layernorm into layernorm2d
      [Why] Prepare to add groupnorm using layernorm5d
      
      * clang-format
      
      * Add groupnorm test
      
      * Refine error message
      
      * Add groupnorm ckProfiler
      
      * Test groupnorm kernel from device_instance
      
      * update example
      
      * upadte profiler
      
      * Fix test naming
      
      * Fix argc number
      
      * Move descriptor and sweeponce to argument for quick debugging
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      4eba345f
    • Anthony Chang's avatar
      Add batched attention special kernel instances (#424) · 7c788e10
      Anthony Chang authored
      * sanity check
      
      * add attribution
      
      * add irrgular k tile size for batched attention
      
      * format
      7c788e10