1. 02 Dec, 2022 2 commits
    • Anthony Chang's avatar
      Fix bug where scaling may not be applied in some code path (#526) · d1567094
      Anthony Chang authored
      * fix bug where scaling may not be applied in some code path
      
      * more test
      
      * revert accidental example code changes
      d1567094
    • ltqin's avatar
      Add multiple d gridwise gemm on Navi21 for ResNet50 (#517) · 23ecf0fa
      ltqin authored
      
      
      * start add example
      
      * add multiple d fp16 example
      
      * device transfer elementwiseop to gridwise
      
      * gridwise add multiple d
      
      * change example for multiple d
      
      * fix spill registers
      
      * fix for passthrough element op
      
      * fix int8 overflow
      
      * change example file name
      
      * add instance for dl multiple d
      
      * example add DsDataType
      
      * remove grouped_convolution_forward_dl.hpp
      
      * add head file(was deleted before)
      
      * fix not support device issue
      
      * format
      
      * remove passthrough check
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      23ecf0fa
  2. 01 Dec, 2022 1 commit
    • Po Yen Chen's avatar
      Modularize ckProfiler operations (#514) · 8784a72e
      Po Yen Chen authored
      
      
      * Re-structure ckProfiler source files
      
      * Rename profiler.cpp to main.cpp
      
      * Modularize ckProfiler operations
      
      * Add description for profiler operations
      
      * Use longer name to avoid name collision
      
      * Use macro to delay expansion
      
      * Use std::move() to avoid object copying
      
      * Prohibit users from calling dtor
      
      * Use macro to eliminate redundant code
      
      * Make friend function hidden
      
      * Add missing include directive <iostream>
      
      * Fix wrong include directives
      
      * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      8784a72e
  3. 30 Nov, 2022 1 commit
    • Qianfeng's avatar
      BatchNorm backward instance/external API/profiler/tests (#519) · 63af525c
      Qianfeng authored
      * Refine the device batchnorm-backward base API templates and data type assignments
      
      * Remove duplicated kernel file
      
      * Add batchnorm backward instances and external API
      
      * Add batchnorm-backward profiler and tests
      
      * Add client example which uses batchnorm backward external API
      
      * Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory
      
      * Loose the threshold for batchnorm-backward check_err()
      63af525c
  4. 28 Nov, 2022 1 commit
  5. 25 Nov, 2022 1 commit
    • Qianfeng's avatar
      BatchNorm forward instance/external api/profiler/tests/client example (#511) · 4e6a5575
      Qianfeng authored
      
      
      * Update to device_batchnorm_forward base class to include all template parameters for problem description
      
      * Add batchnorm forward instances and external api
      
      * Add batchnorm forward profiler module which uses the external api
      
      * Add some comments in batchnorm_forward example to explain the dimensions in lengths[]
      
      * Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward
      
      * Improvement to the batchnorm infer base API
      
      * Add batchnorm forward client example which shows using the batchnorm forward external API
      
      * Add test for batchnorm forward
      
      * Tuning the batchnorm profiler initialized values and error threshold
      
      * Add support for bhalf_t in instances/external api/tests
      
      * Add support for int8_t in instances/external api/tests
      
      * Add support for double in instances/external api/tests
      
      * Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances
      
      * Checking before running best instance in batchnorm_fwd_nhwc client example
      
      * Add checking for YElementwiseOp in batchnorm_forward external API
      
      * Add more types in batchnorm forward profiler
      
      * Add more test lengths
      Co-authored-by: default avatarrocking5566 <ChunYu.Lai@amd.com>
      4e6a5575
  6. 15 Nov, 2022 1 commit
    • guangzlu's avatar
      Add BF16 tests for batched_gemm_softmax_gemm_permute (#504) · 4c4c7328
      guangzlu authored
      
      
      * fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
      
      * added bf16 tests for batched_gemm_softmax_gemm_permute
      
      * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
      
      * aligned annotations
      
      * modified CMakeLists for examples
      
      * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
      
      * use macro to control the instances
      
      * added macro control into instances
      
      * clang-format some files
      
      * changed error tolerance for bf16
      
      * changed index for 10_elementwise_normalization
      
      * fixed xdlops code bug in amd_xdlops.hpp
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      4c4c7328
  7. 14 Nov, 2022 1 commit
    • Po Yen Chen's avatar
      Rangify STL algorithms (#438) · dc663fae
      Po Yen Chen authored
      * Rangify STL algorithms
      
      This commit adapts rangified std::copy(), std::fill() & std::transform()
      
      * Re-write more std::copy() calls
      
      * Re-write std::copy() calls in profiler
      dc663fae
  8. 11 Nov, 2022 1 commit
    • Po Yen Chen's avatar
      Rangify constructor of HostTensorDescriptor & Tensor<> (#445) · 4a2a56c2
      Po Yen Chen authored
      * Rangify STL algorithms
      
      This commit adapts rangified std::copy(), std::fill() & std::transform()
      
      * Rangify check_err()
      
      By rangifying check_err(), we can not only compare values between
      std::vector<>s, but also compare any ranges which have same value
      type.
      
      * Allow constructing Tensor<> like a HostTensorDescriptor
      
      * Simplify Tensor<> object construction logics
      
      * Remove more unnecessary 'HostTensorDescriptor' objects
      
      * Re-format example code
      
      * Re-write more HostTensorDescriptor ctor call
      4a2a56c2
  9. 10 Nov, 2022 3 commits
    • Lauren Wrubleski's avatar
      Add packages for examples and profiler (#502) · 37f2e918
      Lauren Wrubleski authored
      * Add packages for example and profiler
      
      * correct TEST_NAME -> EXAMPLE_NAME
      37f2e918
    • Po Yen Chen's avatar
      Add client example of grouped conv2d forward (data type: fp16) (#488) · f4980310
      Po Yen Chen authored
      * Rename example folder for GroupedConvFwdMultipleD
      
      * Unify example codes
      
      * Change target names
      
      * Add fp16 example for multiple d instance
      
      * Re-format common.hpp
      
      * Add interface 'DeviceGroupedConvFwd'
      
      * Use simpler interface
      
      * Move common conv params out
      
      * Rename conv fwd client example folder
      
      * Add missing include directive
      
      * Update grouped conv instance implementations
      
      * Simplify ckProfiler (grouped conv forward)
      
      * Use GroupedConvFwd to implement client example
      
      * Use greater groupe count in example
      
      * Add custom target to group examples
      
      * Add extra tag param to instance factory function
      
      * Use tag to differentiate factory functions
      
      * Add missing tag argument for factory function
      
      * Remove inheritance relationship
      
      * Remove no-longer used include directive
      
      * Add license in front of file
      f4980310
    • Po Yen Chen's avatar
      Add client example of grouped conv2d backward weight (data type: fp16) (#498) · 38470e04
      Po Yen Chen authored
      * Remove redundant CMake setting
      
      * Extract common code from files
      
      * Rename folder 'convnd' to 'conv'
      
      * Use std::array<> to accept compile-time kwnown # of arguments
      
      * Fix compilation error of tuning parameter
      
      * In example, use same setting as unit-test
      
      * Remove no-longer used include directive
      
      * Add interface for grouped conv bwd weight
      
      * Add group support for conv bwd weight
      
      * Add grouped conv bwd weight example
      
      * Use group parameter in example
      
      * Rename example folder
      
      * Remove non-grouped version example source files
      
      * Rename device op template
      
      * Add group support to convolution backward weight
      
      * Remove debug messages
      
      * Use smaller group size in example
      
      * Use named variable as loop terminate condition
      
      * Prettify example output message
      
      * Enlarge used grid size
      
      * Allow real grid size exceeds expected grid size
      
      * Rename interface file
      
      * Add client example for grouped conv2d bwd weight
      
      * Fix wrong include directive
      
      * Rename client example folder
      38470e04
  10. 03 Nov, 2022 1 commit
    • guangzlu's avatar
      Fused elementwise normalization (#492) · 8a4253ba
      guangzlu authored
      * add fused addition lyernorm
      
      * add fused addition lyernorm
      
      * changed CMakelist
      
      * removed annotates
      
      * modified descriptor of C
      
      * fixed bug in gridwise add layernorm
      
      * format the files
      
      * modified name from add&layernorm into elementwise&layernorm
      
      * created fused elementwise layernorm branch
      
      * change input into tuple type
      
      * add sweep once to reduce load & read of C from global memory
      
      * modified Argument api
      
      * modified way to malloc c in global memory
      
      * changed gamma and beta to m_k_desc
      
      * fixed bug when sweep once and move CDataType when define device level struct
      
      * add src dim for gamma and beta
      
      * implement optimization for coalesced
      
      * delete a annotation line
      
      * fixed some bug to meet the requirements of ck
      
      * add bandwidth computing in example, and fixed the time unit
      
      * move device_elementwise_layernorm_impl.hpp into device/impl
      
      * fixed bug in device_elementwise_layernorm_impl.hpp
      
      * changed name from layernorm into normalization
      
      * clang-format the changed files
      
      * changed the names
      
      * moved immidiate results into lds, it become faster in non-sweeponce cases
      
      * changed naming of C into X to make the defination more clear
      
      * changed naming in example
      
      * add tests for elementwise normalization
      
      * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
      
      * move test_elementwise_layernorm_fp16 into new folder
      
      * move elementwise_normalization_instances into a new folder
      
      * add more tests in test_elementwise_layernorm_fp16.cpp
      
      * added some corner cases in test
      
      * fixed method to compute lds size for matrix X
      
      * changed name of 44_elementwise_normalization into 45_elementwise_normalization
      
      * modified some comments
      
      * modified some other confused comments
      
      * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
      8a4253ba
  11. 02 Nov, 2022 2 commits
    • rocking5566's avatar
      Refine layernorm naming and test code (#497) · d4d1147f
      rocking5566 authored
      * Sync the naming
      
      * Sync the test of layernorm with groupnorm
      
      * Sync the naming
      
      * Minor change for comment and log
      
      * [What] Add saveMean and SaveInvVariance in the interface.
      [Why] These can optimize the backward
      d4d1147f
    • Adam Osewski's avatar
      Softmax unit-test reduction across all and non innermost dims cases. (#406) · 6d8614ee
      Adam Osewski authored
      
      
      * Add reduction across all dims cases.
      
      * host softmax: handle all reduce
      
      * Test cases when reduced dim is not innermost axis.
      
      * Fix syntax.
      
      * Test non innermost dim for fp32 and int8
      
      * Group test suites wrt NumReduceDim.
      
      * Additionally test failing cases.
      
      * Throw error when Rank or NumReduceDims doesn't match arguments.
      
      * Check reducedDims has correct values
      
      * Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
      Instead implement own. (in fact just get rid of one check to enable
      reduction across inner dimensions).
      
      * Reorganize unit tests to better cover use scenarios.
      
      * Test input validation
      * Test reduction of inner dimensions with custom op instances.
      
      * Refactor fp32 and int8 unit tests.
      
      * Fix FP32 instance template parameters.
      
      * Add more instances.
      
      * Instances with InSrcVectorDim=0.
      
      * Do not initialize and copy data when arg not supported.
      
      * ckProfiler Softmax use instance factory.
      
      * Refactor device softmax IsSupported.
      
      * Additionally add non-polymorphic api functions
      
      * Split softmax instances into multiple files.
      
      * Fix profiler.
      
      * Reorganize tests to reuse profiler and cover edge cases.
      
      * Clang-format
      
      * I8 Softmax instances along with UT.
      
      * Reuse type alias definitions from instance factory header.
      
      * Clean included headers
      
      * Fix variable names.
      
      * Add missing checks in Argument constructor.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      6d8614ee
  12. 31 Oct, 2022 1 commit
    • ltqin's avatar
      Add Conv Forward on Navi21 for ResNet50 (#490) · 8ee36118
      ltqin authored
      
      
      * add device of dl
      
      * fix k1 of GridwiseGemmDl_km_kn_mn_v1r3
      
      * init version for dl conv
      
      * add example(init)
      
      * result right
      
      * disable elementwise operation
      
      * check parameters
      
      * add fp32,int8 example and change check code
      
      * change deive file and class name
      
      * add check vector access of C
      
      * add instance
      
      * add to ckProfiler
      
      * add Filter1x1Pad0 instances
      
      * fix ignore error
      
      * fix for CI
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      8ee36118
  13. 27 Oct, 2022 1 commit
    • Anthony Chang's avatar
      Input/output permutation for fused attention (#460) · de37550f
      Anthony Chang authored
      
      
      * reopen masking att instance due to CI is upgraded
      
      * re-enable instances previously failed on 9110
      
      * enable ksize-kpadding pair validity test
      
      * add non-masked attention+permute test; expose masking boolean to attention kernel handles
      
      * disable bench
      
      * fix test
      
      * move files
      
      * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
      
      * format
      
      * amend rename
      
      * disable bench in test
      
      * add mask/no-mask test for non-permute attention kernels
      
      * disable broken kernel instance
      
      * example working
      
      add non-permuted problem statement
      
      evaluating whether overhead comes from permutation or the extra kernel arg
      
      * interface for bias addition without implementing it
      
      * test and profiler running
      
      * tidy
      
      * mask type determined by enum class
      
      * unify example code
      
      * move masking specialization to its own header
      
      * align formats
      
      * extract helper functions
      
      * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
      
      * add tensor specialization to template args
      
      since tensor spec packed shows perf parity when permutation isn't needed
      
      remove redundant template args
      
      comment on 'packed' tensor specialization
      
      * grouped attention with input/output permute example
      
      * format
      
      * clean up
      
      * refactor acc0 tile visitor
      Co-authored-by: wangshaojie6's avatarshaojiewang <wsjmessi@163.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      de37550f
  14. 25 Oct, 2022 3 commits
    • Qianfeng's avatar
      Update to the Reduction API and instances (#476) · dda3a0a1
      Qianfeng authored
      * Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances
      
      * Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces
      
      * Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock
      
      * Split the cpp and hpp files for reduction instances to enable more parallel compiling
      
      * Remove the using of macros for declaring reduction instances and instance references
      
      * Update to add_device_reduce_instance_xxxx templated functions
      
      * Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates
      
      * Change return format
      dda3a0a1
    • guangzlu's avatar
      Revert "Fused elementwise layernorm (#468)" (#491) · 6ea9257e
      guangzlu authored
      This reverts commit efbcc6ed.
      6ea9257e
    • guangzlu's avatar
      Fused elementwise layernorm (#468) · efbcc6ed
      guangzlu authored
      * add fused addition lyernorm
      
      * add fused addition lyernorm
      
      * changed CMakelist
      
      * removed annotates
      
      * modified descriptor of C
      
      * fixed bug in gridwise add layernorm
      
      * format the files
      
      * modified name from add&layernorm into elementwise&layernorm
      
      * created fused elementwise layernorm branch
      
      * change input into tuple type
      
      * add sweep once to reduce load & read of C from global memory
      
      * modified Argument api
      
      * modified way to malloc c in global memory
      
      * changed gamma and beta to m_k_desc
      
      * fixed bug when sweep once and move CDataType when define device level struct
      
      * add src dim for gamma and beta
      
      * implement optimization for coalesced
      
      * delete a annotation line
      
      * fixed some bug to meet the requirements of ck
      
      * add bandwidth computing in example, and fixed the time unit
      
      * move device_elementwise_layernorm_impl.hpp into device/impl
      
      * fixed bug in device_elementwise_layernorm_impl.hpp
      
      * changed name from layernorm into normalization
      
      * clang-format the changed files
      
      * changed the names
      
      * moved immidiate results into lds, it become faster in non-sweeponce cases
      
      * changed naming of C into X to make the defination more clear
      
      * changed naming in example
      
      * add tests for elementwise normalization
      
      * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
      
      * move test_elementwise_layernorm_fp16 into new folder
      
      * move elementwise_normalization_instances into a new folder
      
      * add more tests in test_elementwise_layernorm_fp16.cpp
      
      * added some corner cases in test
      
      * fixed method to compute lds size for matrix X
      
      * changed name of 44_elementwise_normalization into 45_elementwise_normalization
      
      * modified some comments
      
      * modified some other confused comments
      
      * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
      efbcc6ed
  15. 13 Oct, 2022 2 commits
    • Adam Osewski's avatar
      Refactor device op implementations into `impl` subdirectory. (#420) · 30480288
      Adam Osewski authored
      
      
      * Move kernel implementation files under impl directory.
      
      * Update examples paths.
      
      * Update device kernel impl include paths.
      
      * Update tensor operation instances include paths.
      
      * Update profiler and tests include paths.
      
      * Clang-format
      
      * Update include paths for batched gemm reduce
      
      * Refactor UnitTest ConvNDBwdWeight.
      
      * Refactor fwd and bwd data convND UT.
      
      * Fix used test macro.
      
      * Fix include path.
      
      * Fix include paths.
      
      * Fix include paths in profiler and tests.
      
      * Fix include paths.
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      30480288
    • rocking5566's avatar
      Fix bug of layernorm ckProfiler and refine code (#448) · 1b62bfaa
      rocking5566 authored
      * Fix bug of profiler for layernorm
      
      * 1. Rename layernorm into normalization
      2. Decouple softmax from normalization
      
      * clang-format
      1b62bfaa
  16. 22 Sep, 2022 1 commit
  17. 20 Sep, 2022 2 commits
    • Shaojie WANG's avatar
      MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425) · ebab84b6
      Shaojie WANG authored
      
      
      * add lower triangle bmm
      
      * init code for tile skipping
      
      * functionality right with lower triangle mask
      
      * add decoder lower triangular mask calculation
      
      * use 7*13 group
      
      * fix n2 compute error
      
      * attention with lower triangle mask with tile skipping
      
      * add template to distinguish masking kernel
      
      * rename template and remove default template value
      
      * remove lower triangle gemm reference struct
      
      * add some comments on example
      
      * add 10 instance for masking bmm + scale + softmax + bmm + permute kernels
      
      * add test
      
      * add test file
      
      * add gtest for bmm masking scale softmax bmm permute
      
      * clang-format
      
      * fix compile error
      
      * check lef bottom corner for tile skipping
      
      * fix error: check left bottom corner for tile skipping
      
      * add k padding
      
      * add test and instance for MNK padding
      
      * passing a mask struct
      
      * fix instances
      
      * delete used comments
      
      * format
      Co-authored-by: default avatardanyao12 <yaodan@dc-smc-13.amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      ebab84b6
    • rocking5566's avatar
      Group norm (#417) · 4eba345f
      rocking5566 authored
      
      
      * Add groupnorm example by layernorm
      1.  Reference is not ready
      2. shape of gamma and beta need to be fix
      
      * Let shape of gamma and beta can be same as x
      
      * Modify test, instance and client example
      
      * [What] Fix bug of layernorm for greater than 2 dimension.
      [Why] We need to get upper length from merge transform instead of embed transform.
      
      * Add reference for groupnorm
      
      * Fuse sigmoid after groupnorm
      
      * [What] Rename original layernorm into layernorm2d
      [Why] Prepare to add groupnorm using layernorm5d
      
      * clang-format
      
      * Add groupnorm test
      
      * Refine error message
      
      * Add groupnorm ckProfiler
      
      * Test groupnorm kernel from device_instance
      
      * update example
      
      * upadte profiler
      
      * Fix test naming
      
      * Fix argc number
      
      * Move descriptor and sweeponce to argument for quick debugging
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      4eba345f
  18. 14 Sep, 2022 1 commit
    • ltqin's avatar
      batched_gemm + multiple_d + gemm + multiple_d (#394) · 370efa6c
      ltqin authored
      
      
      * refactor
      
      * start
      
      * add device gemm file
      
      * add BatchStrideD0
      
      * add stridd0
      
      * add gridwise file
      
      * add d0 parameters to gridwise gemm
      
      * add c layout transformer
      
      * add d0 threadwise copy
      
      * init kernel
      
      * init kernel
      
      * regular code
      
      * nm desc put to out
      
      * kernel parameter can not use reference
      
      * host add bias+gelu
      
      * run right for bias+gelu
      
      * change AddFastGelu into another file
      
      * interface add d1 bias parameters
      
      * add d1 parameter to argument
      
      * add d1 parameter to gridwise
      
      * first all code,not verify
      
      * gelu change to relu and GetElementSpaceSize bug
      
      * add instance
      
      * start add to ckprofiler
      
      * ckprofiler finish code
      
      * change input parameter for ckProfiler
      
      * fix host bias+gelu bug
      
      * show help for ckProfiler
      
      * fix bug for lunch kernel ignore parametes
      
      * add pad and fix about bug
      
      * mutiple d0
      
      * add dynamic d0_element_op
      
      * change profiler and  instance to mutiple d0
      
      * example have 2 d0
      
      * remove some comments not using
      
      * change 2 d0 have self  parameters
      
      * change d element_op name
      
      * change class name(multiple_d)
      
      * fix bug
      
      * fix bug that don't find file
      
      * update profiler
      
      * refactor
      
      * update profiler
      
      * clean
      
      * revert example change
      
      * add gon layout
      
      * optimize parameter for gno
      
      * add gon to gemm+gemm
      
      * change helping input parameters
      
      * change to GemmPadder_v2
      
      * using ForEach
      
      * fix gb_per_sec
      Co-authored-by: default avatarChao Liu <lc.roy86@gmail.com>
      Co-authored-by: default avatarltqin <letaoqin@amd.com>
      370efa6c
  19. 06 Sep, 2022 2 commits
    • Anthony Chang's avatar
      Fused attention instances & padding tests (#395) · 868e5c55
      Anthony Chang authored
      * modify comment
      
      * trim unnecessary check
      
      * add gemm spec in kernel name
      
      * add TNTT gemm_gemm + atten kernel instances
      
      * refactor attention padding to better fit in unit tests
      
      This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
      Also added compile-time conditionals that load OOB value as NaN only after padding is enabled
      
      * add adhoc padding test for atten
      
      * shrink input value range for attention kernel validation to avoid occasional error by 1e-3
      
      Still unsure whether this kind of deterministic floating point accurary issue is expected
      or not. May want to try exact same approach as the GPU kernel in the host reference
      GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
      shrink the input value range as it is less likely to produce errors of around ~1e-3.
      
      * attention kernel proper granular padding for all 4 dims
      
      * IsSupportedArgument checks
      
      * test more padded cases
      
      * block PadK specialization in attention kernels
      
      * workaround clang crash for gfx908
      
      (gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
      error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
      VGPR_32: Cannot scavenge register without an emergency spill slot!"
      this fall back to less ideal way of handle NPadding in fused attention kernel
      
      * comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
      868e5c55
    • Adam Osewski's avatar
      Softmax client example (#396) · 3da5c19e
      Adam Osewski authored
      
      
      * Update Softmax device operation interface.
      
      * Update ckProfiler.
      
      * Update Softmax UT.
      
      * Update example.
      
      * Client example.
      
      * Clang format
      Co-authored-by: default avatarAdam Osewski <aosewski@amd.com>
      3da5c19e
  20. 29 Aug, 2022 1 commit
  21. 23 Aug, 2022 1 commit
  22. 13 Aug, 2022 3 commits
    • rocking5566's avatar
      Layernorm welford (#346) · 0bd6b842
      rocking5566 authored
      
      
      * Add threadwise and blockwise welford
      
      * Rename gridwise op, prepare to add welford version
      
      * implement welford and integrate welford into layernorm
      
      * Take care of tail loop
      
      * Fix buf when ThreadSliceK > 1
      
      * Fix bug of merging of two empty set
      
      * Rename clip to clamp
      
      * 1. Fix type of count
      2. Remove useless static_assert
      
      * Do not inherit Reduction::Argument
      
      * [What] replace __syncthreads() with block_sync_lds()
      [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
      
      * Add y stride
      
      * Rename.
      DeviceLayernorm -> DeviceLayernormImpl
      DeviceNormalization2 -> DeviceLayernorm
      
      * Move literal ""_uz & ""_zu into namespace 'literals'
      
      * Move namespace 'literals' as 'ck::literals'
      Co-authored-by: default avatarPo-Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      0bd6b842
    • Anthony Chang's avatar
      Fused GEMM+GEMM (#351) · c20a75b0
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * add gemm_gemm instances and tests
      
      * avoid LDS data hazard
      
      * fix build
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      c20a75b0
    • Anthony Chang's avatar
      Fused attention (#345) · cac014f1
      Anthony Chang authored
      
      
      * initial stub for gemm_gemm_xdl_cshuffle
      
      * set up example code
      
      * compiles
      
      * prevent integer overflow
      
      * harmonize interface between ref_gemm and ref_batched_gemm
      
      * batched_gemm_gemm
      
      * fix example
      
      * host tensor gen: diagonal pattern in lowest two-dimensions only
      
      * make c descriptors containing only integral constants
      
      * clean up
      
      * add BlockwiseGemmXdlops_v2 while exploring an unified approach
      
      * implement proper interface
      
      * tidy up example
      
      * fix compilation warnings
      
      * coarsely controlled 2nd gemm padding
      
      * remove rocm-cmake's hard requirement for certain revision
      
      * clang-format
      
      * resolve merge conflict
      
      * fix compilation error on gfx10
      
      * adds acc0 elementwise op to interface
      
      * attention host validation
      
      * add blockwsie softmax v1
      
      * iteratively update softmax+gemm
      
      * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
      
      * add init method for easier debugging
      
      * do away with manual thread cluster calculation
      
      * generalize blockwise softmax interface
      
      * row-wise softmax sum & max
      
      * format
      
      * rename to DeviceBatchedGemmSoftmaxGemm
      
      * add gemm_softmax_gemm instances and tests
      
      * comment
      Co-authored-by: default avatarltqin <letao.qin@amd.com>
      Co-authored-by: default avatarChao Liu <chao.liu2@amd.com>
      cac014f1
  23. 11 Aug, 2022 1 commit
    • rocking5566's avatar
      ckProfiler for layernorm (#330) · fdfd7eb5
      rocking5566 authored
      * Refine parameter
      
      * Add base class for layernorm
      
      * Add layernorm instance
      
      * Add layernorm to ckProfiler
      
      * Remove redundant
      
      * Add verification
      
      * Fix compile error due to merge
      fdfd7eb5
  24. 07 Aug, 2022 1 commit
  25. 29 Jul, 2022 1 commit
    • Chao Liu's avatar
      Clean up conv example, Instances, profiler and test (#324) · 500fa995
      Chao Liu authored
      * convnd_fwd fp16 example
      
      * update example
      
      * update example
      
      * update instance
      
      * updating refernce conv
      
      * update reference conv
      
      * update conv fwd profiler
      
      * update conv 1d and 3d instance
      
      * update include path
      
      * clean
      
      * update profiler for conv bwd data and weight
      
      * update conv bwd weight
      
      * clean
      
      * update conv example
      
      * update profiler for conv bwd weight
      
      * update ckprofiler for conv bwd data
      
      * fix reference conv bwd data bug; update conv bwd data test
      
      * update examples
      
      * fix initialization issue
      
      * update test for conv fwd
      
      * clean
      
      * clean
      
      * remove test case too sensitive to error threshhold
      
      * fix test
      
      * clean
      
      * fix build
      
      * adding conv multiple d
      
      * adding conv multiple D
      
      * add matrix padder
      
      * add gemm padding to convnd
      
      * adding group conv
      
      * update gemm multi-d
      
      * refactor
      
      * refactor
      
      * refactor
      
      * clean
      
      * clean
      
      * refactor
      
      * refactor
      
      * reorg
      
      * add ds
      
      * add bias
      
      * clean
      
      * add G
      
      * adding group
      
      * adding group
      
      * adding group
      
      * update Tensor
      
      * clean
      
      * update example
      
      * update DeviceGemmMultipleD_Xdl_CShuffle
      
      * update conv bwd-data and bwd-weight
      
      * upate contraction example
      
      * update gemm and batch gemm with e permute
      
      * fix example build
      
      * instance for grouped conv1d
      
      * update example
      
      * adding group conv instance
      
      * update gemm bilinear instance
      
      * update gemm+add+add+fastgelu instance
      
      * update profiler
      
      * update profiler
      
      * update test
      
      * update test and client example
      
      * clean
      
      * add grouped conv into profiler
      
      * update profiler
      
      * clean
      
      * add test grouped conv, update all conv test to gtest
      
      * update test
      500fa995
  26. 21 Jul, 2022 2 commits
    • Illia Silin's avatar
      Add full QA with verification option, few other changes. (#331) · d8415a96
      Illia Silin authored
      * add verify flag and update scripts
      
      * replace old check_error function with the new check_err
      
      * fix syntax
      
      * remove blank spaces
      
      * remove empty line
      
      * add check_err for tensors
      
      * fix syntax
      
      * replace tensors with vectors in check_err calls
      
      * fix syntax
      
      * remove blank spaces
      
      * fix syntax
      
      * add new line at end of file
      
      * disable conv2d_bwd_weight test, add gpu check
      
      * set check_gpu using export
      
      * check GPU using runShell
      
      * add definition of runShell
      
      * fix script syntax
      
      * reduce the number of threads, add full qa option
      
      * run processing scripts in bash
      
      * fix the branch and host names in performance scripts, add chronos
      
      * replace parameterizedCron with cron
      
      * archive the perf log files
      
      * try to fix git call
      
      * pass branch and host names as arguments into scripts
      
      * fix script arguments
      
      * fix script arguments
      
      * process results on master
      
      * fix pipeline
      
      * add definition of gpu_arch
      
      * run processing scripts in docker
      
      * fix the brackets
      
      * add agent master for the processing stage
      
      * get rid of show_node_info call on master
      
      * try using mici label instead of master, disable MI100 tests for now
      
      * fix syntax
      
      * simplify container for results processing
      
      * remove node(master) from the process_results stage
      
      * put all stages in original order
      
      * change the agent label from master to mici for gfx908
      d8415a96
    • zjing14's avatar
      Grouped Gemm device with multiD grid (#319) · 7959dad5
      zjing14 authored
      
      
      * replace gridwise_v2r3 with multiD
      
      * adjust parameters
      
      * add instances
      
      * fixed test_grouped_gemm
      
      * fix standalone softmax race condition around blockwise reduction
      
      * fixed ci
      
      * fixed comment: remove redundant workspace
      
      * use instanceFactory
      
      * add test layout
      
      * add empty Ds
      
      * add bias example
      
      * use array
      
      * sperate examples
      Co-authored-by: default avatarAnthony Chang <ac.chang@outlook.com>
      7959dad5
  27. 08 Jul, 2022 1 commit
  28. 02 Jul, 2022 1 commit
    • Chao Liu's avatar
      Gemm+Bilinear (#316) · 9e4429f9
      Chao Liu authored
      * refactor
      
      * update example
      
      * update example
      
      * gemm bilinear
      
      * clean
      
      * update
      9e4429f9