- 27 Oct, 2022 4 commits
-
-
Anthony Chang authored
* reopen masking att instance due to CI is upgraded * re-enable instances previously failed on 9110 * enable ksize-kpadding pair validity test * add non-masked attention+permute test; expose masking boolean to attention kernel handles * disable bench * fix test * move files * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute * format * amend rename * disable bench in test * add mask/no-mask test for non-permute attention kernels * disable broken kernel instance * example working add non-permuted problem statement evaluating whether overhead comes from permutation or the extra kernel arg * interface for bias addition without implementing it * test and profiler running * tidy * mask type determined by enum class * unify example code * move masking specialization to its own header * align formats * extract helper functions * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute * add tensor specialization to template args since tensor spec packed shows perf parity when permutation isn't needed remove redundant template args comment on 'packed' tensor specialization * grouped attention with input/output permute example * format * clean up * refactor acc0 tile visitor * fused attention client example * format Co-authored-by:
shaojiewang <wsjmessi@163.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Anthony Chang authored
* reopen masking att instance due to CI is upgraded * re-enable instances previously failed on 9110 * enable ksize-kpadding pair validity test * add non-masked attention+permute test; expose masking boolean to attention kernel handles * disable bench * fix test * move files * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute * format * amend rename * disable bench in test * add mask/no-mask test for non-permute attention kernels * disable broken kernel instance * example working add non-permuted problem statement evaluating whether overhead comes from permutation or the extra kernel arg * interface for bias addition without implementing it * test and profiler running * tidy * mask type determined by enum class * unify example code * move masking specialization to its own header * align formats * extract helper functions * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute * add tensor specialization to template args since tensor spec packed shows perf parity when permutation isn't needed remove redundant template args comment on 'packed' tensor specialization * grouped attention with input/output permute example * format * clean up * refactor acc0 tile visitor Co-authored-by:
shaojiewang <wsjmessi@163.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Rostyslav Geyyer authored
* Fix for lwpck-425, update BlockTransferSrcVectorDim * Revert "Fix for lwpck-425, update BlockTransferSrcVectorDim" This reverts commit fd24e280e28ff238b452cfdde58a988affd46461. * Add Batched Gemm int8 test, expect it to fail * Format * Re-add the fix
-
Anthony Chang authored
* prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Apply suggestions from code review Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 26 Oct, 2022 1 commit
-
-
Illia Silin authored
-
- 25 Oct, 2022 3 commits
-
-
Qianfeng authored
* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances * Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces * Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock * Split the cpp and hpp files for reduction instances to enable more parallel compiling * Remove the using of macros for declaring reduction instances and instance references * Update to add_device_reduce_instance_xxxx templated functions * Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates * Change return format
-
guangzlu authored
* add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
-
- 19 Oct, 2022 1 commit
-
-
arai713 authored
-
- 17 Oct, 2022 1 commit
-
-
arai713 authored
* adding tensor_permutation example folder * fixed formatting * adding tensor_permutation example folder * fixed formatting * changed deviceelementwise parameters for outscalar * removed .swo file * updated folder/file name * changed function call in verification for better consistency with hostelementwist parameters * formatted again * fixed shape in verification function call * changed verification function call, added definition for nhwc * added elementwise permute example * updated CMakeLists file in folder * Delete CmakeLists.txt * Delete tensor_permute.cpp * first version of 2d gridwise_elementwise kernel * temporary fix for stride problem * formatting * format * changed directory name * Delete gridwise_elementwise_2d.hpp * Delete CMakeLists.txt * Delete extra file * delete extra file * got rid of extraneous code * added 2d device elementwise file * deleted accidently added file * update * stride values generalized with equations * updated stride for output matrix * Update CMakeLists.txt * removed extraneous commented code * removed shape_nchw vector, replaced with GetLength for each dimension * changed vector load in kernel call * removed extra space in CMake
-
- 13 Oct, 2022 3 commits
-
-
Adam Osewski authored
* Move kernel implementation files under impl directory. * Update examples paths. * Update device kernel impl include paths. * Update tensor operation instances include paths. * Update profiler and tests include paths. * Clang-format * Update include paths for batched gemm reduce * Refactor UnitTest ConvNDBwdWeight. * Refactor fwd and bwd data convND UT. * Fix used test macro. * Fix include path. * Fix include paths. * Fix include paths in profiler and tests. * Fix include paths. Co-authored-by:Adam Osewski <aosewski@amd.com>
-
rocking5566 authored
* Fix bug of profiler for layernorm * 1. Rename layernorm into normalization 2. Decouple softmax from normalization * clang-format
-
Adam Osewski authored
Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 11 Oct, 2022 2 commits
-
-
ltqin authored
* start split k * add base device class * add example after merge develop * add gridwise gemm * add b matrix split k * split=1 * change name for kb * not bias result right * bias only add once * fix register spill * regular code * add fp32 example * fix for 64bit index * fix CheckValidity of gridwise
-
Illia Silin authored
* run branch once a day, with release and staging compilers * add GetDockerImage in Clang stage * apply the new triggers to the develop branch
-
- 07 Oct, 2022 1 commit
-
-
Shaojie WANG authored
* use another instance to check the efficiency * optimize group layer norm * 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop * add more instances to layernorm * add 2 more test cases * remove ignore in generating tuple of vector Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
- 03 Oct, 2022 3 commits
-
-
Chao Liu authored
* update cmake script * update readme * Update README.md * add citation * add images * Update README.md * update * Update README.md * Update CONTRIBUTORS.md * Update README.md * Update CITATION.cff * Update README.md * Update CITATION.cff * update doc * Update CONTRIBUTORS.md * Update LICENSE * update
-
Chao Liu authored
* update cmake script * update readme * Update README.md * add citation * add images * Update README.md * update * Update README.md * Update CONTRIBUTORS.md * Update README.md * Update CITATION.cff * Update README.md * Update CITATION.cff * update doc * Update CONTRIBUTORS.md * Update LICENSE
-
Chao Liu authored
* update cmake script * update readme * Update README.md * add citation * add images * Update README.md * update * Update README.md * Update CONTRIBUTORS.md * Update README.md * Update CITATION.cff * Update README.md * Update CITATION.cff
-
- 01 Oct, 2022 1 commit
-
-
Illia Silin authored
* enable ccache and decouple it from MIOpen ccache use * fix the ccache check script * use another method to get server name * fix syntax * add quotes around the server name variable * use check_host as function * change syntax * fix syntax * test if server name is parsed correctly * try different syntax * check the env var value * test new check node function * add ROCMVERSION parameter and fix script syntax * fix script syntax * add missing instances of rocm version * install ccache in the docker image * do not check GPU in clang format stage, clean up old code * update defaults and clean up
-
- 27 Sep, 2022 1 commit
-
-
Illia Silin authored
* add an option to select specific compiler commit * change the logic of forcing building a docker * add check for compiler commit in dockerfile * compiler check syntax fix * change compiler selection logic * fix the new compiler build issue * set new compiler as default, update dev-requirements * fix jenkins syntax * fix docker syntax * get rid of hipcc.pl editing in jenkinsfile * fix the hipcc.pl in both places * try to fix the 10738 compiler linking bug * fix syntax * use dockerhub to store images * use newer amd-stg-open commit as default
-
- 23 Sep, 2022 1 commit
-
-
JD authored
* fix device instance library to add all instances * remove cppcheck from requirements.txt Co-authored-by:
Jun Liu <Liu.Jun@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 22 Sep, 2022 2 commits
-
-
Chao Liu authored
* fix * fix * add instance
-
Illia Silin authored
* replace obsolete offload-arch flags with GPU_TARGETS * fix a build error for client app * replace commma with semicolon in GPU_TARGETS
-
- 21 Sep, 2022 3 commits
-
-
Lixun Zhang authored
-
Illia Silin authored
* build CK only once, use deb package in all subsequent stages * update jenkins file * change prefix for build_CK stage * update writing deb metadata to control file * update ubuntu source for docker, script syntax for deb package metadata * try different way to create deb metadata * clean up DEBIAN before creating one * fix the CI folder names, fix splitK qa * use correct docker in all stages, separate tests for splitK verification and performance * clean old comments, change dir before packaging * use different package syntax * change packaging syntax * package with cmake * remove unnecessary build prefix * get rid of unnecessary paths * change paths during unpacking * change script syntax while unpacking * get rid of unneccesary steps * get rid of comments in the scripts * use double quotes for scripts * add ccache during build, try dpkg -x * pull and install each package separately * use full package names * try to use stashing for packages * change stash/unstash syntax * move unstash out of shell, run tests on any gpu node * unpack each package separately * try re-using existing workspace * merge the build and test stages, only stash ckProfiler * merge the build and test stages, only stash zipped ckProfiler * fix syntax * add GPU check before build and test, rename docker to usual name
-
zjing14 authored
-
- 20 Sep, 2022 6 commits
-
-
Chao Liu authored
* fix build * fix build
-
Shaojie WANG authored
* add lower triangle bmm * init code for tile skipping * functionality right with lower triangle mask * add decoder lower triangular mask calculation * use 7*13 group * fix n2 compute error * attention with lower triangle mask with tile skipping * add template to distinguish masking kernel * rename template and remove default template value * remove lower triangle gemm reference struct * add some comments on example * add 10 instance for masking bmm + scale + softmax + bmm + permute kernels * add test * add test file * add gtest for bmm masking scale softmax bmm permute * clang-format * fix compile error * check lef bottom corner for tile skipping * fix error: check left bottom corner for tile skipping * add k padding * add test and instance for MNK padding * passing a mask struct * fix instances * delete used comments * format Co-authored-by:
danyao12 <yaodan@dc-smc-13.amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Illia Silin authored
-
rocking5566 authored
* Add groupnorm example by layernorm 1. Reference is not ready 2. shape of gamma and beta need to be fix * Let shape of gamma and beta can be same as x * Modify test, instance and client example * [What] Fix bug of layernorm for greater than 2 dimension. [Why] We need to get upper length from merge transform instead of embed transform. * Add reference for groupnorm * Fuse sigmoid after groupnorm * [What] Rename original layernorm into layernorm2d [Why] Prepare to add groupnorm using layernorm5d * clang-format * Add groupnorm test * Refine error message * Add groupnorm ckProfiler * Test groupnorm kernel from device_instance * update example * upadte profiler * Fix test naming * Fix argc number * Move descriptor and sweeponce to argument for quick debugging Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
Po Yen Chen authored
* Add example folder for 'DeviceElementwise' * Re-structure example files * Move common parts into common.hpp * Use more strict input * Add more helper methods in 'DeviceElementwise' * Use more specific method to write example * Allow specify problem through command line argument * Allow specify problem 'axes' through command line argument * Add check to template type argument * Add transpose_shape() to generalize shape permute * Generalize transpose utility functions * Use better name for tensor indices * Add checks in helper functions * Remove debug messages * Refine error message for check_err() * Generalize variable naming in example code * Add device op 'DevicePermute' This device op is clone of 'DeviceElementwise' * Use 'DevicePermute' device op in example * Remove 'elementwise' from identifiers * Remove 'elementwise' from file paths * Remove base class of 'DevicePermute' * Let 'DevicePermute' inherit from 'BaseOperator' * Add simple type traits to validate device op type * Add static_assert() to check type constraints * Create 'DevicePermuteBase' to generate methods * Use indirect base type to generate methods * Remove 'is_device_op<>' type traits * Only accept single-input-single-output for 'DervicePermute' * Simplify 'DevicePermute' interface * Re-format 'DeviceElementwise' * Use CRTP to generate overridden virtual method * Remove unnecessary include directives * Distinguish input & output shape in 'DevicePermute' * Passing 'axes' to 'DevicePermute' * Use more reasonable return value for Invoker::Run() * Add 'GridwisePermute' kernel This kernel is a clone of 'GridwiseElementwise_1D' * Remove no-longer used type argument * Check if input/output shape meet the requirement * Remove no-longer used method * Remove never-entered-if-clause * Change problem description for 'DevicePermute' * Transform descriptor into 3 dimensions * Add debug code the verify result * Add comment to indicate template argument location * Add N/H/WPerBlock template parameter to 'DevicePermute' * Rename 'GridwisePermute' to 'GridwiseCopy' * Check tensor descriptor dimensions in 'GridwiseElementwise_1D' * Add missing include directive * Add 'BlockSize' parameter to 'DevicePermute' * Remove no-longer used method * Add 'BlockToTileMap' for 'GridwiseCopy' * Use the normal Block2TileMap convention * Rename 'BlockToTileMap' as 'Block2TileMap' * Fix most of compilation errors * Let 'Block2TileMap' map block to 2d coordinate * Allow data transfer in 'GridwiseCopy' * Fix wrong output descriptor for 2nd blockwise copy * Rename 'GridwiseCopy' as 'GridwisePermute' * Remove '1d' in identifiers * Remove commented-out codes * Remove 'MPerThread' template parameter * Seperate template parameters * Unify variable namming convention * Use more verbose way to create expressions * Add template parameter 'InBlockLdsExtraW' * Release the constraint on In/OutGridDesc * Use date type directly as template argument * Re-arrange template arguments for blockwise copy * Remove no-longer used template parameters * Embed layout in the variable names * Add GridwisePermute::CheckValidity() * Extract local types as template parameters * Rename local type alias * Add more template parameters (vector width related) * Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions * Fill tensor values start from 1 * Re-formate example code * Avoid too-large block id * Add comment * Make sure 'SrcVectorDim' is not same as 'DstVectorDim' * Add check for the 'VectorDim' & 'ScalarPerVector' template params * Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc * Remove no-longer used template parameter 'NPerBlock' * Fix wrong descriptor creation logics * Specify problem in each examples * Use better example name * Add new example 'example_permute_NxHxW_fp32' * Add example for demonstrating bundle multiple elems in tensor * Add support to permute multiple elements together * Change the default problem size * Add span<> class template * Use span<> to generalize check_err() interface * Fix ambiguous ctor call * Avoid create necessary objects * Use helper functions to simplify example code * Add example for 4xfp16 permute * Disable failed-to-compile example * Add check for the NUM_ELEMS_IN_BUNDLE * Remove redundant parameter in helper lambda function * Add check for the input tensor type's byte-size * Check scalar-per-vector with padded length * Use more verbose name to avoid name collision * Use fixed 'VectorDim' & 'ScalarPerVector' for LDS * Embed shape info in name of descriptor constructor * Rename example folder '36_permute' into '37_permute' * Avoid using too-large LDS in kernel code * Remove redundant example * Usw switch() to group similar codes * Add const to the span<> type arguement * Simply initialize tensor with floating point values * Use fp16 as data type in all examples * Enlarge tensor size in example * Enalrge N-dim in example * Add check for the bundled type in example * Use more stricter error threshold * Remove global load/store loop in kernel code * Measure execution time by default * Use faster device op config for example 'NxHxW_fp16' * Use faster device op config for example '1xHxW_fp16' * Use faster device op config for example 'HxWx4_fp16' * Remove cmd arg parsing logics * Rename functions * Extract bundle permutation logic out * Simplify permute bundle example * Add Tensor<>::GetElementSpaceSizeInBytes() * Add Tensor<>::data() * Use new methods to simplify code * Use type alias to replace duplicated code * Use existing method to shorten code * Allow FillUniformDistribution accept range arugment * Intialize random values in range * Add Tensor<>::size() * Use more meaningful names in permute bundle example * Use more meaningful names in permute element examples * Use rangified copy() to copy elements * Use function return value directly to eliminate variables * Add to_array() conversion tool to eliminate more variables * Add Tensor<>::AsSpan<>() to create view of tensor values * Use AsSpan() to shorten check_err() calls * Remove no-longer-used 'using' directives * Move 'using' directive to proper code position * Remove redudant variables * Remove useless static_assert() * Add check for range types * Declare variable right before first use * Move long return type as tailing return type * Add BaseInvokerCRTP<> class template to generate method * Create new base type for 'DervicePermute' implementations * Move 'NumDim' template param to the first * Rename 'DevicePermute' to 'DevicePermuteImpl' * Add 'noexcept' specifier to CRTP generated method * Move 'Block2TileMap' definition into 'GridwisePermute' * Use type alias to reduce code * Unify naming style in 'DevicePermute' * Add comments in 'GridwisePermute' * Rename permute example folder * Use std::cerr to report error * Use larger shape in examples * Rename '38_permute' to '39_permute' * Make sure we use unsigned type for shape & indices * Remove opt-ed out assertion * Remove template BaseInvokerCRTP<>
-
Anthony Chang authored
* sanity check * add attribution * add irrgular k tile size for batched attention * format
-
- 19 Sep, 2022 3 commits
-
-
Anthony Chang authored
-
Anthony Chang authored
* grouped attn without batch validates; now move toward grouped batched attn * grouped batched attention * working * remove debug logging clean up clean up * reintroduce g_ prefix back to host tensor variables * format * rename file * restore old file * rename * consolidate padded/non-padded attention example * harmonize padding specialization in attn examples
-
Shaojie WANG authored
* init commit of convnd bwd data * begin compiling example * have a first version that produce a right result * refine device level launch kernel code * add more instances in example and get right results * clang-format * format example file * add more instances * fix instances * adding conv_bwd_data multile_d * adding conv_bwd_data multile_d * adding conv_bwd multiple d * adding conv_bwd multiple d * adding conv_bwd multiple d * refactor * refactor * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * adding conv bwd data multiple d * refactor * update conv fwd's bias impl * refactor * reorg file * clean up cmake * clean * clean * clean Co-authored-by:
Chao Liu <lc.roy86@gmail.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 16 Sep, 2022 1 commit
-
-
Chao Liu authored
-
- 14 Sep, 2022 1 commit
-
-
ltqin authored
* refactor * start * add device gemm file * add BatchStrideD0 * add stridd0 * add gridwise file * add d0 parameters to gridwise gemm * add c layout transformer * add d0 threadwise copy * init kernel * init kernel * regular code * nm desc put to out * kernel parameter can not use reference * host add bias+gelu * run right for bias+gelu * change AddFastGelu into another file * interface add d1 bias parameters * add d1 parameter to argument * add d1 parameter to gridwise * first all code,not verify * gelu change to relu and GetElementSpaceSize bug * add instance * start add to ckprofiler * ckprofiler finish code * change input parameter for ckProfiler * fix host bias+gelu bug * show help for ckProfiler * fix bug for lunch kernel ignore parametes * add pad and fix about bug * mutiple d0 * add dynamic d0_element_op * change profiler and instance to mutiple d0 * example have 2 d0 * remove some comments not using * change 2 d0 have self parameters * change d element_op name * change class name(multiple_d) * fix bug * fix bug that don't find file * update profiler * refactor * update profiler * clean * revert example change * add gon layout * optimize parameter for gno * add gon to gemm+gemm * change helping input parameters * change to GemmPadder_v2 * using ForEach * fix gb_per_sec Co-authored-by:
Chao Liu <lc.roy86@gmail.com> Co-authored-by:
ltqin <letaoqin@amd.com>
-
- 13 Sep, 2022 1 commit
-
-
Illia Silin authored
* upgrade the OS and ROCM versions in CK docker * add cxx flags to link code with rocm5.2 and ck-9110 compiler * rename the docker image * run ONNX gemms using init=1
-
- 09 Sep, 2022 1 commit
-
-
carlushuang authored
* add gridwise/device sparse embedding * update code * update code * remove useless makefile * code fix * workable * work properly * emb add * add more instance * format * remove useless code * fix format * fix clang-tidy * clean * fix a compile error Co-authored-by:
Chao Liu <chao.liu2@amd.com> Co-authored-by:
Chao Liu <lc.roy86@gmail.com>
-