- 30 Jan, 2023 1 commit
-
-
Adam Osewski authored
Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 25 Jan, 2023 1 commit
-
-
Qianfeng authored
* File renaming and class renaming for device element-wise operation * Add batchnorm-infer instances, external API and client example * Add batchnorm-infer profiler module and gtests * Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp * Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer * Rename class and file due to conflict from device_elementwise_2d.hpp * Fix namespace in batcnnorm_infer_nhwc client example
-
- 18 Jan, 2023 1 commit
-
-
Raman R jana authored
* wavelet gemm programming model support for CK * GEMM pipeline update for wavelet progrmmaing model * Updated wavelet programming pipeline * fixes for global-write for math-wave * fixed bug in global writes * Updated comments for better readability * fixed clang format errors * added block_lds without barrier sync * clean * clean * clean * clean * refactor * prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Add comment depicting B2C mapping pattern. * Formatting & comments. * Comparison with custom B2C mapping pattern. * Example for wavelet gemm. * Add wavelet to Gemm standalone test. * Remove debug code. * Remove dangling #endif directive. Co-authored-by: root <Raman Jana> Co-authored-by:
Chao Liu <chao.liu2@amd.com> Co-authored-by:
Adam Osewski <aosewski@amd.com> Co-authored-by:
Anthony Chang <ac.chang@outlook.com> Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 17 Jan, 2023 1 commit
-
-
Haocong WANG authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size
-
- 15 Dec, 2022 1 commit
-
-
Illia Silin authored
-
- 07 Dec, 2022 1 commit
-
-
Po Yen Chen authored
* Use smaller tensor size in test * Use even more smaller tensor size * Touch only failing test case inputs
-
- 02 Dec, 2022 1 commit
-
-
Haocong WANG authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * Remote int4 related * delete deprecated test Co-authored-by:
Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 01 Dec, 2022 1 commit
-
-
Po Yen Chen authored
* Re-structure ckProfiler source files * Rename profiler.cpp to main.cpp * Modularize ckProfiler operations * Add description for profiler operations * Use longer name to avoid name collision * Use macro to delay expansion * Use std::move() to avoid object copying * Prohibit users from calling dtor * Use macro to eliminate redundant code * Make friend function hidden * Add missing include directive <iostream> * Fix wrong include directives * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test Co-authored-by:Qianfeng Zhang <Qianfeng.Zhang@amd.com>
-
- 30 Nov, 2022 1 commit
-
-
Qianfeng authored
* Refine the device batchnorm-backward base API templates and data type assignments * Remove duplicated kernel file * Add batchnorm backward instances and external API * Add batchnorm-backward profiler and tests * Add client example which uses batchnorm backward external API * Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory * Loose the threshold for batchnorm-backward check_err()
-
- 29 Nov, 2022 1 commit
-
-
Anthony Chang authored
* properly return error flag; reveals bug in split-k gemm * fix bug in split k * update split-k test case Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
- 28 Nov, 2022 1 commit
-
-
Qianfeng authored
Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test (#516)
-
- 25 Nov, 2022 1 commit
-
-
Qianfeng authored
* Update to device_batchnorm_forward base class to include all template parameters for problem description * Add batchnorm forward instances and external api * Add batchnorm forward profiler module which uses the external api * Add some comments in batchnorm_forward example to explain the dimensions in lengths[] * Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward * Improvement to the batchnorm infer base API * Add batchnorm forward client example which shows using the batchnorm forward external API * Add test for batchnorm forward * Tuning the batchnorm profiler initialized values and error threshold * Add support for bhalf_t in instances/external api/tests * Add support for int8_t in instances/external api/tests * Add support for double in instances/external api/tests * Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances * Checking before running best instance in batchnorm_fwd_nhwc client example * Add checking for YElementwiseOp in batchnorm_forward external API * Add more types in batchnorm forward profiler * Add more test lengths Co-authored-by:rocking5566 <ChunYu.Lai@amd.com>
-
- 17 Nov, 2022 1 commit
-
-
Anthony Chang authored
* workaround bf16 atten fwd issue on gfx908 * typo
-
- 15 Nov, 2022 1 commit
-
-
guangzlu authored
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm * added bf16 tests for batched_gemm_softmax_gemm_permute * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * aligned annotations * modified CMakeLists for examples * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl * use macro to control the instances * added macro control into instances * clang-format some files * changed error tolerance for bf16 * changed index for 10_elementwise_normalization * fixed xdlops code bug in amd_xdlops.hpp Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 11 Nov, 2022 1 commit
-
-
Po Yen Chen authored
* Rangify STL algorithms This commit adapts rangified std::copy(), std::fill() & std::transform() * Rangify check_err() By rangifying check_err(), we can not only compare values between std::vector<>s, but also compare any ranges which have same value type. * Allow constructing Tensor<> like a HostTensorDescriptor * Simplify Tensor<> object construction logics * Remove more unnecessary 'HostTensorDescriptor' objects * Re-format example code * Re-write more HostTensorDescriptor ctor call
-
- 10 Nov, 2022 1 commit
-
-
Po Yen Chen authored
* Remove redundant CMake setting * Extract common code from files * Rename folder 'convnd' to 'conv' * Use std::array<> to accept compile-time kwnown # of arguments * Fix compilation error of tuning parameter * In example, use same setting as unit-test * Remove no-longer used include directive * Add interface for grouped conv bwd weight * Add group support for conv bwd weight * Add grouped conv bwd weight example * Use group parameter in example * Rename example folder * Remove non-grouped version example source files * Rename device op template * Add group support to convolution backward weight * Remove debug messages * Use smaller group size in example * Use named variable as loop terminate condition * Prettify example output message * Enlarge used grid size * Allow real grid size exceeds expected grid size * Rename interface file * Add client example for group...
-
- 03 Nov, 2022 1 commit
-
-
guangzlu authored
* add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
-
- 02 Nov, 2022 3 commits
-
-
Anthony Chang authored
* disable gtest discovery to run tests per-program not per-case * register cmake target to ctest
-
rocking5566 authored
* Sync the naming * Sync the test of layernorm with groupnorm * Sync the naming * Minor change for comment and log * [What] Add saveMean and SaveInvVariance in the interface. [Why] These can optimize the backward
-
Adam Osewski authored
* Add reduction across all dims cases. * host softmax: handle all reduce * Test cases when reduced dim is not innermost axis. * Fix syntax. * Test non innermost dim for fp32 and int8 * Group test suites wrt NumReduceDim. * Additionally test failing cases. * Throw error when Rank or NumReduceDims doesn't match arguments. * Check reducedDims has correct values * Move don't reuse DeviceReduceMultiblock IsSupportedArgument method. Instead implement own. (in fact just get rid of one check to enable reduction across inner dimensions). * Reorganize unit tests to better cover use scenarios. * Test input validation * Test reduction of inner dimensions with custom op instances. * Refactor fp32 and int8 unit tests. * Fix FP32 instance template parameters. * Add more instances. * Instances with InSrcVectorDim=0. * Do not initialize and copy data when arg not supported. * ckProfiler Softmax use instance factory. * Refactor device softmax IsSupported. * Additionally add non-polymorphic api functions * Split softmax instances into multiple files. * Fix profiler. * Reorganize tests to reuse profiler and cover edge cases. * Clang-format * I8 Softmax instances along with UT. * Reuse type alias definitions from instance factory header. * Clean included headers * Fix variable names. * Add missing checks in Argument constructor. Co-authored-by:
Adam Osewski <aosewski@amd.com> Co-authored-by:
Anthony Chang <ac.chang@outlook.com>
-
- 28 Oct, 2022 2 commits
-
-
rocking5566 authored
-
Rostyslav Geyyer authored
-
- 27 Oct, 2022 3 commits
-
-
Anthony Chang authored
* reopen masking att instance due to CI is upgraded * re-enable instances previously failed on 9110 * enable ksize-kpadding pair validity test * add non-masked attention+permute test; expose masking boolean to attention kernel handles * disable bench * fix test * move files * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute * format * amend rename * disable bench in test * add mask/no-mask test for non-permute attention kernels * disable broken kernel instance * example working add non-permuted problem statement evaluating whether overhead comes from permutation or the extra kernel arg * interface for bias addition without implementing it * test and profiler running * tidy * mask type determined by enum class * unify example code * move masking specialization to its own header * align formats * extract helper functions * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute * add tensor specialization to template args since tensor spec packed shows perf parity when permutation isn't needed remove redundant template args comment on 'packed' tensor specialization * grouped attention with input/output permute example * format * clean up * refactor acc0 tile visitor Co-authored-by:
shaojiewang <wsjmessi@163.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Rostyslav Geyyer authored
* Fix for lwpck-425, update BlockTransferSrcVectorDim * Revert "Fix for lwpck-425, update BlockTransferSrcVectorDim" This reverts commit fd24e280e28ff238b452cfdde58a988affd46461. * Add Batched Gemm int8 test, expect it to fail * Format * Re-add the fix
-
Anthony Chang authored
* prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Apply suggestions from code review Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 25 Oct, 2022 2 commits
-
-
guangzlu authored
* add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
- 13 Oct, 2022 2 commits
-
-
Adam Osewski authored
* Move kernel implementation files under impl directory. * Update examples paths. * Update device kernel impl include paths. * Update tensor operation instances include paths. * Update profiler and tests include paths. * Clang-format * Update include paths for batched gemm reduce * Refactor UnitTest ConvNDBwdWeight. * Refactor fwd and bwd data convND UT. * Fix used test macro. * Fix include path. * Fix include paths. * Fix include paths in profiler and tests. * Fix include paths. Co-authored-by:Adam Osewski <aosewski@amd.com>
-
rocking5566 authored
* Fix bug of profiler for layernorm * 1. Rename layernorm into normalization 2. Decouple softmax from normalization * clang-format
-
- 07 Oct, 2022 1 commit
-
-
Shaojie WANG authored
* use another instance to check the efficiency * optimize group layer norm * 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop * add more instances to layernorm * add 2 more test cases * remove ignore in generating tuple of vector Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
- 20 Sep, 2022 3 commits
-
-
Shaojie WANG authored
* add lower triangle bmm * init code for tile skipping * functionality right with lower triangle mask * add decoder lower triangular mask calculation * use 7*13 group * fix n2 compute error * attention with lower triangle mask with tile skipping * add template to distinguish masking kernel * rename template and remove default template value * remove lower triangle gemm reference struct * add some comments on example * add 10 instance for masking bmm + scale + softmax + bmm + permute kernels * add test * add test file * add gtest for bmm masking scale softmax bmm permute * clang-format * fix compile error * check lef bottom corner for tile skipping * fix error: check left bottom corner for tile skipping * add k padding * add test and instance for MNK padding * passing a mask struct * fix instances * delete used comments * format Co-authored-by:
danyao12 <yaodan@dc-smc-13.amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
rocking5566 authored
* Add groupnorm example by layernorm 1. Reference is not ready 2. shape of gamma and beta need to be fix * Let shape of gamma and beta can be same as x * Modify test, instance and client example * [What] Fix bug of layernorm for greater than 2 dimension. [Why] We need to get upper length from merge transform instead of embed transform. * Add reference for groupnorm * Fuse sigmoid after groupnorm * [What] Rename original layernorm into layernorm2d [Why] Prepare to add groupnorm using layernorm5d * clang-format * Add groupnorm test * Refine error message * Add groupnorm ckProfiler * Test groupnorm kernel from device_instance * update example * upadte profiler * Fix test naming * Fix argc number * Move descriptor and sweeponce to argument for quick debugging Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
Anthony Chang authored
* sanity check * add attribution * add irrgular k tile size for batched attention * format
-
- 06 Sep, 2022 3 commits
-
-
Anthony Chang authored
* modify comment * trim unnecessary check * add gemm spec in kernel name * add TNTT gemm_gemm + atten kernel instances * refactor attention padding to better fit in unit tests This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op. Also added compile-time conditionals that load OOB value as NaN only after padding is enabled * add adhoc padding test for atten * shrink input value range for attention kernel validation to avoid occasional error by 1e-3 Still unsure whether this kind of deterministic floating point accurary issue is expected or not. May want to try exact same approach as the GPU kernel in the host reference GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then, shrink the input value range as it is less likely to produce errors of around ~1e-3. * attention kernel proper granular padding for all 4 dims * IsSupportedArgument checks * test more padded cases * block PadK specialization in attention kernels * workaround clang crash for gfx908 (gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class VGPR_32: Cannot scavenge register without an emergency spill slot!" this fall back to less ideal way of handle NPadding in fused attention kernel * comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
-
Anthony Chang authored
* add gemm_gemm TNNT instance * sanitize Gemm1KPack * disable instances that failed validation on mi100
-
Adam Osewski authored
* Update Softmax device operation interface. * Update ckProfiler. * Update Softmax UT. * Update example. * Client example. * Clang format Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 02 Sep, 2022 1 commit
-
-
zjing14 authored
* add scripts * fixed splitK_gemm_fp32 * clean * clean * use gemm_xdl_splitK_c_shuffle into profiler * remove device_gemm_xdl_splitk.hpp
-
- 25 Aug, 2022 1 commit
-
-
Adam Osewski authored
* More int4 UT. * Disable BitwiseRepresentation UT. * Add UT with static_cast * Surround cout statements with #if Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 23 Aug, 2022 1 commit
-
-
Anthony Chang authored
* GemmPadder and GemmGemmPadder * proper padding using GemmGemmPadder * test gemm_gemm padding * properly check size K in IsSupportedArgument() * properly check size requirement given SrcScalarPerVector in IsSupportedArgument() * comment * format
-
- 18 Aug, 2022 1 commit
-
-
Adam Osewski authored
* Introduce int4 data type. * Add unit-tests for int4 * Compile int4 UT only when int4 enabled. * clang-format Co-authored-by:Adam Osewski <aosewski@amd.com>
-