"vscode:/vscode.git/clone" did not exist on "2560387f0798be68f22c8dc42e77ecbddabee5ae"
- 24 May, 2023 1 commit
-
-
rocking authored
* Expand the base class of pool2d, prepare to share base class with pool3d * Add pool3d device op * Add pool3d f16 example * Refactor the base class. implement generic pooling in the future * clang format * get original index in max pooling * Add outputindex to base class * Fix dimension * Add pooling instance * Use indexType instead * Remove useless header * Extract IndexDataType to template * Extract pooling reference code * clang format * clang format * Fix typo * Add tensor stride * Add missing header * Add index stride and output stride * Refine naming * Add type to base class * Rename file * Use proper size * Fix typo * Refine naming * Modify the argument into vector. * Add max pool profiler * Refine naming * Support f32 pool * Fix typo * Add avg pool2d fwd in profiler * clang format * Rename AccDatatype to ComputeDatatype * Fix init * test pool * Extract variable * Add client example * Check the pooling dim * clang format * Connect argv and arg_parser * Add found check * Remove useless header * Refine naming * Adjust the order of device_pool_fwd
-
- 23 May, 2023 1 commit
-
-
Illia Silin authored
* enable dl kernels on navi3 * do not build xdl tests and examples on Navi * run tests before building everything on jenkins * disable gemm_bilinear on gfx1030 * add gpu targets to installer on Navi * put tests in the same order as before * reduce the number of navi targets in CI * build CI installed for gfx940 as well * only build for MI300 during QA runs
-
- 15 May, 2023 1 commit
-
-
Bartłomiej Kocot authored
* Add contraction profiler and tests * Build and style fixes * Allow to use any elementwise operator for ref_contraction * Introduce profile_contraction_scale and profile_contraction_bilinear * Make ref_contraction generic and extend interface tests * Stylistic minor fixes * Extend test_contraction_interface
-
- 22 Mar, 2023 1 commit
-
-
Po Yen Chen authored
-
- 15 Feb, 2023 1 commit
-
-
rocking5566 authored
* Sync the order of type string with template parameter * Add more instances * Check the vector size and remove redundant var * Extract var to static, prepare to separate sweep once kernel * Separate sweeponce flow and optimize the flow * 1. Rename AccDatatype in normalization to computeData 2. Rename AccElementwiseOperation to YElementwiseOperation in normalization * Remove useless code * Update naive variance kernel * Refine string * Fix typo * Support naive variance for device_normalization * Check the blocksize * Share the VGPR of x and y * Share the VGPR of gamma and beta * Add more instances * Support fp16 sqrt for experiment * Add CHANGELOG * Fix typo * clang-format
-
- 10 Feb, 2023 1 commit
-
-
Illia Silin authored
-
- 09 Feb, 2023 2 commits
-
-
rocking5566 authored
* Add gemm + layernorm instance * Add ckProfiler * Add test * Add client example * Detect if user forger to set the workrspace * Use literal in the example * [What] use builtin function for sqrt [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt() * check gemm vaildity in IsSupportedArgument * Add more testcases * Merge duplicated folder in client example * Print more infomation * Use better kernel parameter for MS problem size * clang format * Add constexpr for if condition and remove redundant include * Remove cstdlib and add constexpr
-
guangzlu authored
* added instances for large N * add instance for elementwise normlization * added supported restrict in device_elementwise_normalization_impl.hpp
-
- 08 Feb, 2023 1 commit
-
-
ltqin authored
* add instance for gemm bias softmax gemm * add client example * change CGridDesc_G_M_N to CGridDesc_G_M_O * add gridwise * change c grid name * device add d0s data * fix 08 client_example * add example 47_fused_attention * example output correct * add d0 to example * add d0 element op * rechange instance code * change Acc0ElementwiseOperation to C0DEElementwiseOperation * change example name * update instance for cdeelementwiseop * add bhalf_t ScaleAdd * add test * not surport geem1 bias * remove some ignore * fix test bug
-
- 30 Jan, 2023 1 commit
-
-
Adam Osewski authored
Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 25 Jan, 2023 1 commit
-
-
Qianfeng authored
* File renaming and class renaming for device element-wise operation * Add batchnorm-infer instances, external API and client example * Add batchnorm-infer profiler module and gtests * Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp * Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer * Rename class and file due to conflict from device_elementwise_2d.hpp * Fix namespace in batcnnorm_infer_nhwc client example
-
- 18 Jan, 2023 1 commit
-
-
Raman R jana authored
* wavelet gemm programming model support for CK * GEMM pipeline update for wavelet progrmmaing model * Updated wavelet programming pipeline * fixes for global-write for math-wave * fixed bug in global writes * Updated comments for better readability * fixed clang format errors * added block_lds without barrier sync * clean * clean * clean * clean * refactor * prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Add comment depicting B2C mapping pattern. * Formatting & comments. * Comparison with custom B2C mapping pattern. * Example for wavelet gemm. * Add wavelet to Gemm standalone test. * Remove debug code. * Remove dangling #endif directive. Co-authored-by: root <Raman Jana> Co-authored-by:
Chao Liu <chao.liu2@amd.com> Co-authored-by:
Adam Osewski <aosewski@amd.com> Co-authored-by:
Anthony Chang <ac.chang@outlook.com> Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 17 Jan, 2023 1 commit
-
-
Haocong WANG authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size
-
- 15 Dec, 2022 1 commit
-
-
Illia Silin authored
-
- 07 Dec, 2022 1 commit
-
-
Po Yen Chen authored
* Use smaller tensor size in test * Use even more smaller tensor size * Touch only failing test case inputs
-
- 02 Dec, 2022 1 commit
-
-
Haocong WANG authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * Remote int4 related * delete deprecated test Co-authored-by:
Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 01 Dec, 2022 1 commit
-
-
Po Yen Chen authored
* Re-structure ckProfiler source files * Rename profiler.cpp to main.cpp * Modularize ckProfiler operations * Add description for profiler operations * Use longer name to avoid name collision * Use macro to delay expansion * Use std::move() to avoid object copying * Prohibit users from calling dtor * Use macro to eliminate redundant code * Make friend function hidden * Add missing include directive <iostream> * Fix wrong include directives * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test Co-authored-by:Qianfeng Zhang <Qianfeng.Zhang@amd.com>
-
- 30 Nov, 2022 1 commit
-
-
Qianfeng authored
* Refine the device batchnorm-backward base API templates and data type assignments * Remove duplicated kernel file * Add batchnorm backward instances and external API * Add batchnorm-backward profiler and tests * Add client example which uses batchnorm backward external API * Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory * Loose the threshold for batchnorm-backward check_err()
-
- 29 Nov, 2022 1 commit
-
-
Anthony Chang authored
* properly return error flag; reveals bug in split-k gemm * fix bug in split k * update split-k test case Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
- 28 Nov, 2022 1 commit
-
-
Qianfeng authored
Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test (#516)
-
- 25 Nov, 2022 1 commit
-
-
Qianfeng authored
* Update to device_batchnorm_forward base class to include all template parameters for problem description * Add batchnorm forward instances and external api * Add batchnorm forward profiler module which uses the external api * Add some comments in batchnorm_forward example to explain the dimensions in lengths[] * Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward * Improvement to the batchnorm infer base API * Add batchnorm forward client example which shows using the batchnorm forward external API * Add test for batchnorm forward * Tuning the batchnorm profiler initialized values and error threshold * Add support for bhalf_t in instances/external api/tests * Add support for int8_t in instances/external api/tests * Add support for double in instances/external api/tests * Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances * Checking before running best instance in batchnorm_fwd_nhwc client example * Add checking for YElementwiseOp in batchnorm_forward external API * Add more types in batchnorm forward profiler * Add more test lengths Co-authored-by:rocking5566 <ChunYu.Lai@amd.com>
-
- 17 Nov, 2022 1 commit
-
-
Anthony Chang authored
* workaround bf16 atten fwd issue on gfx908 * typo
-
- 15 Nov, 2022 1 commit
-
-
guangzlu authored
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm * added bf16 tests for batched_gemm_softmax_gemm_permute * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * aligned annotations * modified CMakeLists for examples * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl * use macro to control the instances * added macro control into instances * clang-format some files * changed error tolerance for bf16 * changed index for 10_elementwise_normalization * fixed xdlops code bug in amd_xdlops.hpp Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 11 Nov, 2022 1 commit
-
-
Po Yen Chen authored
* Rangify STL algorithms This commit adapts rangified std::copy(), std::fill() & std::transform() * Rangify check_err() By rangifying check_err(), we can not only compare values between std::vector<>s, but also compare any ranges which have same value type. * Allow constructing Tensor<> like a HostTensorDescriptor * Simplify Tensor<> object construction logics * Remove more unnecessary 'HostTensorDescriptor' objects * Re-format example code * Re-write more HostTensorDescriptor ctor call
-
- 10 Nov, 2022 1 commit
-
-
Po Yen Chen authored
* Remove redundant CMake setting * Extract common code from files * Rename folder 'convnd' to 'conv' * Use std::array<> to accept compile-time kwnown # of arguments * Fix compilation error of tuning parameter * In example, use same setting as unit-test * Remove no-longer used include directive * Add interface for grouped conv bwd weight * Add group support for conv bwd weight * Add grouped conv bwd weight example * Use group parameter in example * Rename example folder * Remove non-grouped version example source files * Rename device op template * Add group support to convolution backward weight * Remove debug messages * Use smaller group size in example * Use named variable as loop terminate condition * Prettify example output message * Enlarge used grid size * Allow real grid size exceeds expected grid size * Rename interface file * Add client example for group...
-
- 03 Nov, 2022 1 commit
-
-
guangzlu authored
* add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
-
- 02 Nov, 2022 3 commits
-
-
Anthony Chang authored
* disable gtest discovery to run tests per-program not per-case * register cmake target to ctest
-
rocking5566 authored
* Sync the naming * Sync the test of layernorm with groupnorm * Sync the naming * Minor change for comment and log * [What] Add saveMean and SaveInvVariance in the interface. [Why] These can optimize the backward
-
Adam Osewski authored
* Add reduction across all dims cases. * host softmax: handle all reduce * Test cases when reduced dim is not innermost axis. * Fix syntax. * Test non innermost dim for fp32 and int8 * Group test suites wrt NumReduceDim. * Additionally test failing cases. * Throw error when Rank or NumReduceDims doesn't match arguments. * Check reducedDims has correct values * Move don't reuse DeviceReduceMultiblock IsSupportedArgument method. Instead implement own. (in fact just get rid of one check to enable reduction across inner dimensions). * Reorganize unit tests to better cover use scenarios. * Test input validation * Test reduction of inner dimensions with custom op instances. * Refactor fp32 and int8 unit tests. * Fix FP32 instance template parameters. * Add more instances. * Instances with InSrcVectorDim=0. * Do not initialize and copy data when arg not supported. * ckProfiler Softmax use instance factory. * Refactor device softmax IsSupported. * Additionally add non-polymorphic api functions * Split softmax instances into multiple files. * Fix profiler. * Reorganize tests to reuse profiler and cover edge cases. * Clang-format * I8 Softmax instances along with UT. * Reuse type alias definitions from instance factory header. * Clean included headers * Fix variable names. * Add missing checks in Argument constructor. Co-authored-by:
Adam Osewski <aosewski@amd.com> Co-authored-by:
Anthony Chang <ac.chang@outlook.com>
-
- 28 Oct, 2022 2 commits
-
-
rocking5566 authored
-
Rostyslav Geyyer authored
-
- 27 Oct, 2022 3 commits
-
-
Anthony Chang authored
* reopen masking att instance due to CI is upgraded * re-enable instances previously failed on 9110 * enable ksize-kpadding pair validity test * add non-masked attention+permute test; expose masking boolean to attention kernel handles * disable bench * fix test * move files * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute * format * amend rename * disable bench in test * add mask/no-mask test for non-permute attention kernels * disable broken kernel instance * example working add non-permuted problem statement evaluating whether overhead comes from permutation or the extra kernel arg * interface for bias addition without implementing it * test and profiler running * tidy * mask type determined by enum class * unify example code * move masking specialization to its own header * align formats * extract helper functions * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute * add tensor specialization to template args since tensor spec packed shows perf parity when permutation isn't needed remove redundant template args comment on 'packed' tensor specialization * grouped attention with input/output permute example * format * clean up * refactor acc0 tile visitor Co-authored-by:
shaojiewang <wsjmessi@163.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Rostyslav Geyyer authored
* Fix for lwpck-425, update BlockTransferSrcVectorDim * Revert "Fix for lwpck-425, update BlockTransferSrcVectorDim" This reverts commit fd24e280e28ff238b452cfdde58a988affd46461. * Add Batched Gemm int8 test, expect it to fail * Format * Re-add the fix
-
Anthony Chang authored
* prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Apply suggestions from code review Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 25 Oct, 2022 2 commits
-
-
guangzlu authored
* add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
- 13 Oct, 2022 2 commits
-
-
Adam Osewski authored
* Move kernel implementation files under impl directory. * Update examples paths. * Update device kernel impl include paths. * Update tensor operation instances include paths. * Update profiler and tests include paths. * Clang-format * Update include paths for batched gemm reduce * Refactor UnitTest ConvNDBwdWeight. * Refactor fwd and bwd data convND UT. * Fix used test macro. * Fix include path. * Fix include paths. * Fix include paths in profiler and tests. * Fix include paths. Co-authored-by:Adam Osewski <aosewski@amd.com>
-
rocking5566 authored
* Fix bug of profiler for layernorm * 1. Rename layernorm into normalization 2. Decouple softmax from normalization * clang-format
-
- 07 Oct, 2022 1 commit
-
-
Shaojie WANG authored
* use another instance to check the efficiency * optimize group layer norm * 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop * add more instances to layernorm * add 2 more test cases * remove ignore in generating tuple of vector Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
- 20 Sep, 2022 1 commit
-
-
Shaojie WANG authored
* add lower triangle bmm * init code for tile skipping * functionality right with lower triangle mask * add decoder lower triangular mask calculation * use 7*13 group * fix n2 compute error * attention with lower triangle mask with tile skipping * add template to distinguish masking kernel * rename template and remove default template value * remove lower triangle gemm reference struct * add some comments on example * add 10 instance for masking bmm + scale + softmax + bmm + permute kernels * add test * add test file * add gtest for bmm masking scale softmax bmm permute * clang-format * fix compile error * check lef bottom corner for tile skipping * fix error: check left bottom corner for tile skipping * add k padding * add test and instance for MNK padding * passing a mask struct * fix instances * delete used comments * format Co-authored-by:
danyao12 <yaodan@dc-smc-13.amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-