- 15 Nov, 2022 3 commits
-
-
ltqin authored
* start add example * add device dl * change launch kernel * change init data method * change example config * add config valid check * add instance for dl bwd * add instance to ckProfiler * reserver to profiler and cmakelist * add instance to ckProfiler2 * change instance f32 config * fix example return value Co-authored-by:
letaoqin <letaoqin@amd.com> Co-authored-by:
Po Yen Chen <PoYen.Chen@amd.com>
-
Po Yen Chen authored
-
Po Yen Chen authored
We can use this template to eliminate duplicated iterator computing logics. By providing return type to ck::accumulate_n(), we can avoid type conversion operations.
-
- 14 Nov, 2022 1 commit
-
-
Po Yen Chen authored
* Rangify STL algorithms This commit adapts rangified std::copy(), std::fill() & std::transform() * Re-write more std::copy() calls * Re-write std::copy() calls in profiler
-
- 11 Nov, 2022 3 commits
-
-
Po Yen Chen authored
* Rangify check_err() By rangifying check_err(), we can not only compare values between std::vector<>s, but also compare any ranges which have same value type. * Re-format example code
-
Po Yen Chen authored
* Add missing ignore expression * Add missing include directive
-
Po Yen Chen authored
* Rangify STL algorithms This commit adapts rangified std::copy(), std::fill() & std::transform() * Rangify check_err() By rangifying check_err(), we can not only compare values between std::vector<>s, but also compare any ranges which have same value type. * Allow constructing Tensor<> like a HostTensorDescriptor * Simplify Tensor<> object construction logics * Remove more unnecessary 'HostTensorDescriptor' objects * Re-format example code * Re-write more HostTensorDescriptor ctor call
-
- 10 Nov, 2022 6 commits
-
-
Lauren Wrubleski authored
* Add packages for example and profiler * correct TEST_NAME -> EXAMPLE_NAME
-
Po Yen Chen authored
Allow passing forward range to its call operator
-
guangzlu authored
* add client example for elementwise_normalization * clang format elementwise_layernorm2d.cpp * changed some naming to make it more understandable * changed naming of input into ab_input * fixed bug for threadwise_x_store * add elementwise operation to reference
-
Po Yen Chen authored
* Rename example folder for GroupedConvFwdMultipleD * Unify example codes * Change target names * Add fp16 example for multiple d instance * Re-format common.hpp * Add interface 'DeviceGroupedConvFwd' * Use simpler interface * Move common conv params out * Rename conv fwd client example folder * Add missing include directive * Update grouped conv instance implementations * Simplify ckProfiler (grouped conv forward) * Use GroupedConvFwd to implement client example * Use greater groupe count in example * Add custom target to group examples * Add extra tag param to instance factory function * Use tag to differentiate factory functions * Add missing tag argument for factory function * Remove inheritance relationship * Remove no-longer used include directive * Add license in front of file
-
Po Yen Chen authored
* Remove redundant CMake setting * Extract common code from files * Rename folder 'convnd' to 'conv' * Use std::array<> to accept compile-time kwnown # of arguments * Fix compilation error of tuning parameter * In example, use same setting as unit-test * Remove no-longer used include directive * Add interface for grouped conv bwd weight * Add group support for conv bwd weight * Add grouped conv bwd weight example * Use group parameter in example * Rename example folder * Remove non-grouped version example source files * Rename device op template * Add group support to convolution backward weight * Remove debug messages * Use smaller group size in example * Use named variable as loop terminate condition * Prettify example output message * Enlarge used grid size * Allow real grid size exceeds expected grid size * Rename interface file * Add client example for group...
-
Po Yen Chen authored
* Remove interface 'DeviceGroupedConvBwdData' * Remove no-longer needed include directive * Rename client example folder
-
- 03 Nov, 2022 1 commit
-
-
guangzlu authored
* add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
-
- 02 Nov, 2022 7 commits
-
-
Anthony Chang authored
* disable gtest discovery to run tests per-program not per-case * register cmake target to ctest
-
rocking5566 authored
* Sync the naming * Sync the test of layernorm with groupnorm * Sync the naming * Minor change for comment and log * [What] Add saveMean and SaveInvVariance in the interface. [Why] These can optimize the backward
-
Anthony Chang authored
-
Po Yen Chen authored
* Improve example reusability * Remove no-longer used file * Rename folder of grouped_conv_bwd_data example * Add normal grouped conv bwd example * Add interface 'DeviceGroupedConvBwdData' * Prettify comment of device op type arguments * Add grouped conv2d/conv3d backward data fp16 instances * Fix wrong template argument * Add grouped_conv2d_bwd_data client example * Use simpler expression to calculate memory size * Fix formating * Remove grouped_conv3d_bw_data instances Underlying device operator is not ready to handle 3D input * Remove no-longer necessary include directive * Add missing include directive * Use more realistic conv param in example
-
Rostyslav Geyyer authored
* Add gridwise gemm pipeline v1/v2 selector * Pipeline selector working, test-wise add pipeline options to one instance * Add gemm instances * Add debug info to DeviceGemmXdl * Add debug info to DeviceGemmXdl_CShuffle * Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu * Minor fix * Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm * set up inter-wave configuration * use defualt loop scheduling for supported gemm ops for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI * Add enum PipelineVersion * Update instances * Format * Fix the merge conflict * Add flags to disable added instances * Test disable flag check * Disable flag check * Enable the instances Co-authored-by:Anthony Chang <ac.chang@outlook.com>
-
Adam Osewski authored
* Add reduction across all dims cases. * host softmax: handle all reduce * Test cases when reduced dim is not innermost axis. * Fix syntax. * Test non innermost dim for fp32 and int8 * Group test suites wrt NumReduceDim. * Additionally test failing cases. * Throw error when Rank or NumReduceDims doesn't match arguments. * Check reducedDims has correct values * Move don't reuse DeviceReduceMultiblock IsSupportedArgument method. Instead implement own. (in fact just get rid of one check to enable reduction across inner dimensions). * Reorganize unit tests to better cover use scenarios. * Test input validation * Test reduction of inner dimensions with custom op instances. * Refactor fp32 and int8 unit tests. * Fix FP32 instance template parameters. * Add more instances. * Instances with InSrcVectorDim=0. * Do not initialize and copy data when arg not supported. * ckProfiler Softmax use instance factory. * Refactor device softmax IsSupported. * Additionally add non-polymorphic api functions * Split softmax instances into multiple files. * Fix profiler. * Reorganize tests to reuse profiler and cover edge cases. * Clang-format * I8 Softmax instances along with UT. * Reuse type alias definitions from instance factory header. * Clean included headers * Fix variable names. * Add missing checks in Argument constructor. Co-authored-by:
Adam Osewski <aosewski@amd.com> Co-authored-by:
Anthony Chang <ac.chang@outlook.com>
-
rocking5566 authored
* Add conv2d requant example * Fix bash error * Rename example * 1. Rename gemm quantization 2. shares the requantization lambda function with conv * Refine declare type * Add conv bias relu quantization exmaple * clang format * Fix compile error due to merge develop * Fix CI error * Extract quantization post operation into another file * Support quantization for non piecewise linear function * Add instance for conv quantization * Add convolution quantization factory * Add convolution quantization client example * Add more instances with different template parameters * clang format * Sync the naming with the develop
-
- 31 Oct, 2022 1 commit
-
-
ltqin authored
* add device of dl * fix k1 of GridwiseGemmDl_km_kn_mn_v1r3 * init version for dl conv * add example(init) * result right * disable elementwise operation * check parameters * add fp32,int8 example and change check code * change deive file and class name * add check vector access of C * add instance * add to ckProfiler * add Filter1x1Pad0 instances * fix ignore error * fix for CI Co-authored-by:letaoqin <letaoqin@amd.com>
-
- 28 Oct, 2022 3 commits
-
-
rocking5566 authored
-
Rostyslav Geyyer authored
-
Qianfeng authored
* Update to the batchnorm-forward API and base class * Fix leeked header including in gridwise_set_buffer_value.hpp * Add kernels and device file for batchnorm-forward welford supporting both blockwise and multi-block reduction * Update to the batchnorm-forward example to use the new batchnorm-forward device interface * Change the batchnorm-forward reference to use sequential welford method * Change to assign the workspace into four buffers in the host layer * Use GetReduceCountPerThread functor to replace the initial count for Blockwise and Multiblock welford * Tiny correction and remove un-used file under example/34_batchnorm * Renaming in the kernel arguments * Explicitly use ck::math::sqrt in batchnorm-forward kernels * Add some comments to some kernels * Tiny fix * Generalize the data types in reference_batchnorm_forward_nhwc_c * Use ck::ignore to mark un-used parameters * Move GetReduceCountPerThread functor codes from kernel to device * Remove some un-used codes in device_batchnorm_forward_impl.hpp * Tiny fix in batchnorm_forward example * Move GetReduceCountPerThread() to welford_helper.hpp * Use seperate data type for Scale and Bias * Renaming in device Op * Tiny fix in forward example * Updata to batchnorm-infer (type spliting, renaming) * Add time and bandwidth measurement to the batchnorm-forward example * Add support of elementwise operation for batchnorm forward output * Reduce object copying by passing object as reference type * Tiny change for performance * Updates for performance again * Some Renamings * Add GetActualVariance template parameter for ThreadwiseWelfordMerge * Tiny update in reference batchnorm forward nhwc/c * Move batchnorm multiblock kernel files to grid/batchnorm_multiblock sub-directory * Fuse mean and bias in the normalization calculation Co-authored-by:
root <root@dc-smc-18.amd.com> Co-authored-by:
rocking5566 <ChunYu.Lai@amd.com>
-
- 27 Oct, 2022 7 commits
-
-
Po Yen Chen authored
-
Anthony Chang authored
-
Illia Silin authored
* reduce the number of default targets * re-write the setting of target flags * move all options to one place * add new custom target instances for installing CK
-
Anthony Chang authored
* reopen masking att instance due to CI is upgraded * re-enable instances previously failed on 9110 * enable ksize-kpadding pair validity test * add non-masked attention+permute test; expose masking boolean to attention kernel handles * disable bench * fix test * move files * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute * format * amend rename * disable bench in test * add mask/no-mask test for non-permute attention kernels * disable broken kernel instance * example working add non-permuted problem statement evaluating whether overhead comes from permutation or the extra kernel arg * interface for bias addition without implementing it * test and profiler running * tidy * mask type determined by enum class * unify example code * move masking specialization to its own header * align formats * extract helper functions * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute * add tensor specialization to template args since tensor spec packed shows perf parity when permutation isn't needed remove redundant template args comment on 'packed' tensor specialization * grouped attention with input/output permute example * format * clean up * refactor acc0 tile visitor * fused attention client example * format Co-authored-by:
shaojiewang <wsjmessi@163.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Anthony Chang authored
* reopen masking att instance due to CI is upgraded * re-enable instances previously failed on 9110 * enable ksize-kpadding pair validity test * add non-masked attention+permute test; expose masking boolean to attention kernel handles * disable bench * fix test * move files * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute * format * amend rename * disable bench in test * add mask/no-mask test for non-permute attention kernels * disable broken kernel instance * example working add non-permuted problem statement evaluating whether overhead comes from permutation or the extra kernel arg * interface for bias addition without implementing it * test and profiler running * tidy * mask type determined by enum class * unify example code * move masking specialization to its own header * align formats * extract helper functions * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute * add tensor specialization to template args since tensor spec packed shows perf parity when permutation isn't needed remove redundant template args comment on 'packed' tensor specialization * grouped attention with input/output permute example * format * clean up * refactor acc0 tile visitor Co-authored-by:
shaojiewang <wsjmessi@163.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
Rostyslav Geyyer authored
* Fix for lwpck-425, update BlockTransferSrcVectorDim * Revert "Fix for lwpck-425, update BlockTransferSrcVectorDim" This reverts commit fd24e280e28ff238b452cfdde58a988affd46461. * Add Batched Gemm int8 test, expect it to fail * Format * Re-add the fix
-
Anthony Chang authored
* prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Apply suggestions from code review Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
- 26 Oct, 2022 1 commit
-
-
Illia Silin authored
-
- 25 Oct, 2022 3 commits
-
-
Qianfeng authored
* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances * Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces * Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock * Split the cpp and hpp files for reduction instances to enable more parallel compiling * Remove the using of macros for declaring reduction instances and instance references * Update to add_device_reduce_instance_xxxx templated functions * Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates * Change return format
-
guangzlu authored
* add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp
-
- 19 Oct, 2022 1 commit
-
-
arai713 authored
-
- 17 Oct, 2022 1 commit
-
-
arai713 authored
* adding tensor_permutation example folder * fixed formatting * adding tensor_permutation example folder * fixed formatting * changed deviceelementwise parameters for outscalar * removed .swo file * updated folder/file name * changed function call in verification for better consistency with hostelementwist parameters * formatted again * fixed shape in verification function call * changed verification function call, added definition for nhwc * added elementwise permute example * updated CMakeLists file in folder * Delete CmakeLists.txt * Delete tensor_permute.cpp * first version of 2d gridwise_elementwise kernel * temporary fix for stride problem * formatting * format * changed directory name * Delete gridwise_elementwise_2d.hpp * Delete CMakeLists.txt * Delete extra file * delete extra file * got rid of extraneous code * added 2d device elementwise file * deleted accidently added file * update * stride values generalized with equations * updated stride for output matrix * Update CMakeLists.txt * removed extraneous commented code * removed shape_nchw vector, replaced with GetLength for each dimension * changed vector load in kernel call * removed extra space in CMake
-
- 13 Oct, 2022 2 commits
-
-
Adam Osewski authored
* Move kernel implementation files under impl directory. * Update examples paths. * Update device kernel impl include paths. * Update tensor operation instances include paths. * Update profiler and tests include paths. * Clang-format * Update include paths for batched gemm reduce * Refactor UnitTest ConvNDBwdWeight. * Refactor fwd and bwd data convND UT. * Fix used test macro. * Fix include path. * Fix include paths. * Fix include paths in profiler and tests. * Fix include paths. Co-authored-by:Adam Osewski <aosewski@amd.com>
-
rocking5566 authored
* Fix bug of profiler for layernorm * 1. Rename layernorm into normalization 2. Decouple softmax from normalization * clang-format
-