- 12 Jan, 2023 2 commits
-
-
Illia Silin authored
* add DEBUG_LOG macro to enable/disable debug output * fix syntax * fix syntax again * fix syntax one more time * remove balnk spaces * use ifdefs * add the Print argument * move the definition of DEBUG_LOG to ck.hpp * add the missign argument to Print()
-
Qianfeng authored
* Let cmath included when compiling host codes in math_v2.hpp * Remove including of cmath in device_base.hpp and device_permute.hpp
-
- 15 Dec, 2022 4 commits
-
-
zjing14 authored
* add mnk padding, support m=0 * clean code * clean code Co-authored-by:Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
-
Illia Silin authored
-
Qianfeng authored
-
Rostyslav Geyyer authored
Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable arbitrary problem size (#535) * Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances * Add padding device_gemm_add_fastgelu_xdl_c_shuffle instances * Add gemm_add_fastgelu profiler impl * Add padding device_gemm_fastgelu_xdl_c_shuffle instances * Add gemm_fastgelu profiler impl
-
- 14 Dec, 2022 1 commit
-
-
Rostyslav Geyyer authored
-
- 12 Dec, 2022 1 commit
-
-
arai713 authored
* added 2d gridwise elementwise * added 2d version of device elementwise * added example file with updated device elementwise call * added Cmake file * changed NumDim into 2D * fixed compiler issues * fixed indexing for loop step * fixed NumDim dimension error * changed blockID to 2D * updated Grid Desc * updated kernel call * fixed 2d thread indexing * added dimensions for example file * commented out unused code * changed vector load * removed extra code * temporarily removing vector load on 2nd dim * changed vector load back, still causing errors * altered indexing * changed isSupportedArgument for 2D * changed indexing + do/while * fixed isSupportedArgument * changed dimension for debugging * fixed * added testing printouts * testing change * added variables to distribute threads through both dimensions * testing changes * integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise * removed most of the extraneous code, testing with different dimensions * testing * removed debugging print statements * moved 2d elementwise permute into elementwise permute directory * fixed formatting * removed debugging comments from threadwise transfer Co-authored-by:
Jing Zhang <jizhan@amd.com> Co-authored-by:
Po Yen Chen <PoYen.Chen@amd.com>
-
- 08 Dec, 2022 1 commit
-
-
Illia Silin authored
* apply new K-dimension check in gemm_xdl_cshuffle * add K-dim check to gemm_xdl and batched_gemm_xdl * fix syntax * fix syntax * clean-up the debug output
-
- 07 Dec, 2022 3 commits
-
-
Po Yen Chen authored
* Use smaller tensor size in test * Use even more smaller tensor size * Touch only failing test case inputs
-
Rostyslav Geyyer authored
Co-authored-by:
Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
guangzlu authored
Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
- 06 Dec, 2022 1 commit
-
-
Illia Silin authored
* ignore .git folder when doing clang-format * fix syntax * add backslashes before quotes * add path filter for several extensions
-
- 02 Dec, 2022 3 commits
-
-
Anthony Chang authored
* fix bug where scaling may not be applied in some code path * more test * revert accidental example code changes
-
ltqin authored
* start add example * add multiple d fp16 example * device transfer elementwiseop to gridwise * gridwise add multiple d * change example for multiple d * fix spill registers * fix for passthrough element op * fix int8 overflow * change example file name * add instance for dl multiple d * example add DsDataType * remove grouped_convolution_forward_dl.hpp * add head file(was deleted before) * fix not support device issue * format * remove passthrough check Co-authored-by:letaoqin <letaoqin@amd.com>
-
Haocong WANG authored
* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * Remote int4 related * delete deprecated test Co-authored-by:
Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by:
Chao Liu <chao.liu2@amd.com>
-
- 01 Dec, 2022 1 commit
-
-
Po Yen Chen authored
* Re-structure ckProfiler source files * Rename profiler.cpp to main.cpp * Modularize ckProfiler operations * Add description for profiler operations * Use longer name to avoid name collision * Use macro to delay expansion * Use std::move() to avoid object copying * Prohibit users from calling dtor * Use macro to eliminate redundant code * Make friend function hidden * Add missing include directive <iostream> * Fix wrong include directives * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test Co-authored-by:Qianfeng Zhang <Qianfeng.Zhang@amd.com>
-
- 30 Nov, 2022 2 commits
-
-
rocking5566 authored
* Use gemm_multiple_D instead * Add gemm bias relu quantization example * Add pure gemm quantization example * Add quantization of perchannel conv + bias + relu example * Refine the code * Rename multiplier to requant_scale * Rename the folder * Remove redundant comment * Rename the file. Prepare to add perchannel * Add conv perchannel instance * Move to quantization folder * Add conv perchannel client example * Apply Rangify constructor of HostTensorDescriptor & Tensor<> * Fix merge error
-
Qianfeng authored
* Refine the device batchnorm-backward base API templates and data type assignments * Remove duplicated kernel file * Add batchnorm backward instances and external API * Add batchnorm-backward profiler and tests * Add client example which uses batchnorm backward external API * Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory * Loose the threshold for batchnorm-backward check_err()
-
- 29 Nov, 2022 3 commits
-
-
Anthony Chang authored
* properly return error flag; reveals bug in split-k gemm * fix bug in split k * update split-k test case Co-authored-by:Chao Liu <chao.liu2@amd.com>
-
fsx950223 authored
-
Qianfeng authored
* Implemented batchnorm-backward Blockwise and Multiblock kernels * Add batchnorm-backward device op * Add batchnorm-backward host-reference op * Add batchnorm-backward example * Parameters renaming in batchnorm backward kernels and device op * Change in the example to loose the threshold for ScaleDiff checking * Add comments to explain the implementation of batchnorm-backward * Parameters renaming again in batchnorm backward kernels * Improve the expression calculation for performance * Add batchnorm backward to README * Add comments to explain inv-variance in batchnorm forward and backward * Renaming the batchnorm forward training and inferring examples * Add/update the comments for batchnorm-backward kernels * Renaming again * Add block_sync_lds between two consecutive blockwise reductions * Move common expression 1/N out of the static_for loops * Add dy_elementwise_op * Renaming in backward example again * Add checking for reduceDims in reference_batchnorm_backward * Update to comments and codes format * Rename in the comments * Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c * Add block_sync_lds() between blockwise reduction again * Fix comments again * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
-
- 28 Nov, 2022 1 commit
-
-
Qianfeng authored
Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test (#516)
-
- 25 Nov, 2022 1 commit
-
-
Qianfeng authored
* Update to device_batchnorm_forward base class to include all template parameters for problem description * Add batchnorm forward instances and external api * Add batchnorm forward profiler module which uses the external api * Add some comments in batchnorm_forward example to explain the dimensions in lengths[] * Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward * Improvement to the batchnorm infer base API * Add batchnorm forward client example which shows using the batchnorm forward external API * Add test for batchnorm forward * Tuning the batchnorm profiler initialized values and error threshold * Add support for bhalf_t in instances/external api/tests * Add support for int8_t in instances/external api/tests * Add support for double in instances/external api/tests * Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances * Checking before running best instance in batchnorm_fwd_nhwc client example * Add checking for YElementwiseOp in batchnorm_forward external API * Add more types in batchnorm forward profiler * Add more test lengths Co-authored-by:rocking5566 <ChunYu.Lai@amd.com>
-
- 20 Nov, 2022 1 commit
-
-
Adam Osewski authored
* FastGelu support for more data types. * AddFastGelu & FastGelu instances. * Client example. * clang-format * Remove unused stride variable. * Add new line at EOF. Co-authored-by:Adam Osewski <aosewski@amd.com>
-
- 17 Nov, 2022 1 commit
-
-
Anthony Chang authored
* workaround bf16 atten fwd issue on gfx908 * typo
-
- 15 Nov, 2022 4 commits
-
-
guangzlu authored
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm * added bf16 tests for batched_gemm_softmax_gemm_permute * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * aligned annotations * modified CMakeLists for examples * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl * use macro to control the instances * added macro control into instances * clang-format some files * changed error tolerance for bf16 * changed index for 10_elementwise_normalization * fixed xdlops code bug in amd_xdlops.hpp Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
ltqin authored
* start add example * add device dl * change launch kernel * change init data method * change example config * add config valid check * add instance for dl bwd * add instance to ckProfiler * reserver to profiler and cmakelist * add instance to ckProfiler2 * change instance f32 config * fix example return value Co-authored-by:
letaoqin <letaoqin@amd.com> Co-authored-by:
Po Yen Chen <PoYen.Chen@amd.com>
-
Po Yen Chen authored
-
Po Yen Chen authored
We can use this template to eliminate duplicated iterator computing logics. By providing return type to ck::accumulate_n(), we can avoid type conversion operations.
-
- 14 Nov, 2022 1 commit
-
-
Po Yen Chen authored
* Rangify STL algorithms This commit adapts rangified std::copy(), std::fill() & std::transform() * Re-write more std::copy() calls * Re-write std::copy() calls in profiler
-
- 11 Nov, 2022 3 commits
-
-
Po Yen Chen authored
* Rangify check_err() By rangifying check_err(), we can not only compare values between std::vector<>s, but also compare any ranges which have same value type. * Re-format example code
-
Po Yen Chen authored
* Add missing ignore expression * Add missing include directive
-
Po Yen Chen authored
* Rangify STL algorithms This commit adapts rangified std::copy(), std::fill() & std::transform() * Rangify check_err() By rangifying check_err(), we can not only compare values between std::vector<>s, but also compare any ranges which have same value type. * Allow constructing Tensor<> like a HostTensorDescriptor * Simplify Tensor<> object construction logics * Remove more unnecessary 'HostTensorDescriptor' objects * Re-format example code * Re-write more HostTensorDescriptor ctor call
-
- 10 Nov, 2022 6 commits
-
-
Lauren Wrubleski authored
* Add packages for example and profiler * correct TEST_NAME -> EXAMPLE_NAME
-
Po Yen Chen authored
Allow passing forward range to its call operator
-
guangzlu authored
* add client example for elementwise_normalization * clang format elementwise_layernorm2d.cpp * changed some naming to make it more understandable * changed naming of input into ab_input * fixed bug for threadwise_x_store * add elementwise operation to reference
-
Po Yen Chen authored
* Rename example folder for GroupedConvFwdMultipleD * Unify example codes * Change target names * Add fp16 example for multiple d instance * Re-format common.hpp * Add interface 'DeviceGroupedConvFwd' * Use simpler interface * Move common conv params out * Rename conv fwd client example folder * Add missing include directive * Update grouped conv instance implementations * Simplify ckProfiler (grouped conv forward) * Use GroupedConvFwd to implement client example * Use greater groupe count in example * Add custom target to group examples * Add extra tag param to instance factory function * Use tag to differentiate factory functions * Add missing tag argument for factory function * Remove inheritance relationship * Remove no-longer used include directive * Add license in front of file
-
Po Yen Chen authored
* Remove redundant CMake setting * Extract common code from files * Rename folder 'convnd' to 'conv' * Use std::array<> to accept compile-time kwnown # of arguments * Fix compilation error of tuning parameter * In example, use same setting as unit-test * Remove no-longer used include directive * Add interface for grouped conv bwd weight * Add group support for conv bwd weight * Add grouped conv bwd weight example * Use group parameter in example * Rename example folder * Remove non-grouped version example source files * Rename device op template * Add group support to convolution backward weight * Remove debug messages * Use smaller group size in example * Use named variable as loop terminate condition * Prettify example output message * Enlarge used grid size * Allow real grid size exceeds expected grid size * Rename interface file * Add client example for group...
-
Po Yen Chen authored
* Remove interface 'DeviceGroupedConvBwdData' * Remove no-longer needed include directive * Rename client example folder
-