Reduction in Composable Kernel (#82)

* Initial adding of generic reduction * Initial adding of generic reduction ... * Updates to make compiling done * clang-format all files * clang-format some files again * Renaming in profiler/include/profile_reduce.hpp * Updates and make BlockWise cases passed * Updates and make ThreadWise and MultiBlockTwoCall cases passed * Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances * Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes * format * adding pooling * added max and average pooling * comment out cout and kernel timing * Tiny simplification in profiler/reduce_profiler.cpp * Add example for reduce_blockwise * Tiny updates * Change to pass the ElementWiseOp from device layer to kernel * Fix the vectorDim and vectorSize in Device layer * Enable vector load on both dim0 and dim1 for Threadwise method * Tiny updates...

Reduction in Composable Kernel (#82)
* Initial adding of generic reduction * Initial adding of generic reduction ... * Updates to make compiling done * clang-format all files * clang-format some files again * Renaming in profiler/include/profile_reduce.hpp * Updates and make BlockWise cases passed * Updates and make ThreadWise and MultiBlockTwoCall cases passed * Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances * Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes * format * adding pooling * added max and average pooling * comment out cout and kernel timing * Tiny simplification in profiler/reduce_profiler.cpp * Add example for reduce_blockwise * Tiny updates * Change to pass the ElementWiseOp from device layer to kernel * Fix the vectorDim and vectorSize in Device layer * Enable vector load on both dim0 and dim1 for Threadwise method * Tiny updates...
e17c0d80 · Qianfeng · GitHub · 12dfba3d · e17c0d80 · e17c0d80
Unverified Commit e17c0d80 authored Mar 06, 2022 by Qianfeng Committed by GitHub Mar 05, 2022
20 changed files
--- a/device_operation/include/device_reduce_blockwise.hpp
+++ b/device_operation/include/device_reduce_blockwise.hpp
--- a/device_operation/include/device_reduce_blockwise_second_call.hpp
+++ b/device_operation/include/device_reduce_blockwise_second_call.hpp
--- a/device_operation/include/device_reduce_common.hpp
+++ b/device_operation/include/device_reduce_common.hpp
--- a/device_operation/include/device_reduce_instance.hpp
+++ b/device_operation/include/device_reduce_instance.hpp
--- a/device_operation/include/device_reduce_instance_blockwise.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_f16_f16_f16.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_f16_f16_f16.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_f16_f32_f16.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_f16_f32_f16.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_f32_f32_f32.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_f32_f32_f32.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_f32_f64_f32.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_f32_f64_f32.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_f64_f64_f64.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_f64_f64_f64.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_second_call.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_second_call.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_second_call_f16_f16_f16.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_second_call_f16_f16_f16.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_second_call_f32_f32_f16.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_second_call_f32_f32_f16.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_second_call_f32_f32_f32.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_second_call_f32_f32_f32.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_second_call_f64_f64_f32.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_second_call_f64_f64_f32.hpp
--- a/device_operation/include/device_reduce_instance_blockwise_second_call_f64_f64_f64.hpp
+++ b/device_operation/include/device_reduce_instance_blockwise_second_call_f64_f64_f64.hpp
--- a/device_operation/include/device_reduce_instance_impl_common.hpp
+++ b/device_operation/include/device_reduce_instance_impl_common.hpp
--- a/device_operation/include/device_reduce_instance_multiblock_atomic_add.hpp
+++ b/device_operation/include/device_reduce_instance_multiblock_atomic_add.hpp
--- a/device_operation/include/device_reduce_instance_multiblock_atomic_add_f16_f32_f32.hpp
+++ b/device_operation/include/device_reduce_instance_multiblock_atomic_add_f16_f32_f32.hpp
--- a/device_operation/include/device_reduce_instance_multiblock_atomic_add_f32_f32_f32.hpp
+++ b/device_operation/include/device_reduce_instance_multiblock_atomic_add_f32_f32_f32.hpp