Commits · 80e05267417f948e4f7e63c0fe807106d9a0c0ef · yangql / composable_kernel-1

"official/modeling/optimization/slide_optimizer.py" did not exist on "1862b9c3ee60575a54d5c02a8f00e12cccf85d71"

17 Jan, 2023 1 commit

Reduction external API and client examples (#493) · 80e05267

Qianfeng authored Jan 17, 2023



* Change to the DeviceReduce base class template to include all problem description information

* Add external api for reduction

* Add client example to test the reduction external api

* Spelling correction

* Re-implement the host_reduction to follow the DeviceReduce base API format

* Change the reduce profiler to call the external API for collecting device instances

* Rename reduce client example directory from 08_reduce to 12_reduce

* Remove (void) before the functional call

* Tiny update in reduce client example

* Tiny update in profile_reduce_impl.hpp

* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

80e05267

14 Nov, 2022 1 commit

Rangify STL algorithms (#438) · dc663fae

Po Yen Chen authored Nov 15, 2022

* Rangify STL algorithms

This commit adapts rangified std::copy(), std::fill() & std::transform()

* Re-write more std::copy() calls

* Re-write std::copy() calls in profiler

dc663fae

11 Nov, 2022 1 commit

Rangify constructor of HostTensorDescriptor & Tensor<> (#445) · 4a2a56c2

Po Yen Chen authored Nov 12, 2022

* Rangify STL algorithms

This commit adapts rangified std::copy(), std::fill() & std::transform()

* Rangify check_err()

By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.

* Allow constructing Tensor<> like a HostTensorDescriptor

* Simplify Tensor<> object construction logics

* Remove more unnecessary 'HostTensorDescriptor' objects

* Re-format example code

* Re-write more HostTensorDescriptor ctor call

4a2a56c2

25 Oct, 2022 1 commit

Update to the Reduction API and instances (#476) · dda3a0a1

Qianfeng authored Oct 25, 2022

* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances

* Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces

* Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock

* Split the cpp and hpp files for reduction instances to enable more parallel compiling

* Remove the using of macros for declaring reduction instances and instance references

* Update to add_device_reduce_instance_xxxx templated functions

* Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates

* Change return format

dda3a0a1

13 Oct, 2022 1 commit

Refactor device op implementations into `impl` subdirectory. (#420) · 30480288

Adam Osewski authored Oct 13, 2022



* Move kernel implementation files under impl directory.

* Update examples paths.

* Update device kernel impl include paths.

* Update tensor operation instances include paths.

* Update profiler and tests include paths.

* Clang-format

* Update include paths for batched gemm reduce

* Refactor UnitTest ConvNDBwdWeight.

* Refactor fwd and bwd data convND UT.

* Fix used test macro.

* Fix include path.

* Fix include paths.

* Fix include paths in profiler and tests.

* Fix include paths.
Co-authored-by: Adam Osewski <aosewski@amd.com>

30480288

25 Aug, 2022 1 commit

Add int4 reduction examples (#372) · d520d0cf

Qianfeng authored Aug 26, 2022

* Add int4 reduction examples

* Contain all using of int4_t inside the pre-compiling condition checking

d520d0cf

13 Aug, 2022 1 commit

Add examples for reduction fp16/fp32/bp16/int8/fp64 for 3d/4d/5d (#342) · 14932e8d

Qianfeng authored Aug 13, 2022

* Update the reduce_blockwise example to support user specified data type and input+reducing dimensions

* Add examples for using reduce_multiblock_atomic_add

* Add more running examples to the default command-line

* Remove un-necessary header including

* Update to the example README.md

14932e8d