Commits · aa7ff911d4e6b2a3539ff29b4a05f1200b4250e0 · gaoqiong / MIGraphX

10 Jun, 2022 1 commit

Add vectorized reduce (#1202) · aa7ff911

Paul Fultz II authored Jun 09, 2022



Consolidate the vectorize and preload
Add vectorization to reduction
Co-authored-by: kahmed10 <15948690+kahmed10@users.noreply.github.com>

aa7ff911

27 Apr, 2022 1 commit

Add lane reduction (#1180) · 4c72cc95

Paul Fultz II authored Apr 27, 2022

With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum:

# lane
gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms
# block
gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms
# original
gpu::reduce_sum[axes={1}]: 6.73456ms
There is some basic logic to pick between lane and block reduce automatically.

4c72cc95

17 Apr, 2022 1 commit

Reduce with runtime compilation (#1150) · f9a5b81e

Paul Fultz II authored Apr 17, 2022

There is significant improvement on larger tensors with half almost 50% faster:

lens: [1024, 384, 768]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.16685ms
gpu::reduce_sum[axes={2}]: 1.73126ms
Also for non-trivial layouts this can sometimes be over 2x faster:

lens: [64, 1024, 768, 4]
gpu::code_object[code_object=13832,symbol_name=kernel,global=39321600,local=256,]: 1.1706ms
gpu::reduce_sum[axes={1}]: 2.63375ms
Of course if the stride becomes larger this speed improvement diminishes due to poor memory access patterns. A lane_reduce instead of a block_reduce is needed for such type of kernels. I plan to address that in a future PR.

Finally, this also includes a MIGRAPHX_GPU_DUMP_ASM env variable which will print out the assembly when the kernel compiles.

f9a5b81e