• Paul Fultz II's avatar
    Add lane reduction (#1180) · 4c72cc95
    Paul Fultz II authored
    With reductions such as {2048, 2, 1456} on axes 1, this is 23x faster than using our new block_reduce, and its even over 100x faster than our original reduce_sum:
    
    # lane
    gpu::code_object[code_object=13736,symbol_name=kernel,global=2981888,local=1024,]: 0.0672928ms
    # block
    gpu::code_object[code_object=13800,symbol_name=kernel,global=39321600,local=64,]: 1.46072ms
    # original
    gpu::reduce_sum[axes={1}]: 6.73456ms
    There is some basic logic to pick between lane and block reduce automatically.
    4c72cc95
reduce.cpp 5.66 KB