• Shucai Xiao's avatar
    Softmax perf optimization (#1014) · 2e337c7f
    Shucai Xiao authored
    Changed the number of threads in a block from 256 to 128
    Increased the max number of blocks in the kernel from 256 to 1M.
    For the case that the axis is the last dimension, we removed the computation of index since it is not required.
    
    With these change, we can get about 2x speedup compared to the develop branch for the softmax op used in the BertSquad model.
    2e337c7f
softmax.cpp 3.37 KB