- 18 Mar, 2022 1 commit
-
-
turneram authored
Add exclusive and reverse modes to gpu implementation of prefix_scan_sum, which completes support for ONNX op CumSum
-
- 14 Mar, 2022 1 commit
-
-
Shucai Xiao authored
change max number of groups in a kernel to 1B for greater performance
-
- 10 Mar, 2022 4 commits
-
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
- 08 Mar, 2022 5 commits
-
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
- 07 Mar, 2022 2 commits
-
-
Shucai Xiao authored
-
Shucai Xiao authored
-
- 04 Mar, 2022 5 commits
-
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
- 03 Mar, 2022 2 commits
-
-
Shucai Xiao authored
-
Shucai Xiao authored
-
- 02 Mar, 2022 1 commit
-
-
bpickrel authored
Update the base version of clang-format from 5.0 to 10.0
-
- 01 Mar, 2022 4 commits
-
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
- 28 Feb, 2022 8 commits
-
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
Shucai Xiao authored
-
- 26 Feb, 2022 2 commits
-
-
Shucai Xiao authored
-
Shucai Xiao authored
-
- 08 Feb, 2022 3 commits
-
-
Khalique Ahmed authored
-
Khalique Ahmed authored
-
Khalique Ahmed authored
-
- 31 Jan, 2022 1 commit
-
-
Khalique Ahmed authored
-
- 09 Dec, 2021 1 commit
-
-
Shucai Xiao authored
Changed the number of threads in a block from 256 to 128 Increased the max number of blocks in the kernel from 256 to 1M. For the case that the axis is the last dimension, we removed the computation of index since it is not required. With these change, we can get about 2x speedup compared to the develop branch for the softmax op used in the BertSquad model.
-