1. 01 Sep, 2021 1 commit
  2. 31 Aug, 2021 3 commits
  3. 30 Aug, 2021 1 commit
  4. 20 Aug, 2021 1 commit
  5. 17 Jul, 2021 2 commits
    • Nan Zheng's avatar
      Added more fusion and vectorized kernel for transducer (#1125) · 0c2c6eea
      Nan Zheng authored
      * Added support for fused ReLU and dropout into transducer joint
      
      * Reorganized code selection path in transducer joint fwd
      * Added support for fused ReLU+dropout into transducer joint
      
      * Vectorize transducer loss backward with fused softmax (#3)
      
      * Nanz/transducer loss (#4)
      
      * Vectorize transducer loss backward with fused softmax
      
      * Added a predicate to avoid potential IMA
      
      * Nanz/transducer loss (#5)
      
      * Vectorize transducer loss backward with fused softmax
      
      * Added a predicate to avoid potentional IMA
      
      * Added more predicates to avoid IMAs
      
      * Updated documentations for newly added features.
      
      * Fixed a error in transducer.py
      0c2c6eea
    • yjk21's avatar
      Adds small-batch kernels (#1126) · ed719967
      yjk21 authored
      ed719967
  6. 16 Jul, 2021 1 commit
  7. 15 Jun, 2021 2 commits
  8. 26 May, 2021 1 commit
  9. 17 May, 2021 1 commit
  10. 19 Apr, 2021 1 commit
  11. 17 Apr, 2021 3 commits
  12. 16 Apr, 2021 1 commit
  13. 15 Apr, 2021 3 commits
  14. 24 Mar, 2021 2 commits
  15. 23 Feb, 2021 1 commit
  16. 10 Feb, 2021 1 commit
  17. 20 Jan, 2021 1 commit
  18. 17 Dec, 2020 2 commits
  19. 04 Dec, 2020 3 commits
  20. 02 Dec, 2020 1 commit
  21. 01 Dec, 2020 1 commit
  22. 19 Oct, 2020 1 commit
    • lly-zero-one's avatar
      Optimize the sync batchnorm by batching the communication (#980) · 8a1ed9e8
      lly-zero-one authored
      In this PR, we mainly tried to optimize the performance of Syncatchnorm and also fixed one potential issue in the welford_parallel kernel implementation.
      
      For performance improvement, we batched the mean/var/count all_gather communication together and sent it once in the forward path
      We also batch the all_reduce in backward path
      We add the contiguous call on the input of welford_parallel kernel.
      If there is any standard perf benchmark, I would be happy to run it.
      8a1ed9e8
  23. 29 Sep, 2020 1 commit
  24. 15 Sep, 2020 1 commit
  25. 14 Sep, 2020 2 commits
  26. 15 Aug, 2020 1 commit
  27. 10 Aug, 2020 1 commit