1. 07 May, 2020 2 commits
    • Chaitanya Sri Krishna Lolla's avatar
    • Chaitanya Sri Krishna Lolla's avatar
      [Upstream] IFU 05072020 (#4) · e85a1d4b
      Chaitanya Sri Krishna Lolla authored
      
      
      * fix dropout scaling from p to 1/(1-p) (#816)
      Co-authored-by: default avatarSukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
      
      * Improvements to apex.mlp (#804)
      
      * update fused bias relu backward kernel
      
      * adding support for not require first layer dgrad
      
      * fix bug: wrong layer in requires grad
      
      * add infrastructure for optional bias and activation, currently only support no bias and no relu
      
      * make bias and relu optional separately
      
      * add sigmoid activation option
      
      * enable wider load/store for multi_tensor_apply kernels (#763)
      
      * modify MTA axpby for wider load/store
      
      * Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
      
      * Changes to make xentropysoftmax load/store vectorized when possible: (#725)
      
      * Changes to make xentropysoftmax load/store vectorized when possible:
      Increase default ILP so that each thread handle 16 Bytes data in one step
      Make thread load/store longest vector possible
      Make unroll case handle adjacent data instead of strided...
      e85a1d4b
  2. 28 Apr, 2020 1 commit
  3. 23 Apr, 2020 1 commit
  4. 22 Apr, 2020 2 commits
    • Deyu Fu's avatar
    • Vinicius Reis's avatar
      Fix LARC with mixed precision (#793) · 2ec84ebd
      Vinicius Reis authored
      The LARC optimizer wraps an underlying optimizer and then needs to be passed
      to amp.initialize for mixed precision. There were 3 different crashes happening
      in this situation, fix all of them and add a unit test.
      
      I don't know if the 'LARC' in sys.modules check ever worked. In my setup, the
      entry in sys.modules is 'apex.parallel.LARC'. Checking if the variable is
      defined seems more reliable though.
      2ec84ebd
  5. 20 Apr, 2020 2 commits
  6. 13 Apr, 2020 1 commit
  7. 05 Apr, 2020 2 commits
  8. 03 Apr, 2020 4 commits
  9. 02 Apr, 2020 1 commit
  10. 01 Apr, 2020 2 commits
  11. 31 Mar, 2020 2 commits
  12. 25 Mar, 2020 1 commit
    • msbaines's avatar
      Fix contrib fused_adam to work correctly with multi-GPU (#752) · 8fac3a72
      msbaines authored
      
      
      The cuda kernel used by fused-adam was using the default stream
      on the default device. The kernel needs use the same device as
      the parameter tensor.
      
      Fixed by using context manager to set correct default device. For
      the use_mt case, raised an error. Alternatively, the use_mt
      case could launch one kernel per cuda device.
      
      The non-contrib version will also need to be fixed.
      Co-authored-by: default avatarMandeep Singh Baines <msb@fb.com>
      8fac3a72
  13. 23 Mar, 2020 2 commits
  14. 21 Mar, 2020 2 commits
  15. 20 Mar, 2020 3 commits
  16. 17 Mar, 2020 2 commits
  17. 11 Mar, 2020 2 commits
  18. 02 Mar, 2020 1 commit
  19. 27 Feb, 2020 1 commit
  20. 25 Feb, 2020 3 commits
  21. 24 Feb, 2020 1 commit
    • Kevin Stephano's avatar
      Change to Multihead Attention to allow Batched GEMMs larger than 64K. (#728) · 1733946a
      Kevin Stephano authored
      * Adding C++ Multihead Attention implementation to contrib.
      
      * Add reference test that at least works for forward.
      
      * Remove CublasLt support from multihead attention.
      
      * Add new Python version of self attention.
      
      * Update python model of MHA with backward pass.
      
      * Fixed Output Linear connection in MHA.
      
      * Clean up compiles and add documentation to PySelfAttention.
      
      * Add Encdec Python version of multihead attention.  Cleanup files.
      
      * Tests for self and encdec multihead attention.
      
      * Add reference pytorch implementation of attention with norm and add.
      
      * Add cutlass branch definition.
      
      * Add cutlass download to compile.
      
      * Add norm/add tests.
      
      * Add biases to pytorch python versions.
      
      * Add tests and fix issues with python version of attention masking.
      
      * Create README.md
      
      * Update README.md
      
      * Update README.md
      
      * Update perf test parameters.
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * Add files via upload
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * Fix matmul1 output tensor size.  Fix tests that missed issue.
      
      * Allow for Z dimensions of 64K and greater on batched GEMMs.
      
      * remove redundant imports
      
      * general cleanup, remove deprecated or unused functions
      1733946a
  22. 15 Feb, 2020 1 commit
  23. 10 Feb, 2020 1 commit