Commits · 8ed8eaacb15753588adbf2caf018bdb62a499c71 · OpenDAS / apex

30 May, 2020 4 commits
- Use correct names for mt lamb cuda kernels · 8ed8eaac
  Thor Johnsen authored May 30, 2020
  
  8ed8eaac
- Make separate apex option for distributed lamb · 45388d48
  Thor Johnsen authored May 30, 2020
  
  45388d48
- Remove unused forward def · 848a2844
  Thor Johnsen authored May 30, 2020
  
  848a2844
- Distributed LAMB optimizer · 19892f1d
  Thor Johnsen authored May 30, 2020
  
  19892f1d
29 May, 2020 3 commits
- Fixes to Multihead Attention with LayerNorm and Dropout-Add (#860) · 5754fa7a
  Kevin Stephano authored May 29, 2020
  
  5754fa7a
- Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add... · 6c2babf9
  Burc Eryilmaz authored May 29, 2020
```
Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add additive mask support, separate Q/K/V parameters (#854)
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  6c2babf9
- Merge pull request #851 from kexinyu/master · 36c9e904
  Kexin Yu authored May 29, 2020
```
make FusedLAMB async
```
  36c9e904
28 May, 2020 1 commit
- Fixed a typo (#856) · 87aca22a
  Max V. Irgiznov authored May 28, 2020
  
  87aca22a
27 May, 2020 1 commit

Update Softmax in multihead attention to use the Current Cuda Stream instead... · 5cb187f3

Kevin Stephano authored May 26, 2020

Update Softmax in multihead attention to use the Current Cuda Stream instead of the Default Cuda Stream. (#843)

* Adding C++ Multihead Attention implementation to contrib.

* Add reference test that at least works for forward.

* Remove CublasLt support from multihead attention.

* Add new Python version of self attention.

* Update python model of MHA with backward pass.

* Fixed Output Linear connection in MHA.

* Clean up compiles and add documentation to PySelfAttention.

* Add Encdec Python version of multihead attention. Cleanup files.

* Tests for self and encdec multihead attention.

* Add reference pytorch implementation of attention with norm and add.

* Add cutlass branch definition.

* Add cutlass download to compile.

* Add norm/add tests.

* Add biases to pytorch python versions.

* Add tests and fix issues with python version of attention masking.

* Create README.md

* Update README.md

* Update perf test parameters.

* Update README.md

* Add files via upload

* Update README.md

* Fix matmul1 output tensor size. Fix tests that missed issue.

* Allow for Z dimensions of 64K and greater on batched GEMMs.

* remove redundant imports

* general cleanup, remove deprecated or unused functions

* Update Multihead Attention's softmax to use the Current Stream instead of the default stream.

* Fix setup.py that got messed up in merge with upstream.

* Update Multihead Attention strided batched gemms to use the current stream instead of the default.
Co-authored-by: pbialecki <pbialecki@nvidia.com>

5cb187f3

23 May, 2020 1 commit
- fix function signature · 2be773d3
  Kexin Yu authored May 23, 2020
  
  2be773d3
22 May, 2020 7 commits
- more fixes on dtypes · cf918ac1
  Kexin Yu authored May 22, 2020
  
  cf918ac1
- use pointer · 06a83ce7
  Kexin Yu authored May 22, 2020
  
  06a83ce7
- Merge pull request #845 from NVIDIA/distopt_bug_fix · 4a1aa97e
  Thor Johnsen authored May 22, 2020
```
Bug fix
```
  4a1aa97e
- Bug fix · 3ccdfaa3
  Thor Johnsen authored May 22, 2020
  
  3ccdfaa3
- .data<...>() · 3a727a01
  Kexin Yu authored May 21, 2020
  
  3a727a01
- at::Tensor::data_ptr() · 2c3f3d9a
  Kexin Yu authored May 21, 2020
  
  2c3f3d9a
- fix dtype · abc991da
  Kexin Yu authored May 21, 2020
  
  abc991da
21 May, 2020 1 commit
- make fused LAMB async · f54cc1c9
  Kexin Yu authored May 21, 2020
  
  f54cc1c9
19 May, 2020 1 commit
- Merge pull request #819 from kexinyu/master · 8abb6908
  Kexin Yu authored May 19, 2020
```
Use global gradient clipping in FusedLAMB & add option for using NVLAMB
```
  8abb6908
14 May, 2020 1 commit
- Add FusedAdagrad (#822) · 3bae8c83
  Andrew Tulloch authored May 14, 2020
  
  3bae8c83
13 May, 2020 1 commit
- Fixes flake8 --select W605 test warnings (#829) · 9165b27f
  Andrew Sears authored May 13, 2020
```
Signed-off-by: asears <asears@users.noreply.github.com>
```
  9165b27f
12 May, 2020 2 commits
- Merge pull request #753 from NVIDIA/revertable_fused_adam_with_mt_support · e1b7997a
  Thor Johnsen authored May 12, 2020
```
Reversible fused adam with mt support
```
  e1b7997a
- Resolve possible race condition in stride_finite_check kernel · 758826fc
  Thor Johnsen authored May 11, 2020
  
  758826fc
08 May, 2020 1 commit
- Merge · 0bfb8300
  Thor Johnsen authored May 08, 2020
  
  0bfb8300
07 May, 2020 2 commits
- Resolve merge conflict · 2619f1cb
  Thor Johnsen authored May 07, 2020
  
  2619f1cb
- Slight improvements · 91a5a87e
  Thor Johnsen authored May 06, 2020
  
  91a5a87e
06 May, 2020 3 commits
- Re-introduce original non-reversible fused contrib adam cuda kernel · 25c80afe
  Thor Johnsen authored May 06, 2020
  
  25c80afe
- Revert regular contrib fused adam optimizer · 9bb71066
  Thor Johnsen authored May 06, 2020
  
  9bb71066
- Ultra-simple global all-reduce version of distributed optimizer · 7e3536dd
  Thor Johnsen authored May 05, 2020
  
  7e3536dd
05 May, 2020 1 commit
- Try out different partition scheme · a60bbe63
  Thor Johnsen authored May 04, 2020
  
  a60bbe63
04 May, 2020 1 commit
- Bug fix · 7da28fc3
  Thor Johnsen authored May 04, 2020
  
  7da28fc3
02 May, 2020 3 commits
- initialize on device · bd6e66df
  Kexin Yu authored May 02, 2020
  
  bd6e66df
- initialize with tensor · 9033ad58
  Kexin Yu authored May 02, 2020
  
  9033ad58
- save a sync when calculating global gradient norm · f560bd0b
  Kexin Yu authored May 02, 2020
  
  f560bd0b
01 May, 2020 4 commits
- Merge branch 'master' of https://github.com/NVIDIA/apex · ac4ef2d6
  Kexin Yu authored May 01, 2020
  
  ac4ef2d6
- make use_nvlamb a class attribute for FusedLAMB · 85e4af76
  Kexin Yu authored Apr 30, 2020
  
  85e4af76
- Changes to make xentropysoftmax load/store vectorized when possible: (#725) · cf50dc7c
  Deyu Fu authored Apr 30, 2020
```
* Changes to make xentropysoftmax load/store vectorized when possible:
Increase default ILP so that each thread handle 16 Bytes data in one step
Make thread load/store longest vector possible
Make unroll case handle adjacent data instead of strided, so same order compare to vector case

* Add shift for not aligned case. Remove less than 16 bytes aligned access
```
  cf50dc7c
- add import · 3fd3e2c8
  Kexin Yu authored Apr 30, 2020
  
  3fd3e2c8
30 Apr, 2020 2 commits
- fix function signature for LAMBStage2Functor · c8bcfff8
  Kexin Yu authored Apr 30, 2020
  
  c8bcfff8
- enable wider load/store for multi_tensor_apply kernels (#763) · 17ee854e
  Deyu Fu authored Apr 30, 2020
```
* modify MTA axpby for wider load/store

* Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
```
  17ee854e