Commits · 87aca22aa3fe3a85729c0f0ca408fd9a68d2f17d · OpenDAS / apex

28 May, 2020 1 commit
- Fixed a typo (#856) · 87aca22a
  Max V. Irgiznov authored May 28, 2020
  
  87aca22a
27 May, 2020 1 commit

Update Softmax in multihead attention to use the Current Cuda Stream instead... · 5cb187f3

Kevin Stephano authored May 26, 2020

Update Softmax in multihead attention to use the Current Cuda Stream instead of the Default Cuda Stream. (#843)

* Adding C++ Multihead Attention implementation to contrib.

* Add reference test that at least works for forward.

* Remove CublasLt support from multihead attention.

* Add new Python version of self attention.

* Update python model of MHA with backward pass.

* Fixed Output Linear connection in MHA.

* Clean up compiles and add documentation to PySelfAttention.

* Add Encdec Python version of multihead attention. Cleanup files.

* Tests for self and encdec multihead attention.

* Add reference pytorch implementation of attention with norm and add.

* Add cutlass branch definition.

* Add cutlass download to compile.

* Add norm/add tests.

* Add biases to pytorch python versions.

* Add tests and fix issues with python version of attention masking.

* Create README.md

* Update README.md

* Update perf test parameters.

* Update README.md

* Add files via upload

* Update README.md

* Fix matmul1 output tensor size. Fix tests that missed issue.

* Allow for Z dimensions of 64K and greater on batched GEMMs.

* remove redundant imports

* general cleanup, remove deprecated or unused functions

* Update Multihead Attention's softmax to use the Current Stream instead of the default stream.

* Fix setup.py that got messed up in merge with upstream.

* Update Multihead Attention strided batched gemms to use the current stream instead of the default.
Co-authored-by: pbialecki <pbialecki@nvidia.com>

5cb187f3

22 May, 2020 2 commits
- Merge pull request #845 from NVIDIA/distopt_bug_fix · 4a1aa97e
  Thor Johnsen authored May 22, 2020
```
Bug fix
```
  4a1aa97e
- Bug fix · 3ccdfaa3
  Thor Johnsen authored May 22, 2020
  
  3ccdfaa3
19 May, 2020 1 commit
- Merge pull request #819 from kexinyu/master · 8abb6908
  Kexin Yu authored May 19, 2020
```
Use global gradient clipping in FusedLAMB & add option for using NVLAMB
```
  8abb6908
14 May, 2020 1 commit
- Add FusedAdagrad (#822) · 3bae8c83
  Andrew Tulloch authored May 14, 2020
  
  3bae8c83
13 May, 2020 1 commit
- Fixes flake8 --select W605 test warnings (#829) · 9165b27f
  Andrew Sears authored May 13, 2020
```
Signed-off-by: asears <asears@users.noreply.github.com>
```
  9165b27f
12 May, 2020 2 commits
- Merge pull request #753 from NVIDIA/revertable_fused_adam_with_mt_support · e1b7997a
  Thor Johnsen authored May 12, 2020
```
Reversible fused adam with mt support
```
  e1b7997a
- Resolve possible race condition in stride_finite_check kernel · 758826fc
  Thor Johnsen authored May 11, 2020
  
  758826fc
08 May, 2020 1 commit
- Merge · 0bfb8300
  Thor Johnsen authored May 08, 2020
  
  0bfb8300
07 May, 2020 2 commits
- Resolve merge conflict · 2619f1cb
  Thor Johnsen authored May 07, 2020
  
  2619f1cb
- Slight improvements · 91a5a87e
  Thor Johnsen authored May 06, 2020
  
  91a5a87e
06 May, 2020 3 commits
- Re-introduce original non-reversible fused contrib adam cuda kernel · 25c80afe
  Thor Johnsen authored May 06, 2020
  
  25c80afe
- Revert regular contrib fused adam optimizer · 9bb71066
  Thor Johnsen authored May 06, 2020
  
  9bb71066
- Ultra-simple global all-reduce version of distributed optimizer · 7e3536dd
  Thor Johnsen authored May 05, 2020
  
  7e3536dd
05 May, 2020 1 commit
- Try out different partition scheme · a60bbe63
  Thor Johnsen authored May 04, 2020
  
  a60bbe63
04 May, 2020 1 commit
- Bug fix · 7da28fc3
  Thor Johnsen authored May 04, 2020
  
  7da28fc3
02 May, 2020 3 commits
- initialize on device · bd6e66df
  Kexin Yu authored May 02, 2020
  
  bd6e66df
- initialize with tensor · 9033ad58
  Kexin Yu authored May 02, 2020
  
  9033ad58
- save a sync when calculating global gradient norm · f560bd0b
  Kexin Yu authored May 02, 2020
  
  f560bd0b
01 May, 2020 4 commits
- Merge branch 'master' of https://github.com/NVIDIA/apex · ac4ef2d6
  Kexin Yu authored May 01, 2020
  
  ac4ef2d6
- make use_nvlamb a class attribute for FusedLAMB · 85e4af76
  Kexin Yu authored Apr 30, 2020
  
  85e4af76
- Changes to make xentropysoftmax load/store vectorized when possible: (#725) · cf50dc7c
  Deyu Fu authored Apr 30, 2020
```
* Changes to make xentropysoftmax load/store vectorized when possible:
Increase default ILP so that each thread handle 16 Bytes data in one step
Make thread load/store longest vector possible
Make unroll case handle adjacent data instead of strided, so same order compare to vector case

* Add shift for not aligned case. Remove less than 16 bytes aligned access
```
  cf50dc7c
- add import · 3fd3e2c8
  Kexin Yu authored Apr 30, 2020
  
  3fd3e2c8
30 Apr, 2020 6 commits
- fix function signature for LAMBStage2Functor · c8bcfff8
  Kexin Yu authored Apr 30, 2020
  
  c8bcfff8
- enable wider load/store for multi_tensor_apply kernels (#763) · 17ee854e
  Deyu Fu authored Apr 30, 2020
```
* modify MTA axpby for wider load/store

* Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
```
  17ee854e
- Improvements to apex.mlp (#804) · 31aceeaa
  Deyu Fu authored Apr 30, 2020
```
* update fused bias relu backward kernel

* adding support for not require first layer dgrad

* fix bug: wrong layer in requires grad

* add infrastructure for optional bias and activation, currently only support no bias and no relu

* make bias and relu optional separately

* add sigmoid activation option
```
  31aceeaa
- fix dropout scaling from p to 1/(1-p) (#816) · aad9300b
  Burc Eryilmaz authored Apr 30, 2020
```
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  aad9300b
- Remove implicit memcpy of grad tensor in do_overlapped function · 9c82241d
  Thor Johnsen authored Apr 30, 2020
  
  9c82241d
- Don't pad between consecutive parameters · 5d1993cf
  Thor Johnsen authored Apr 29, 2020
  
  5d1993cf
29 Apr, 2020 5 commits
- Bug fix · e1a4deba
  Thor Johnsen authored Apr 29, 2020
  
  e1a4deba
- Perf improvement (less CPU work) · 415e2646
  Thor Johnsen authored Apr 29, 2020
  
  415e2646
- Make L2 grad norm a CPU variable · 9d6d2e01
  Thor Johnsen authored Apr 29, 2020
  
  9d6d2e01
- Bug fix · bc81b1c1
  Thor Johnsen authored Apr 29, 2020
  
  bc81b1c1
- Reduce CPU overhead, bigger step, all-gather · 44f54712
  Thor Johnsen authored Apr 28, 2020
  
  44f54712
28 Apr, 2020 1 commit
- LAMB: global grad clipping & more flexibility in adaptive lr · 5b300119
  Kexin Yu authored Apr 28, 2020
  
  5b300119
23 Apr, 2020 1 commit

CUDAGenerator fix for #36026 (#801) · 1f2aa915

ptrblck authored Apr 22, 2020



* add CUDAGenerator guard

* fix generator_flag

* add guards for gen pointer/ref issue

* change mutex_ to mutex()

* add check_generator
Co-authored-by: pbialecki <pbialecki@nvidia.com>

1f2aa915

22 Apr, 2020 2 commits

initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
Deyu Fu authored Apr 22, 2020

71511faf

Fix LARC with mixed precision (#793) · 2ec84ebd

Vinicius Reis authored Apr 22, 2020

The LARC optimizer wraps an underlying optimizer and then needs to be passed
to amp.initialize for mixed precision. There were 3 different crashes happening
in this situation, fix all of them and add a unit test.

I don't know if the 'LARC' in sys.modules check ever worked. In my setup, the
entry in sys.modules is 'apex.parallel.LARC'. Checking if the variable is
defined seems more reliable though.

2ec84ebd

20 Apr, 2020 1 commit
- Merge pull request #761 from kexinyu/master · 55716d85
  Kexin Yu authored Apr 20, 2020
```
add additional loop for lists of params in FP16_Optimizer's load_state_dict 
```
  55716d85