Commits · e1b7997a63babb73c3279ccecdc2bd3f61b8e462 · OpenDAS / apex

"examples/vscode:/vscode.git/clone" did not exist on "aedf3d1f3806d3c87d5d8d7e929cc09f7aa9175e"

12 May, 2020 2 commits
- Merge pull request #753 from NVIDIA/revertable_fused_adam_with_mt_support · e1b7997a
  Thor Johnsen authored May 12, 2020
```
Reversible fused adam with mt support
```
  e1b7997a
- Resolve possible race condition in stride_finite_check kernel · 758826fc
  Thor Johnsen authored May 11, 2020
  
  758826fc
08 May, 2020 1 commit
- Merge · 0bfb8300
  Thor Johnsen authored May 08, 2020
  
  0bfb8300
07 May, 2020 2 commits
- Resolve merge conflict · 2619f1cb
  Thor Johnsen authored May 07, 2020
  
  2619f1cb
- Slight improvements · 91a5a87e
  Thor Johnsen authored May 06, 2020
  
  91a5a87e
06 May, 2020 3 commits
- Re-introduce original non-reversible fused contrib adam cuda kernel · 25c80afe
  Thor Johnsen authored May 06, 2020
  
  25c80afe
- Revert regular contrib fused adam optimizer · 9bb71066
  Thor Johnsen authored May 06, 2020
  
  9bb71066
- Ultra-simple global all-reduce version of distributed optimizer · 7e3536dd
  Thor Johnsen authored May 05, 2020
  
  7e3536dd
05 May, 2020 1 commit
- Try out different partition scheme · a60bbe63
  Thor Johnsen authored May 04, 2020
  
  a60bbe63
04 May, 2020 1 commit
- Bug fix · 7da28fc3
  Thor Johnsen authored May 04, 2020
  
  7da28fc3
01 May, 2020 1 commit

Changes to make xentropysoftmax load/store vectorized when possible: (#725) · cf50dc7c

Deyu Fu authored Apr 30, 2020

* Changes to make xentropysoftmax load/store vectorized when possible:
Increase default ILP so that each thread handle 16 Bytes data in one step
Make thread load/store longest vector possible
Make unroll case handle adjacent data instead of strided, so same order compare to vector case

* Add shift for not aligned case. Remove less than 16 bytes aligned access

cf50dc7c

30 Apr, 2020 5 commits

enable wider load/store for multi_tensor_apply kernels (#763) · 17ee854e
Deyu Fu authored Apr 30, 2020
```
* modify MTA axpby for wider load/store

* Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
```
17ee854e

Improvements to apex.mlp (#804) · 31aceeaa

Deyu Fu authored Apr 30, 2020

* update fused bias relu backward kernel

* adding support for not require first layer dgrad

* fix bug: wrong layer in requires grad

* add infrastructure for optional bias and activation, currently only support no bias and no relu

* make bias and relu optional separately

* add sigmoid activation option

31aceeaa

fix dropout scaling from p to 1/(1-p) (#816) · aad9300b
Burc Eryilmaz authored Apr 30, 2020
```
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
aad9300b
Remove implicit memcpy of grad tensor in do_overlapped function · 9c82241d
Thor Johnsen authored Apr 30, 2020

9c82241d
Don't pad between consecutive parameters · 5d1993cf
Thor Johnsen authored Apr 29, 2020

5d1993cf

29 Apr, 2020 5 commits
- Bug fix · e1a4deba
  Thor Johnsen authored Apr 29, 2020
  
  e1a4deba
- Perf improvement (less CPU work) · 415e2646
  Thor Johnsen authored Apr 29, 2020
  
  415e2646
- Make L2 grad norm a CPU variable · 9d6d2e01
  Thor Johnsen authored Apr 29, 2020
  
  9d6d2e01
- Bug fix · bc81b1c1
  Thor Johnsen authored Apr 29, 2020
  
  bc81b1c1
- Reduce CPU overhead, bigger step, all-gather · 44f54712
  Thor Johnsen authored Apr 28, 2020
  
  44f54712
23 Apr, 2020 1 commit

CUDAGenerator fix for #36026 (#801) · 1f2aa915

ptrblck authored Apr 22, 2020



* add CUDAGenerator guard

* fix generator_flag

* add guards for gen pointer/ref issue

* change mutex_ to mutex()

* add check_generator
Co-authored-by: pbialecki <pbialecki@nvidia.com>

1f2aa915

22 Apr, 2020 2 commits

initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
Deyu Fu authored Apr 22, 2020

71511faf

Fix LARC with mixed precision (#793) · 2ec84ebd

Vinicius Reis authored Apr 22, 2020

The LARC optimizer wraps an underlying optimizer and then needs to be passed
to amp.initialize for mixed precision. There were 3 different crashes happening
in this situation, fix all of them and add a unit test.

I don't know if the 'LARC' in sys.modules check ever worked. In my setup, the
entry in sys.modules is 'apex.parallel.LARC'. Checking if the variable is
defined seems more reliable though.

2ec84ebd

20 Apr, 2020 3 commits
- Merge pull request #761 from kexinyu/master · 55716d85
  Kexin Yu authored Apr 20, 2020
```
add additional loop for lists of params in FP16_Optimizer's load_state_dict 
```
  55716d85
- Add alternate distributed optimizer implementation · f0448054
  Thor Johnsen authored Apr 20, 2020
  
  f0448054
- install option for contrib.optimizers.FusedLAMB · 04de0f7a
  Kexin Yu authored Apr 20, 2020
  
  04de0f7a
16 Apr, 2020 5 commits
- Partial move towards syncfree optimizer · 4a01ff26
  Thor Johnsen authored Apr 16, 2020
  
  4a01ff26
- Use glob_chunk to index streams and process groups · 2622d7f1
  Thor Johnsen authored Apr 16, 2020
  
  2622d7f1
- Bug fix · 85497632
  Thor Johnsen authored Apr 16, 2020
  
  85497632
- Bug fix · cef660ba
  Thor Johnsen authored Apr 16, 2020
  
  cef660ba
- Pragmatic change, seems like WAR for NCCL crash · 6eca2389
  Thor Johnsen authored Apr 16, 2020
  
  6eca2389
15 Apr, 2020 2 commits
- Bug(?) fix · 2c744ee5
  Thor Johnsen authored Apr 15, 2020
  
  2c744ee5
- internal pipelining more similar to micro-benchmarks · 208c91e0
  Thor Johnsen authored Apr 14, 2020
  
  208c91e0
13 Apr, 2020 1 commit
- Return internal optimizer's param_groups from LARC (#767) · 11faaca7
  Mannat Singh authored Apr 13, 2020
  
  11faaca7
10 Apr, 2020 2 commits
- Add option to skip overflow check in step() method · 7ba6a038
  Thor Johnsen authored Apr 10, 2020
  
  7ba6a038
- Add no-flattening e5m2-allgather option · c7b34549
  Thor Johnsen authored Apr 09, 2020
  
  c7b34549
09 Apr, 2020 1 commit
- Add e5m2 allgather option · cd206434
  Thor Johnsen authored Apr 09, 2020
  
  cd206434
08 Apr, 2020 1 commit
- Add internal pipelining option · aa90d31f
  Thor Johnsen authored Apr 08, 2020
  
  aa90d31f
07 Apr, 2020 1 commit
- Bug fix · be4c41c2
  Thor Johnsen authored Apr 07, 2020
  
  be4c41c2