Commits · 2d0f9cf20f3c998293225c633e3ec42f68edbba4 · OpenDAS / apex

07 May, 2020 3 commits

Enable fusedlayernorm extension (#3) · 2d0f9cf2
Chaitanya Sri Krishna Lolla authored May 07, 2020

2d0f9cf2
enable python only base sparse tensor support for loss scaling (#2) · 3ccdd63d
Chaitanya Sri Krishna Lolla authored May 07, 2020

3ccdd63d

[Upstream] IFU 05072020 (#4) · e85a1d4b

Chaitanya Sri Krishna Lolla authored May 07, 2020



* fix dropout scaling from p to 1/(1-p) (#816)
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

* Improvements to apex.mlp (#804)

* update fused bias relu backward kernel

* adding support for not require first layer dgrad

* fix bug: wrong layer in requires grad

* add infrastructure for optional bias and activation, currently only support no bias and no relu

* make bias and relu optional separately

* add sigmoid activation option

* enable wider load/store for multi_tensor_apply kernels (#763)

* modify MTA axpby for wider load/store

* Make scale/axpby/l2/adam/lamb multi_tensor uses wider load

* Changes to make xentropysoftmax load/store vectorized when possible: (#725)

* Changes to make xentropysoftmax load/store vectorized when possible:
Increase default ILP so that each thread handle 16 Bytes data in one step
Make thread load/store longest vector possible
Make unroll case handle adjacent data instead of strided, so same order compare to vector case

* Add shift for not aligned case. Remove less than 16 bytes aligned access
Co-authored-by: Burc Eryilmaz <sberyilm@gmail.com>
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
Co-authored-by: Deyu Fu <deyuf@nvidia.com>

e85a1d4b

28 Apr, 2020 1 commit

Enable Apex on ROCm and support multi tensor support. (#1) · 8124df13

Chaitanya Sri Krishna Lolla authored Apr 28, 2020

* Initial commit to hipify all cuda code

* enable multi_tensor_apply extension

* added generatedFileCleaner to handle nested hip files

8124df13

23 Apr, 2020 1 commit

CUDAGenerator fix for #36026 (#801) · 1f2aa915

ptrblck authored Apr 22, 2020



* add CUDAGenerator guard

* fix generator_flag

* add guards for gen pointer/ref issue

* change mutex_ to mutex()

* add check_generator
Co-authored-by: pbialecki <pbialecki@nvidia.com>

1f2aa915

22 Apr, 2020 2 commits

initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
Deyu Fu authored Apr 22, 2020

71511faf

Fix LARC with mixed precision (#793) · 2ec84ebd

Vinicius Reis authored Apr 22, 2020

The LARC optimizer wraps an underlying optimizer and then needs to be passed
to amp.initialize for mixed precision. There were 3 different crashes happening
in this situation, fix all of them and add a unit test.

I don't know if the 'LARC' in sys.modules check ever worked. In my setup, the
entry in sys.modules is 'apex.parallel.LARC'. Checking if the variable is
defined seems more reliable though.

2ec84ebd

20 Apr, 2020 2 commits
- Merge pull request #761 from kexinyu/master · 55716d85
  Kexin Yu authored Apr 20, 2020
```
add additional loop for lists of params in FP16_Optimizer's load_state_dict 
```
  55716d85
- install option for contrib.optimizers.FusedLAMB · 04de0f7a
  Kexin Yu authored Apr 20, 2020
  
  04de0f7a
13 Apr, 2020 1 commit
- Return internal optimizer's param_groups from LARC (#767) · 11faaca7
  Mannat Singh authored Apr 13, 2020
  
  11faaca7
05 Apr, 2020 2 commits
- fix typo · f3a960f8
  Kexin Yu authored Apr 05, 2020
  
  f3a960f8
- .item() · d38e6fe4
  Kexin Yu authored Apr 05, 2020
  
  d38e6fe4
03 Apr, 2020 4 commits
- more debugging · a0bf956a
  Kexin Yu authored Apr 03, 2020
  
  a0bf956a
- check empty lists · feb93a2a
  Kexin Yu authored Apr 02, 2020
  
  feb93a2a
- more debugging · 8e5699e4
  Kexin Yu authored Apr 02, 2020
  
  8e5699e4
- seg fault debugging · 9b96c824
  Kexin Yu authored Apr 02, 2020
  
  9b96c824
02 Apr, 2020 1 commit
- import amp_C.multi_tensor_l2norm · 92186863
  Kexin Yu authored Apr 01, 2020
  
  92186863
01 Apr, 2020 2 commits
- add printing to test · 96b017a8
  Kexin Yu authored Mar 31, 2020
  
  96b017a8
- fix parameter type · 90729bc8
  Kexin Yu authored Mar 31, 2020
  
  90729bc8
31 Mar, 2020 2 commits
- clip gradients globally, rather than per group · 32d2c4e2
  Kexin Yu authored Mar 31, 2020
  
  32d2c4e2
- Add support for bool datatype (#601) (#603) · ca00adac
  Jeff Bowles authored Mar 31, 2020
  
  ca00adac
25 Mar, 2020 1 commit

Fix contrib fused_adam to work correctly with multi-GPU (#752) · 8fac3a72

msbaines authored Mar 24, 2020



The cuda kernel used by fused-adam was using the default stream
on the default device. The kernel needs use the same device as
the parameter tensor.

Fixed by using context manager to set correct default device. For
the use_mt case, raised an error. Alternatively, the use_mt
case could launch one kernel per cuda device.

The non-contrib version will also need to be fixed.
Co-authored-by: Mandeep Singh Baines <msb@fb.com>

8fac3a72

23 Mar, 2020 2 commits
- revert to gradient pre-normalization · 8405d436
  Kexin Yu authored Mar 23, 2020
  
  8405d436
- add l2norm source for FusedLAMB · a3ffb8a7
  Kexin Yu authored Mar 23, 2020
  
  a3ffb8a7
21 Mar, 2020 2 commits
- fix typo · 04927b3a
  Kexin Yu authored Mar 21, 2020
  
  04927b3a
- import name fix · d8a78acb
  Kexin Yu authored Mar 21, 2020
  
  d8a78acb
20 Mar, 2020 3 commits
- add FusedLamb in __init__ · 33f21d68
  Kexin Yu authored Mar 20, 2020
  
  33f21d68
- extension name fix · b4c32010
  Kexin Yu authored Mar 20, 2020
  
  b4c32010
- apex.contrib.optimizers.FuseLamb first commit · b222ed2b
  Kexin Yu authored Mar 19, 2020
  
  b222ed2b
17 Mar, 2020 2 commits
- add additional loop for lists of params when loading state_dict in... · 35e86d3d
  Kexin Yu authored Mar 17, 2020
```
add additional loop for lists of params when loading state_dict in apex.contrib.optimizers.FP16_Optimizer
```
  35e86d3d
- Merge remote-tracking branch 'upstream/master' · 93f91cde
  Kexin Yu authored Mar 17, 2020
  
  93f91cde
11 Mar, 2020 2 commits

Fix deprecated calls in multihead_attn and ninja build failure (#746) · 80b90b9d

ptrblck authored Mar 11, 2020



* disable ninja for multihead_attn

* fix getCurrentStream in multihead_attn
Co-authored-by: pbialecki <pbialecki@nvidia.com>

80b90b9d

Do not unscale the gradients if loss scale equal to 1 (#748) · 20d00ab1

Tomasz Grel authored Mar 11, 2020

* Do not unscale the gradients if loss scale equal to 1

* Disable unscaling loss scale == 1 only for static scaling

20d00ab1

02 Mar, 2020 1 commit
- Revert "remove gencode from multihead_attn build (#731)" · 5633f6db
  pbialecki authored Mar 01, 2020
```
This reverts commit 92b3b9a9.
```
  5633f6db
27 Feb, 2020 1 commit
- NHWC support for multi tensor apply (#732) · de6378f5
  mcarilli authored Feb 26, 2020
```
* NHWC support for multi tensor apply

* compilation fix for version<=1.4
```
  de6378f5
25 Feb, 2020 3 commits
- remove gencode from multihead_attn build (#731) · 92b3b9a9
  ptrblck authored Feb 25, 2020
  
  92b3b9a9
- remove duplicated multihead_attn install (#729) · 5f6b9b0e
  ptrblck authored Feb 24, 2020
  
  5f6b9b0e
- Adding 'ctc_loss' to the list of FP32 funcs (#722) · 93cabd5d
  Saransh Karira authored Feb 25, 2020
  
  93cabd5d
24 Feb, 2020 1 commit

Change to Multihead Attention to allow Batched GEMMs larger than 64K. (#728) · 1733946a

Kevin Stephano authored Feb 24, 2020

* Adding C++ Multihead Attention implementation to contrib.

* Add reference test that at least works for forward.

* Remove CublasLt support from multihead attention.

* Add new Python version of self attention.

* Update python model of MHA with backward pass.

* Fixed Output Linear connection in MHA.

* Clean up compiles and add documentation to PySelfAttention.

* Add Encdec Python version of multihead attention.  Cleanup files.

* Tests for self and encdec multihead attention.

* Add reference pytorch implementation of attention with norm and add.

* Add cutlass branch definition.

* Add cutlass download to compile.

* Add norm/add tests.

* Add biases to pytorch python versions.

* Add tests and fix issues with python version of attention masking.

* Create README.md

* Update README.md

* Update README.md

* Update perf test parameters.

* Update README.md

* Update README.md

* Update README.md

* Add files via upload

* Update README.md

* Update README.md

* Update README.md

* Fix matmul1 output tensor size.  Fix tests that missed issue.

* Allow for Z dimensions of 64K and greater on batched GEMMs.

* remove redundant imports

* general cleanup, remove deprecated or unused functions

1733946a

15 Feb, 2020 1 commit
- change include_dirs to abs path (#719) · 50338df6
  Deyu Fu authored Feb 14, 2020
  
  50338df6