Commits · e2083df5eb96643c61613b9df48dd4eea6b07690 · OpenDAS / apex

23 Feb, 2021 1 commit
- fast layer norm (#1037) · e2083df5
  yjk21 authored Feb 23, 2021
  
  e2083df5
01 Dec, 2020 1 commit

DistributedFusedAdam Model Parallelism Support (Megatron) (#981) · 6b7e77b0

Kexin Yu authored Dec 01, 2020



DistributedFusedAdam Model Parallelism Support (Megatron)
Co-authored-by: Kexin Yu <kexiny@nvidia.com>
Co-authored-by: Kexin Yu <kexinznzn@gmail.com>

6b7e77b0

10 Aug, 2020 1 commit
- move sm80 code inside MHA (#937) · 5d9b5cbc
  ptrblck authored Aug 10, 2020
```
Co-authored-by: pbialecki <pbialecki@nvidia.com>
```
  5d9b5cbc
01 Aug, 2020 1 commit
- Add sm80 for CUDA >= 11 (#925) · 5b53121a
  ptrblck authored Aug 01, 2020
  
  5b53121a
01 Jun, 2020 1 commit
- Add Pyprof removal warnings that point to new repo (#862) · 097238f8
  mcarilli authored Jun 01, 2020
```
Co-authored-by: Michael Carilli <mcarilli@nvidia.com>
```
  097238f8
30 May, 2020 2 commits
- Make separate apex option for distributed lamb · 45388d48
  Thor Johnsen authored May 30, 2020
  
  45388d48
- Distributed LAMB optimizer · 19892f1d
  Thor Johnsen authored May 30, 2020
  
  19892f1d
29 May, 2020 1 commit

Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add... · 6c2babf9

Burc Eryilmaz authored May 29, 2020


Fuses dropout and softmax in backward pass, add bias support to CPP MHA, add additive mask support, separate Q/K/V parameters (#854)
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6c2babf9

14 May, 2020 1 commit
- Add FusedAdagrad (#822) · 3bae8c83
  Andrew Tulloch authored May 14, 2020
  
  3bae8c83
23 Apr, 2020 1 commit

CUDAGenerator fix for #36026 (#801) · 1f2aa915

ptrblck authored Apr 22, 2020



* add CUDAGenerator guard

* fix generator_flag

* add guards for gen pointer/ref issue

* change mutex_ to mutex()

* add check_generator
Co-authored-by: pbialecki <pbialecki@nvidia.com>

1f2aa915

22 Apr, 2020 1 commit
- initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
  Deyu Fu authored Apr 22, 2020
  
  71511faf
23 Mar, 2020 1 commit
- add l2norm source for FusedLAMB · a3ffb8a7
  Kexin Yu authored Mar 23, 2020
  
  a3ffb8a7
20 Mar, 2020 2 commits
- extension name fix · b4c32010
  Kexin Yu authored Mar 20, 2020
  
  b4c32010
- apex.contrib.optimizers.FuseLamb first commit · b222ed2b
  Kexin Yu authored Mar 19, 2020
  
  b222ed2b
11 Mar, 2020 1 commit

Fix deprecated calls in multihead_attn and ninja build failure (#746) · 80b90b9d

ptrblck authored Mar 11, 2020



* disable ninja for multihead_attn

* fix getCurrentStream in multihead_attn
Co-authored-by: pbialecki <pbialecki@nvidia.com>

80b90b9d

02 Mar, 2020 1 commit
- Revert "remove gencode from multihead_attn build (#731)" · 5633f6db
  pbialecki authored Mar 01, 2020
```
This reverts commit 92b3b9a9.
```
  5633f6db
27 Feb, 2020 1 commit
- NHWC support for multi tensor apply (#732) · de6378f5
  mcarilli authored Feb 26, 2020
```
* NHWC support for multi tensor apply

* compilation fix for version<=1.4
```
  de6378f5
25 Feb, 2020 2 commits
- remove gencode from multihead_attn build (#731) · 92b3b9a9
  ptrblck authored Feb 25, 2020
  
  92b3b9a9
- remove duplicated multihead_attn install (#729) · 5f6b9b0e
  ptrblck authored Feb 24, 2020
  
  5f6b9b0e
24 Feb, 2020 1 commit

Change to Multihead Attention to allow Batched GEMMs larger than 64K. (#728) · 1733946a

Kevin Stephano authored Feb 24, 2020

* Adding C++ Multihead Attention implementation to contrib.

* Add reference test that at least works for forward.

* Remove CublasLt support from multihead attention.

* Add new Python version of self attention.

* Update python model of MHA with backward pass.

* Fixed Output Linear connection in MHA.

* Clean up compiles and add documentation to PySelfAttention.

* Add Encdec Python version of multihead attention.  Cleanup files.

* Tests for self and encdec multihead attention.

* Add reference pytorch implementation of attention with norm and add.

* Add cutlass branch definition.

* Add cutlass download to compile.

* Add norm/add tests.

* Add biases to pytorch python versions.

* Add tests and fix issues with python version of attention masking.

* Create README.md

* Update README.md

* Update README.md

* Update perf test parameters.

* Update README.md

* Update README.md

* Update README.md

* Add files via upload

* Update README.md

* Update README.md

* Update README.md

* Fix matmul1 output tensor size.  Fix tests that missed issue.

* Allow for Z dimensions of 64K and greater on batched GEMMs.

* remove redundant imports

* general cleanup, remove deprecated or unused functions

1733946a

15 Feb, 2020 1 commit
- change include_dirs to abs path (#719) · 50338df6
  Deyu Fu authored Feb 14, 2020
  
  50338df6
06 Feb, 2020 1 commit

Add Fast Multihead Attention to APEX Contrib (#697) · 3f94528e

Kevin Stephano authored Feb 06, 2020

* Adding C++ Multihead Attention implementation to contrib.

* Add reference test that at least works for forward.

* Remove CublasLt support from multihead attention.

* Add new Python version of self attention.

* Update python model of MHA with backward pass.

* Fixed Output Linear connection in MHA.

* Clean up compiles and add documentation to PySelfAttention.

* Add Encdec Python version of multihead attention.  Cleanup files.

* Tests for self and encdec multihead attention.

* Add reference pytorch implementation of attention with norm and add.

* Add cutlass branch definition.

* Add cutlass download to compile.

* Add norm/add tests.

* Add biases to pytorch python versions.

* Add tests and fix issues with python version of attention masking.

* Create README.md

* Update README.md

* Update README.md

* Update perf test parameters.

* Update README.md

* Update README.md

* Update README.md

* Add f...

3f94528e

21 Jan, 2020 1 commit
- removing build target sm_70 from bnp (#683) · b66ffc1d
  jjsjann123 authored Jan 20, 2020
  
  b66ffc1d
08 Jan, 2020 1 commit
- add WAR for pip>=19.3.1 (#652) · b5a7c5f9
  ptrblck authored Jan 08, 2020
```
* add WAR for pip>=19.3.1

* remove pipmain, use extras_require instead
```
  b5a7c5f9
04 Oct, 2019 1 commit

move previous fused_adam and fp16_optimizer to contrib (#517) · 1904e48d

Deyu Fu authored Oct 04, 2019

* move previous fused_adam and fp16_optimizer to contrib

* make build contrib.fused_adam optional

* change build option name

* remove unnecessary try import

1904e48d

13 Sep, 2019 1 commit
- Seems to work locally (#490) · 3ae89c75
  mcarilli authored Sep 12, 2019
  
  3ae89c75
06 Sep, 2019 1 commit

Fix for #456 (#477) · 325f5a0b

mcarilli authored Sep 05, 2019

* Pushing for build tests

* Contrib files

* Removing deprecated checks

325f5a0b

17 Aug, 2019 1 commit
- add back legacy lamb code for backward comptibility now · 2bc766ce
  Deyu Fu authored Aug 16, 2019
  
  2bc766ce
16 Aug, 2019 1 commit
- add fused lamb, put lamb kernels into one file · c8f9cceb
  Deyu Fu authored Aug 16, 2019
  
  c8f9cceb
13 Aug, 2019 1 commit

Adding PyProf to Apex (#404) · 880ab925

Marek Kolodziej authored Aug 13, 2019


Co-authored-by: Aditya Agrawal <aditya.iitb@gmail.com>
Co-authored-by: Marek Kolodziej <mkolod@gmail.com>

880ab925

08 Aug, 2019 1 commit
- initial commit to make fused optimizers compatible with AMP · 690b1f71
  Deyu Fu authored Aug 08, 2019
  
  690b1f71
31 May, 2019 1 commit

Multi tensor lamb optimizer (#334) · 8be5b6be

Thor Johnsen authored May 31, 2019

* First draft, for discussion

* Fix mistakes in LAMB equations

* Add loop over chunk

* Bug fix

* Bug fix

* Bug fix

* Undo bug fix

* Bug fix

* Add multi tensor LAMB optimizer to setup.py

* Rename step_size to learning_rate

* Fix compilation errors

8be5b6be

23 May, 2019 1 commit
- Changing error message · e6eec3ba
  Michael Carilli authored May 23, 2019
  
  e6eec3ba
22 May, 2019 1 commit
- Hard error on Pytorch Cuda + Cuda toolkit version mismatch (#323) · 50689f6a
  mcarilli authored May 22, 2019
  
  50689f6a
09 May, 2019 1 commit

Add softmax cross entropy loss with label smoothing support. (#295) · 0c74571f

Wil Kong authored May 10, 2019

* Add softmax cross entropy loss with label smoothing support.

* Fix deprecation of AT_DISPATCH_XXX and several minor issues.

* Fix issues commented by reviewers.

* Add FB license.

* Remove code generation constraints.

* Add a simple unittest for label smoothing.

0c74571f

27 Apr, 2019 1 commit

Bnp integration pr (#275) · fedfe0d7

jjsjann123 authored Apr 26, 2019

* Persistent group batchnorm added

Added persistent grouped batch norm for performance run on strong scaling case:
currently only supporting:

  1. nhwc layout
  2. fp16
  3. synchronization only within a node!

Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
by the persistent kernel.

Documentation and examples will follow.

* updating type().scalarType() to scalar_type()

* moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm

* fixing the cta computation

* review comment:

set device_id through cudaGetDevice()
move cudaMemset to cudaMemsetAsync
updated __threadfence() to __threadfence_system() inter device write

fedfe0d7

18 Apr, 2019 1 commit
- cleanup · 651150cb
  Michael Carilli authored Apr 18, 2019
  
  651150cb
09 Apr, 2019 1 commit
- Simple cut of the kernel in place · e57f5d0e
  Michael Carilli authored Apr 09, 2019
  
  e57f5d0e
23 Mar, 2019 1 commit
- Fix typo in setup.py error message on torch version check (#219) · dc55a996
  Cubbee authored Mar 23, 2019
  
  dc55a996
22 Mar, 2019 1 commit

Check cuda version (#216) · 5b8faa29

mcarilli authored Mar 21, 2019

* Adding Torch + bare-metal nvcc version check and container build tests

* Putting a canary in the coalmine

* canary proved elusive

* Trying direct setup.py install

* this should work

* Removing canary

* hopefully this works

5b8faa29