Commits · 5b71d3695bf39efcdcda9dff5be2f70314b8f091 · OpenDAS / apex

06 Feb, 2020 1 commit

Add Fast Multihead Attention to APEX Contrib (#697) · 3f94528e

Kevin Stephano authored Feb 06, 2020

* Adding C++ Multihead Attention implementation to contrib.

* Add reference test that at least works for forward.

* Remove CublasLt support from multihead attention.

* Add new Python version of self attention.

* Update python model of MHA with backward pass.

* Fixed Output Linear connection in MHA.

* Clean up compiles and add documentation to PySelfAttention.

* Add Encdec Python version of multihead attention.  Cleanup files.

* Tests for self and encdec multihead attention.

* Add reference pytorch implementation of attention with norm and add.

* Add cutlass branch definition.

* Add cutlass download to compile.

* Add norm/add tests.

* Add biases to pytorch python versions.

* Add tests and fix issues with python version of attention masking.

* Create README.md

* Update README.md

* Update README.md

* Update perf test parameters.

* Update README.md

* Update README.md

* Update README.md

* Add f...

3f94528e

21 Jan, 2020 1 commit
- removing build target sm_70 from bnp (#683) · b66ffc1d
  jjsjann123 authored Jan 20, 2020
  
  b66ffc1d
08 Jan, 2020 1 commit
- add WAR for pip>=19.3.1 (#652) · b5a7c5f9
  ptrblck authored Jan 08, 2020
```
* add WAR for pip>=19.3.1

* remove pipmain, use extras_require instead
```
  b5a7c5f9
04 Oct, 2019 1 commit

move previous fused_adam and fp16_optimizer to contrib (#517) · 1904e48d

Deyu Fu authored Oct 04, 2019

* move previous fused_adam and fp16_optimizer to contrib

* make build contrib.fused_adam optional

* change build option name

* remove unnecessary try import

1904e48d

13 Sep, 2019 1 commit
- Seems to work locally (#490) · 3ae89c75
  mcarilli authored Sep 12, 2019
  
  3ae89c75
06 Sep, 2019 1 commit

Fix for #456 (#477) · 325f5a0b

mcarilli authored Sep 05, 2019

* Pushing for build tests

* Contrib files

* Removing deprecated checks

325f5a0b

17 Aug, 2019 1 commit
- add back legacy lamb code for backward comptibility now · 2bc766ce
  Deyu Fu authored Aug 16, 2019
  
  2bc766ce
16 Aug, 2019 1 commit
- add fused lamb, put lamb kernels into one file · c8f9cceb
  Deyu Fu authored Aug 16, 2019
  
  c8f9cceb
13 Aug, 2019 1 commit

Adding PyProf to Apex (#404) · 880ab925

Marek Kolodziej authored Aug 13, 2019


Co-authored-by: Aditya Agrawal <aditya.iitb@gmail.com>
Co-authored-by: Marek Kolodziej <mkolod@gmail.com>

880ab925

08 Aug, 2019 1 commit
- initial commit to make fused optimizers compatible with AMP · 690b1f71
  Deyu Fu authored Aug 08, 2019
  
  690b1f71
31 May, 2019 1 commit

Multi tensor lamb optimizer (#334) · 8be5b6be

Thor Johnsen authored May 31, 2019

* First draft, for discussion

* Fix mistakes in LAMB equations

* Add loop over chunk

* Bug fix

* Bug fix

* Bug fix

* Undo bug fix

* Bug fix

* Add multi tensor LAMB optimizer to setup.py

* Rename step_size to learning_rate

* Fix compilation errors

8be5b6be

23 May, 2019 1 commit
- Changing error message · e6eec3ba
  Michael Carilli authored May 23, 2019
  
  e6eec3ba
22 May, 2019 1 commit
- Hard error on Pytorch Cuda + Cuda toolkit version mismatch (#323) · 50689f6a
  mcarilli authored May 22, 2019
  
  50689f6a
09 May, 2019 1 commit

Add softmax cross entropy loss with label smoothing support. (#295) · 0c74571f

Wil Kong authored May 10, 2019

* Add softmax cross entropy loss with label smoothing support.

* Fix deprecation of AT_DISPATCH_XXX and several minor issues.

* Fix issues commented by reviewers.

* Add FB license.

* Remove code generation constraints.

* Add a simple unittest for label smoothing.

0c74571f

27 Apr, 2019 1 commit

Bnp integration pr (#275) · fedfe0d7

jjsjann123 authored Apr 26, 2019

* Persistent group batchnorm added

Added persistent grouped batch norm for performance run on strong scaling case:
currently only supporting:

  1. nhwc layout
  2. fp16
  3. synchronization only within a node!

Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
by the persistent kernel.

Documentation and examples will follow.

* updating type().scalarType() to scalar_type()

* moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm

* fixing the cta computation

* review comment:

set device_id through cudaGetDevice()
move cudaMemset to cudaMemsetAsync
updated __threadfence() to __threadfence_system() inter device write

fedfe0d7

18 Apr, 2019 1 commit
- cleanup · 651150cb
  Michael Carilli authored Apr 18, 2019
  
  651150cb
09 Apr, 2019 1 commit
- Simple cut of the kernel in place · e57f5d0e
  Michael Carilli authored Apr 09, 2019
  
  e57f5d0e
23 Mar, 2019 1 commit
- Fix typo in setup.py error message on torch version check (#219) · dc55a996
  Cubbee authored Mar 23, 2019
  
  dc55a996
22 Mar, 2019 1 commit

Check cuda version (#216) · 5b8faa29

mcarilli authored Mar 21, 2019

* Adding Torch + bare-metal nvcc version check and container build tests

* Putting a canary in the coalmine

* canary proved elusive

* Trying direct setup.py install

* this should work

* Removing canary

* hopefully this works

5b8faa29

19 Mar, 2019 1 commit
- Multi-tensor axpby kernel for more flexible unscaling (groundwork for #163 and #179 fix) · 5e552004
  Michael Carilli authored Mar 18, 2019
  
  5e552004
13 Mar, 2019 1 commit
- Add FusedAdam with multi-tensor apply support. · 3f86316e
  Wil Kong authored Mar 13, 2019
  
  3f86316e
12 Mar, 2019 1 commit
- Forward/backward compatibility around pytorch 3aeb78, to fix #191 · 42180bd9
  Michael Carilli authored Mar 11, 2019
  
  42180bd9
10 Mar, 2019 1 commit
- Removing deprecated scale_check_overflow kernel · 8f53411a
  Michael Carilli authored Mar 10, 2019
  
  8f53411a
08 Mar, 2019 1 commit
- Fused multi-tensor SGD · cadad920
  Simon Layton authored Mar 08, 2019
```
Initial implementation, all fp32
Tested against torch.optim.sgd
```
  cadad920
05 Mar, 2019 1 commit
- Documentation updates · 7f39db93
  Michael Carilli authored Mar 04, 2019
  
  7f39db93
04 Mar, 2019 1 commit
- Cleaning up READMEs · df83b67e
  Michael Carilli authored Mar 04, 2019
  
  df83b67e
19 Feb, 2019 1 commit
- Reworked multi tensor apply, added tests · 6763a8be
  Michael Carilli authored Feb 18, 2019
  
  6763a8be
11 Feb, 2019 1 commit
- Stashing work · fad78c16
  Michael Carilli authored Feb 10, 2019
  
  fad78c16
04 Feb, 2019 1 commit
- Restoring fused inf/nan check + downscale kernel · fd03f26a
  Michael Carilli authored Feb 03, 2019
  
  fd03f26a
12 Dec, 2018 1 commit
- Warning instead of error if nvcc is not found · 197bcc48
  Michael Carilli authored Dec 12, 2018
  
  197bcc48
31 Oct, 2018 1 commit

[WIP] Fused layer norm cuda (#69) · 1b9b65ca

Thor Johnsen authored Oct 31, 2018

* Pre-release of fused layer norm apex extension

* Remove half and __half2 specializations

* Code changes from review

1b9b65ca

30 Oct, 2018 1 commit
- Remove arch from adam compile options · 45f030db
  ngimel authored Oct 30, 2018
  
  45f030db
29 Oct, 2018 1 commit

Merging in fused adam optimizer, additional DDP features tested in 18.10 (#60) · e0bc5d62

mcarilli authored Oct 29, 2018

* test passes

* notes

* Using C++-side flatten and unflatten functions

* Adding csrc

* Persistent synchronization event so it doesn't need to be created and destroyed each time

* Interop with parameter flattening in SSD

* Added deterministic option to imagenet main.py

* Adding options to split gradient averaging and allreduce in pure fp32

* Fixing allreduce_maybe_retain call

* Fixing allreduce_fallback

* Also sync active_i_buckets from rank 0

* Making retain_allreduce_buffers compatible with/orthogonal to delay_allreduce=True|False

* Correcting syntax error, now all seems to work with SSD

* Optional cpp extension build

* Add mixed precision adam optimizer (#59)

* Add FusedAdam Optimizer to Apex that places all the math into a cuda kernel.

* Added fixes to fused_adam to get it to work with network.

* wip work on python interface for adam with options

* fix dispatch for halfs, add python options to handle optional half gradients and params

* cleanup, get rid of grid-stride loop

e0bc5d62

23 Oct, 2018 1 commit

[syncBN] (#48) · 81eef1ef

jjsjann123 authored Oct 23, 2018

* [syncBN]
  added syncBN in native pure python apex
  added fused cuda kernels used for sync BN. Using welford for mean/var
    optional installation using 'python setup.py install --cuda_ext'
  added unit test with side to side comparison between apex sync BN with
    PyTorch BN. Notice that for pytorch BN implementation, because of
    numerical issue for mean/var, the output will be slightly off.

* [syncBN PR]
  added fp16 support
  addressing review comments on:
    1. updating last pow 2
    2. look for import error when importing syncBN kernel

* [syncBN PR]
  added convert function to insert SyncBatchNorm
  refactored some kernel code

* fixing type issue (fp16/fp32/fp64)
added Kahan summation
editing unit test to use pytorch primitive ops with double, passing reasonable tests now

* updating tensor creation calls

* fixing the all_reduce contiguous tensor

* transposed all reduce results

* [syncBN]
support fp16 input & fp32 layer for apex fp16
partially fixing launch configs
enabling imagenet example to run with --sync_bn

* [syncBN PR]
Documentation added

* adjusting README

* adjusting again

* added some doc to imagenet example

* [syncBN]
  warp-level reduction
  bug fix: warp reduction logic updated. check for dummy element to avoid nan.
  improved launch config for better reduction kernels. Further improvements
would be to increase grid size.

* [syncBN]
  fixing undefined behavior in __shfl_down_sync from divergent threads in warp
reduction.
  changing at::native::empty to at::empty (upstream comments)

81eef1ef

23 Jul, 2018 1 commit
- Switch to simple Python-only install, in preparation for upstreaming C++ backend. · d695b68b
  Michael Carilli authored Jul 23, 2018
  
  d695b68b
05 Jul, 2018 1 commit
- Update setup.py for #23 · 247349f1
  mcarilli authored Jul 05, 2018
  
  247349f1
04 Jul, 2018 1 commit
- docs: minor spelling tweaks (#25) · d74fda26
  brett koonce authored Jul 03, 2018
  
  d74fda26
24 Jun, 2018 1 commit
- Response to issue #16: more comprehensive instructions in examples/docker · 2e7d799f
  Michael Carilli authored Jun 24, 2018
  
  2e7d799f
21 Jun, 2018 1 commit
- import ctypes for line 51 of setup.py · 5f2c649a
  cclauss authored Jun 21, 2018
  
  5f2c649a
14 Jun, 2018 1 commit
- Pulling in old logic to manually look for CUDA_HOME in Pytorch <= 0.4 to allow cross-compilation · 98fa5a3b
  Michael Carilli authored Jun 14, 2018
  
  98fa5a3b