Commits · 5cfdc014f41b9204fb8fc98dd88e48867bc4db59 · OpenDAS / apex

21 May, 2020 1 commit
- pass all TensorListMetadata as pointer to pinned host memory (#13) · bdd481d1
  Jeff Daily authored May 21, 2020
  
  bdd481d1
20 May, 2020 1 commit
- bug fixes in sgd kernel in bfp16 bringup · 98a64039
  lcskrishna authored May 20, 2020
  
  98a64039
14 May, 2020 1 commit
- Add FusedAdagrad (#822) · 3bae8c83
  Andrew Tulloch authored May 14, 2020
  
  3bae8c83
12 May, 2020 2 commits
- Enable support for sparse tensors for multi_tensor_apply (#6) · 02a5274b
  Chaitanya Sri Krishna Lolla authored May 12, 2020
  
  02a5274b
- enable multi tensor extension for bfloat16 · 69251362
  rohithkrn authored May 11, 2020
  
  69251362
07 May, 2020 2 commits

Enable fusedlayernorm extension (#3) · 2d0f9cf2
Chaitanya Sri Krishna Lolla authored May 07, 2020

2d0f9cf2

[Upstream] IFU 05072020 (#4) · e85a1d4b

Chaitanya Sri Krishna Lolla authored May 07, 2020



* fix dropout scaling from p to 1/(1-p) (#816)
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

* Improvements to apex.mlp (#804)

* update fused bias relu backward kernel

* adding support for not require first layer dgrad

* fix bug: wrong layer in requires grad

* add infrastructure for optional bias and activation, currently only support no bias and no relu

* make bias and relu optional separately

* add sigmoid activation option

* enable wider load/store for multi_tensor_apply kernels (#763)

* modify MTA axpby for wider load/store

* Make scale/axpby/l2/adam/lamb multi_tensor uses wider load

* Changes to make xentropysoftmax load/store vectorized when possible: (#725)

* Changes to make xentropysoftmax load/store vectorized when possible:
Increase default ILP so that each thread handle 16 Bytes data in one step
Make thread load/store longest vector possible
Make unroll case handle adjacent data instead of strided, so same order compare to vector case

* Add shift for not aligned case. Remove less than 16 bytes aligned access
Co-authored-by: Burc Eryilmaz <sberyilm@gmail.com>
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
Co-authored-by: Deyu Fu <deyuf@nvidia.com>

e85a1d4b

30 Apr, 2020 2 commits

enable wider load/store for multi_tensor_apply kernels (#763) · 17ee854e
Deyu Fu authored Apr 30, 2020
```
* modify MTA axpby for wider load/store

* Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
```
17ee854e

Improvements to apex.mlp (#804) · 31aceeaa

Deyu Fu authored Apr 30, 2020

* update fused bias relu backward kernel

* adding support for not require first layer dgrad

* fix bug: wrong layer in requires grad

* add infrastructure for optional bias and activation, currently only support no bias and no relu

* make bias and relu optional separately

* add sigmoid activation option

31aceeaa

28 Apr, 2020 1 commit

Enable Apex on ROCm and support multi tensor support. (#1) · 8124df13

Chaitanya Sri Krishna Lolla authored Apr 28, 2020

* Initial commit to hipify all cuda code

* enable multi_tensor_apply extension

* added generatedFileCleaner to handle nested hip files

8124df13

22 Apr, 2020 1 commit
- initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
  Deyu Fu authored Apr 22, 2020
  
  71511faf
10 Apr, 2020 1 commit
- Add no-flattening e5m2-allgather option · c7b34549
  Thor Johnsen authored Apr 09, 2020
  
  c7b34549
27 Feb, 2020 1 commit
- NHWC support for multi tensor apply (#732) · de6378f5
  mcarilli authored Feb 26, 2020
```
* NHWC support for multi tensor apply

* compilation fix for version<=1.4
```
  de6378f5
04 Oct, 2019 1 commit

move previous fused_adam and fp16_optimizer to contrib (#517) · 1904e48d

Deyu Fu authored Oct 04, 2019

* move previous fused_adam and fp16_optimizer to contrib

* make build contrib.fused_adam optional

* change build option name

* remove unnecessary try import

1904e48d

06 Sep, 2019 1 commit

Fix for #456 (#477) · 325f5a0b

mcarilli authored Sep 05, 2019

* Pushing for build tests

* Contrib files

* Removing deprecated checks

325f5a0b

20 Aug, 2019 1 commit
- add back lamb stage1/2 to amp_C python · b9f0995b
  Deyu Fu authored Aug 20, 2019
  
  b9f0995b
17 Aug, 2019 1 commit
- add back legacy lamb code for backward comptibility now · 2bc766ce
  Deyu Fu authored Aug 16, 2019
  
  2bc766ce
16 Aug, 2019 2 commits

clean up variance options support by all fused optimizers: · 18062b69

Deyu Fu authored Aug 16, 2019

correctly not apply bias correction to epsilon(same as recent upstream change)
correctly not apply bias correction to weight decay(consistent with upstream AdamW)
Make adam_w_mode for FusedAdam/LAMB, to do L2 or Weight Decay (Adam vs AdamW)
Correct document reg_inside_moment differently from adam_w_mode in FusedNovoGrad
Removed legacy eps_mode from FusedAdam
Make internal math type float across fused optimizers

18062b69

add fused lamb, put lamb kernels into one file · c8f9cceb
Deyu Fu authored Aug 16, 2019

c8f9cceb

08 Aug, 2019 1 commit
- initial commit to make fused optimizers compatible with AMP · 690b1f71
  Deyu Fu authored Aug 08, 2019
  
  690b1f71
06 Aug, 2019 1 commit

Clean up layer norm tests (#418) · 3ef01fae

ngimel authored Aug 06, 2019

* Bug fix for non-affine layer-norm + add backward unit test

* clean up tests and add tests for a large batch

3ef01fae

01 Aug, 2019 1 commit
- fix fused layer norm for >65535 batch · 4a8e1a87
  Natalia Gimelshein authored Aug 01, 2019
  
  4a8e1a87
26 Jul, 2019 1 commit
- Add missing semicolon. (#390) · 3f7f5fba
  Edward Z. Yang authored Jul 12, 2019
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
```
  3f7f5fba
12 Jul, 2019 1 commit
- Add missing semicolon. (#390) · 80e0143e
  Edward Z. Yang authored Jul 12, 2019
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
```
  80e0143e
03 Jul, 2019 4 commits
- Pulling in deprecation warning changes · 665b2dd7
  Michael Carilli authored Jul 03, 2019
  
  665b2dd7
- Remove deprecated Type.h · 816813f9
  Michael Carilli authored Jul 03, 2019
  
  816813f9
- Remove deprecated Type.h · 7096b1b7
  Michael Carilli authored Jul 03, 2019
  
  7096b1b7
- Changing AT_CHECK to TORCH_CHECK · adee29f6
  Michael Carilli authored Jul 03, 2019
  
  adee29f6
28 Jun, 2019 1 commit
- Add support for fp16 update term (new UPD_T typename in template) · 3aeea0d8
  Thor Johnsen authored Jun 28, 2019
  
  3aeea0d8
14 Jun, 2019 1 commit
- Separate LDG/STG from compute loop (#359) · 121a2500
  Thor Johnsen authored Jun 13, 2019
  
  121a2500
11 Jun, 2019 1 commit
- Allow multi_tensor_lamb to update fp16 params · 47e3367f
  Michael Carilli authored Jun 11, 2019
  
  47e3367f
31 May, 2019 2 commits

Multi tensor lamb optimizer (#334) · 8be5b6be

Thor Johnsen authored May 31, 2019

* First draft, for discussion

* Fix mistakes in LAMB equations

* Add loop over chunk

* Bug fix

* Bug fix

* Bug fix

* Undo bug fix

* Bug fix

* Add multi tensor LAMB optimizer to setup.py

* Rename step_size to learning_rate

* Fix compilation errors

8be5b6be

Give multi-tensor L2 norm the ability to compute norms per-tensor as well as globally (#333) · 93338e62

mcarilli authored May 31, 2019

* Existing tests passing, still need to add per-tensor tests

* Test is passing, still need to measure performance

* ILP for l2norm functor

93338e62

27 May, 2019 1 commit
- FusedSGD tests passing for all opt_levels · 848c777d
  Michael Carilli authored May 27, 2019
  
  848c777d
10 May, 2019 1 commit
- materialize_master_weights for FusedSGD · c763f0fe
  Michael Carilli authored May 09, 2019
  
  c763f0fe
03 May, 2019 1 commit
- Converting dispatch macros in fused_adam_cuda_kernel.cu · f3528d99
  Michael Carilli authored May 03, 2019
  
  f3528d99
27 Apr, 2019 1 commit

Bnp integration pr (#275) · fedfe0d7

jjsjann123 authored Apr 26, 2019

* Persistent group batchnorm added

Added persistent grouped batch norm for performance run on strong scaling case:
currently only supporting:

  1. nhwc layout
  2. fp16
  3. synchronization only within a node!

Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
by the persistent kernel.

Documentation and examples will follow.

* updating type().scalarType() to scalar_type()

* moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm

* fixing the cta computation

* review comment:

set device_id through cudaGetDevice()
move cudaMemset to cudaMemsetAsync
updated __threadfence() to __threadfence_system() inter device write

fedfe0d7

26 Apr, 2019 3 commits

Removing instances of ScalarType, still need to change macros · d175acb0
Michael Carilli authored Apr 26, 2019

d175acb0
whitespace · c978bda5
Michael Carilli authored Apr 26, 2019

c978bda5

Replace type().ScalarType() with scalar_type() (#272) · 855808f3

ptrblck authored Apr 26, 2019

* change .type().ScalarType() to .scalar_type() + at::ScalarType::X to at::kX

* revert scalar_type() to type() for AT_DISPATCH_FLOATING_TYPES_AND_HALF

* revert scalar_type() to type() in AT_DISPATCH_FLOATING_TYPES

* revert scalar_type() to type() for AT_DISPATCH_FLOATING_TYPES_AND_HALF in welford.cu

* revert scalar_type() to type() in layer_norm_cuda_kernel.cu

* revert at::kType  to at::ScalarType::Type

* use DISPATCH_FLOAT_AND_HALF to get rid of warnings

* add dispatch mechanisms for double+float and double+float+half

855808f3