Commits · ddc0803912bd5fe70dd441df36fae2ce37776598 · OpenDAS / apex

26 Feb, 2022 1 commit

[transformer] Fuse grad accumulation with wgrad (#1297) · ddc08039

Masaki Kozuki authored Feb 25, 2022



* fuse grad accumulation w/ weight grad
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* fp32 training path

* not using *args, **kwargs

* backward: moved the tensor dimension cnversion
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* move files to csrc/megatron

* fix fp32 path

* fix typo

* add  to  in order to select the correct custom extension

* fix typo

* comment on import guard

* update test: enable gradient_accumulation_fusion

* 86

* remove redundant call of `test_column_parallel_linear`
Co-authored-by: Sangkug Lym <slym@nvidia.com>

ddc08039

15 Feb, 2022 1 commit
- taking channels last 3d into account (#1284) · 39fc7ccf
  Masaki Kozuki authored Feb 15, 2022
  
  39fc7ccf
12 Feb, 2022 1 commit
- cast for `-Wc++11-narrowing` (#1288) · 1e218749
  Masaki Kozuki authored Feb 11, 2022
  
  1e218749
04 Feb, 2022 1 commit

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · 684c4733

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

684c4733

09 Dec, 2021 1 commit

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

27 Oct, 2021 1 commit

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021

* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s...

63d5dd63

08 Oct, 2021 1 commit
- check in (#1187) · 3ad9db2a
  eqy authored Oct 07, 2021
  
  3ad9db2a
07 Oct, 2021 1 commit
- Update layer_norm_cuda_kernel.cu (#1184) · 5adf7bc2
  eqy authored Oct 06, 2021
  
  5adf7bc2
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

24 Sep, 2021 1 commit
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 76daa454
  Masaki Kozuki authored Sep 24, 2021
  
  76daa454
04 Sep, 2021 1 commit

fix CUBLAS guards (#1162) · 54b93919

Burc Eryilmaz authored Sep 04, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path

* safer guard around CUBLAS constants, remove unreferenced variable

* more guard changes

* guard against cublas version instead of cuda
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

54b93919

01 Sep, 2021 2 commits

Seryilmaz/fuse norm into scale (#1149) · 4d190db6

Burc Eryilmaz authored Sep 01, 2021



* fuse norm into scale

* add fused norm into dlamb
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

4d190db6

Seryilmaz/more cublas lt (#1147) · 6af09dd9

Burc Eryilmaz authored Aug 31, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6af09dd9

17 May, 2021 1 commit
- compile cublasLt code only for cublas >= 11.0 (#1108) · 00c1e56d
  Burc Eryilmaz authored May 17, 2021
```
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  00c1e56d
19 Apr, 2021 1 commit
- Fix cublasLt context create/destroy overhead in MLP extension (#1083) · 082f999a
  Burc Eryilmaz authored Apr 19, 2021
```
* don't create cublasLt handle, fix zero block size case

* cleanup
```
  082f999a
17 Apr, 2021 1 commit

initial cublaslt support for MLP (#1080) · b8be1bc7

Burc Eryilmaz authored Apr 16, 2021



* initial cublaslt support

* 64 bit input

* add license headers

* cleanup

* remove license
Co-authored-by: pbialecki <pbialecki@nvidia.com>

b8be1bc7

15 Apr, 2021 1 commit

Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac

Sudhakar Singh authored Apr 15, 2021

* Add unit tests for fused-novograd

* Fix: tensors should reside on the same device

* Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test

* fixed issues mentioned in the comments

59d2f7ac

19 Oct, 2020 1 commit

Optimize the sync batchnorm by batching the communication (#980) · 8a1ed9e8

lly-zero-one authored Oct 19, 2020

In this PR, we mainly tried to optimize the performance of Syncatchnorm and also fixed one potential issue in the welford_parallel kernel implementation.

For performance improvement, we batched the mean/var/count all_gather communication together and sent it once in the forward path
We also batch the all_reduce in backward path
We add the contiguous call on the input of welford_parallel kernel.
If there is any standard perf benchmark, I would be happy to run it.

8a1ed9e8

05 Aug, 2020 1 commit

set device guard for multi tensor optimizer implementations (#927) · 274cc063

ngimel authored Aug 05, 2020

* add device guards to the optimizers

* add untracked file

* set deviceGuard in multi_tensor_apply

* address review comments; fix lamb

* indent

* typo

274cc063

06 Jul, 2020 1 commit

[sync BN] (#792) · 1ff54b8f

jjsjann123 authored Jul 06, 2020

* [sync BN]

support non-uniform batch size across process group.

TODO: test should be added once cleaned up.

* updating unit tests

* new unit tests for different inputs

* cleaning

1ff54b8f

23 May, 2020 1 commit
- fix function signature · 2be773d3
  Kexin Yu authored May 23, 2020
  
  2be773d3
22 May, 2020 5 commits
- more fixes on dtypes · cf918ac1
  Kexin Yu authored May 22, 2020
  
  cf918ac1
- use pointer · 06a83ce7
  Kexin Yu authored May 22, 2020
  
  06a83ce7
- .data<...>() · 3a727a01
  Kexin Yu authored May 21, 2020
  
  3a727a01
- at::Tensor::data_ptr() · 2c3f3d9a
  Kexin Yu authored May 21, 2020
  
  2c3f3d9a
- fix dtype · abc991da
  Kexin Yu authored May 21, 2020
  
  abc991da
21 May, 2020 1 commit
- make fused LAMB async · f54cc1c9
  Kexin Yu authored May 21, 2020
  
  f54cc1c9
14 May, 2020 1 commit
- Add FusedAdagrad (#822) · 3bae8c83
  Andrew Tulloch authored May 14, 2020
  
  3bae8c83
30 Apr, 2020 3 commits

fix function signature for LAMBStage2Functor · c8bcfff8
Kexin Yu authored Apr 30, 2020

c8bcfff8
enable wider load/store for multi_tensor_apply kernels (#763) · 17ee854e
Deyu Fu authored Apr 30, 2020
```
* modify MTA axpby for wider load/store

* Make scale/axpby/l2/adam/lamb multi_tensor uses wider load
```
17ee854e

Improvements to apex.mlp (#804) · 31aceeaa

Deyu Fu authored Apr 30, 2020

* update fused bias relu backward kernel

* adding support for not require first layer dgrad

* fix bug: wrong layer in requires grad

* add infrastructure for optional bias and activation, currently only support no bias and no relu

* make bias and relu optional separately

* add sigmoid activation option

31aceeaa

28 Apr, 2020 1 commit
- LAMB: global grad clipping & more flexibility in adaptive lr · 5b300119
  Kexin Yu authored Apr 28, 2020
  
  5b300119
22 Apr, 2020 1 commit
- initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
  Deyu Fu authored Apr 22, 2020
  
  71511faf
10 Apr, 2020 1 commit
- Add no-flattening e5m2-allgather option · c7b34549
  Thor Johnsen authored Apr 09, 2020
  
  c7b34549
27 Feb, 2020 1 commit
- NHWC support for multi tensor apply (#732) · de6378f5
  mcarilli authored Feb 26, 2020
```
* NHWC support for multi tensor apply

* compilation fix for version<=1.4
```
  de6378f5
04 Oct, 2019 1 commit

move previous fused_adam and fp16_optimizer to contrib (#517) · 1904e48d

Deyu Fu authored Oct 04, 2019

* move previous fused_adam and fp16_optimizer to contrib

* make build contrib.fused_adam optional

* change build option name

* remove unnecessary try import

1904e48d

06 Sep, 2019 1 commit

Fix for #456 (#477) · 325f5a0b

mcarilli authored Sep 05, 2019

* Pushing for build tests

* Contrib files

* Removing deprecated checks

325f5a0b

20 Aug, 2019 1 commit
- add back lamb stage1/2 to amp_C python · b9f0995b
  Deyu Fu authored Aug 20, 2019
  
  b9f0995b
17 Aug, 2019 1 commit
- add back legacy lamb code for backward comptibility now · 2bc766ce
  Deyu Fu authored Aug 16, 2019
  
  2bc766ce
16 Aug, 2019 1 commit

clean up variance options support by all fused optimizers: · 18062b69

Deyu Fu authored Aug 16, 2019

correctly not apply bias correction to epsilon(same as recent upstream change)
correctly not apply bias correction to weight decay(consistent with upstream AdamW)
Make adam_w_mode for FusedAdam/LAMB, to do L2 or Weight Decay (Adam vs AdamW)
Correct document reg_inside_moment differently from adam_w_mode in FusedNovoGrad
Removed legacy eps_mode from FusedAdam
Make internal math type float across fused optimizers

18062b69