Commits · 14db5c27acbe7c122794e11e94c205d0e4c8462e · OpenDAS / apex

25 Jan, 2023 1 commit

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

14db5c27

09 Dec, 2022 1 commit
- Fix a bug in fused_dense_cuda on ROCm · e90ba51b
  hubertlu-tw authored Dec 09, 2022
  
  e90ba51b
05 Aug, 2022 1 commit

Enable FusedRMSNorm (#78) · c97ebfab

Hubert Lu authored Aug 05, 2022



* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)

* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix and generate docs for FusedRMSNorm (#1285)

* [FusedRMSNorm doc] document where epsilon is added (#1295)

* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording

* Fix some bugs

* Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs

* Fix NaN issues in FusedRMSNorm

* Update test_fused_layer_norm.py

* Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm

* Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
Co-authored-by: eqy <eddiey@nvidia.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

c97ebfab

29 Jul, 2022 1 commit
- Fix some compiling errors · 038ed999
  hubertlu-tw authored Jul 29, 2022
  
  038ed999
22 Jun, 2022 1 commit

Temporary Solution to Let `FusedAdam` support BFloat16 (#1407) · 81f8ba79

Masaki Kozuki authored Jun 22, 2022

* add temporary dispatch of double, float, half, bfloat16

* fusedadam of bfloat16

* Add bfloat16 path to FusedAdam

81f8ba79

31 May, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming (#79) · cf77e9b5

Hubert Lu authored May 31, 2022

* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming

* Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement

cf77e9b5

15 Apr, 2022 5 commits

Fix NaN issues in FusedRMSNorm · 8df1b6b8
hubertlu-tw authored Apr 15, 2022

8df1b6b8
Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs · 28c5638d
hubertlu-tw authored Apr 15, 2022

28c5638d
Fix some bugs · d755f1f1
hubertlu-tw authored Apr 15, 2022

d755f1f1

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · c14cfb10

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

c14cfb10

Apex transformer (#77) · 27a47345

Hubert Lu authored Apr 14, 2022

* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda

* Comment out CUDA-specific implementations

* Resolve filename collision of *cpp files with to-hipify code and *cu files

27a47345

06 Apr, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142

Hubert Lu authored Apr 06, 2022

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)

* First attempt to make rocblas flag backward compatible

* Fix some bugs

* Fix some bugs

* Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions

* Add groupbn extension unit tests for ROCm

* Fix some bugs

5ecad142

23 Mar, 2022 1 commit

Add rocblas_alt_impl flag for backprop in MLP (#71) · 063d720f

Hubert Lu authored Mar 23, 2022

* Add rocblas_alt_impl flag in MLP

* Refactor rocblas_alt_impl implementation and only use it for backprop

063d720f

26 Feb, 2022 1 commit

[transformer] Fuse grad accumulation with wgrad (#1297) · ddc08039

Masaki Kozuki authored Feb 25, 2022



* fuse grad accumulation w/ weight grad
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* fp32 training path

* not using *args, **kwargs

* backward: moved the tensor dimension cnversion
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* move files to csrc/megatron

* fix fp32 path

* fix typo

* add  to  in order to select the correct custom extension

* fix typo

* comment on import guard

* update test: enable gradient_accumulation_fusion

* 86

* remove redundant call of `test_column_parallel_linear`
Co-authored-by: Sangkug Lym <slym@nvidia.com>

ddc08039

15 Feb, 2022 1 commit
- taking channels last 3d into account (#1284) · 39fc7ccf
  Masaki Kozuki authored Feb 15, 2022
  
  39fc7ccf
12 Feb, 2022 1 commit
- cast for `-Wc++11-narrowing` (#1288) · 1e218749
  Masaki Kozuki authored Feb 11, 2022
  
  1e218749
04 Feb, 2022 1 commit

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · 684c4733

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

684c4733

25 Jan, 2022 1 commit

Optimize layer normalization for AMD GPUs (#66) · 1cb3da87

Hubert Lu authored Jan 25, 2022

* Optimize fused layer normalization for MI100

* Optimize cuComputePartGradGammaBeta for AMD GPUs

1cb3da87

13 Dec, 2021 1 commit
- Remove deprecated THC/THC.h · 67ded2e2
  Hubert Lu authored Dec 13, 2021
  
  67ded2e2
09 Dec, 2021 2 commits

Add fused mixed precision lamb optimizer. (#1237) · d11ddccf

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

d11ddccf

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

17 Nov, 2021 1 commit
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 5c79a278
  Masaki Kozuki authored Sep 24, 2021
  
  5c79a278
27 Oct, 2021 1 commit

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021

* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s...

63d5dd63

19 Oct, 2021 1 commit
- Fix the hipification issues for cublasGemmEx by adding rocblas_gemm_ex · 8091b3e2
  Hubert Lu authored Oct 19, 2021
  
  8091b3e2
08 Oct, 2021 1 commit
- check in (#1187) · 3ad9db2a
  eqy authored Oct 07, 2021
  
  3ad9db2a
07 Oct, 2021 1 commit
- Update layer_norm_cuda_kernel.cu (#1184) · 5adf7bc2
  eqy authored Oct 06, 2021
  
  5adf7bc2
04 Oct, 2021 1 commit
- in multi tensor apply, skip empty tensors (#54) · 297ab210
  Jeff Daily authored Oct 04, 2021
  
  297ab210
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

24 Sep, 2021 1 commit
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 76daa454
  Masaki Kozuki authored Sep 24, 2021
  
  76daa454
04 Sep, 2021 1 commit

fix CUBLAS guards (#1162) · 54b93919

Burc Eryilmaz authored Sep 04, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path

* safer guard around CUBLAS constants, remove unreferenced variable

* more guard changes

* guard against cublas version instead of cuda
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

54b93919

01 Sep, 2021 2 commits

Seryilmaz/fuse norm into scale (#1149) · 4d190db6

Burc Eryilmaz authored Sep 01, 2021



* fuse norm into scale

* add fused norm into dlamb
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

4d190db6

Seryilmaz/more cublas lt (#1147) · 6af09dd9

Burc Eryilmaz authored Aug 31, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6af09dd9

17 May, 2021 1 commit
- compile cublasLt code only for cublas >= 11.0 (#1108) · 00c1e56d
  Burc Eryilmaz authored May 17, 2021
```
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  00c1e56d
19 Apr, 2021 1 commit
- Fix cublasLt context create/destroy overhead in MLP extension (#1083) · 082f999a
  Burc Eryilmaz authored Apr 19, 2021
```
* don't create cublasLt handle, fix zero block size case

* cleanup
```
  082f999a
17 Apr, 2021 1 commit

initial cublaslt support for MLP (#1080) · b8be1bc7

Burc Eryilmaz authored Apr 16, 2021



* initial cublaslt support

* 64 bit input

* add license headers

* cleanup

* remove license
Co-authored-by: pbialecki <pbialecki@nvidia.com>

b8be1bc7

15 Apr, 2021 1 commit

Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac

Sudhakar Singh authored Apr 15, 2021

* Add unit tests for fused-novograd

* Fix: tensors should reside on the same device

* Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test

* fixed issues mentioned in the comments

59d2f7ac

25 Feb, 2021 1 commit
- Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)" · fbb8cd93
  Jeff Daily authored Feb 25, 2021
```
This reverts commit bdd481d1.
```
  fbb8cd93
25 Jan, 2021 1 commit

fix bugs in syncbn (#46) · 3f49dbf0

Jeff Daily authored Jan 25, 2021

- incorrect use of __shfl_down
- fix warp size assumptions
- update unit tests to exit on failure

3f49dbf0

21 Jan, 2021 1 commit
- use __launch_bounds__ for multi_tensor_apply (#44) · 5baa68d3
  Jeff Daily authored Jan 21, 2021
```
use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests
```
  5baa68d3
18 Jan, 2021 1 commit
- missing #include <c10/cuda/CUDAGuard.h> · 4ebf2b90
  Jeff Daily authored Jan 18, 2021
  
  4ebf2b90