Commits · 8fc9b21fed40d458a6088bbb31501bef2db8c749 · OpenDAS / apex

11 Aug, 2023 1 commit
- Changes to support hipblas migration (#113) · 8fc9b21f
  Pruthvi Madugundu authored Aug 11, 2023
  
  8fc9b21f
23 Mar, 2023 1 commit

Add FusedLARS optimizer (#109) · 7a428776

luise.chen authored Mar 24, 2023

* Add fused_lars optimizer

* Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW

* Add flow of using nesterov in FusedLARS

7a428776

15 Feb, 2023 1 commit

Grid optimization - Chunk_Size optimization. (#104) · b047a1f1

aspanday authored Feb 15, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here - https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.

* Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K).
In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320).
The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements.
This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex.
The set of performance along with comaprison with Torch is captured here
https://amdcloud.sharepoint.com/❌

/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8
See sheet chunk_opt.

* Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits.
changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers.
The change includes introducing multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels.

---------
Co-authored-by: aspanday <aspanday@amd.com>

b047a1f1

25 Jan, 2023 1 commit

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

14db5c27

09 Dec, 2022 1 commit
- Fix a bug in fused_dense_cuda on ROCm · e90ba51b
  hubertlu-tw authored Dec 09, 2022
  
  e90ba51b
05 Aug, 2022 1 commit

Enable FusedRMSNorm (#78) · c97ebfab

Hubert Lu authored Aug 05, 2022



* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)

* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix and generate docs for FusedRMSNorm (#1285)

* [FusedRMSNorm doc] document where epsilon is added (#1295)

* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording

* Fix some bugs

* Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs

* Fix NaN issues in FusedRMSNorm

* Update test_fused_layer_norm.py

* Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm

* Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
Co-authored-by: eqy <eddiey@nvidia.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

c97ebfab

29 Jul, 2022 1 commit
- Fix some compiling errors · 038ed999
  hubertlu-tw authored Jul 29, 2022
  
  038ed999
22 Jun, 2022 1 commit

Temporary Solution to Let `FusedAdam` support BFloat16 (#1407) · 81f8ba79

Masaki Kozuki authored Jun 22, 2022

* add temporary dispatch of double, float, half, bfloat16

* fusedadam of bfloat16

* Add bfloat16 path to FusedAdam

81f8ba79

31 May, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming (#79) · cf77e9b5

Hubert Lu authored May 31, 2022

* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming

* Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement

cf77e9b5

15 Apr, 2022 5 commits

Fix NaN issues in FusedRMSNorm · 8df1b6b8
hubertlu-tw authored Apr 15, 2022

8df1b6b8
Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs · 28c5638d
hubertlu-tw authored Apr 15, 2022

28c5638d
Fix some bugs · d755f1f1
hubertlu-tw authored Apr 15, 2022

d755f1f1

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · c14cfb10

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

c14cfb10

Apex transformer (#77) · 27a47345

Hubert Lu authored Apr 14, 2022

* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda

* Comment out CUDA-specific implementations

* Resolve filename collision of *cpp files with to-hipify code and *cu files

27a47345

06 Apr, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142

Hubert Lu authored Apr 06, 2022

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)

* First attempt to make rocblas flag backward compatible

* Fix some bugs

* Fix some bugs

* Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions

* Add groupbn extension unit tests for ROCm

* Fix some bugs

5ecad142

23 Mar, 2022 1 commit

Add rocblas_alt_impl flag for backprop in MLP (#71) · 063d720f

Hubert Lu authored Mar 23, 2022

* Add rocblas_alt_impl flag in MLP

* Refactor rocblas_alt_impl implementation and only use it for backprop

063d720f

26 Feb, 2022 1 commit

[transformer] Fuse grad accumulation with wgrad (#1297) · ddc08039

Masaki Kozuki authored Feb 25, 2022



* fuse grad accumulation w/ weight grad
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* fp32 training path

* not using *args, **kwargs

* backward: moved the tensor dimension cnversion
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* move files to csrc/megatron

* fix fp32 path

* fix typo

* add  to  in order to select the correct custom extension

* fix typo

* comment on import guard

* update test: enable gradient_accumulation_fusion

* 86

* remove redundant call of `test_column_parallel_linear`
Co-authored-by: Sangkug Lym <slym@nvidia.com>

ddc08039

15 Feb, 2022 1 commit
- taking channels last 3d into account (#1284) · 39fc7ccf
  Masaki Kozuki authored Feb 15, 2022
  
  39fc7ccf
12 Feb, 2022 1 commit
- cast for `-Wc++11-narrowing` (#1288) · 1e218749
  Masaki Kozuki authored Feb 11, 2022
  
  1e218749
04 Feb, 2022 1 commit

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · 684c4733

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

684c4733

25 Jan, 2022 1 commit

Optimize layer normalization for AMD GPUs (#66) · 1cb3da87

Hubert Lu authored Jan 25, 2022

* Optimize fused layer normalization for MI100

* Optimize cuComputePartGradGammaBeta for AMD GPUs

1cb3da87

13 Dec, 2021 1 commit
- Remove deprecated THC/THC.h · 67ded2e2
  Hubert Lu authored Dec 13, 2021
  
  67ded2e2
09 Dec, 2021 2 commits

Add fused mixed precision lamb optimizer. (#1237) · d11ddccf

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

d11ddccf

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

17 Nov, 2021 1 commit
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 5c79a278
  Masaki Kozuki authored Sep 24, 2021
  
  5c79a278
27 Oct, 2021 1 commit

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021



* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s) to 2 file(s)

* fix cyclic import

* fix syntax of Callable

* fix - 1

* move directory as testing used for pp test as well

* add some functions for num microbatches calculator

* model is a list in pipeline parallel

* skip build num microbatch calculator

* fix test

* assert -> raise

* skip args printing

* specify tensor shape everywhere even if None - phase 1

* private timers

* passing tensor shape & dtype around

* update dtype handling by introducing helper func

* write helper func to reduce cyclomatic complexity

* remove duplicate

* update

* move split_tensor_into_1d_equal_chunks to avoid cyclic import

* tmp

* cosmetic

* move gather_split_1d_tensor to avoid cyclic imports

* remove debug print

* add outer loop

* early return if possible

* cosmetic

* passing around tensor shape

* refactor test

* add script to learn batch sampler behavior

* update

* minibatch splitter

* add minibatch splitter

* split minibatch into microbatches

* minor changes

* uncomment split batch for test sake

* set as attribute

* study the behavior of no pipelining

* debug 1

* reflect test util namespace change

* update readme

* cosmetic in test

* add model build helper func for interleaving shced

* adding model builder from megatron

* canbe cyclic import

* fix

* enable interleaving test, but failing even if forward only

* fix batch preparation

* add explanation

* print data parallel size

* fix typo

* Add Megatron style GPT model by Rishi
Co-authored-by: Rishi Puri <riship@nvidia.com>

* update

* type hint for jit

* fix forward_backward_no_pipelining test

* pipeline forward backward seem to hang if not forward only

* fix typo

* debug

* add p2p test

* simplify

* fix

* tentative

* set both tmp and pmp to 1

* init

* fix typo

* fix

* fix path of divide

* set seed for tmp

* update upon Eddie comment

* fix typo

* adding failing data loader test

* fix

* megatron still failing

* check in

* with the nested loop of new order, interleaving seems fine

* cosmetic change

* make `forward_backward_pipelining_with_interleaving private

* warn users that interleaving sched is unstable

* move noop handler to no pipelining

* comment out rank_print

* make `build_model` more flexible

* skip megatron test tentatively

* correctly comment out rank_print

* correctly comment out rank_print

* correctly comment out rank_print

* skip appropriately

* remove wip p2p comm test

* update type hint of model_provider_func

* disable tf32 in each test script

* skip interleaving w/ backward

* rename as mpu is the old name

* remove broken case

* expose build_model func

* delete `dist.ring_exchange` func call and `use_ring_exchange` argument

* nit fixes

* check in

* remove unused file

* update the list

* update tensor shape

* remove mixed dtype case

* use torch.distributed.run

* 2020 -> 2021

* another 2020 -> 2021

* docstring & type hint

* fix teardown

* update

* change to experimental

* check if warned
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

63d5dd63

19 Oct, 2021 1 commit
- Fix the hipification issues for cublasGemmEx by adding rocblas_gemm_ex · 8091b3e2
  Hubert Lu authored Oct 19, 2021
  
  8091b3e2
08 Oct, 2021 1 commit
- check in (#1187) · 3ad9db2a
  eqy authored Oct 07, 2021
  
  3ad9db2a
07 Oct, 2021 1 commit
- Update layer_norm_cuda_kernel.cu (#1184) · 5adf7bc2
  eqy authored Oct 06, 2021
  
  5adf7bc2
04 Oct, 2021 1 commit
- in multi tensor apply, skip empty tensors (#54) · 297ab210
  Jeff Daily authored Oct 04, 2021
  
  297ab210
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

24 Sep, 2021 1 commit
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 76daa454
  Masaki Kozuki authored Sep 24, 2021
  
  76daa454
04 Sep, 2021 1 commit

fix CUBLAS guards (#1162) · 54b93919

Burc Eryilmaz authored Sep 04, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path

* safer guard around CUBLAS constants, remove unreferenced variable

* more guard changes

* guard against cublas version instead of cuda
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

54b93919

01 Sep, 2021 2 commits

Seryilmaz/fuse norm into scale (#1149) · 4d190db6

Burc Eryilmaz authored Sep 01, 2021



* fuse norm into scale

* add fused norm into dlamb
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

4d190db6

Seryilmaz/more cublas lt (#1147) · 6af09dd9

Burc Eryilmaz authored Aug 31, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6af09dd9

17 May, 2021 1 commit
- compile cublasLt code only for cublas >= 11.0 (#1108) · 00c1e56d
  Burc Eryilmaz authored May 17, 2021
```
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  00c1e56d
19 Apr, 2021 1 commit
- Fix cublasLt context create/destroy overhead in MLP extension (#1083) · 082f999a
  Burc Eryilmaz authored Apr 19, 2021
```
* don't create cublasLt handle, fix zero block size case

* cleanup
```
  082f999a
17 Apr, 2021 1 commit

initial cublaslt support for MLP (#1080) · b8be1bc7

Burc Eryilmaz authored Apr 16, 2021



* initial cublaslt support

* 64 bit input

* add license headers

* cleanup

* remove license
Co-authored-by: pbialecki <pbialecki@nvidia.com>

b8be1bc7

15 Apr, 2021 1 commit

Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac

Sudhakar Singh authored Apr 15, 2021

* Add unit tests for fused-novograd

* Fix: tensors should reside on the same device

* Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test

* fixed issues mentioned in the comments

59d2f7ac

25 Feb, 2021 1 commit
- Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)" · fbb8cd93
  Jeff Daily authored Feb 25, 2021
```
This reverts commit bdd481d1.
```
  fbb8cd93