Commits · 87fc412541546909e2b14cfe776dd28f57197332 · OpenDAS / apex

08 Aug, 2022 1 commit

Skip the failing unit tests from the FusedRMSNorm PR (#85) · 87fc4125

Hubert Lu authored Aug 08, 2022



* Skip the failing unit tests from the FusedRMSNorm PR

* Update test_lamb.py
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

87fc4125

05 Aug, 2022 1 commit

Enable FusedRMSNorm (#78) · c97ebfab

Hubert Lu authored Aug 05, 2022



* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)

* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix and generate docs for FusedRMSNorm (#1285)

* [FusedRMSNorm doc] document where epsilon is added (#1295)

* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording

* Fix some bugs

* Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs

* Fix NaN issues in FusedRMSNorm

* Update test_fused_layer_norm.py

* Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm

* Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
Co-authored-by: eqy <eddiey@nvidia.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

c97ebfab

14 Dec, 2021 2 commits
- Conditionally define autocast_dtypes for different torch versions · 68364b49
  Hubert Lu authored Dec 14, 2021
  
  68364b49
- Skip failing unit tests (#61) · d150afdc
  Hubert Lu authored Dec 13, 2021
```
* Skip failing unit tests

* Modify the test skipping messages
```
  d150afdc
09 Dec, 2021 1 commit

Add fused mixed precision lamb optimizer. (#1237) · d11ddccf

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

d11ddccf

01 Dec, 2021 1 commit
- Update run_rocm_distributed.sh · 3f3da214
  Hubert Lu authored Dec 01, 2021
  
  3f3da214
22 Nov, 2021 1 commit
- Update run_rocm.sh · 405956c3
  Hubert Lu authored Nov 22, 2021
```
Change python3.6 to python
```
  405956c3
19 Nov, 2021 3 commits

Add unit tests for Apex extensions and distributed Apex · 15498555
Hubert Lu authored Nov 19, 2021

15498555

minimal bert pipeline parallel test (#1216) · aa756cec

eqy authored Nov 18, 2021

* minimal bert pipeline parallel test

* fix global and cleanup

* use get_forward_backward_func

* cleanup and fix some tests

aa756cec

[POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133

Masaki Kozuki authored Nov 19, 2021

* init logging use

* fix

* clean up

* fp32 p2p comm

* init

* Dynamic global batch size with `MegatronPretrainingSampler`

I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now

* revive original pipeline parallel test

* update MULTIGPU_TEST: add dynamic batchsize test

* run MegatronPretrainingRandomSampler

* fix comment

* fix

* update

* cosmetic

* add note

* Apply 2 suggestion(s) to 2 file(s)

* change following https://github.com/NVIDIA/apex/pull/1210

* fix

35336133

10 Nov, 2021 1 commit
- check in (#1210) · 2205cff2
  eqy authored Nov 09, 2021
  
  2205cff2
27 Oct, 2021 2 commits

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021



* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s) to 2 file(s)

* fix cyclic import

* fix syntax of Callable

* fix - 1

* move directory as testing used for pp test as well

* add some functions for num microbatches calculator

* model is a list in pipeline parallel

* skip build num microbatch calculator

* fix test

* assert -> raise

* skip args printing

* specify tensor shape everywhere even if None - phase 1

* private timers

* passing tensor shape & dtype around

* update dtype handling by introducing helper func

* write helper func to reduce cyclomatic complexity

* remove duplicate

* update

* move split_tensor_into_1d_equal_chunks to avoid cyclic import

* tmp

* cosmetic

* move gather_split_1d_tensor to avoid cyclic imports

* remove debug print

* add outer loop

* early return if possible

* cosmetic

* passing around tensor shape

* refactor test

* add script to learn batch sampler behavior

* update

* minibatch splitter

* add minibatch splitter

* split minibatch into microbatches

* minor changes

* uncomment split batch for test sake

* set as attribute

* study the behavior of no pipelining

* debug 1

* reflect test util namespace change

* update readme

* cosmetic in test

* add model build helper func for interleaving shced

* adding model builder from megatron

* canbe cyclic import

* fix

* enable interleaving test, but failing even if forward only

* fix batch preparation

* add explanation

* print data parallel size

* fix typo

* Add Megatron style GPT model by Rishi
Co-authored-by: Rishi Puri <riship@nvidia.com>

* update

* type hint for jit

* fix forward_backward_no_pipelining test

* pipeline forward backward seem to hang if not forward only

* fix typo

* debug

* add p2p test

* simplify

* fix

* tentative

* set both tmp and pmp to 1

* init

* fix typo

* fix

* fix path of divide

* set seed for tmp

* update upon Eddie comment

* fix typo

* adding failing data loader test

* fix

* megatron still failing

* check in

* with the nested loop of new order, interleaving seems fine

* cosmetic change

* make `forward_backward_pipelining_with_interleaving private

* warn users that interleaving sched is unstable

* move noop handler to no pipelining

* comment out rank_print

* make `build_model` more flexible

* skip megatron test tentatively

* correctly comment out rank_print

* correctly comment out rank_print

* correctly comment out rank_print

* skip appropriately

* remove wip p2p comm test

* update type hint of model_provider_func

* disable tf32 in each test script

* skip interleaving w/ backward

* rename as mpu is the old name

* remove broken case

* expose build_model func

* delete `dist.ring_exchange` func call and `use_ring_exchange` argument

* nit fixes

* check in

* remove unused file

* update the list

* update tensor shape

* remove mixed dtype case

* use torch.distributed.run

* 2020 -> 2021

* another 2020 -> 2021

* docstring & type hint

* fix teardown

* update

* change to experimental

* check if warned
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

63d5dd63

Revert "Enable MLP unit tests on ROCm" · aee9f00d
hubertlu authored Oct 27, 2021
```
This reverts commit 964e61f1.
```
aee9f00d

26 Oct, 2021 1 commit
- Enable MLP unit tests on ROCm · 964e61f1
  hubertlu authored Oct 26, 2021
  
  964e61f1
23 Oct, 2021 1 commit

Use out-of-place to avoid D2D copy in tensor parallel cross entropy (#1198) · 3303b3e7

Masaki Kozuki authored Oct 23, 2021



* switch from clone to out-of-place subtract

* Update apex/mpu/cross_entropy.py

* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

3303b3e7

20 Oct, 2021 1 commit
- Revert test_fused_layer_norm.py to prevent from missing torch.cuda.is_bf16_supported in pytorch 1.9 · d36b3c63
  Hubert Lu authored Oct 20, 2021
  
  d36b3c63
19 Oct, 2021 1 commit
- Revert back to the test_fused_optimizer.py in upstream to solve multiple unit test errors · 93f3a3bc
  Hubert Lu authored Oct 19, 2021
  
  93f3a3bc
08 Oct, 2021 1 commit
- Remove `custom_fwd`/`custom_bwd` from fused softmax (#1188) · 14ccf598
  Masaki Kozuki authored Oct 09, 2021
```
* run backward

* remove custom_fwd/custom_bwd
```
  14ccf598
06 Oct, 2021 1 commit
- ColumnParallelLinearWithAsyncAllreduce autocast support (#1183) · b3da6036
  Masaki Kozuki authored Oct 06, 2021
```
* [ColumnParallelLinear] Test behavior in autocast

* fix test

* casts manually to autocast dtype
```
  b3da6036
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

15 Apr, 2021 1 commit

Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac

Sudhakar Singh authored Apr 15, 2021

* Add unit tests for fused-novograd

* Fix: tensors should reside on the same device

* Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test

* fixed issues mentioned in the comments

59d2f7ac

25 Jan, 2021 1 commit

fix bugs in syncbn (#46) · 3f49dbf0

Jeff Daily authored Jan 25, 2021

- incorrect use of __shfl_down
- fix warp size assumptions
- update unit tests to exit on failure

3f49dbf0

21 Jan, 2021 1 commit
- use __launch_bounds__ for multi_tensor_apply (#44) · 5baa68d3
  Jeff Daily authored Jan 21, 2021
```
use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests
```
  5baa68d3
18 Jan, 2021 1 commit
- skip failing tests on ROCm · 13c8d152
  Jeff Daily authored Jan 18, 2021
  
  13c8d152
15 Jan, 2021 1 commit
- Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm · ff232fb8
  Sarunya Pumma authored Nov 28, 2020
  
  ff232fb8
31 Dec, 2020 2 commits
- missing import statement · 41bbf93c
  lcskrishna authored Dec 31, 2020
  
  41bbf93c
- skip the unit tests · 5bae299e
  lcskrishna authored Dec 31, 2020
  
  5bae299e
01 Dec, 2020 1 commit

DistributedFusedAdam Model Parallelism Support (Megatron) (#981) · 6b7e77b0

Kexin Yu authored Dec 01, 2020



DistributedFusedAdam Model Parallelism Support (Megatron)
Co-authored-by: Kexin Yu <kexiny@nvidia.com>
Co-authored-by: Kexin Yu <kexinznzn@gmail.com>

6b7e77b0

04 Nov, 2020 1 commit

Fix LayerNorm op on ROCm (#36) · 7eed38aa

Ashish Farmer authored Nov 04, 2020

* fix warp size in WARP_SHFL* in layernorm

* enable fused_layer_norm tests on ROCm

7eed38aa

05 Aug, 2020 2 commits

Enable mlp_cuda extension. (#28) · d2f6d04a

Chaitanya Sri Krishna Lolla authored Aug 05, 2020

* enable mlp cuda

* add setup changes and tests

* skip the unit tests

* updated conditions for empty array

* removed hip platform conditions

d2f6d04a

set device guard for multi tensor optimizer implementations (#927) · 274cc063

ngimel authored Aug 05, 2020

* add device guards to the optimizers

* add untracked file

* set deviceGuard in multi_tensor_apply

* address review comments; fix lamb

* indent

* typo

274cc063

31 Jul, 2020 1 commit
- skipping bfloat16 mgpu tests (#32) · 8dd19e3b
  Chaitanya Sri Krishna Lolla authored Jul 31, 2020
  
  8dd19e3b
10 Jul, 2020 1 commit

Enable sync batchnorm extension. (#27) · 9c80f6d3

Chaitanya Sri Krishna Lolla authored Jul 10, 2020

* Enable sync batchnorm

* enable syncbn properly

* update the unit tests

* update tests

* update conditions for welford_merge_element

* updated conditions based on comments.

9c80f6d3

07 Jul, 2020 1 commit
- skip newer tests · eba809d7
  lcskrishna authored Jul 07, 2020
  
  eba809d7
06 Jul, 2020 1 commit

[sync BN] (#792) · 1ff54b8f

jjsjann123 authored Jul 06, 2020

* [sync BN]

support non-uniform batch size across process group.

TODO: test should be added once cleaned up.

* updating unit tests

* new unit tests for different inputs

* cleaning

1ff54b8f

23 Jun, 2020 3 commits
- add test case for non-zero weight decay · ad50ce9a
  Kexin Yu authored Jun 23, 2020
  
  ad50ce9a
- test nvlamb; hyperparams consistent with adam/adagrad tests · cd3d6d12
  Kexin Yu authored Jun 23, 2020
  
  cd3d6d12
- add test for FusedLAMB · 9774ce0d
  Kexin Yu authored Jun 22, 2020
  
  9774ce0d
03 Jun, 2020 1 commit

bfloat16 support for mgpu (#19) · b0c7d09f

rohithkrn authored Jun 03, 2020

* bfloat16 support for apex DDP

* enable mgpu tests for fp16 and bf16

* update Dockerfile

b0c7d09f

26 May, 2020 1 commit
- enable bfloat16 for optimizers · 85549903
  rohithkrn authored May 26, 2020
  
  85549903