Commits · 4e7a2a8e44577c96dc1f0446c78520244c49738d · OpenDAS / apex

28 Nov, 2023 1 commit
- fix up for torch2.1 · 4e7a2a8e
  flyingdown authored Nov 28, 2023
  
  4e7a2a8e
08 May, 2023 1 commit
- add README_HIP · 2c6c0f28
  flyingdown authored May 08, 2023
```
fix test for torch 1.10.0
```
  2c6c0f28
23 Apr, 2023 3 commits

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 06053e19

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

06053e19

Unskip some unit tests related to issue #82 (#98) · 2951440a

Hubert Lu authored Dec 06, 2022

* Unskip some unit tests related to issue #82

* Ensure test_state_dict to use capturable=True for torch.optim.Adam

* Fix TestFusedAdam tests in test_fused_optimizer.py

2951440a

Consider both contiguous and channels_last tensors for FusedSGD (#97) · 9a13347c

Hubert Lu authored Dec 06, 2022

* Consider both contiguous and channel_last tensors for FusedSGD

* Consider all the memory formats in fused_sgd

* Add an unit test script for nhwc fused_sgd

9a13347c

25 Jan, 2023 1 commit

Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27

aspanday authored Jan 24, 2023

* Updating BLOCK_SIZE to 1024.
tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
For now skipping test_bfloat16 for Adam in the unittest.
Ran 17 other tests and ALL other tests pass!
More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization

.
This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.

* Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
Co-authored-by: aspanday <aspanday@amd.com>

14db5c27

06 Dec, 2022 2 commits

Unskip some unit tests related to issue #82 (#98) · 4dcf30a6

Hubert Lu authored Dec 06, 2022

* Unskip some unit tests related to issue #82

* Ensure test_state_dict to use capturable=True for torch.optim.Adam

* Fix TestFusedAdam tests in test_fused_optimizer.py

4dcf30a6

Consider both contiguous and channels_last tensors for FusedSGD (#97) · 9ebc53e5

Hubert Lu authored Dec 06, 2022

* Consider both contiguous and channel_last tensors for FusedSGD

* Consider all the memory formats in fused_sgd

* Add an unit test script for nhwc fused_sgd

9ebc53e5

26 Aug, 2022 1 commit

cached cast fix (#90) · a27b4e43

Hubert Lu authored Aug 26, 2022



* Handle len(cached_x.grad_fn.next_functions) == 1 in cached_cast

* Unskip the unit tests related to len(cached_x.grad_fn.next_functions) == 1
Co-authored-by: David Fan <jiafa@microsoft.com>

a27b4e43

10 Aug, 2022 1 commit
- Skip a failing test introduced by a upstream PyTorch regression · cc5f83b5
  hubertlu-tw authored Aug 10, 2022
  
  cc5f83b5
09 Aug, 2022 5 commits
- Remove some comments in run_test.py · cebbb04f
  hubertlu-tw authored Aug 09, 2022
  
  cebbb04f
- Remove run_pyprof_data and run_pyprof_nvtx unit tests · 4d567459
  hubertlu-tw authored Aug 09, 2022
  
  4d567459
- Update L0 unit test script · ced59fcc
  hubertlu-tw authored Aug 09, 2022
  
  ced59fcc
- Skip a flaky unit test · 8a8eb34f
  hubertlu-tw authored Aug 09, 2022
  
  8a8eb34f
- Skip some flaky unit tests · 975a0e53
  hubertlu-tw authored Aug 09, 2022
  
  975a0e53
08 Aug, 2022 3 commits
- Un-skip some tests and skip some flaky tests · 1b7b02ef
  hubertlu-tw authored Aug 08, 2022
  
  1b7b02ef
- Addd a wrapper to skip flaky unit tests. · 4cfbe05c
  hubertlu-tw authored Aug 08, 2022
  
  4cfbe05c
- Skip the failing unit tests from the FusedRMSNorm PR (#85) · 87fc4125
  Hubert Lu authored Aug 08, 2022
```
* Skip the failing unit tests from the FusedRMSNorm PR

* Update test_lamb.py
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
```
  87fc4125
05 Aug, 2022 1 commit

Enable FusedRMSNorm (#78) · c97ebfab

Hubert Lu authored Aug 05, 2022



* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)

* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix and generate docs for FusedRMSNorm (#1285)

* [FusedRMSNorm doc] document where epsilon is added (#1295)

* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording

* Fix some bugs

* Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs

* Fix NaN issues in FusedRMSNorm

* Update test_fused_layer_norm.py

* Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm

* Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
Co-authored-by: eqy <eddiey@nvidia.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

c97ebfab

29 Jul, 2022 2 commits
- Unskip run_transformer unit tests · bbf2c8d0
  hubertlu-tw authored Jul 29, 2022
  
  bbf2c8d0
- Update test_fused_layer_norm.py · 0df6c4c3
  hubertlu-tw authored Jul 29, 2022
  
  0df6c4c3
26 Jul, 2022 1 commit

Fix bug when initializing model-parallel process groups for GPT-3 (#1435) · fb21698e

Tim Moon authored Jul 26, 2022

* Hack to enable training GPT-3

Seems to fix bug from #1416

* Add test to initialize model-parallelism for decoder-only Transformers

Namely GPT-3.

fb21698e

25 Jul, 2022 1 commit
- [transformer] update tests (#1428) · e57d9e79
  Aidyn-A authored Jul 25, 2022
  
  e57d9e79
20 Jul, 2022 1 commit

[transformer] UCC async test (#1417) · a29a698f

Aidyn-A authored Jul 20, 2022

* add test

* update batch sizes

* update batch sizes

* small updates

* delete comment

* add async comm

* add sync if needed

* update tests

* remove redundant imports

* code cleanup

* minor updates

* update dtype for comparison

* fix dtypes

* fix typo

* modify sizes and use common_utils.find_free_port

* fix typo and use double precision

* revert some changes, create test for profiling on L1

* remove redundant line

* revert UCC_TLS and add sync to fwd_bwd

* code clean up

* code clean up

* modify BERT test

* add comment

a29a698f

14 Jul, 2022 1 commit

Time dimension shape check for fused scale mask softmax kernel (#1421) · 1337e81e

Sandeep Subramanian authored Jul 13, 2022



* Time dimension shape check for fused scale mask softmax kernel
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add shape test
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix mask shape
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

1337e81e

11 Jul, 2022 1 commit

update: mpu for t5 rpe (#1416) · 5ff5a884

Perkz Zheng authored Jul 12, 2022



* update: mpu for t5 rpe

* update: add rpe mpu group test

* fix semicolon bugs
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix semicolon bugs
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

5ff5a884

07 Jul, 2022 1 commit
- Remove `pyprof` and `reparameterization` (#1404) · 8a7a3325
  Masaki Kozuki authored Jul 06, 2022
```
* remove pyprof

* remove reparameterization

* remove pyprof test

* clean up
```
  8a7a3325
23 Jun, 2022 2 commits

[transformer] Port Sequence Parallelism (takeover of #1396) (#1400) · 3ff1a10f

Masaki Kozuki authored Jun 23, 2022

* it looks possible to remove this file

* add communication collectives

* update Column|RowParallelLinear

* update checkpoint function

* update function name

* parity between public and private collectives

* row parallel linear

* column parallel linear

* sequence parallel: p2p comm

fix typo

* sequence parallel: pipeline parallel

* fix typo

* add layernorm with sequence_parallel_enabled attr

* class variable -> member variable

* fix col parallel test with sequence parallel

* Initial test of `forward_backward_pipelining_without_interleaving` with `model_type=ModelType.encoder_and_decoder`

* add cases pretending to test sequence_parallel

* Apply 2 suggestion(s) to 1 file(s)

* update sequence_parallel_enabled docstring

* update docstring: order of tensor dimensions, sequence_parallel_enabled behavior

* Divide sequence_length if sequence parallel

tensor shape should be updated if sequence parallel is enabled.

* cherry-pick https://github.com/NVIDIA/Megatron-LM/commit/8474e6e54fcb9dfa37aea039352f9fb485fb6f61

* type annotation

* Fix matmul call in RowParallelLinear

Fix `sequence_parallel_enabled` to `False` as you can see in
https://github.com/NVIDIA/Megatron-LM/blob/d898a8991d1a08d29074f87819d1bf41517e35f5/megatron/mpu/layers.py#L511-L514

* update rowparallellinear test

* fix `loss_weight` is not defined in test_layers

* @eqy's comment

* mixed fused layer norm

* fix typo

* misc

* test_layers cleanup

* Skip Bert/GPT script

Since these two models haven't gotten updated for sequence parallle, e.g. the update of the order of dimension from (batch, sequence, feature) to (sequence, batch, feature) and global variables of arguments

* debug part 1/N: comment out `x.retain_grad`

* debug part 2/N: [ColumnParallelLinear] comment out overriding of sequence_parallel_enabled

* debug 3/N: add pipeline test with parallel mlp

* Fix handling `self.input_tensor` and argument

* tp2pp4 ModelType.encoder_or_decoder is failing, which can be at my fault because the backward is blaming the output and the grad_ouptut shape don't match

* revert debug 1/N

* defer tensor model parallel size > 1

* split tensor in sequence dim

* cosmetic

* cosmetic: remove archaic comment

* enable TP>1 for encoder_and_decoder as well

* set requires_grad=True always...

* Set `scatter_gather_tensors_in_pipeline` to :obj:`False`

for the sake of nemo megatron's GPT works with sequence parallel enabled.

* brush up comment of `requires_grad()`

There's a possibility that PyTorch DistributedDataParallel hangs
when some tensor (or parameter) doesn't require grad according to @ptrblck.
This forced `requires_grad` in my understanding is different from that.

* misc changes of scatter_gather_tensors_in_pipeline comment

* guard for torch_ucc

* cosmetic changes related to tests

* update command line arguments

* update TransformerLanguageModel

* rename

* move gpt to gpt.py

* update bert

* add all_gather for params in sequence parallel region

* misc. some diffs were lost during rebasing...

* updates for non sequence parallel execution

* gpt with sequence parallel

* Apply 2 suggestion(s) to 2 file(s)

* update tensor&pipeline parallel size

* why `sequence_parallel_enabled` is not supplied!? Did I messed up when rebasing?

* cosmetic fix

* correct key is sequence_parallel_enabled

3ff1a10f

Move distributed Adam unit test to contrib dir (#1406) · 57f890a7

Tim Moon authored Jun 22, 2022

* Increase default bucket size in distributed Adam

* Move distributed Adam unit test to contrib tests

Integrate into unit testing framework

* Tweak hyperparameters for dist Adam optimizer test

Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar.

* Use distributed test infrastructure in distributed Adam unit test

Suggestion from @crcrpar.

57f890a7

22 Jun, 2022 1 commit

Temporary Solution to Let `FusedAdam` support BFloat16 (#1407) · 81f8ba79

Masaki Kozuki authored Jun 22, 2022

* add temporary dispatch of double, float, half, bfloat16

* fusedadam of bfloat16

* Add bfloat16 path to FusedAdam

81f8ba79

14 Jun, 2022 2 commits
- Update documentation to reflect DistributedFusedAdam uses AdamW · 846f7f8a
  Tim Moon authored Jun 14, 2022
```
Adjust test options to have tighter tolerances.
```
  846f7f8a
- Update dist Adam test to use updated API · e2af089c
  Tim Moon authored Jun 13, 2022
  
  e2af089c
31 May, 2022 1 commit

Do pipeline parallelism tests in double because TF32 environment variables can... · 265b451d

eqy authored May 31, 2022

Do pipeline parallelism tests in double because TF32 environment variables can be painful to manage across test suites (#1391)

* check in

* skip interleaved with 2 GPU

* change type annotation

* address comments thanks @crcrpar @Aidyn-A

265b451d

20 May, 2022 1 commit

Add grad check in test pipeline parallel fwd bwd (#1386) · ab5fc48f

Aidyn-A authored May 20, 2022

* add grad check

* change assert

* minor changes

* revert unnecessary changes

* suggested changes

* fix tensor comparison

* small changes

ab5fc48f

19 May, 2022 2 commits

Test `len(model) > 1` in `test_pipelining_with_interleaving` (#1384) · da1f7f2f

eqy authored May 18, 2022



* check in

* type

* cleanup

* cleanup

* fix function call

* Apply suggestions from code review
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

da1f7f2f

[Pipeline-Parallelism][TF32] Disable TF32 for Pipeline-Parallel numerical checks (#1382) · 891d57d3
eqy authored May 18, 2022
```
* check in

* fancy context style
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
```
891d57d3

18 May, 2022 1 commit

[transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1

Masaki Kozuki authored May 18, 2022



* NcclDistributedTestBase

* fix stupid mistake

* add UCC test

* add UCC backend

* torch ucc tests

* allows for UCC backend

* Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense

* Apply 4 suggestion(s) to 1 file(s)

* mix&match NCCL & UCC

* use both ucc&nccl in gpt

* UCC for Pipeline Parallel, NCCL for the others

* conditionally use ucc

* make ucc guards more friendly

* test raises when torch_ucc isn't available

* Change to member variable from class variable
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

* pass async_comm to train, I mistakenly dropped it during the rebase

* fix typo: functionality

* Enable tensor parallel only when device count > 4

I want pipeline model parallel world size to be >= 4 because
previously I saw GPT/BERT failing when only UCC is used.
So I'm speculating that there's some gotcha around pipeline size of 4.

* Add nvidia driver version guard
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

* move world_size as it was not correctly reflected

* keep eye on the nvml api thing

* import unittest
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

3490b9e1

12 May, 2022 1 commit

Async pipeline parallel (#1373) · 3fe35211

eqy authored May 12, 2022

* initial check in

* fix

* fix test

* address some review comments and cleanup

* fix

* bookmark

* fix sync placement to come before gather

* similar fix for non-gather case

* add async bert

* update gpt minimal test

* allow selection of default pp test

* fix bert test

* cleanup

* cleanup

3fe35211

11 May, 2022 1 commit

[transformer] add loss comparison to test_pipeline_parallel_fwd_bwd (#1374) · 68440264

Aidyn-A authored May 11, 2022

* add loss comparison to test_pipeline_parallel_fwd_bwd

* applied some suggested changes

* update test_pipeline_parallel_fwd_bwd.py

* update test_pipeline_parallel_fwd_bwd.py 2

* minor update

* update test_pipeline_parallel_fwd_bwd.py 3

68440264

29 Apr, 2022 1 commit
- [transformer][pipeline parallel] fix typo in test (#1370) · c3018b13
  eqy authored Apr 29, 2022
```
* fix typo

* Update test_pipeline_parallel_fwd_bwd.py
```
  c3018b13