Commits · e57d9e79ca6e75fbcdf76cfd950bdf33e9c9203f · OpenDAS / apex

25 Jul, 2022 1 commit
- [transformer] update tests (#1428) · e57d9e79
  Aidyn-A authored Jul 25, 2022
  
  e57d9e79
20 Jul, 2022 1 commit

[transformer] UCC async test (#1417) · a29a698f

Aidyn-A authored Jul 20, 2022

* add test

* update batch sizes

* update batch sizes

* small updates

* delete comment

* add async comm

* add sync if needed

* update tests

* remove redundant imports

* code cleanup

* minor updates

* update dtype for comparison

* fix dtypes

* fix typo

* modify sizes and use common_utils.find_free_port

* fix typo and use double precision

* revert some changes, create test for profiling on L1

* remove redundant line

* revert UCC_TLS and add sync to fwd_bwd

* code clean up

* code clean up

* modify BERT test

* add comment

a29a698f

14 Jul, 2022 1 commit

Time dimension shape check for fused scale mask softmax kernel (#1421) · 1337e81e

Sandeep Subramanian authored Jul 13, 2022



* Time dimension shape check for fused scale mask softmax kernel
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add shape test
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix mask shape
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

1337e81e

11 Jul, 2022 1 commit

update: mpu for t5 rpe (#1416) · 5ff5a884

Perkz Zheng authored Jul 12, 2022



* update: mpu for t5 rpe

* update: add rpe mpu group test

* fix semicolon bugs
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix semicolon bugs
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

5ff5a884

07 Jul, 2022 1 commit
- Remove `pyprof` and `reparameterization` (#1404) · 8a7a3325
  Masaki Kozuki authored Jul 06, 2022
```
* remove pyprof

* remove reparameterization

* remove pyprof test

* clean up
```
  8a7a3325
23 Jun, 2022 2 commits

[transformer] Port Sequence Parallelism (takeover of #1396) (#1400) · 3ff1a10f

Masaki Kozuki authored Jun 23, 2022

* it looks possible to remove this file

* add communication collectives

* update Column|RowParallelLinear

* update checkpoint function

* update function name

* parity between public and private collectives

* row parallel linear

* column parallel linear

* sequence parallel: p2p comm

fix typo

* sequence parallel: pipeline parallel

* fix typo

* add layernorm with sequence_parallel_enabled attr

* class variable -> member variable

* fix col parallel test with sequence parallel

* Initial test of `forward_backward_pipelining_without_interleaving` with `model_type=ModelType.encoder_and_decoder`

* add cases pretending to test sequence_parallel

* Apply 2 suggestion(s) to 1 file(s)

* update sequence_parallel_enabled docstring

* update docstring: order of tensor dimensions, sequence_parallel_enabled behavior

* Divide sequence_length if sequence parallel

tensor shape should be updated if sequence parallel is enabled.

* cherry-pick https://github.com/NVIDIA/Megatron-LM/commit/8474e6e54fcb9dfa37aea039352f9fb485fb6f61

* type annotation

* Fix matmul call in RowParallelLinear

Fix `sequence_parallel_enabled` to `False` as you can see in
https://github.com/NVIDIA/Megatron-LM/blob/d898a8991d1a08d29074f87819d1bf41517e35f5/megatron/mpu/layers.py#L511-L514

* update rowparallellinear test

* fix `loss_weight` is not defined in test_layers

* @eqy's comment

* mixed fused layer norm

* fix typo

* misc

* test_layers cleanup

* Skip Bert/GPT script

Since these two models haven't gotten updated for sequence parallle, e.g. the update of the order of dimension from (batch, sequence, feature) to (sequence, batch, feature) and global variables of arguments

* debug part 1/N: comment out `x.retain_grad`

* debug part 2/N: [ColumnParallelLinear] comment out overriding of sequence_parallel_enabled

* debug 3/N: add pipeline test with parallel mlp

* Fix handling `self.input_tensor` and argument

* tp2pp4 ModelType.encoder_or_decoder is failing, which can be at my fault because the backward is blaming the output and the grad_ouptut shape don't match

* revert debug 1/N

* defer tensor model parallel size > 1

* split tensor in sequence dim

* cosmetic

* cosmetic: remove archaic comment

* enable TP>1 for encoder_and_decoder as well

* set requires_grad=True always...

* Set `scatter_gather_tensors_in_pipeline` to :obj:`False`

for the sake of nemo megatron's GPT works with sequence parallel enabled.

* brush up comment of `requires_grad()`

There's a possibility that PyTorch DistributedDataParallel hangs
when some tensor (or parameter) doesn't require grad according to @ptrblck.
This forced `requires_grad` in my understanding is different from that.

* misc changes of scatter_gather_tensors_in_pipeline comment

* guard for torch_ucc

* cosmetic changes related to tests

* update command line arguments

* update TransformerLanguageModel

* rename

* move gpt to gpt.py

* update bert

* add all_gather for params in sequence parallel region

* misc. some diffs were lost during rebasing...

* updates for non sequence parallel execution

* gpt with sequence parallel

* Apply 2 suggestion(s) to 2 file(s)

* update tensor&pipeline parallel size

* why `sequence_parallel_enabled` is not supplied!? Did I messed up when rebasing?

* cosmetic fix

* correct key is sequence_parallel_enabled

3ff1a10f

Move distributed Adam unit test to contrib dir (#1406) · 57f890a7

Tim Moon authored Jun 22, 2022

* Increase default bucket size in distributed Adam

* Move distributed Adam unit test to contrib tests

Integrate into unit testing framework

* Tweak hyperparameters for dist Adam optimizer test

Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar.

* Use distributed test infrastructure in distributed Adam unit test

Suggestion from @crcrpar.

57f890a7

22 Jun, 2022 1 commit

Temporary Solution to Let `FusedAdam` support BFloat16 (#1407) · 81f8ba79

Masaki Kozuki authored Jun 22, 2022

* add temporary dispatch of double, float, half, bfloat16

* fusedadam of bfloat16

* Add bfloat16 path to FusedAdam

81f8ba79

14 Jun, 2022 2 commits
- Update documentation to reflect DistributedFusedAdam uses AdamW · 846f7f8a
  Tim Moon authored Jun 14, 2022
```
Adjust test options to have tighter tolerances.
```
  846f7f8a
- Update dist Adam test to use updated API · e2af089c
  Tim Moon authored Jun 13, 2022
  
  e2af089c
31 May, 2022 1 commit

Do pipeline parallelism tests in double because TF32 environment variables can... · 265b451d

eqy authored May 31, 2022

Do pipeline parallelism tests in double because TF32 environment variables can be painful to manage across test suites (#1391)

* check in

* skip interleaved with 2 GPU

* change type annotation

* address comments thanks @crcrpar @Aidyn-A

265b451d

20 May, 2022 1 commit

Add grad check in test pipeline parallel fwd bwd (#1386) · ab5fc48f

Aidyn-A authored May 20, 2022

* add grad check

* change assert

* minor changes

* revert unnecessary changes

* suggested changes

* fix tensor comparison

* small changes

ab5fc48f

19 May, 2022 2 commits

Test `len(model) > 1` in `test_pipelining_with_interleaving` (#1384) · da1f7f2f

eqy authored May 18, 2022



* check in

* type

* cleanup

* cleanup

* fix function call

* Apply suggestions from code review
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

da1f7f2f

[Pipeline-Parallelism][TF32] Disable TF32 for Pipeline-Parallel numerical checks (#1382) · 891d57d3
eqy authored May 18, 2022
```
* check in

* fancy context style
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
```
891d57d3

18 May, 2022 1 commit

[transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1

Masaki Kozuki authored May 18, 2022



* NcclDistributedTestBase

* fix stupid mistake

* add UCC test

* add UCC backend

* torch ucc tests

* allows for UCC backend

* Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense

* Apply 4 suggestion(s) to 1 file(s)

* mix&match NCCL & UCC

* use both ucc&nccl in gpt

* UCC for Pipeline Parallel, NCCL for the others

* conditionally use ucc

* make ucc guards more friendly

* test raises when torch_ucc isn't available

* Change to member variable from class variable
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

* pass async_comm to train, I mistakenly dropped it during the rebase

* fix typo: functionality

* Enable tensor parallel only when device count > 4

I want pipeline model parallel world size to be >= 4 because
previously I saw GPT/BERT failing when only UCC is used.
So I'm speculating that there's some gotcha around pipeline size of 4.

* Add nvidia driver version guard
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

* move world_size as it was not correctly reflected

* keep eye on the nvml api thing

* import unittest
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

3490b9e1

12 May, 2022 1 commit

Async pipeline parallel (#1373) · 3fe35211

eqy authored May 12, 2022

* initial check in

* fix

* fix test

* address some review comments and cleanup

* fix

* bookmark

* fix sync placement to come before gather

* similar fix for non-gather case

* add async bert

* update gpt minimal test

* allow selection of default pp test

* fix bert test

* cleanup

* cleanup

3fe35211

11 May, 2022 1 commit

[transformer] add loss comparison to test_pipeline_parallel_fwd_bwd (#1374) · 68440264

Aidyn-A authored May 11, 2022

* add loss comparison to test_pipeline_parallel_fwd_bwd

* applied some suggested changes

* update test_pipeline_parallel_fwd_bwd.py

* update test_pipeline_parallel_fwd_bwd.py 2

* minor update

* update test_pipeline_parallel_fwd_bwd.py 3

68440264

29 Apr, 2022 1 commit
- [transformer][pipeline parallel] fix typo in test (#1370) · c3018b13
  eqy authored Apr 29, 2022
```
* fix typo

* Update test_pipeline_parallel_fwd_bwd.py
```
  c3018b13
07 Apr, 2022 1 commit
- [transformer] add microbatches test (#1349) · 7d903878
  Masaki Kozuki authored Apr 07, 2022
```
* add test

* destroy model parallel was missing
```
  7d903878
25 Mar, 2022 1 commit

[transformer] Format & Test Refactoring (#1325) · a0ed4151

Masaki Kozuki authored Mar 24, 2022

* try PyTorch custom TestCase class

* revert

* initial working example

* update

* data utils

* fix imports

* hardcode backend to nccl

* fix signature

* fix typo

* mapping

* set device

* init

* refactor x entropy

* remove unused import & destroy model parallel

* refactor random

* fix test

* remove migrated tests

* refactor

* init

* separate affine weight init

* init model parallel

* split more

* weight init fix part 1

* use cpu init for consistency btwn native and tensor parallel

* black

* add col parallel

* use a 3D tensor of square matrix for column parallel linear

* skip the failing cases

* migrate layers test

* pipeline parallel forward/backward

* fix typo

* fix typo

* fix

* fix pipeline world size

* black

* rm `run_pipeline_parallel_test` in favor of test_pipeline_parallel_fwd_bwd.py

* stop logging

* set log level

* black

* license and format

* fix

* skip tf32 as matrices are small

* remove potentially inappropriate license

* Apply suggestions from code review

* remove `TODO` comment

* `torch.testing.assert_allclose` -> `torch.testing.assert_close`

* remove comment-outs

* remote unused import

* minor fix

a0ed4151

26 Feb, 2022 1 commit

[transformer] Fuse grad accumulation with wgrad (#1297) · ddc08039

Masaki Kozuki authored Feb 25, 2022



* fuse grad accumulation w/ weight grad
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* fp32 training path

* not using *args, **kwargs

* backward: moved the tensor dimension cnversion
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* move files to csrc/megatron

* fix fp32 path

* fix typo

* add  to  in order to select the correct custom extension

* fix typo

* comment on import guard

* update test: enable gradient_accumulation_fusion

* 86

* remove redundant call of `test_column_parallel_linear`
Co-authored-by: Sangkug Lym <slym@nvidia.com>

ddc08039

25 Feb, 2022 1 commit
- add setter of pipeline model parallel split rank (#1306) · e95c3b9c
  Masaki Kozuki authored Feb 25, 2022
  
  e95c3b9c
23 Feb, 2022 1 commit
- access to pipeline_model_parallel_split_rank (#1300) · 069ff336
  Masaki Kozuki authored Feb 23, 2022
  
  069ff336
04 Feb, 2022 1 commit

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · 684c4733

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

684c4733

31 Jan, 2022 1 commit

T5 pipeline parallel changes (#1279) · 0da60e10

Masaki Kozuki authored Jan 31, 2022

* Free output tensor on each pipeline stage for smaller memory footprint

see:
https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb

* ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff

* ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4

* remove position embedding group in destroy

* pass deallocate_pipeline_outputs to backward_step

* fix typo

* missing deallocate_pipeline_outputs

* fix typo: grad_ouptut -> grad_output

* update tests

* remove accessed todo

* test with data parallel size of 2 if there's equal to or more than 8 gpus

0da60e10

28 Jan, 2022 2 commits

small changes in test and logger format (#1278) · b1c75f6f
Masaki Kozuki authored Jan 28, 2022
```
* cosmetic refactor in test

* log with PID

* log more info: rank, pid, filename, lineNo
```
b1c75f6f

allow for `None` batch (#1280) · a960fe8c

Masaki Kozuki authored Jan 28, 2022

* have get_kth_microbatch deal with None batch

* broadcast based on tensor parallel rank

* dtype

* remove unnecessary .cuda()

Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.

a960fe8c

21 Jan, 2022 1 commit

Grad scaler (#1277) · 2a4ab177

Masaki Kozuki authored Jan 21, 2022

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* option to turn off autocast inside forward_step function

As there's some users who activate `autocast` outside fwd/bwd functions.

* add missing arg of disable_autocast

* reorder args of no pipeline

2a4ab177

17 Dec, 2021 1 commit

Add an argument of `dtype` to forward_backward functions to specify the dtype... · b88c507e

Masaki Kozuki authored Dec 17, 2021

Add an argument of `dtype` to forward_backward functions to specify the dtype used in p2p comm (#1249)

* let users sepcify dtype for p2p comm taking the possibility of O2 style AMP into account

* add `dtype` argument to forward_backward functions

* fix

* better message

* add docstring of dtype

* add a link to dtype logic of p2p comm

b88c507e

16 Dec, 2021 1 commit

Reduce OOM potential and report it if it happens in BERT test (#1250) · e0f5ea8c

eqy authored Dec 15, 2021



* reduce bert memory usage, placeholder data for gpt

* update gpt test

* fix

* Update tests/L0/run_transformer/run_bert_minimal_test.py

remove debugging indexing
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* Update tests/L0/run_transformer/run_bert_minimal_test.py

cleanup
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

e0f5ea8c

14 Dec, 2021 1 commit
- check size in kth microbatch (#1247) · ed94d0bb
  eqy authored Dec 13, 2021
  
  ed94d0bb
10 Dec, 2021 2 commits

Cherry-pick Megatron-LM's changes in pipeline model parallel for T5 (#1232) · 0e25fcc4

Masaki Kozuki authored Dec 10, 2021

* update parallel_state

* update pipeline common funcs - forward_step and backward_step

* update pipelining w/o interleaving

* type hint

* merge utils into without_interleaving

Motivation: functions in utils are only used by
forward_backward_pipelining_without_interleaving

* fix handling of `model_type`

* fix import of DDP

* update set_input_tensor method

* fix

* cosmetic

* update model

* refactor pipeline test scripts

0e25fcc4

Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel)... · ab7af058

Rishi Puri authored Dec 09, 2021


Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222)

* minimal bert pipeline parallel test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* adding gpt_minimal_test to list of multigpu tests
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: riship <riship@nvidia.com>

ab7af058

09 Dec, 2021 1 commit

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

19 Nov, 2021 2 commits

minimal bert pipeline parallel test (#1216) · aa756cec

eqy authored Nov 18, 2021

* minimal bert pipeline parallel test

* fix global and cleanup

* use get_forward_backward_func

* cleanup and fix some tests

aa756cec

[POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133

Masaki Kozuki authored Nov 19, 2021

* init logging use

* fix

* clean up

* fp32 p2p comm

* init

* Dynamic global batch size with `MegatronPretrainingSampler`

I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now

* revive original pipeline parallel test

* update MULTIGPU_TEST: add dynamic batchsize test

* run MegatronPretrainingRandomSampler

* fix comment

* fix

* update

* cosmetic

* add note

* Apply 2 suggestion(s) to 2 file(s)

* change following https://github.com/NVIDIA/apex/pull/1210

* fix

35336133

10 Nov, 2021 1 commit
- check in (#1210) · 2205cff2
  eqy authored Nov 09, 2021
  
  2205cff2
27 Oct, 2021 1 commit

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021



* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s) to 2 file(s)

* fix cyclic import

* fix syntax of Callable

* fix - 1

* move directory as testing used for pp test as well

* add some functions for num microbatches calculator

* model is a list in pipeline parallel

* skip build num microbatch calculator

* fix test

* assert -> raise

* skip args printing

* specify tensor shape everywhere even if None - phase 1

* private timers

* passing tensor shape & dtype around

* update dtype handling by introducing helper func

* write helper func to reduce cyclomatic complexity

* remove duplicate

* update

* move split_tensor_into_1d_equal_chunks to avoid cyclic import

* tmp

* cosmetic

* move gather_split_1d_tensor to avoid cyclic imports

* remove debug print

* add outer loop

* early return if possible

* cosmetic

* passing around tensor shape

* refactor test

* add script to learn batch sampler behavior

* update

* minibatch splitter

* add minibatch splitter

* split minibatch into microbatches

* minor changes

* uncomment split batch for test sake

* set as attribute

* study the behavior of no pipelining

* debug 1

* reflect test util namespace change

* update readme

* cosmetic in test

* add model build helper func for interleaving shced

* adding model builder from megatron

* canbe cyclic import

* fix

* enable interleaving test, but failing even if forward only

* fix batch preparation

* add explanation

* print data parallel size

* fix typo

* Add Megatron style GPT model by Rishi
Co-authored-by: Rishi Puri <riship@nvidia.com>

* update

* type hint for jit

* fix forward_backward_no_pipelining test

* pipeline forward backward seem to hang if not forward only

* fix typo

* debug

* add p2p test

* simplify

* fix

* tentative

* set both tmp and pmp to 1

* init

* fix typo

* fix

* fix path of divide

* set seed for tmp

* update upon Eddie comment

* fix typo

* adding failing data loader test

* fix

* megatron still failing

* check in

* with the nested loop of new order, interleaving seems fine

* cosmetic change

* make `forward_backward_pipelining_with_interleaving private

* warn users that interleaving sched is unstable

* move noop handler to no pipelining

* comment out rank_print

* make `build_model` more flexible

* skip megatron test tentatively

* correctly comment out rank_print

* correctly comment out rank_print

* correctly comment out rank_print

* skip appropriately

* remove wip p2p comm test

* update type hint of model_provider_func

* disable tf32 in each test script

* skip interleaving w/ backward

* rename as mpu is the old name

* remove broken case

* expose build_model func

* delete `dist.ring_exchange` func call and `use_ring_exchange` argument

* nit fixes

* check in

* remove unused file

* update the list

* update tensor shape

* remove mixed dtype case

* use torch.distributed.run

* 2020 -> 2021

* another 2020 -> 2021

* docstring & type hint

* fix teardown

* update

* change to experimental

* check if warned
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

63d5dd63

23 Oct, 2021 1 commit

Use out-of-place to avoid D2D copy in tensor parallel cross entropy (#1198) · 3303b3e7

Masaki Kozuki authored Oct 23, 2021



* switch from clone to out-of-place subtract

* Update apex/mpu/cross_entropy.py

* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

3303b3e7

08 Oct, 2021 1 commit
- Remove `custom_fwd`/`custom_bwd` from fused softmax (#1188) · 14ccf598
  Masaki Kozuki authored Oct 09, 2021
```
* run backward

* remove custom_fwd/custom_bwd
```
  14ccf598