Commits · e95c3b9ccc9daa747d00de03693abb0259edee02 · OpenDAS / apex

25 Feb, 2022 3 commits
- add setter of pipeline model parallel split rank (#1306) · e95c3b9c
  Masaki Kozuki authored Feb 25, 2022
  
  e95c3b9c
- [transformer] use logger in microbatches module (#1302) · 17e1a1f6
  Masaki Kozuki authored Feb 25, 2022
  
  17e1a1f6
- skip FastLayerNorm (#1305) · 4506a687
  Masaki Kozuki authored Feb 24, 2022
  
  4506a687
23 Feb, 2022 4 commits
- be more flexible (#1299) · 199fa834
  Masaki Kozuki authored Feb 23, 2022
  
  199fa834
- access to pipeline_model_parallel_split_rank (#1300) · 069ff336
  Masaki Kozuki authored Feb 23, 2022
  
  069ff336
- Merge pull request #1301 from NVIDIA/bug_fix_in_fast_bottleneck · ab1a93a7
  Thor Johnsen authored Feb 23, 2022
```
Change data type for virtual tensors to float
```
  ab1a93a7
- Change data type for virtual tensors to float · 51e81314
  Thor Johnsen authored Feb 23, 2022
  
  51e81314
15 Feb, 2022 1 commit
- taking channels last 3d into account (#1284) · 39fc7ccf
  Masaki Kozuki authored Feb 15, 2022
  
  39fc7ccf
12 Feb, 2022 1 commit
- cast for `-Wc++11-narrowing` (#1288) · 1e218749
  Masaki Kozuki authored Feb 11, 2022
  
  1e218749
11 Feb, 2022 1 commit
- [FusedRMSNorm doc] document where epsilon is added (#1295) · c8c00ef5
  Stas Bekman authored Feb 11, 2022
```
* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording
```
  c8c00ef5
10 Feb, 2022 1 commit
- 8.6 requires CUDA 11.1 (#1289) · e1aa1fc1
  Masaki Kozuki authored Feb 10, 2022
  
  e1aa1fc1
07 Feb, 2022 1 commit
- fix and generate docs for FusedRMSNorm (#1285) · a786ca0c
  eqy authored Feb 07, 2022
  
  a786ca0c
04 Feb, 2022 1 commit

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · 684c4733

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

684c4733

01 Feb, 2022 2 commits

Add the permutation related support as the extension for asp lib. (#1194) · 89edb819

ChongyuNVIDIA authored Feb 02, 2022

* Add the permutation related support as the extension for asp lib.

* [Fix] Track the permutation sequence for progressive channel swap strategy.

* Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings.

* Fix the deprecated functions in ASP unit tests.

* Fix the sparsity info typo in ASP lib.

* [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search.

* Update the README.md with identical random seed setting and NeurIPS info.

* Integrate the Pybind11 enhancement of permutation search into ASP lib.

89edb819

transformer: Allows for custom sync context in no pipelining forward backward function (#1281) · 79c01877
Masaki Kozuki authored Jan 31, 2022
```
* add kwarg of `custom_sync_context_handler`

* add kwargs to ignore custom_sync_context_handler which mistakenly passed to fwd/bwd funcs
```
79c01877

31 Jan, 2022 2 commits

T5 pipeline parallel changes (#1279) · 0da60e10

Masaki Kozuki authored Jan 31, 2022

* Free output tensor on each pipeline stage for smaller memory footprint

see:
https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb

* ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff

* ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4

* remove position embedding group in destroy

* pass deallocate_pipeline_outputs to backward_step

* fix typo

* missing deallocate_pipeline_outputs

* fix typo: grad_ouptut -> grad_output

* update tests

* remove accessed todo

* test with data parallel size of 2 if there's equal to or more than 8 gpus

0da60e10

fix group range to compute l2_norm (#1266) · a47d1a76

chochowski authored Jan 31, 2022



* fix graph capture failure, fix norm computation with full_ar and clip_after

* fix group range to compute l2_norm
Co-authored-by: seryilmaz <seryilmaz@nvidia.com>
Co-authored-by: mchochowski <mchochowski@nvidia.com>

a47d1a76

29 Jan, 2022 1 commit
- add inline asm 128-bit counter (#1265) · 2eafdb3d
  Burc Eryilmaz authored Jan 29, 2022
  
  2eafdb3d
28 Jan, 2022 2 commits

small changes in test and logger format (#1278) · b1c75f6f
Masaki Kozuki authored Jan 28, 2022
```
* cosmetic refactor in test

* log with PID

* log more info: rank, pid, filename, lineNo
```
b1c75f6f

allow for `None` batch (#1280) · a960fe8c

Masaki Kozuki authored Jan 28, 2022

* have get_kth_microbatch deal with None batch

* broadcast based on tensor parallel rank

* dtype

* remove unnecessary .cuda()

Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.

a960fe8c

21 Jan, 2022 2 commits

Grad scaler (#1277) · 2a4ab177

Masaki Kozuki authored Jan 21, 2022

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* option to turn off autocast inside forward_step function

As there's some users who activate `autocast` outside fwd/bwd functions.

* add missing arg of disable_autocast

* reorder args of no pipeline

2a4ab177

Fix missing `dtype` in `recv_forward` (#1276) · 45cd1001
eqy authored Jan 21, 2022
```
CC @crcrpar @ptrblck
```
45cd1001

19 Jan, 2022 1 commit
- pass flags to transducer joint kernel (#1273) · c4e85f7b
  Masaki Kozuki authored Jan 18, 2022
  
  c4e85f7b
13 Jan, 2022 1 commit
- support new path to CUDAGeneratorImpl.h (#1267) · b2fdf9c4
  Shintaro Iwasaki authored Jan 13, 2022
  
  b2fdf9c4
17 Dec, 2021 1 commit

Add an argument of `dtype` to forward_backward functions to specify the dtype... · b88c507e

Masaki Kozuki authored Dec 17, 2021

Add an argument of `dtype` to forward_backward functions to specify the dtype used in p2p comm (#1249)

* let users sepcify dtype for p2p comm taking the possibility of O2 style AMP into account

* add `dtype` argument to forward_backward functions

* fix

* better message

* add docstring of dtype

* add a link to dtype logic of p2p comm

b88c507e

16 Dec, 2021 2 commits

version guard (#1253) · e8473822
Masaki Kozuki authored Dec 16, 2021

e8473822

Reduce OOM potential and report it if it happens in BERT test (#1250) · e0f5ea8c

eqy authored Dec 15, 2021



* reduce bert memory usage, placeholder data for gpt

* update gpt test

* fix

* Update tests/L0/run_transformer/run_bert_minimal_test.py

remove debugging indexing
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* Update tests/L0/run_transformer/run_bert_minimal_test.py

cleanup
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

e0f5ea8c

15 Dec, 2021 2 commits
- Add `--threads 4` to `extra_compile_args["nvcc"]` (#1251) · f63dac80
  Masaki Kozuki authored Dec 15, 2021
```
* apply formatter & remove duplicate func def

* dry CUDA_HOME None check

* `--threads 4`
```
  f63dac80
- check `model_parallel` is initialized in `build_model` (#1248) · 1cd1181d
  Masaki Kozuki authored Dec 15, 2021
  
  1cd1181d
14 Dec, 2021 2 commits

Faster `--fast_multihead_attn` build (#1245) · 7ec8ed67

Masaki Kozuki authored Dec 14, 2021

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

7ec8ed67

check size in kth microbatch (#1247) · ed94d0bb
eqy authored Dec 13, 2021

ed94d0bb

10 Dec, 2021 2 commits

Cherry-pick Megatron-LM's changes in pipeline model parallel for T5 (#1232) · 0e25fcc4

Masaki Kozuki authored Dec 10, 2021

* update parallel_state

* update pipeline common funcs - forward_step and backward_step

* update pipelining w/o interleaving

* type hint

* merge utils into without_interleaving

Motivation: functions in utils are only used by
forward_backward_pipelining_without_interleaving

* fix handling of `model_type`

* fix import of DDP

* update set_input_tensor method

* fix

* cosmetic

* update model

* refactor pipeline test scripts

0e25fcc4

Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel)... · ab7af058

Rishi Puri authored Dec 09, 2021


Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222)

* minimal bert pipeline parallel test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* adding gpt_minimal_test to list of multigpu tests
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: riship <riship@nvidia.com>

ab7af058

09 Dec, 2021 2 commits

Remove `THCState` from `apex/contrib/multihead_attn` (#1239) · 1203099a
Masaki Kozuki authored Dec 09, 2021
```
* pass `self.mask_additive`

* clang-format

* removing THCState
```
1203099a

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

19 Nov, 2021 3 commits

minimal bert pipeline parallel test (#1216) · aa756cec

eqy authored Nov 18, 2021

* minimal bert pipeline parallel test

* fix global and cleanup

* use get_forward_backward_func

* cleanup and fix some tests

aa756cec

porting GradScaler (#1220) · fcae8fa3

Masaki Kozuki authored Nov 19, 2021


Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

fcae8fa3

[POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133

Masaki Kozuki authored Nov 19, 2021

* init logging use

* fix

* clean up

* fp32 p2p comm

* init

* Dynamic global batch size with `MegatronPretrainingSampler`

I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now

* revive original pipeline parallel test

* update MULTIGPU_TEST: add dynamic batchsize test

* run MegatronPretrainingRandomSampler

* fix comment

* fix

* update

* cosmetic

* add note

* Apply 2 suggestion(s) to 2 file(s)

* change following https://github.com/NVIDIA/apex/pull/1210

* fix

35336133

10 Nov, 2021 2 commits
- conditionally import amp_C (#1211) · 25bfcb91
  Masaki Kozuki authored Nov 10, 2021
  
  25bfcb91
- check in (#1210) · 2205cff2
  eqy authored Nov 09, 2021
  
  2205cff2