Commits · c4e85f7bf144cb0e368da96d339a6cbd9882cea5 · OpenDAS / apex

19 Jan, 2022 1 commit
- pass flags to transducer joint kernel (#1273) · c4e85f7b
  Masaki Kozuki authored Jan 18, 2022
  
  c4e85f7b
13 Jan, 2022 1 commit
- support new path to CUDAGeneratorImpl.h (#1267) · b2fdf9c4
  Shintaro Iwasaki authored Jan 13, 2022
  
  b2fdf9c4
17 Dec, 2021 1 commit

Add an argument of `dtype` to forward_backward functions to specify the dtype... · b88c507e

Masaki Kozuki authored Dec 17, 2021

Add an argument of `dtype` to forward_backward functions to specify the dtype used in p2p comm (#1249)

* let users sepcify dtype for p2p comm taking the possibility of O2 style AMP into account

* add `dtype` argument to forward_backward functions

* fix

* better message

* add docstring of dtype

* add a link to dtype logic of p2p comm

b88c507e

16 Dec, 2021 2 commits

version guard (#1253) · e8473822
Masaki Kozuki authored Dec 16, 2021

e8473822

Reduce OOM potential and report it if it happens in BERT test (#1250) · e0f5ea8c

eqy authored Dec 15, 2021



* reduce bert memory usage, placeholder data for gpt

* update gpt test

* fix

* Update tests/L0/run_transformer/run_bert_minimal_test.py

remove debugging indexing
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* Update tests/L0/run_transformer/run_bert_minimal_test.py

cleanup
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

e0f5ea8c

15 Dec, 2021 2 commits
- Add `--threads 4` to `extra_compile_args["nvcc"]` (#1251) · f63dac80
  Masaki Kozuki authored Dec 15, 2021
```
* apply formatter & remove duplicate func def

* dry CUDA_HOME None check

* `--threads 4`
```
  f63dac80
- check `model_parallel` is initialized in `build_model` (#1248) · 1cd1181d
  Masaki Kozuki authored Dec 15, 2021
  
  1cd1181d
14 Dec, 2021 2 commits

Faster `--fast_multihead_attn` build (#1245) · 7ec8ed67

Masaki Kozuki authored Dec 14, 2021

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

7ec8ed67

check size in kth microbatch (#1247) · ed94d0bb
eqy authored Dec 13, 2021

ed94d0bb

10 Dec, 2021 2 commits

Cherry-pick Megatron-LM's changes in pipeline model parallel for T5 (#1232) · 0e25fcc4

Masaki Kozuki authored Dec 10, 2021

* update parallel_state

* update pipeline common funcs - forward_step and backward_step

* update pipelining w/o interleaving

* type hint

* merge utils into without_interleaving

Motivation: functions in utils are only used by
forward_backward_pipelining_without_interleaving

* fix handling of `model_type`

* fix import of DDP

* update set_input_tensor method

* fix

* cosmetic

* update model

* refactor pipeline test scripts

0e25fcc4

Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel)... · ab7af058

Rishi Puri authored Dec 09, 2021


Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222)

* minimal bert pipeline parallel test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* adding gpt_minimal_test to list of multigpu tests
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: riship <riship@nvidia.com>

ab7af058

09 Dec, 2021 2 commits

Remove `THCState` from `apex/contrib/multihead_attn` (#1239) · 1203099a
Masaki Kozuki authored Dec 09, 2021
```
* pass `self.mask_additive`

* clang-format

* removing THCState
```
1203099a

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

19 Nov, 2021 3 commits

minimal bert pipeline parallel test (#1216) · aa756cec

eqy authored Nov 18, 2021

* minimal bert pipeline parallel test

* fix global and cleanup

* use get_forward_backward_func

* cleanup and fix some tests

aa756cec

porting GradScaler (#1220) · fcae8fa3

Masaki Kozuki authored Nov 19, 2021


Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

fcae8fa3

[POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133

Masaki Kozuki authored Nov 19, 2021

* init logging use

* fix

* clean up

* fp32 p2p comm

* init

* Dynamic global batch size with `MegatronPretrainingSampler`

I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now

* revive original pipeline parallel test

* update MULTIGPU_TEST: add dynamic batchsize test

* run MegatronPretrainingRandomSampler

* fix comment

* fix

* update

* cosmetic

* add note

* Apply 2 suggestion(s) to 2 file(s)

* change following https://github.com/NVIDIA/apex/pull/1210

* fix

35336133

10 Nov, 2021 3 commits
- conditionally import amp_C (#1211) · 25bfcb91
  Masaki Kozuki authored Nov 10, 2021
  
  25bfcb91
- check in (#1210) · 2205cff2
  eqy authored Nov 09, 2021
  
  2205cff2
- check in (#1205) · fa8bd7e6
  eqy authored Nov 09, 2021
  
  fa8bd7e6
27 Oct, 2021 2 commits

`FastLayerNorm` compat with `autocast` (#1203) · ae757634

Masaki Kozuki authored Oct 27, 2021



* Persistent LayerNorm: Multi-CTA Rewrite

* autocast support
Co-authored-by: Young-Jun Ko <youngjun.ko@gmail.com>

ae757634

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021



* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s) to 2 file(s)

* fix cyclic import

* fix syntax of Callable

* fix - 1

* move directory as testing used for pp test as well

* add some functions for num microbatches calculator

* model is a list in pipeline parallel

* skip build num microbatch calculator

* fix test

* assert -> raise

* skip args printing

* specify tensor shape everywhere even if None - phase 1

* private timers

* passing tensor shape & dtype around

* update dtype handling by introducing helper func

* write helper func to reduce cyclomatic complexity

* remove duplicate

* update

* move split_tensor_into_1d_equal_chunks to avoid cyclic import

* tmp

* cosmetic

* move gather_split_1d_tensor to avoid cyclic imports

* remove debug print

* add outer loop

* early return if possible

* cosmetic

* passing around tensor shape

* refactor test

* add script to learn batch sampler behavior

* update

* minibatch splitter

* add minibatch splitter

* split minibatch into microbatches

* minor changes

* uncomment split batch for test sake

* set as attribute

* study the behavior of no pipelining

* debug 1

* reflect test util namespace change

* update readme

* cosmetic in test

* add model build helper func for interleaving shced

* adding model builder from megatron

* canbe cyclic import

* fix

* enable interleaving test, but failing even if forward only

* fix batch preparation

* add explanation

* print data parallel size

* fix typo

* Add Megatron style GPT model by Rishi
Co-authored-by: Rishi Puri <riship@nvidia.com>

* update

* type hint for jit

* fix forward_backward_no_pipelining test

* pipeline forward backward seem to hang if not forward only

* fix typo

* debug

* add p2p test

* simplify

* fix

* tentative

* set both tmp and pmp to 1

* init

* fix typo

* fix

* fix path of divide

* set seed for tmp

* update upon Eddie comment

* fix typo

* adding failing data loader test

* fix

* megatron still failing

* check in

* with the nested loop of new order, interleaving seems fine

* cosmetic change

* make `forward_backward_pipelining_with_interleaving private

* warn users that interleaving sched is unstable

* move noop handler to no pipelining

* comment out rank_print

* make `build_model` more flexible

* skip megatron test tentatively

* correctly comment out rank_print

* correctly comment out rank_print

* correctly comment out rank_print

* skip appropriately

* remove wip p2p comm test

* update type hint of model_provider_func

* disable tf32 in each test script

* skip interleaving w/ backward

* rename as mpu is the old name

* remove broken case

* expose build_model func

* delete `dist.ring_exchange` func call and `use_ring_exchange` argument

* nit fixes

* check in

* remove unused file

* update the list

* update tensor shape

* remove mixed dtype case

* use torch.distributed.run

* 2020 -> 2021

* another 2020 -> 2021

* docstring & type hint

* fix teardown

* update

* change to experimental

* check if warned
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

63d5dd63

23 Oct, 2021 1 commit

Use out-of-place to avoid D2D copy in tensor parallel cross entropy (#1198) · 3303b3e7

Masaki Kozuki authored Oct 23, 2021



* switch from clone to out-of-place subtract

* Update apex/mpu/cross_entropy.py

* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

3303b3e7

18 Oct, 2021 1 commit

remove THC headers/functions (#1192) · 0c7d8e3f

Masaki Kozuki authored Oct 19, 2021

Changes include
- THC headers removal
- TH macros replacement
- fix some typo in comment

0c7d8e3f

16 Oct, 2021 1 commit
- replace (#1191) · 60821f53
  Masaki Kozuki authored Oct 16, 2021
  
  60821f53
14 Oct, 2021 2 commits

change chunking scheme for full-allreduce case, add parameter order argument,... · 1d5f7e55

Burc Eryilmaz authored Oct 13, 2021

change chunking scheme for full-allreduce case, add parameter order argument, both to enable contiguous chunking of allgather (#1190)

1d5f7e55

Fix dist lamb (#1185) · d9a46fde

Nan Zheng authored Oct 14, 2021

1. remove the weight broadcast in the constructor
2. disable unnecessary allreduces for clip-after-ar

d9a46fde

13 Oct, 2021 1 commit
- check in (#1189) · 4e9fae9b
  eqy authored Oct 13, 2021
  
  4e9fae9b
08 Oct, 2021 2 commits
- Remove `custom_fwd`/`custom_bwd` from fused softmax (#1188) · 14ccf598
  Masaki Kozuki authored Oct 09, 2021
```
* run backward

* remove custom_fwd/custom_bwd
```
  14ccf598
- check in (#1187) · 3ad9db2a
  eqy authored Oct 07, 2021
  
  3ad9db2a
07 Oct, 2021 1 commit
- Update layer_norm_cuda_kernel.cu (#1184) · 5adf7bc2
  eqy authored Oct 06, 2021
  
  5adf7bc2
06 Oct, 2021 1 commit
- ColumnParallelLinearWithAsyncAllreduce autocast support (#1183) · b3da6036
  Masaki Kozuki authored Oct 06, 2021
```
* [ColumnParallelLinear] Test behavior in autocast

* fix test

* casts manually to autocast dtype
```
  b3da6036
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

30 Sep, 2021 1 commit
- use cuda caching allocator from pytorch (#1180) · bdac244e
  X Wang authored Sep 30, 2021
  
  bdac244e
28 Sep, 2021 1 commit
- cleanup missing THCDeviceUtils.cuh header (#1177) · 2a559c51
  X Wang authored Sep 28, 2021
  
  2a559c51
24 Sep, 2021 2 commits
- Fix typo in contrib FusedLamb. (#1172) · 70d4a0ba
  romerojosh authored Sep 24, 2021
  
  70d4a0ba
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 76daa454
  Masaki Kozuki authored Sep 24, 2021
  
  76daa454
08 Sep, 2021 1 commit

enable ninja (#1164) · 9ce0a10f

Masaki Kozuki authored Sep 08, 2021

- passing include directories to `CUDAExtension`'s `include_dirs` argument
- removing `-I/path/to/dir` arguments from `extra_compile_args`

9ce0a10f

04 Sep, 2021 1 commit

fix CUBLAS guards (#1162) · 54b93919

Burc Eryilmaz authored Sep 04, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path

* safer guard around CUBLAS constants, remove unreferenced variable

* more guard changes

* guard against cublas version instead of cuda
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

54b93919

02 Sep, 2021 2 commits
- Merge pull request #1161 from NVIDIA/optional_caller_supplied_communicator · ae1cdd64
  Thor Johnsen authored Sep 02, 2021
```
Optional NCCL communicator argument to init method
```
  ae1cdd64
- Optional NCCL communicator argument to init method · e777bddb
  Thor Johnsen authored Sep 02, 2021
  
  e777bddb