Commits · e95c3b9ccc9daa747d00de03693abb0259edee02 · OpenDAS / apex

25 Feb, 2022 1 commit
- add setter of pipeline model parallel split rank (#1306) · e95c3b9c
  Masaki Kozuki authored Feb 25, 2022
  
  e95c3b9c
23 Feb, 2022 1 commit
- access to pipeline_model_parallel_split_rank (#1300) · 069ff336
  Masaki Kozuki authored Feb 23, 2022
  
  069ff336
04 Feb, 2022 1 commit

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · 684c4733

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

684c4733

31 Jan, 2022 1 commit

T5 pipeline parallel changes (#1279) · 0da60e10

Masaki Kozuki authored Jan 31, 2022

* Free output tensor on each pipeline stage for smaller memory footprint

see:
https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb

* ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff

* ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4

* remove position embedding group in destroy

* pass deallocate_pipeline_outputs to backward_step

* fix typo

* missing deallocate_pipeline_outputs

* fix typo: grad_ouptut -> grad_output

* update tests

* remove accessed todo

* test with data parallel size of 2 if there's equal to or more than 8 gpus

0da60e10

28 Jan, 2022 2 commits

small changes in test and logger format (#1278) · b1c75f6f
Masaki Kozuki authored Jan 28, 2022
```
* cosmetic refactor in test

* log with PID

* log more info: rank, pid, filename, lineNo
```
b1c75f6f

allow for `None` batch (#1280) · a960fe8c

Masaki Kozuki authored Jan 28, 2022

* have get_kth_microbatch deal with None batch

* broadcast based on tensor parallel rank

* dtype

* remove unnecessary .cuda()

Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.

a960fe8c

21 Jan, 2022 1 commit

Grad scaler (#1277) · 2a4ab177

Masaki Kozuki authored Jan 21, 2022

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* option to turn off autocast inside forward_step function

As there's some users who activate `autocast` outside fwd/bwd functions.

* add missing arg of disable_autocast

* reorder args of no pipeline

2a4ab177

17 Dec, 2021 1 commit

Add an argument of `dtype` to forward_backward functions to specify the dtype... · b88c507e

Masaki Kozuki authored Dec 17, 2021

Add an argument of `dtype` to forward_backward functions to specify the dtype used in p2p comm (#1249)

* let users sepcify dtype for p2p comm taking the possibility of O2 style AMP into account

* add `dtype` argument to forward_backward functions

* fix

* better message

* add docstring of dtype

* add a link to dtype logic of p2p comm

b88c507e

16 Dec, 2021 1 commit

Reduce OOM potential and report it if it happens in BERT test (#1250) · e0f5ea8c

eqy authored Dec 15, 2021



* reduce bert memory usage, placeholder data for gpt

* update gpt test

* fix

* Update tests/L0/run_transformer/run_bert_minimal_test.py

remove debugging indexing
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* Update tests/L0/run_transformer/run_bert_minimal_test.py

cleanup
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

e0f5ea8c

14 Dec, 2021 1 commit
- check size in kth microbatch (#1247) · ed94d0bb
  eqy authored Dec 13, 2021
  
  ed94d0bb
10 Dec, 2021 2 commits

Cherry-pick Megatron-LM's changes in pipeline model parallel for T5 (#1232) · 0e25fcc4

Masaki Kozuki authored Dec 10, 2021

* update parallel_state

* update pipeline common funcs - forward_step and backward_step

* update pipelining w/o interleaving

* type hint

* merge utils into without_interleaving

Motivation: functions in utils are only used by
forward_backward_pipelining_without_interleaving

* fix handling of `model_type`

* fix import of DDP

* update set_input_tensor method

* fix

* cosmetic

* update model

* refactor pipeline test scripts

0e25fcc4

Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel)... · ab7af058

Rishi Puri authored Dec 09, 2021


Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222)

* minimal bert pipeline parallel test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* adding gpt_minimal_test to list of multigpu tests
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: riship <riship@nvidia.com>

ab7af058

09 Dec, 2021 1 commit

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

19 Nov, 2021 2 commits

minimal bert pipeline parallel test (#1216) · aa756cec

eqy authored Nov 18, 2021

* minimal bert pipeline parallel test

* fix global and cleanup

* use get_forward_backward_func

* cleanup and fix some tests

aa756cec

[POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133

Masaki Kozuki authored Nov 19, 2021

* init logging use

* fix

* clean up

* fp32 p2p comm

* init

* Dynamic global batch size with `MegatronPretrainingSampler`

I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now

* revive original pipeline parallel test

* update MULTIGPU_TEST: add dynamic batchsize test

* run MegatronPretrainingRandomSampler

* fix comment

* fix

* update

* cosmetic

* add note

* Apply 2 suggestion(s) to 2 file(s)

* change following https://github.com/NVIDIA/apex/pull/1210

* fix

35336133

10 Nov, 2021 1 commit
- check in (#1210) · 2205cff2
  eqy authored Nov 09, 2021
  
  2205cff2
27 Oct, 2021 1 commit

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021



* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s) to 2 file(s)

* fix cyclic import

* fix syntax of Callable

* fix - 1

* move directory as testing used for pp test as well

* add some functions for num microbatches calculator

* model is a list in pipeline parallel

* skip build num microbatch calculator

* fix test

* assert -> raise

* skip args printing

* specify tensor shape everywhere even if None - phase 1

* private timers

* passing tensor shape & dtype around

* update dtype handling by introducing helper func

* write helper func to reduce cyclomatic complexity

* remove duplicate

* update

* move split_tensor_into_1d_equal_chunks to avoid cyclic import

* tmp

* cosmetic

* move gather_split_1d_tensor to avoid cyclic imports

* remove debug print

* add outer loop

* early return if possible

* cosmetic

* passing around tensor shape

* refactor test

* add script to learn batch sampler behavior

* update

* minibatch splitter

* add minibatch splitter

* split minibatch into microbatches

* minor changes

* uncomment split batch for test sake

* set as attribute

* study the behavior of no pipelining

* debug 1

* reflect test util namespace change

* update readme

* cosmetic in test

* add model build helper func for interleaving shced

* adding model builder from megatron

* canbe cyclic import

* fix

* enable interleaving test, but failing even if forward only

* fix batch preparation

* add explanation

* print data parallel size

* fix typo

* Add Megatron style GPT model by Rishi
Co-authored-by: Rishi Puri <riship@nvidia.com>

* update

* type hint for jit

* fix forward_backward_no_pipelining test

* pipeline forward backward seem to hang if not forward only

* fix typo

* debug

* add p2p test

* simplify

* fix

* tentative

* set both tmp and pmp to 1

* init

* fix typo

* fix

* fix path of divide

* set seed for tmp

* update upon Eddie comment

* fix typo

* adding failing data loader test

* fix

* megatron still failing

* check in

* with the nested loop of new order, interleaving seems fine

* cosmetic change

* make `forward_backward_pipelining_with_interleaving private

* warn users that interleaving sched is unstable

* move noop handler to no pipelining

* comment out rank_print

* make `build_model` more flexible

* skip megatron test tentatively

* correctly comment out rank_print

* correctly comment out rank_print

* correctly comment out rank_print

* skip appropriately

* remove wip p2p comm test

* update type hint of model_provider_func

* disable tf32 in each test script

* skip interleaving w/ backward

* rename as mpu is the old name

* remove broken case

* expose build_model func

* delete `dist.ring_exchange` func call and `use_ring_exchange` argument

* nit fixes

* check in

* remove unused file

* update the list

* update tensor shape

* remove mixed dtype case

* use torch.distributed.run

* 2020 -> 2021

* another 2020 -> 2021

* docstring & type hint

* fix teardown

* update

* change to experimental

* check if warned
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

63d5dd63

23 Oct, 2021 1 commit

Use out-of-place to avoid D2D copy in tensor parallel cross entropy (#1198) · 3303b3e7

Masaki Kozuki authored Oct 23, 2021



* switch from clone to out-of-place subtract

* Update apex/mpu/cross_entropy.py

* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

3303b3e7

08 Oct, 2021 1 commit
- Remove `custom_fwd`/`custom_bwd` from fused softmax (#1188) · 14ccf598
  Masaki Kozuki authored Oct 09, 2021
```
* run backward

* remove custom_fwd/custom_bwd
```
  14ccf598
06 Oct, 2021 1 commit
- ColumnParallelLinearWithAsyncAllreduce autocast support (#1183) · b3da6036
  Masaki Kozuki authored Oct 06, 2021
```
* [ColumnParallelLinear] Test behavior in autocast

* fix test

* casts manually to autocast dtype
```
  b3da6036
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

15 Apr, 2021 1 commit

Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac

Sudhakar Singh authored Apr 15, 2021

* Add unit tests for fused-novograd

* Fix: tensors should reside on the same device

* Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test

* fixed issues mentioned in the comments

59d2f7ac

01 Dec, 2020 1 commit

DistributedFusedAdam Model Parallelism Support (Megatron) (#981) · 6b7e77b0

Kexin Yu authored Dec 01, 2020



DistributedFusedAdam Model Parallelism Support (Megatron)
Co-authored-by: Kexin Yu <kexiny@nvidia.com>
Co-authored-by: Kexin Yu <kexinznzn@gmail.com>

6b7e77b0

05 Aug, 2020 1 commit

set device guard for multi tensor optimizer implementations (#927) · 274cc063

ngimel authored Aug 05, 2020

* add device guards to the optimizers

* add untracked file

* set deviceGuard in multi_tensor_apply

* address review comments; fix lamb

* indent

* typo

274cc063

23 Jun, 2020 3 commits
- add test case for non-zero weight decay · ad50ce9a
  Kexin Yu authored Jun 23, 2020
  
  ad50ce9a
- test nvlamb; hyperparams consistent with adam/adagrad tests · cd3d6d12
  Kexin Yu authored Jun 23, 2020
  
  cd3d6d12
- add test for FusedLAMB · 9774ce0d
  Kexin Yu authored Jun 22, 2020
  
  9774ce0d
14 May, 2020 1 commit
- Add FusedAdagrad (#822) · 3bae8c83
  Andrew Tulloch authored May 14, 2020
  
  3bae8c83
30 Apr, 2020 1 commit

Improvements to apex.mlp (#804) · 31aceeaa

Deyu Fu authored Apr 30, 2020

* update fused bias relu backward kernel

* adding support for not require first layer dgrad

* fix bug: wrong layer in requires grad

* add infrastructure for optional bias and activation, currently only support no bias and no relu

* make bias and relu optional separately

* add sigmoid activation option

31aceeaa

22 Apr, 2020 2 commits

initial commit to add Multilayer Perceptron (MLP) extension (#790) · 71511faf
Deyu Fu authored Apr 22, 2020

71511faf

Fix LARC with mixed precision (#793) · 2ec84ebd

Vinicius Reis authored Apr 22, 2020

The LARC optimizer wraps an underlying optimizer and then needs to be passed
to amp.initialize for mixed precision. There were 3 different crashes happening
in this situation, fix all of them and add a unit test.

I don't know if the 'LARC' in sys.modules check ever worked. In my setup, the
entry in sys.modules is 'apex.parallel.LARC'. Checking if the variable is
defined seems more reliable though.

2ec84ebd

31 Mar, 2020 1 commit
- Add support for bool datatype (#601) (#603) · ca00adac
  Jeff Bowles authored Mar 31, 2020
  
  ca00adac
27 Feb, 2020 1 commit
- NHWC support for multi tensor apply (#732) · de6378f5
  mcarilli authored Feb 26, 2020
```
* NHWC support for multi tensor apply

* compilation fix for version<=1.4
```
  de6378f5
03 Oct, 2019 1 commit

Disable tests for mixed opt_levels, add bitwise accurate test of parameters (#520) · 0b74bfd9

ptrblck authored Oct 03, 2019

* increase atol for Half-Float comparison to 1.5e-4

* disable tests for different opt_levels

* reset atol

* add bitwise accurate comparison

0b74bfd9

03 Sep, 2019 1 commit

Fix issues in fused_dam (#469) · 7fa74925

Deyu Fu authored Sep 03, 2019

* move import of amp_C to __init__()

* make fp16/32 separate lists to support mixed param types, disable double test

* make zero_grad consistent between adam/novograd/lamb

7fa74925

27 Aug, 2019 1 commit

Enable Checkpointing (#420) · dec4fdd6

ptrblck authored Aug 27, 2019

* add state_dict, load_state_dict

* add test_restoring, test_loss_scale_decrease

* disable amp outputs for checkpoint tests

* add test for amp.state_dict, cleanup

* add state_dict patch, add test

* fixed testing, cleanup

* add readme for checkpointing

* add docs to source/amp

* add review changes to doc

dec4fdd6

17 Aug, 2019 1 commit
- disable breaking test until switch to test against upstream v1.2.0 · f855f856
  Deyu Fu authored Aug 16, 2019
  
  f855f856
15 Aug, 2019 1 commit
- Undefined name: import os for line 134 · 453eefa5
  Christian Clauss authored Aug 15, 2019
  
  453eefa5
13 Aug, 2019 2 commits

Reverse to Fused* naming, clean up accordingly: · 007c5947

Deyu Fu authored Aug 13, 2019

FusedSGD now work as before
FusedAdam now work with o1/o2, no longer fuse scaling and casting
Removed special backend handling for FusedAdam
Moved and updated test for FusedAdam into run_optimizers
Removed legacy tests for optimizers.FP16_optimizer and FusedAdam in run_mixed_adam

007c5947

Adding PyProf to Apex (#404) · 880ab925

Marek Kolodziej authored Aug 13, 2019


Co-authored-by: Aditya Agrawal <aditya.iitb@gmail.com>
Co-authored-by: Marek Kolodziej <mkolod@gmail.com>

880ab925