Commits · dcb02fcf805524b4df52e31d26953d852bbeb291 · OpenDAS / apex

14 Jun, 2022 2 commits
- Update documentation to reflect DistributedFusedAdam uses AdamW · 846f7f8a
  Tim Moon authored Jun 14, 2022
```
Adjust test options to have tighter tolerances.
```
  846f7f8a
- Update dist Adam test to use updated API · e2af089c
  Tim Moon authored Jun 13, 2022
  
  e2af089c
31 May, 2022 1 commit

Do pipeline parallelism tests in double because TF32 environment variables can... · 265b451d

eqy authored May 31, 2022

Do pipeline parallelism tests in double because TF32 environment variables can be painful to manage across test suites (#1391)

* check in

* skip interleaved with 2 GPU

* change type annotation

* address comments thanks @crcrpar @Aidyn-A

265b451d

20 May, 2022 1 commit

Add grad check in test pipeline parallel fwd bwd (#1386) · ab5fc48f

Aidyn-A authored May 20, 2022

* add grad check

* change assert

* minor changes

* revert unnecessary changes

* suggested changes

* fix tensor comparison

* small changes

ab5fc48f

19 May, 2022 2 commits

Test `len(model) > 1` in `test_pipelining_with_interleaving` (#1384) · da1f7f2f

eqy authored May 18, 2022



* check in

* type

* cleanup

* cleanup

* fix function call

* Apply suggestions from code review
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

da1f7f2f

[Pipeline-Parallelism][TF32] Disable TF32 for Pipeline-Parallel numerical checks (#1382) · 891d57d3
eqy authored May 18, 2022
```
* check in

* fancy context style
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
```
891d57d3

18 May, 2022 1 commit

[transformer] Allow for different backend for Pipeline Parallel ProcessGroups (#1380) · 3490b9e1

Masaki Kozuki authored May 18, 2022



* NcclDistributedTestBase

* fix stupid mistake

* add UCC test

* add UCC backend

* torch ucc tests

* allows for UCC backend

* Set `UCX_TLS` to `tcp,cuda_copy` & Use DDP iff it makes sense

* Apply 4 suggestion(s) to 1 file(s)

* mix&match NCCL & UCC

* use both ucc&nccl in gpt

* UCC for Pipeline Parallel, NCCL for the others

* conditionally use ucc

* make ucc guards more friendly

* test raises when torch_ucc isn't available

* Change to member variable from class variable
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

* pass async_comm to train, I mistakenly dropped it during the rebase

* fix typo: functionality

* Enable tensor parallel only when device count > 4

I want pipeline model parallel world size to be >= 4 because
previously I saw GPT/BERT failing when only UCC is used.
So I'm speculating that there's some gotcha around pipeline size of 4.

* Add nvidia driver version guard
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

* move world_size as it was not correctly reflected

* keep eye on the nvml api thing

* import unittest
Co-authored-by: Aidyn Aitzhan <31858918+Aidyn-A@users.noreply.github.com>

3490b9e1

12 May, 2022 1 commit

Async pipeline parallel (#1373) · 3fe35211

eqy authored May 12, 2022

* initial check in

* fix

* fix test

* address some review comments and cleanup

* fix

* bookmark

* fix sync placement to come before gather

* similar fix for non-gather case

* add async bert

* update gpt minimal test

* allow selection of default pp test

* fix bert test

* cleanup

* cleanup

3fe35211

11 May, 2022 1 commit

[transformer] add loss comparison to test_pipeline_parallel_fwd_bwd (#1374) · 68440264

Aidyn-A authored May 11, 2022

* add loss comparison to test_pipeline_parallel_fwd_bwd

* applied some suggested changes

* update test_pipeline_parallel_fwd_bwd.py

* update test_pipeline_parallel_fwd_bwd.py 2

* minor update

* update test_pipeline_parallel_fwd_bwd.py 3

68440264

29 Apr, 2022 1 commit
- [transformer][pipeline parallel] fix typo in test (#1370) · c3018b13
  eqy authored Apr 29, 2022
```
* fix typo

* Update test_pipeline_parallel_fwd_bwd.py
```
  c3018b13
07 Apr, 2022 1 commit
- [transformer] add microbatches test (#1349) · 7d903878
  Masaki Kozuki authored Apr 07, 2022
```
* add test

* destroy model parallel was missing
```
  7d903878
25 Mar, 2022 1 commit

[transformer] Format & Test Refactoring (#1325) · a0ed4151

Masaki Kozuki authored Mar 24, 2022

* try PyTorch custom TestCase class

* revert

* initial working example

* update

* data utils

* fix imports

* hardcode backend to nccl

* fix signature

* fix typo

* mapping

* set device

* init

* refactor x entropy

* remove unused import & destroy model parallel

* refactor random

* fix test

* remove migrated tests

* refactor

* init

* separate affine weight init

* init model parallel

* split more

* weight init fix part 1

* use cpu init for consistency btwn native and tensor parallel

* black

* add col parallel

* use a 3D tensor of square matrix for column parallel linear

* skip the failing cases

* migrate layers test

* pipeline parallel forward/backward

* fix typo

* fix typo

* fix

* fix pipeline world size

* black

* rm `run_pipeline_parallel_test` in favor of test_pipeline_parallel_fwd_bwd.py

* stop logging

* set log level

* black

* license and format

* fix

* skip tf32 as matrices are small

* remove potentially inappropriate license

* Apply suggestions from code review

* remove `TODO` comment

* `torch.testing.assert_allclose` -> `torch.testing.assert_close`

* remove comment-outs

* remote unused import

* minor fix

a0ed4151

26 Feb, 2022 1 commit

[transformer] Fuse grad accumulation with wgrad (#1297) · ddc08039

Masaki Kozuki authored Feb 25, 2022



* fuse grad accumulation w/ weight grad
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* fp32 training path

* not using *args, **kwargs

* backward: moved the tensor dimension cnversion
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* move files to csrc/megatron

* fix fp32 path

* fix typo

* add  to  in order to select the correct custom extension

* fix typo

* comment on import guard

* update test: enable gradient_accumulation_fusion

* 86

* remove redundant call of `test_column_parallel_linear`
Co-authored-by: Sangkug Lym <slym@nvidia.com>

ddc08039

25 Feb, 2022 1 commit
- add setter of pipeline model parallel split rank (#1306) · e95c3b9c
  Masaki Kozuki authored Feb 25, 2022
  
  e95c3b9c
23 Feb, 2022 1 commit
- access to pipeline_model_parallel_split_rank (#1300) · 069ff336
  Masaki Kozuki authored Feb 23, 2022
  
  069ff336
04 Feb, 2022 1 commit

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · 684c4733

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

684c4733

31 Jan, 2022 1 commit

T5 pipeline parallel changes (#1279) · 0da60e10

Masaki Kozuki authored Jan 31, 2022

* Free output tensor on each pipeline stage for smaller memory footprint

see:
https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb

* ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff

* ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4

* remove position embedding group in destroy

* pass deallocate_pipeline_outputs to backward_step

* fix typo

* missing deallocate_pipeline_outputs

* fix typo: grad_ouptut -> grad_output

* update tests

* remove accessed todo

* test with data parallel size of 2 if there's equal to or more than 8 gpus

0da60e10

28 Jan, 2022 2 commits

small changes in test and logger format (#1278) · b1c75f6f
Masaki Kozuki authored Jan 28, 2022
```
* cosmetic refactor in test

* log with PID

* log more info: rank, pid, filename, lineNo
```
b1c75f6f

allow for `None` batch (#1280) · a960fe8c

Masaki Kozuki authored Jan 28, 2022

* have get_kth_microbatch deal with None batch

* broadcast based on tensor parallel rank

* dtype

* remove unnecessary .cuda()

Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.

a960fe8c

21 Jan, 2022 1 commit

Grad scaler (#1277) · 2a4ab177

Masaki Kozuki authored Jan 21, 2022

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* option to turn off autocast inside forward_step function

As there's some users who activate `autocast` outside fwd/bwd functions.

* add missing arg of disable_autocast

* reorder args of no pipeline

2a4ab177

17 Dec, 2021 1 commit

Add an argument of `dtype` to forward_backward functions to specify the dtype... · b88c507e

Masaki Kozuki authored Dec 17, 2021

Add an argument of `dtype` to forward_backward functions to specify the dtype used in p2p comm (#1249)

* let users sepcify dtype for p2p comm taking the possibility of O2 style AMP into account

* add `dtype` argument to forward_backward functions

* fix

* better message

* add docstring of dtype

* add a link to dtype logic of p2p comm

b88c507e

16 Dec, 2021 1 commit

Reduce OOM potential and report it if it happens in BERT test (#1250) · e0f5ea8c

eqy authored Dec 15, 2021



* reduce bert memory usage, placeholder data for gpt

* update gpt test

* fix

* Update tests/L0/run_transformer/run_bert_minimal_test.py

remove debugging indexing
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* Update tests/L0/run_transformer/run_bert_minimal_test.py

cleanup
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

e0f5ea8c

14 Dec, 2021 1 commit
- check size in kth microbatch (#1247) · ed94d0bb
  eqy authored Dec 13, 2021
  
  ed94d0bb
10 Dec, 2021 2 commits

Cherry-pick Megatron-LM's changes in pipeline model parallel for T5 (#1232) · 0e25fcc4

Masaki Kozuki authored Dec 10, 2021

* update parallel_state

* update pipeline common funcs - forward_step and backward_step

* update pipelining w/o interleaving

* type hint

* merge utils into without_interleaving

Motivation: functions in utils are only used by
forward_backward_pipelining_without_interleaving

* fix handling of `model_type`

* fix import of DDP

* update set_input_tensor method

* fix

* cosmetic

* update model

* refactor pipeline test scripts

0e25fcc4

Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel)... · ab7af058

Rishi Puri authored Dec 09, 2021


Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222)

* minimal bert pipeline parallel test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* first draft of gpt minimal test

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* framework to scale up the gpt2 test for variety of distributed setups

* adding gpt_minimal_test to list of multigpu tests
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: riship <riship@nvidia.com>

ab7af058

09 Dec, 2021 1 commit

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161

19 Nov, 2021 2 commits

minimal bert pipeline parallel test (#1216) · aa756cec

eqy authored Nov 18, 2021

* minimal bert pipeline parallel test

* fix global and cleanup

* use get_forward_backward_func

* cleanup and fix some tests

aa756cec

[POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133

Masaki Kozuki authored Nov 19, 2021

* init logging use

* fix

* clean up

* fp32 p2p comm

* init

* Dynamic global batch size with `MegatronPretrainingSampler`

I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now

* revive original pipeline parallel test

* update MULTIGPU_TEST: add dynamic batchsize test

* run MegatronPretrainingRandomSampler

* fix comment

* fix

* update

* cosmetic

* add note

* Apply 2 suggestion(s) to 2 file(s)

* change following https://github.com/NVIDIA/apex/pull/1210

* fix

35336133

10 Nov, 2021 1 commit
- check in (#1210) · 2205cff2
  eqy authored Nov 09, 2021
  
  2205cff2
27 Oct, 2021 1 commit

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021



* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s) to 2 file(s)

* fix cyclic import

* fix syntax of Callable

* fix - 1

* move directory as testing used for pp test as well

* add some functions for num microbatches calculator

* model is a list in pipeline parallel

* skip build num microbatch calculator

* fix test

* assert -> raise

* skip args printing

* specify tensor shape everywhere even if None - phase 1

* private timers

* passing tensor shape & dtype around

* update dtype handling by introducing helper func

* write helper func to reduce cyclomatic complexity

* remove duplicate

* update

* move split_tensor_into_1d_equal_chunks to avoid cyclic import

* tmp

* cosmetic

* move gather_split_1d_tensor to avoid cyclic imports

* remove debug print

* add outer loop

* early return if possible

* cosmetic

* passing around tensor shape

* refactor test

* add script to learn batch sampler behavior

* update

* minibatch splitter

* add minibatch splitter

* split minibatch into microbatches

* minor changes

* uncomment split batch for test sake

* set as attribute

* study the behavior of no pipelining

* debug 1

* reflect test util namespace change

* update readme

* cosmetic in test

* add model build helper func for interleaving shced

* adding model builder from megatron

* canbe cyclic import

* fix

* enable interleaving test, but failing even if forward only

* fix batch preparation

* add explanation

* print data parallel size

* fix typo

* Add Megatron style GPT model by Rishi
Co-authored-by: Rishi Puri <riship@nvidia.com>

* update

* type hint for jit

* fix forward_backward_no_pipelining test

* pipeline forward backward seem to hang if not forward only

* fix typo

* debug

* add p2p test

* simplify

* fix

* tentative

* set both tmp and pmp to 1

* init

* fix typo

* fix

* fix path of divide

* set seed for tmp

* update upon Eddie comment

* fix typo

* adding failing data loader test

* fix

* megatron still failing

* check in

* with the nested loop of new order, interleaving seems fine

* cosmetic change

* make `forward_backward_pipelining_with_interleaving private

* warn users that interleaving sched is unstable

* move noop handler to no pipelining

* comment out rank_print

* make `build_model` more flexible

* skip megatron test tentatively

* correctly comment out rank_print

* correctly comment out rank_print

* correctly comment out rank_print

* skip appropriately

* remove wip p2p comm test

* update type hint of model_provider_func

* disable tf32 in each test script

* skip interleaving w/ backward

* rename as mpu is the old name

* remove broken case

* expose build_model func

* delete `dist.ring_exchange` func call and `use_ring_exchange` argument

* nit fixes

* check in

* remove unused file

* update the list

* update tensor shape

* remove mixed dtype case

* use torch.distributed.run

* 2020 -> 2021

* another 2020 -> 2021

* docstring & type hint

* fix teardown

* update

* change to experimental

* check if warned
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

63d5dd63

23 Oct, 2021 1 commit

Use out-of-place to avoid D2D copy in tensor parallel cross entropy (#1198) · 3303b3e7

Masaki Kozuki authored Oct 23, 2021



* switch from clone to out-of-place subtract

* Update apex/mpu/cross_entropy.py

* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

3303b3e7

08 Oct, 2021 1 commit
- Remove `custom_fwd`/`custom_bwd` from fused softmax (#1188) · 14ccf598
  Masaki Kozuki authored Oct 09, 2021
```
* run backward

* remove custom_fwd/custom_bwd
```
  14ccf598
06 Oct, 2021 1 commit
- ColumnParallelLinearWithAsyncAllreduce autocast support (#1183) · b3da6036
  Masaki Kozuki authored Oct 06, 2021
```
* [ColumnParallelLinear] Test behavior in autocast

* fix test

* casts manually to autocast dtype
```
  b3da6036
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

15 Apr, 2021 1 commit

Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac

Sudhakar Singh authored Apr 15, 2021

* Add unit tests for fused-novograd

* Fix: tensors should reside on the same device

* Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test

* fixed issues mentioned in the comments

59d2f7ac

01 Dec, 2020 1 commit

DistributedFusedAdam Model Parallelism Support (Megatron) (#981) · 6b7e77b0

Kexin Yu authored Dec 01, 2020



DistributedFusedAdam Model Parallelism Support (Megatron)
Co-authored-by: Kexin Yu <kexiny@nvidia.com>
Co-authored-by: Kexin Yu <kexinznzn@gmail.com>

6b7e77b0

05 Aug, 2020 1 commit

set device guard for multi tensor optimizer implementations (#927) · 274cc063

ngimel authored Aug 05, 2020

* add device guards to the optimizers

* add untracked file

* set deviceGuard in multi_tensor_apply

* address review comments; fix lamb

* indent

* typo

274cc063

06 Jul, 2020 1 commit

[sync BN] (#792) · 1ff54b8f

jjsjann123 authored Jul 06, 2020

* [sync BN]

support non-uniform batch size across process group.

TODO: test should be added once cleaned up.

* updating unit tests

* new unit tests for different inputs

* cleaning

1ff54b8f

23 Jun, 2020 2 commits
- add test case for non-zero weight decay · ad50ce9a
  Kexin Yu authored Jun 23, 2020
  
  ad50ce9a
- test nvlamb; hyperparams consistent with adam/adagrad tests · cd3d6d12
  Kexin Yu authored Jun 23, 2020
  
  cd3d6d12