Commits · 9615983e46eb6df2e6ccdc9ac7dfc6a03f85571a · OpenDAS / apex

09 Dec, 2021 3 commits
- Remove `THCState` from `apex/contrib/multihead_attn` (#1239) · 9615983e
  Masaki Kozuki authored Dec 09, 2021
```
* pass `self.mask_additive`

* clang-format

* removing THCState
```
  9615983e
- Add fused mixed precision lamb optimizer. (#1237) · d11ddccf
  Kevin Stephano authored Dec 08, 2021
```
* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.
```
  d11ddccf
- Merge remote-tracking branch 'upstream/master' into IFU-master-2021-12-08 · 79906517
  hubertlu-tw authored Dec 08, 2021
  
  79906517
08 Dec, 2021 1 commit
- Merge pull request #55 from ROCmSoftwarePlatform/IFU-master-2021-10-15 · cc92a4b4
  Jithun Nair authored Dec 08, 2021
```
IFU-2021-10-15 (+ remove redundant defines + C10_CUDA_CHECK)
```
  cc92a4b4
06 Dec, 2021 2 commits

Replace THCudaCheck with C10_CUDA_CHECK · fec3141c
Hubert Lu authored Dec 06, 2021

fec3141c

remove THC headers/functions (#1192) · 2155dabf

Masaki Kozuki authored Oct 19, 2021

Changes include
- THC headers removal
- TH macros replacement
- fix some typo in comment
 Conflicts:
	apex/contrib/csrc/multihead_attn/additive_masked_softmax_dropout_cuda.cu
	apex/contrib/csrc/multihead_attn/encdec_multihead_attn_cuda.cu
	apex/contrib/csrc/multihead_attn/encdec_multihead_attn_norm_add_cuda.cu
	apex/contrib/csrc/multihead_attn/masked_softmax_dropout_cuda.cu
	apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_additive_mask_cuda.cu
	apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_cuda.cu
	apex/contrib/csrc/multihead_attn/self_multihead_attn_cuda.cu
	apex/contrib/csrc/multihead_attn/self_multihead_attn_norm_add_cuda.cu
	apex/contrib/csrc/multihead_attn/strided_batched_gemm.h

2155dabf

03 Dec, 2021 2 commits
- Merge remote-tracking branch 'origin/master' into IFU-master-2021-10-15 · 79a2d204
  hubertlu-tw authored Dec 03, 2021
  
  79a2d204
- Add IS_ROCM_PYTORCH if statement for some newly-added extensions · 39a65c92
  hubertlu-tw authored Dec 03, 2021
  
  39a65c92
02 Dec, 2021 4 commits
- Enable all supported CUDA extensions using --cuda_ext flag (#59) · 1e0f9bc6
  Jithun Nair authored Dec 02, 2021
```
* Use --cuda_ext flag to build all supported extensions

* Don't remove --cuda_ext since it'll be needed to build other extensions

* Need to clear all cmdline args so setup.py doesn't complain
```
  1e0f9bc6
- Merge pull request #58 from ROCmSoftwarePlatform/dev/hubertlu/unit_tests · 541da7a0
  Hubert Lu authored Dec 02, 2021
```
Add more unit tests for both distributed and extensions
```
  541da7a0
- Merge remote-tracking branch 'origin/master' into IFU-master-2021-10-15 · 1436a66a
  hubertlu-tw authored Dec 02, 2021
  
  1436a66a
- Merge remote-tracking branch 'origin/master' into dev/hubertlu/unit_tests · 2228f1bf
  Hubert Lu authored Dec 02, 2021
  
  2228f1bf
01 Dec, 2021 2 commits
- Enable Distributed FusedLAMB (#57) · 08e88b1b
  athitten authored Dec 01, 2021
  
  08e88b1b
- Update run_rocm_distributed.sh · 3f3da214
  Hubert Lu authored Dec 01, 2021
  
  3f3da214
29 Nov, 2021 1 commit
- include iostream (#1144) · 51b402df
  X Wang authored Aug 20, 2021
  
  51b402df
22 Nov, 2021 1 commit
- Update run_rocm.sh · 405956c3
  Hubert Lu authored Nov 22, 2021
```
Change python3.6 to python
```
  405956c3
19 Nov, 2021 5 commits

Bug fix for self_multihead_attn_norm_add · bcf9d067
Hubert Lu authored Nov 19, 2021

bcf9d067
Add unit tests for Apex extensions and distributed Apex · 15498555
Hubert Lu authored Nov 19, 2021

15498555

minimal bert pipeline parallel test (#1216) · aa756cec

eqy authored Nov 18, 2021

* minimal bert pipeline parallel test

* fix global and cleanup

* use get_forward_backward_func

* cleanup and fix some tests

aa756cec

porting GradScaler (#1220) · fcae8fa3

Masaki Kozuki authored Nov 19, 2021


Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

fcae8fa3

[POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133

Masaki Kozuki authored Nov 19, 2021

* init logging use

* fix

* clean up

* fp32 p2p comm

* init

* Dynamic global batch size with `MegatronPretrainingSampler`

I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now

* revive original pipeline parallel test

* update MULTIGPU_TEST: add dynamic batchsize test

* run MegatronPretrainingRandomSampler

* fix comment

* fix

* update

* cosmetic

* add note

* Apply 2 suggestion(s) to 2 file(s)

* change following https://github.com/NVIDIA/apex/pull/1210

* fix

35336133

18 Nov, 2021 1 commit
- Enable Distributed FusedLAMB · f3868524
  Abhishree authored Nov 18, 2021
  
  f3868524
17 Nov, 2021 2 commits
- cleanup missing THCDeviceUtils.cuh header (#1177) · abb6e5ba
  X Wang authored Sep 28, 2021
  
  abb6e5ba
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 5c79a278
  Masaki Kozuki authored Sep 24, 2021
  
  5c79a278
10 Nov, 2021 3 commits
- conditionally import amp_C (#1211) · 25bfcb91
  Masaki Kozuki authored Nov 10, 2021
  
  25bfcb91
- check in (#1210) · 2205cff2
  eqy authored Nov 09, 2021
  
  2205cff2
- check in (#1205) · fa8bd7e6
  eqy authored Nov 09, 2021
  
  fa8bd7e6
02 Nov, 2021 3 commits
- Merge pull request #56 from ROCmSoftwarePlatform/dev/hubertlu/multihead_attn · 9f899769
  Hubert Lu authored Nov 02, 2021
```
Enable multihead atten
```
  9f899769
- Update setup.py · 62f06964
  Hubert Lu authored Nov 02, 2021
```
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
```
  62f06964
- Trigger Build · 4b15f641
  hubertlu-tw authored Nov 02, 2021
  
  4b15f641
01 Nov, 2021 3 commits
- Fix namespace for pybind11 · 9319318d
  hubertlu-tw authored Oct 29, 2021
```
Fix rocblas_gemmex namespace

Fix namespace

Clean up comments
```
  9319318d
- Hipify self_multihead_attn · 83181423
  hubertlu-tw authored Oct 28, 2021
```
Enable HIP floa to hald conversion
```
  83181423
- Hipify self_multihead_attn_bias · 61416180
  hubertlu-tw authored Oct 28, 2021
```
Fix some spacing
```
  61416180
29 Oct, 2021 2 commits
- Update README.md · 325246e4
  Peng authored Oct 29, 2021
  
  325246e4
- Hipify encdec_multihead_attn · 8bdbb502
  hubertlu-tw authored Oct 28, 2021
  
  8bdbb502
28 Oct, 2021 1 commit
- Hipify self_multihead_attn_bias_additive_mask. · ba0e5fa5
  hubertlu-tw authored Oct 28, 2021
  
  ba0e5fa5
27 Oct, 2021 3 commits

`FastLayerNorm` compat with `autocast` (#1203) · ae757634

Masaki Kozuki authored Oct 27, 2021



* Persistent LayerNorm: Multi-CTA Rewrite

* autocast support
Co-authored-by: Young-Jun Ko <youngjun.ko@gmail.com>

ae757634

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021



* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s) to 2 file(s)

* fix cyclic import

* fix syntax of Callable

* fix - 1

* move directory as testing used for pp test as well

* add some functions for num microbatches calculator

* model is a list in pipeline parallel

* skip build num microbatch calculator

* fix test

* assert -> raise

* skip args printing

* specify tensor shape everywhere even if None - phase 1

* private timers

* passing tensor shape & dtype around

* update dtype handling by introducing helper func

* write helper func to reduce cyclomatic complexity

* remove duplicate

* update

* move split_tensor_into_1d_equal_chunks to avoid cyclic import

* tmp

* cosmetic

* move gather_split_1d_tensor to avoid cyclic imports

* remove debug print

* add outer loop

* early return if possible

* cosmetic

* passing around tensor shape

* refactor test

* add script to learn batch sampler behavior

* update

* minibatch splitter

* add minibatch splitter

* split minibatch into microbatches

* minor changes

* uncomment split batch for test sake

* set as attribute

* study the behavior of no pipelining

* debug 1

* reflect test util namespace change

* update readme

* cosmetic in test

* add model build helper func for interleaving shced

* adding model builder from megatron

* canbe cyclic import

* fix

* enable interleaving test, but failing even if forward only

* fix batch preparation

* add explanation

* print data parallel size

* fix typo

* Add Megatron style GPT model by Rishi
Co-authored-by: Rishi Puri <riship@nvidia.com>

* update

* type hint for jit

* fix forward_backward_no_pipelining test

* pipeline forward backward seem to hang if not forward only

* fix typo

* debug

* add p2p test

* simplify

* fix

* tentative

* set both tmp and pmp to 1

* init

* fix typo

* fix

* fix path of divide

* set seed for tmp

* update upon Eddie comment

* fix typo

* adding failing data loader test

* fix

* megatron still failing

* check in

* with the nested loop of new order, interleaving seems fine

* cosmetic change

* make `forward_backward_pipelining_with_interleaving private

* warn users that interleaving sched is unstable

* move noop handler to no pipelining

* comment out rank_print

* make `build_model` more flexible

* skip megatron test tentatively

* correctly comment out rank_print

* correctly comment out rank_print

* correctly comment out rank_print

* skip appropriately

* remove wip p2p comm test

* update type hint of model_provider_func

* disable tf32 in each test script

* skip interleaving w/ backward

* rename as mpu is the old name

* remove broken case

* expose build_model func

* delete `dist.ring_exchange` func call and `use_ring_exchange` argument

* nit fixes

* check in

* remove unused file

* update the list

* update tensor shape

* remove mixed dtype case

* use torch.distributed.run

* 2020 -> 2021

* another 2020 -> 2021

* docstring & type hint

* fix teardown

* update

* change to experimental

* check if warned
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

63d5dd63

Revert "Enable MLP unit tests on ROCm" · aee9f00d
hubertlu authored Oct 27, 2021
```
This reverts commit 964e61f1.
```
aee9f00d

26 Oct, 2021 1 commit
- Enable MLP unit tests on ROCm · 964e61f1
  hubertlu authored Oct 26, 2021
  
  964e61f1