Commits · fa8bd7e68c791805e555195a246ee4d8be464dcd · OpenDAS / apex

10 Nov, 2021 1 commit
- check in (#1205) · fa8bd7e6
  eqy authored Nov 09, 2021
  
  fa8bd7e6
27 Oct, 2021 2 commits

`FastLayerNorm` compat with `autocast` (#1203) · ae757634

Masaki Kozuki authored Oct 27, 2021



* Persistent LayerNorm: Multi-CTA Rewrite

* autocast support
Co-authored-by: Young-Jun Ko <youngjun.ko@gmail.com>

ae757634

Pipeline Model Parallel (#1202) · 63d5dd63

Masaki Kozuki authored Oct 27, 2021



* Init apex.ppu (pipeline model parallel utility)

Reference commit:

```
commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
Merge: 14f2c684 7b293d9b
Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
Date:   Wed Sep 22 22:57:54 2021 -0700

    Merge branch 'add_BOS' into 'main'

    Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives

    See merge request ADLR/megatron-lm!328
```

* removing get_args and replace import - phase 1

* removing get_args and replace import - phase 2

* move ppu to apex.transformer.pipeline_parallel

* update two __init__.py

* update READMEs

* mpu -> parallel_state & tensor_parallel

* fix

* remove not pipeline files

* separate schedules.py - phase 1

* dissect schedules.py

* data_iterators -> batch

* remove optimizer from forward_backward_step funcs

* init test

* Apply 2 suggestion(s) to 2 file(s)

* fix cyclic import

* fix syntax of Callable

* fix - 1

* move directory as testing used for pp test as well

* add some functions for num microbatches calculator

* model is a list in pipeline parallel

* skip build num microbatch calculator

* fix test

* assert -> raise

* skip args printing

* specify tensor shape everywhere even if None - phase 1

* private timers

* passing tensor shape & dtype around

* update dtype handling by introducing helper func

* write helper func to reduce cyclomatic complexity

* remove duplicate

* update

* move split_tensor_into_1d_equal_chunks to avoid cyclic import

* tmp

* cosmetic

* move gather_split_1d_tensor to avoid cyclic imports

* remove debug print

* add outer loop

* early return if possible

* cosmetic

* passing around tensor shape

* refactor test

* add script to learn batch sampler behavior

* update

* minibatch splitter

* add minibatch splitter

* split minibatch into microbatches

* minor changes

* uncomment split batch for test sake

* set as attribute

* study the behavior of no pipelining

* debug 1

* reflect test util namespace change

* update readme

* cosmetic in test

* add model build helper func for interleaving shced

* adding model builder from megatron

* canbe cyclic import

* fix

* enable interleaving test, but failing even if forward only

* fix batch preparation

* add explanation

* print data parallel size

* fix typo

* Add Megatron style GPT model by Rishi
Co-authored-by: Rishi Puri <riship@nvidia.com>

* update

* type hint for jit

* fix forward_backward_no_pipelining test

* pipeline forward backward seem to hang if not forward only

* fix typo

* debug

* add p2p test

* simplify

* fix

* tentative

* set both tmp and pmp to 1

* init

* fix typo

* fix

* fix path of divide

* set seed for tmp

* update upon Eddie comment

* fix typo

* adding failing data loader test

* fix

* megatron still failing

* check in

* with the nested loop of new order, interleaving seems fine

* cosmetic change

* make `forward_backward_pipelining_with_interleaving private

* warn users that interleaving sched is unstable

* move noop handler to no pipelining

* comment out rank_print

* make `build_model` more flexible

* skip megatron test tentatively

* correctly comment out rank_print

* correctly comment out rank_print

* correctly comment out rank_print

* skip appropriately

* remove wip p2p comm test

* update type hint of model_provider_func

* disable tf32 in each test script

* skip interleaving w/ backward

* rename as mpu is the old name

* remove broken case

* expose build_model func

* delete `dist.ring_exchange` func call and `use_ring_exchange` argument

* nit fixes

* check in

* remove unused file

* update the list

* update tensor shape

* remove mixed dtype case

* use torch.distributed.run

* 2020 -> 2021

* another 2020 -> 2021

* docstring & type hint

* fix teardown

* update

* change to experimental

* check if warned
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

63d5dd63

23 Oct, 2021 1 commit

Use out-of-place to avoid D2D copy in tensor parallel cross entropy (#1198) · 3303b3e7

Masaki Kozuki authored Oct 23, 2021



* switch from clone to out-of-place subtract

* Update apex/mpu/cross_entropy.py

* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Eddie Yan <eddiey@nvidia.com>

3303b3e7

18 Oct, 2021 1 commit

remove THC headers/functions (#1192) · 0c7d8e3f

Masaki Kozuki authored Oct 19, 2021

Changes include
- THC headers removal
- TH macros replacement
- fix some typo in comment

0c7d8e3f

16 Oct, 2021 1 commit
- replace (#1191) · 60821f53
  Masaki Kozuki authored Oct 16, 2021
  
  60821f53
14 Oct, 2021 2 commits

change chunking scheme for full-allreduce case, add parameter order argument,... · 1d5f7e55

Burc Eryilmaz authored Oct 13, 2021

change chunking scheme for full-allreduce case, add parameter order argument, both to enable contiguous chunking of allgather (#1190)

1d5f7e55

Fix dist lamb (#1185) · d9a46fde

Nan Zheng authored Oct 14, 2021

1. remove the weight broadcast in the constructor
2. disable unnecessary allreduces for clip-after-ar

d9a46fde

13 Oct, 2021 1 commit
- check in (#1189) · 4e9fae9b
  eqy authored Oct 13, 2021
  
  4e9fae9b
08 Oct, 2021 2 commits
- Remove `custom_fwd`/`custom_bwd` from fused softmax (#1188) · 14ccf598
  Masaki Kozuki authored Oct 09, 2021
```
* run backward

* remove custom_fwd/custom_bwd
```
  14ccf598
- check in (#1187) · 3ad9db2a
  eqy authored Oct 07, 2021
  
  3ad9db2a
07 Oct, 2021 1 commit
- Update layer_norm_cuda_kernel.cu (#1184) · 5adf7bc2
  eqy authored Oct 06, 2021
  
  5adf7bc2
06 Oct, 2021 1 commit
- ColumnParallelLinearWithAsyncAllreduce autocast support (#1183) · b3da6036
  Masaki Kozuki authored Oct 06, 2021
```
* [ColumnParallelLinear] Test behavior in autocast

* fix test

* casts manually to autocast dtype
```
  b3da6036
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

30 Sep, 2021 1 commit
- use cuda caching allocator from pytorch (#1180) · bdac244e
  X Wang authored Sep 30, 2021
  
  bdac244e
28 Sep, 2021 1 commit
- cleanup missing THCDeviceUtils.cuh header (#1177) · 2a559c51
  X Wang authored Sep 28, 2021
  
  2a559c51
24 Sep, 2021 2 commits
- Fix typo in contrib FusedLamb. (#1172) · 70d4a0ba
  romerojosh authored Sep 24, 2021
  
  70d4a0ba
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 76daa454
  Masaki Kozuki authored Sep 24, 2021
  
  76daa454
08 Sep, 2021 1 commit

enable ninja (#1164) · 9ce0a10f

Masaki Kozuki authored Sep 08, 2021

- passing include directories to `CUDAExtension`'s `include_dirs` argument
- removing `-I/path/to/dir` arguments from `extra_compile_args`

9ce0a10f

04 Sep, 2021 1 commit

fix CUBLAS guards (#1162) · 54b93919

Burc Eryilmaz authored Sep 04, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path

* safer guard around CUBLAS constants, remove unreferenced variable

* more guard changes

* guard against cublas version instead of cuda
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

54b93919

02 Sep, 2021 13 commits
- Merge pull request #1161 from NVIDIA/optional_caller_supplied_communicator · ae1cdd64
  Thor Johnsen authored Sep 02, 2021
```
Optional NCCL communicator argument to init method
```
  ae1cdd64
- Optional NCCL communicator argument to init method · e777bddb
  Thor Johnsen authored Sep 02, 2021
  
  e777bddb
- Merge pull request #1160 from NVIDIA/bug_fix_in_wgrad · 9b880665
  Thor Johnsen authored Sep 02, 2021
```
Bug fix in wgrad
```
  9b880665
- Bug fix in wgrad · 9e295728
  Thor Johnsen authored Sep 02, 2021
  
  9e295728
- Merge pull request #1159 from NVIDIA/more_bug_fixes · 0506fe36
  Thor Johnsen authored Sep 02, 2021
```
Bug fixes
```
  0506fe36
- Revert some changes · 8c4a0075
  Thor Johnsen authored Sep 02, 2021
  
  8c4a0075
- Bug fixes · 8cdcc821
  Thor Johnsen authored Sep 02, 2021
  
  8cdcc821
- Merge pull request #1158 from NVIDIA/bug_fixes · 0cb1cb3b
  Thor Johnsen authored Sep 02, 2021
```
Various bug fixes in fused spatial parallel bottleneck block
```
  0cb1cb3b
- More detailed output · 67a0ffcb
  Thor Johnsen authored Sep 02, 2021
  
  67a0ffcb
- Bug fixes · bc9114c9
  Thor Johnsen authored Sep 02, 2021
  
  bc9114c9
- option to set param views to flat buffer (#1152) · 17eec271
  Burc Eryilmaz authored Sep 02, 2021
```
* option to set param views to flat buffer

* remove redundant variables in init_stage1
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
Co-authored-by: ptrblck <ptrblck@users.noreply.github.com>
```
  17eec271
- use prescaling for collective (#1157) · 2e98baa7
  Burc Eryilmaz authored Sep 02, 2021
  
  2e98baa7
- Add full all-reduce code path for DistributedFusedAdam (#1146) · 1cb9c5c3
  Kexin Yu authored Sep 01, 2021
```
* add full all-reduce code path

* debug

* debug
Co-authored-by: ptrblck <ptrblck@users.noreply.github.com>
```
  1cb9c5c3
01 Sep, 2021 5 commits

Merge pull request #1154 from NVIDIA/rework_spatial_bottleneck_code_split · d934eca3
Thor Johnsen authored Sep 01, 2021
```
Add functions to compute grad_out1, grad_out1_halo
```
d934eca3
Add functions to compute grad_out1, grad_out1_halo · b6980a0d
Thor Johnsen authored Sep 01, 2021

b6980a0d

Seryilmaz/fuse norm into scale (#1149) · 4d190db6

Burc Eryilmaz authored Sep 01, 2021



* fuse norm into scale

* add fused norm into dlamb
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

4d190db6

Seryilmaz/more cublas lt (#1147) · 6af09dd9

Burc Eryilmaz authored Aug 31, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

6af09dd9

Merge pull request #1148 from azrael417/thorsten-view-fix · 9d86158d
Kexin Yu authored Aug 31, 2021
```
wrapper function for flat view creation in _lazy_init_stage2
```
9d86158d

31 Aug, 2021 2 commits
- Merge pull request #1151 from NVIDIA/spatial_fast_bottleneck · ed713c84
  Thor Johnsen authored Aug 31, 2021
```
Spatially Distributed Fast Bottleneck block
```
  ed713c84
- Add module tests · bbc95c0a
  Thor Johnsen authored Aug 31, 2021
  
  bbc95c0a