Commits · a0ed4151c036d9d860e1ce8b9f5eee4f04092c4b · OpenDAS / apex

25 Mar, 2022 2 commits

[transformer] Format & Test Refactoring (#1325) · a0ed4151

Masaki Kozuki authored Mar 24, 2022

* try PyTorch custom TestCase class

* revert

* initial working example

* update

* data utils

* fix imports

* hardcode backend to nccl

* fix signature

* fix typo

* mapping

* set device

* init

* refactor x entropy

* remove unused import & destroy model parallel

* refactor random

* fix test

* remove migrated tests

* refactor

* init

* separate affine weight init

* init model parallel

* split more

* weight init fix part 1

* use cpu init for consistency btwn native and tensor parallel

* black

* add col parallel

* use a 3D tensor of square matrix for column parallel linear

* skip the failing cases

* migrate layers test

* pipeline parallel forward/backward

* fix typo

* fix typo

* fix

* fix pipeline world size

* black

* rm `run_pipeline_parallel_test` in favor of test_pipeline_parallel_fwd_bwd.py

* stop logging

* set log level

* black

* license and format

* fix

* skip tf32 as matrices are small

* remove potentially inappropriate license

* Apply suggestions from code review

* remove `TODO` comment

* `torch.testing.assert_allclose` -> `torch.testing.assert_close`

* remove comment-outs

* remote unused import

* minor fix

a0ed4151

[transformer] `parallel_state`: Position Embedding (#1343) · f10b4b89
Masaki Kozuki authored Mar 24, 2022
```
* update

* Add comment to `destroy_model_parallel`
```
f10b4b89

24 Mar, 2022 1 commit

Add CUDA Focal Loss Implementation (#1337) · 28f8539c

Masaki Kozuki authored Mar 24, 2022



Take-over of #1097

* Add fast CUDA focal loss implementation.

* Enable fast math for CUDA focal loss.

* Correct typo.

* replace deprecated macros

* Add fast CUDA focal loss implementation.

* Enable fast math for CUDA focal loss.

* Correct typo.

* replace deprecated macros

* TORCH_CUDA_CHECK -> AT_CUDA_CHECK

The former is defined in torch/csrc/profiler/cuda.cpp so it's not available usually.
The latter however is defined in ATen/cuda/Exceptions.h as an alias of C10_CUDA_CHECK.

* add test

* clean up

* guard for torchvision
Co-authored-by: Wil Kong <alpha0422@gmail.com>

28f8539c

18 Mar, 2022 1 commit

Minor `README.md` edit + docs update from @crcrpar (#1334) · feae3851

eqy authored Mar 17, 2022



* update ngc link and dockerhub container tag

* update

* update

* update

* Update README.md
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

feae3851

16 Mar, 2022 1 commit
- [transformer] Warn only when `gradient_accumulation_fusion` is `True` and... · 7950a82d
  Masaki Kozuki authored Mar 15, 2022
```
[transformer] Warn only when `gradient_accumulation_fusion` is `True` and `fused_weight_gradient_mlp_cuda` is missing (#1317)
```
  7950a82d
15 Mar, 2022 4 commits
- Add Template of Bug Report (#1321) · a56e88dc
  Masaki Kozuki authored Mar 15, 2022
```
* initial issue_template -- bug

* Apply suggestions from code review
Co-authored-by: eqy <eqy@cs.washington.edu>
Co-authored-by: eqy <eqy@cs.washington.edu>
```
  a56e88dc
- Update cudnn-frontend submodule (#1327) · 5bed56a7
  Yuanzhe Dong authored Mar 15, 2022
```
* Move forward cudnn-frontend

* update throw_if to adapt cudnn frontend
```
  5bed56a7
- Merge pull request #1329 from NVIDIA/leave_bottleneck_masks_as_bool · 1a43f292
  Thor Johnsen authored Mar 15, 2022
```
Leave bottleneck masks as bool
```
  1a43f292
- Leave bottleneck masks as bool · bd7c1a0f
  Thor Johnsen authored Mar 14, 2022
  
  bd7c1a0f
11 Mar, 2022 1 commit

contrib/fmha: Add option to zero out tensors before math (#1322) · 7e1c22d0

chochowski authored Mar 11, 2022



* extend api to allow forced memory zeroing (empty() does not do it)

* typo fix

* ctx change

* move zeroing flag to ctx

* update test
Co-authored-by: mchochowski <mchochowski@nvidia.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

7e1c22d0

08 Mar, 2022 4 commits
- Revert "Deprecate reparameterization module (#1316)" (#1319) · 44c30436
  Masaki Kozuki authored Mar 08, 2022
```
This reverts commit adbe075a.
```
  44c30436
- Revert "officially deprecate and clarify the plan of pyprof removal (#1315)" (#1320) · 79143c31
  Masaki Kozuki authored Mar 08, 2022
```
This reverts commit 74e04667.
```
  79143c31
- Deprecate reparameterization module (#1316) · adbe075a
  Masaki Kozuki authored Mar 08, 2022
  
  adbe075a
- officially deprecate and clarify the plan of pyprof removal (#1315) · 74e04667
  Masaki Kozuki authored Mar 08, 2022
  
  74e04667
01 Mar, 2022 1 commit
- [transformer] Update `build_model` function to support encoder&decoder model (#1307) · 59978d5e
  Masaki Kozuki authored Feb 28, 2022
```
* update build_model to support enc&dec model

* fix typo: cur_sargs -> cur_args

* enc&dec path: correctly update pre/post process
```
  59978d5e
27 Feb, 2022 1 commit
- build fused grad accum w/ wgrad only if cuda>10 (#1312) · 47c269b6
  Masaki Kozuki authored Feb 26, 2022
  
  47c269b6
26 Feb, 2022 1 commit

[transformer] Fuse grad accumulation with wgrad (#1297) · ddc08039

Masaki Kozuki authored Feb 25, 2022



* fuse grad accumulation w/ weight grad
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* fp32 training path

* not using *args, **kwargs

* backward: moved the tensor dimension cnversion
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* move files to csrc/megatron

* fix fp32 path

* fix typo

* add  to  in order to select the correct custom extension

* fix typo

* comment on import guard

* update test: enable gradient_accumulation_fusion

* 86

* remove redundant call of `test_column_parallel_linear`
Co-authored-by: Sangkug Lym <slym@nvidia.com>

ddc08039

25 Feb, 2022 3 commits
- add setter of pipeline model parallel split rank (#1306) · e95c3b9c
  Masaki Kozuki authored Feb 25, 2022
  
  e95c3b9c
- [transformer] use logger in microbatches module (#1302) · 17e1a1f6
  Masaki Kozuki authored Feb 25, 2022
  
  17e1a1f6
- skip FastLayerNorm (#1305) · 4506a687
  Masaki Kozuki authored Feb 24, 2022
  
  4506a687
23 Feb, 2022 4 commits
- be more flexible (#1299) · 199fa834
  Masaki Kozuki authored Feb 23, 2022
  
  199fa834
- access to pipeline_model_parallel_split_rank (#1300) · 069ff336
  Masaki Kozuki authored Feb 23, 2022
  
  069ff336
- Merge pull request #1301 from NVIDIA/bug_fix_in_fast_bottleneck · ab1a93a7
  Thor Johnsen authored Feb 23, 2022
```
Change data type for virtual tensors to float
```
  ab1a93a7
- Change data type for virtual tensors to float · 51e81314
  Thor Johnsen authored Feb 23, 2022
  
  51e81314
15 Feb, 2022 1 commit
- taking channels last 3d into account (#1284) · 39fc7ccf
  Masaki Kozuki authored Feb 15, 2022
  
  39fc7ccf
12 Feb, 2022 1 commit
- cast for `-Wc++11-narrowing` (#1288) · 1e218749
  Masaki Kozuki authored Feb 11, 2022
  
  1e218749
11 Feb, 2022 1 commit
- [FusedRMSNorm doc] document where epsilon is added (#1295) · c8c00ef5
  Stas Bekman authored Feb 11, 2022
```
* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording
```
  c8c00ef5
10 Feb, 2022 1 commit
- 8.6 requires CUDA 11.1 (#1289) · e1aa1fc1
  Masaki Kozuki authored Feb 10, 2022
  
  e1aa1fc1
07 Feb, 2022 1 commit
- fix and generate docs for FusedRMSNorm (#1285) · a786ca0c
  eqy authored Feb 07, 2022
  
  a786ca0c
04 Feb, 2022 1 commit

FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274) · 684c4733

eqy authored Feb 03, 2022



* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

684c4733

01 Feb, 2022 2 commits

Add the permutation related support as the extension for asp lib. (#1194) · 89edb819

ChongyuNVIDIA authored Feb 02, 2022

* Add the permutation related support as the extension for asp lib.

* [Fix] Track the permutation sequence for progressive channel swap strategy.

* Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings.

* Fix the deprecated functions in ASP unit tests.

* Fix the sparsity info typo in ASP lib.

* [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search.

* Update the README.md with identical random seed setting and NeurIPS info.

* Integrate the Pybind11 enhancement of permutation search into ASP lib.

89edb819

transformer: Allows for custom sync context in no pipelining forward backward function (#1281) · 79c01877
Masaki Kozuki authored Jan 31, 2022
```
* add kwarg of `custom_sync_context_handler`

* add kwargs to ignore custom_sync_context_handler which mistakenly passed to fwd/bwd funcs
```
79c01877

31 Jan, 2022 2 commits

T5 pipeline parallel changes (#1279) · 0da60e10

Masaki Kozuki authored Jan 31, 2022

* Free output tensor on each pipeline stage for smaller memory footprint

see:
https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb

* ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff

* ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4

* remove position embedding group in destroy

* pass deallocate_pipeline_outputs to backward_step

* fix typo

* missing deallocate_pipeline_outputs

* fix typo: grad_ouptut -> grad_output

* update tests

* remove accessed todo

* test with data parallel size of 2 if there's equal to or more than 8 gpus

0da60e10

fix group range to compute l2_norm (#1266) · a47d1a76

chochowski authored Jan 31, 2022



* fix graph capture failure, fix norm computation with full_ar and clip_after

* fix group range to compute l2_norm
Co-authored-by: seryilmaz <seryilmaz@nvidia.com>
Co-authored-by: mchochowski <mchochowski@nvidia.com>

a47d1a76

29 Jan, 2022 1 commit
- add inline asm 128-bit counter (#1265) · 2eafdb3d
  Burc Eryilmaz authored Jan 29, 2022
  
  2eafdb3d
28 Jan, 2022 2 commits

small changes in test and logger format (#1278) · b1c75f6f
Masaki Kozuki authored Jan 28, 2022
```
* cosmetic refactor in test

* log with PID

* log more info: rank, pid, filename, lineNo
```
b1c75f6f

allow for `None` batch (#1280) · a960fe8c

Masaki Kozuki authored Jan 28, 2022

* have get_kth_microbatch deal with None batch

* broadcast based on tensor parallel rank

* dtype

* remove unnecessary .cuda()

Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.

a960fe8c

21 Jan, 2022 2 commits

Grad scaler (#1277) · 2a4ab177

Masaki Kozuki authored Jan 21, 2022

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* add keyword argument of `grad_scaler`

* update test

* pass dtype to fwd_step_func

* add log

* calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting

* option to turn off autocast inside forward_step function

As there's some users who activate `autocast` outside fwd/bwd functions.

* add missing arg of disable_autocast

* reorder args of no pipeline

2a4ab177

Fix missing `dtype` in `recv_forward` (#1276) · 45cd1001
eqy authored Jan 21, 2022
```
CC @crcrpar @ptrblck
```
45cd1001

19 Jan, 2022 1 commit
- pass flags to transducer joint kernel (#1273) · c4e85f7b
  Masaki Kozuki authored Jan 18, 2022
  
  c4e85f7b