Commits · fed20d2aa5eeb7dda29f5ef02adba15a07545d2b · OpenDAS / apex

20 Apr, 2022 1 commit
- Merge pull request #1340 from NVIDIA/peer_memory · fed20d2a
  Thor Johnsen authored Apr 20, 2022
```
Peer memory halo exchange
```
  fed20d2a
19 Apr, 2022 1 commit
- [submodule update] Bump cudnn-frontend to v0.6.1 (#1353) · d89f5e66
  Masaki Kozuki authored Apr 18, 2022
```
* bump version

* add guard

* fix the cond
```
  d89f5e66
14 Apr, 2022 1 commit
- Bit faster · 5698eeeb
  Thor Johnsen authored Apr 14, 2022
  
  5698eeeb
13 Apr, 2022 1 commit
- Bug fixes · 140282d5
  Thor Johnsen authored Apr 12, 2022
  
  140282d5
08 Apr, 2022 3 commits
- Add graphing, switch to peer mem exchanger as default · bec558b1
  Thor Johnsen authored Apr 08, 2022
  
  bec558b1
- Bug fix · 4aeb24cb
  Thor Johnsen authored Apr 07, 2022
  
  4aeb24cb
- Fix deadlock issue when peer memory halo exchanger is used with cuda graph · c70f0e32
  Thor Johnsen authored Apr 07, 2022
  
  c70f0e32
07 Apr, 2022 2 commits

Deprecation warning: `pyprof` & `reparameterization` (#1348) · 727a6452

Masaki Kozuki authored Apr 07, 2022

* add warning to pyprof

* add warning to reparameterization

note: this module is already not import-able as follows:

```
(base) root@c4bb3f161482:/vscode/apex# python -c 'import torch; import
apex; from apex import reparameterization'
/vscode/apex/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be
removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022",
FutureWarning)
/vscode/apex/apex/reparameterization/__init__.py:2: FutureWarning:
reparameterization will be removed by the end of June, 2022
  warnings.warn("reparameterization will be removed by the end of June,
2022", FutureWarning)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/vscode/apex/apex/reparameterization/__init__.py", line 4, in
<module>
    from .weight_norm import WeightNorm
  File "/vscode/apex/apex/reparameterization/weight_norm.py", line 3, in
<module>
    from ..fp16_utils import Fused_Weight_Norm
ImportError: cannot import name 'Fused_Weight_Norm' from
'apex.fp16_utils' (/vscode/apex/apex/fp16_utils/__init__.py)
```

727a6452

[transformer] add microbatches test (#1349) · 7d903878
Masaki Kozuki authored Apr 07, 2022
```
* add test

* destroy model parallel was missing
```
7d903878

05 Apr, 2022 2 commits
- Rename nccl_p2p extension to nccl_p2p_cuda · d8db8c15
  Thor Johnsen authored Apr 05, 2022
  
  d8db8c15
- Rename peer_memory extension to peer_memory_cuda · 6e7e2d90
  Thor Johnsen authored Apr 05, 2022
  
  6e7e2d90
03 Apr, 2022 1 commit
- Clean up code · fa8e7d99
  Thor Johnsen authored Apr 03, 2022
  
  fa8e7d99
02 Apr, 2022 4 commits
- Bug fix · 05dd9c69
  Thor Johnsen authored Apr 01, 2022
  
  05dd9c69
- Bug fix · a5d51c01
  Thor Johnsen authored Apr 01, 2022
  
  a5d51c01
- Bug fix · 8b6f8fc1
  Thor Johnsen authored Apr 01, 2022
  
  8b6f8fc1
- Remove unused field · 64b93e3e
  Thor Johnsen authored Apr 01, 2022
  
  64b93e3e
01 Apr, 2022 3 commits
- Add halo correction kernel for bprop · 88914a50
  Thor Johnsen authored Apr 01, 2022
  
  88914a50
- Fix halo correction kernel · 705aa35d
  Thor Johnsen authored Apr 01, 2022
  
  705aa35d
- Add halo correction using new cudnn masking feature · 60000f73
  Thor Johnsen authored Mar 31, 2022
  
  60000f73
31 Mar, 2022 3 commits
- Bug fixes · 9c16d945
  Thor Johnsen authored Mar 31, 2022
  
  9c16d945
- Some fixes to better support native nhwc · 0c20c455
  Thor Johnsen authored Mar 31, 2022
  
  0c20c455
- wgrad2 in parallel stream, optional mode to wait for halo transfer · 34df0f79
  Thor Johnsen authored Mar 31, 2022
  
  34df0f79
30 Mar, 2022 2 commits

Conv-Bias-ReLU fusion (#1332) · 23cfb576

Gil Shomron authored Mar 30, 2022



* Enabled Conv-Bias-ReLU fusion

The following modules are enabled using cuDNN runtime fusion:
1) Conv-Bias-ReLU (+backward)
2) Conv-Bias (+backward)
3) Conv-Bias-Mask-ReLU (+backward)

* Casts cleanup and autocast in unittest

- Remove redundant dtype casts
- Simulate the usage in the unittest by using torch.cuda.amp.autocast
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

* Fixed save_for_backward
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: root <root@luna-0277.selene.nvidia.com>

23cfb576

Concatenate out1 with halos for backward · 834b1d01
Thor Johnsen authored Mar 29, 2022

834b1d01

29 Mar, 2022 2 commits
- Module test improvements, bug fixes · e5d0be82
  Thor Johnsen authored Mar 29, 2022
  
  e5d0be82
- Add some debug prints · d925763a
  Thor Johnsen authored Mar 29, 2022
  
  d925763a
28 Mar, 2022 1 commit
- Bug fix · aff81e54
  Thor Johnsen authored Mar 28, 2022
  
  aff81e54
25 Mar, 2022 7 commits

update fmha (#1344) · 3c88451a
yjk21 authored Mar 25, 2022

3c88451a
Forgot · cd8db094
Thor Johnsen authored Mar 25, 2022

cd8db094
Optional inplace halo exchange · b41c68b3
Thor Johnsen authored Mar 25, 2022

b41c68b3

[transformer] Format & Test Refactoring (#1325) · a0ed4151

Masaki Kozuki authored Mar 24, 2022

* try PyTorch custom TestCase class

* revert

* initial working example

* update

* data utils

* fix imports

* hardcode backend to nccl

* fix signature

* fix typo

* mapping

* set device

* init

* refactor x entropy

* remove unused import & destroy model parallel

* refactor random

* fix test

* remove migrated tests

* refactor

* init

* separate affine weight init

* init model parallel

* split more

* weight init fix part 1

* use cpu init for consistency btwn native and tensor parallel

* black

* add col parallel

* use a 3D tensor of square matrix for column parallel linear

* skip the failing cases

* migrate layers test

* pipeline parallel forward/backward

* fix typo

* fix typo

* fix

* fix pipeline world size

* black

* rm `run_pipeline_parallel_test` in favor of test_pipeline_parallel_fwd_bwd.py

* stop logging

* set log level

* black

* license and format

* fix

* skip tf32 as matrices are small

* remove potentially inappropriate license

* Apply suggestions from code review

* remove `TODO` comment

* `torch.testing.assert_allclose` -> `torch.testing.assert_close`

* remove comment-outs

* remote unused import

* minor fix

a0ed4151

[transformer] `parallel_state`: Position Embedding (#1343) · f10b4b89
Masaki Kozuki authored Mar 24, 2022
```
* update

* Add comment to `destroy_model_parallel`
```
f10b4b89
Halo exchangers · 778808eb
Thor Johnsen authored Mar 24, 2022

778808eb
Add bottleneck block · 3ade5b26
Thor Johnsen authored Mar 24, 2022

3ade5b26

24 Mar, 2022 4 commits

Bug fix · b48898fb
Thor Johnsen authored Mar 24, 2022

b48898fb

Add CUDA Focal Loss Implementation (#1337) · 28f8539c

Masaki Kozuki authored Mar 24, 2022



Take-over of #1097

* Add fast CUDA focal loss implementation.

* Enable fast math for CUDA focal loss.

* Correct typo.

* replace deprecated macros

* Add fast CUDA focal loss implementation.

* Enable fast math for CUDA focal loss.

* Correct typo.

* replace deprecated macros

* TORCH_CUDA_CHECK -> AT_CUDA_CHECK

The former is defined in torch/csrc/profiler/cuda.cpp so it's not available usually.
The latter however is defined in ATen/cuda/Exceptions.h as an alias of C10_CUDA_CHECK.

* add test

* clean up

* guard for torchvision
Co-authored-by: Wil Kong <alpha0422@gmail.com>

28f8539c

Sample 1d peer memory halo exchanger · e510b003
Thor Johnsen authored Mar 24, 2022

e510b003
Add module test for peer memory halo exchanger · a61f0c25
Thor Johnsen authored Mar 24, 2022

a61f0c25

23 Mar, 2022 2 commits
- Bug fixes · a4eb97fb
  Thor Johnsen authored Mar 23, 2022
  
  a4eb97fb
- Peer memory halo exchange · 40a0e025
  Thor Johnsen authored Mar 22, 2022
  
  40a0e025