Commits · 96850dfa73661f61b37d8b984efca17876f7fe77 · OpenDAS / apex

08 Aug, 2022 1 commit
- Revert code changes to mutltihead_attn tests · 51783cc7
  hubertlu-tw authored Aug 08, 2022
  
  51783cc7
29 Jul, 2022 1 commit
- Fix some compiling errors · 038ed999
  hubertlu-tw authored Jul 29, 2022
  
  038ed999
26 Jul, 2022 1 commit

Improvements in distributed Adam optimizer for Megatron (#1432) · 2e025ab5

Tim Moon authored Jul 26, 2022

* Improvements in distributed Adam optimizer for Megatron

Add option to allocate gradient buckets out of one large buffer. Add option to initialize params in user-provided order. Perform communication when saving optimizer state. Support param sync with any dtype.

* Style fixes in distributed Adam helper classes

Review suggestions from @crcrpar

2e025ab5

21 Jul, 2022 1 commit
- Bug fixes, perf improvements · f687e7fa
  Thor Johnsen authored Jul 21, 2022
  
  f687e7fa
14 Jul, 2022 1 commit

[contrib] Fix the reference implementation of multihead_attn (#1423) · 809043f5

Masaki Kozuki authored Jul 14, 2022



* follow the current signature
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* call .backward on outputs
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* update the other caller of _softmax_backward_data
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

809043f5

05 Jul, 2022 1 commit

Add features to distributed Adam for Megatron support (#1414) · cd499737

Tim Moon authored Jul 05, 2022

* Add features to distributed Adam for Megatron support

Support gradient clipping, gradient scaling, FP32 grad accumulation, and multiple dtypes and devices.

* Restore closure arg to distributed Adam

Review suggestion from @crcrpar

cd499737

23 Jun, 2022 1 commit

Move distributed Adam unit test to contrib dir (#1406) · 57f890a7

Tim Moon authored Jun 22, 2022

* Increase default bucket size in distributed Adam

* Move distributed Adam unit test to contrib tests

Integrate into unit testing framework

* Tweak hyperparameters for dist Adam optimizer test

Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar.

* Use distributed test infrastructure in distributed Adam unit test

Suggestion from @crcrpar.

57f890a7

22 Jun, 2022 1 commit

Gradient clipping with fused kernels (#1405) · dcb02fcf

Tim Moon authored Jun 21, 2022

* Gradient clipping routine with fused kernels

Identical API as PyTorch. Falls back to PyTorch impl when not computing L2 norm.

* Add unit test for gradient clipping

* Add fp16 case to gradient clipping unit test

* Tweaks to grad clipping unit test

Review suggestions from @crcrpar

* Debug gradient clipping tests

When checking that incorrect results produce assertion errors, make sure to generate a discrepancy outside the range of numerical error.

dcb02fcf

16 Jun, 2022 1 commit

Remove legacy fuser usage from multihead attention in contrib in favor of the... · 1403c21a

Kevin Stephano authored Jun 15, 2022

Remove legacy fuser usage from multihead attention in contrib in favor of the default which should be nvfuser.  Modify test scripts to activate fusion. (#1403)

1403c21a

14 Jun, 2022 1 commit
- Update documentation to reflect DistributedFusedAdam uses AdamW · 846f7f8a
  Tim Moon authored Jun 14, 2022
```
Adjust test options to have tighter tolerances.
```
  846f7f8a
13 Jun, 2022 1 commit
- Add ZeRO-2 support in DistributedFusedAdam · 6e412916
  Tim Moon authored Jun 13, 2022
  
  6e412916
31 May, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming (#79) · cf77e9b5

Hubert Lu authored May 31, 2022

* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming

* Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement

cf77e9b5

29 Apr, 2022 1 commit
- [FastLayerNorm] Support hidden dim of 14336 (#1368) · 77f9d73c
  yjk21 authored Apr 29, 2022
  
  77f9d73c
21 Apr, 2022 1 commit

Give Some Extensions Version Guard in Build&Runtime (#1358) · f9305e75

Masaki Kozuki authored Apr 21, 2022

* guard

* update

* remove unnecessary version guard

* runtime version guard

* cosmetic

* skip tests appropriately

f9305e75

19 Apr, 2022 1 commit
- [submodule update] Bump cudnn-frontend to v0.6.1 (#1353) · d89f5e66
  Masaki Kozuki authored Apr 18, 2022
```
* bump version

* add guard

* fix the cond
```
  d89f5e66
14 Apr, 2022 2 commits

Added support for memory format API(torch.channels_last) in GBN (#72) · dd584a59

mahathis authored Apr 14, 2022



* Added suuport for memory format API(torch.channels_last) in GBN

Group Batch Norm (GBN) is an NHWC operation.  It assumes that the
underlying memory format of an input tensor is NHWC.  It originally does
not support PyTorch's memory_format API.

To support PyTorch's memory_format API, i.e., .to(memory_format=...) or
.contiguous(memory_format=...), we add the torch_channels_last
flag to indicate whether the workload adopts the PyTorch memory_format
API by setting memory_format=torch.channels_last.  This flag allows GBN
to handle memory formats of input tensors properly.

An example to use memory_format in GBN:

"""
from apex.contrib.groupbn.batch_norm import BatchNorm2d_NHWC

GBN = BatchNorm2d_NHWC(planes, fuse_relu=True, bn_group=1, torch_channels_last=True)

"""

The cases that GBN handles are as follows:

1. torch_channels_last=True and input tensor's
memory_format=torch.channels_last, GBN will generate the
torch.channels_last output tensor.

2. torch_channels_last=True and input tensor's
memory_format=torch.contiguous_format, GBN will convert the input tensor
to torch.channels_last and will generate the torch.channels_last output
tensor.

3. use_pytorch_channels_last=False and input tensor's
memory_format=torch.contiguous_format, GBN will generate the
torch.contiguous_format output tensor.

* Add GBN unit tests for channel_last memory format
Co-authored-by: hubertlu-tw <hubertlu@amd.com>

dd584a59

Bit faster · 5698eeeb
Thor Johnsen authored Apr 14, 2022

5698eeeb

13 Apr, 2022 2 commits

Cherry-picked the commit from upstream for faster --fast_multihead_attn build (#76) · 29b36315

Hubert Lu authored Apr 13, 2022



* Faster `--fast_multihead_attn` build (#1245)

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

* Fix some bugs
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

29b36315

Bug fixes · 140282d5
Thor Johnsen authored Apr 12, 2022

140282d5

08 Apr, 2022 3 commits
- Add graphing, switch to peer mem exchanger as default · bec558b1
  Thor Johnsen authored Apr 08, 2022
  
  bec558b1
- Bug fix · 4aeb24cb
  Thor Johnsen authored Apr 07, 2022
  
  4aeb24cb
- Fix deadlock issue when peer memory halo exchanger is used with cuda graph · c70f0e32
  Thor Johnsen authored Apr 07, 2022
  
  c70f0e32
06 Apr, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142

Hubert Lu authored Apr 06, 2022

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)

* First attempt to make rocblas flag backward compatible

* Fix some bugs

* Fix some bugs

* Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions

* Add groupbn extension unit tests for ROCm

* Fix some bugs

5ecad142

05 Apr, 2022 2 commits
- Rename nccl_p2p extension to nccl_p2p_cuda · d8db8c15
  Thor Johnsen authored Apr 05, 2022
  
  d8db8c15
- Rename peer_memory extension to peer_memory_cuda · 6e7e2d90
  Thor Johnsen authored Apr 05, 2022
  
  6e7e2d90
03 Apr, 2022 1 commit
- Clean up code · fa8e7d99
  Thor Johnsen authored Apr 03, 2022
  
  fa8e7d99
02 Apr, 2022 4 commits
- Bug fix · 05dd9c69
  Thor Johnsen authored Apr 01, 2022
  
  05dd9c69
- Bug fix · a5d51c01
  Thor Johnsen authored Apr 01, 2022
  
  a5d51c01
- Bug fix · 8b6f8fc1
  Thor Johnsen authored Apr 01, 2022
  
  8b6f8fc1
- Remove unused field · 64b93e3e
  Thor Johnsen authored Apr 01, 2022
  
  64b93e3e
01 Apr, 2022 3 commits
- Add halo correction kernel for bprop · 88914a50
  Thor Johnsen authored Apr 01, 2022
  
  88914a50
- Fix halo correction kernel · 705aa35d
  Thor Johnsen authored Apr 01, 2022
  
  705aa35d
- Add halo correction using new cudnn masking feature · 60000f73
  Thor Johnsen authored Mar 31, 2022
  
  60000f73
31 Mar, 2022 3 commits
- Bug fixes · 9c16d945
  Thor Johnsen authored Mar 31, 2022
  
  9c16d945
- Some fixes to better support native nhwc · 0c20c455
  Thor Johnsen authored Mar 31, 2022
  
  0c20c455
- wgrad2 in parallel stream, optional mode to wait for halo transfer · 34df0f79
  Thor Johnsen authored Mar 31, 2022
  
  34df0f79
30 Mar, 2022 2 commits

Conv-Bias-ReLU fusion (#1332) · 23cfb576

Gil Shomron authored Mar 30, 2022



* Enabled Conv-Bias-ReLU fusion

The following modules are enabled using cuDNN runtime fusion:
1) Conv-Bias-ReLU (+backward)
2) Conv-Bias (+backward)
3) Conv-Bias-Mask-ReLU (+backward)

* Casts cleanup and autocast in unittest

- Remove redundant dtype casts
- Simulate the usage in the unittest by using torch.cuda.amp.autocast
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

* Fixed save_for_backward
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: root <root@luna-0277.selene.nvidia.com>

23cfb576

Concatenate out1 with halos for backward · 834b1d01
Thor Johnsen authored Mar 29, 2022

834b1d01

29 Mar, 2022 2 commits
- Module test improvements, bug fixes · e5d0be82
  Thor Johnsen authored Mar 29, 2022
  
  e5d0be82
- Add some debug prints · d925763a
  Thor Johnsen authored Mar 29, 2022
  
  d925763a