Commits · 27a473459c96943d4c046ad53f413f85570a8955 · OpenDAS / apex

15 Apr, 2022 1 commit

Hubert Lu authored Apr 14, 2022

* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda

* Comment out CUDA-specific implementations

* Resolve filename collision of *cpp files with to-hipify code and *cu files

27a47345

14 Apr, 2022 1 commit

Added support for memory format API(torch.channels_last) in GBN (#72) · dd584a59

mahathis authored Apr 14, 2022



* Added suuport for memory format API(torch.channels_last) in GBN

Group Batch Norm (GBN) is an NHWC operation.  It assumes that the
underlying memory format of an input tensor is NHWC.  It originally does
not support PyTorch's memory_format API.

To support PyTorch's memory_format API, i.e., .to(memory_format=...) or
.contiguous(memory_format=...), we add the torch_channels_last
flag to indicate whether the workload adopts the PyTorch memory_format
API by setting memory_format=torch.channels_last.  This flag allows GBN
to handle memory formats of input tensors properly.

An example to use memory_format in GBN:

"""
from apex.contrib.groupbn.batch_norm import BatchNorm2d_NHWC

GBN = BatchNorm2d_NHWC(planes, fuse_relu=True, bn_group=1, torch_channels_last=True)

"""

The cases that GBN handles are as follows:

1. torch_channels_last=True and input tensor's
memory_format=torch.channels_last, GBN will generate the
torch.channels_last output tensor.

2. torch_channels_last=True and input tensor's
memory_format=torch.contiguous_format, GBN will convert the input tensor
to torch.channels_last and will generate the torch.channels_last output
tensor.

3. use_pytorch_channels_last=False and input tensor's
memory_format=torch.contiguous_format, GBN will generate the
torch.contiguous_format output tensor.

* Add GBN unit tests for channel_last memory format
Co-authored-by: hubertlu-tw <hubertlu@amd.com>

dd584a59

13 Apr, 2022 1 commit

Cherry-picked the commit from upstream for faster --fast_multihead_attn build (#76) · 29b36315

Hubert Lu authored Apr 13, 2022



* Faster `--fast_multihead_attn` build (#1245)

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

* Fix some bugs
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

29b36315

06 Apr, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142

Hubert Lu authored Apr 06, 2022

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)

* First attempt to make rocblas flag backward compatible

* Fix some bugs

* Fix some bugs

* Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions

* Add groupbn extension unit tests for ROCm

* Fix some bugs

5ecad142

23 Mar, 2022 1 commit

Add rocblas_alt_impl flag for backprop in MLP (#71) · 063d720f

Hubert Lu authored Mar 23, 2022

* Add rocblas_alt_impl flag in MLP

* Refactor rocblas_alt_impl implementation and only use it for backprop

063d720f

18 Mar, 2022 1 commit

Add rocblas_alt_impl falg for bwd rocblas calls in MHA (#70) · b6a1f48b

athitten authored Mar 18, 2022



* Add missing flags arg in gemm_switch_fp32accum call

* Add rocblas_alt_impl flag in MHA

<rev> Add rocblas_alt_impl flag for all bwd gemms in MHA module

* Use ifdef for rocblas_gemm_flags_fp16_alt_impl to target at various AMD hardware
Co-authored-by: hubertlu-tw <hubertlu@amd.com>

b6a1f48b

11 Mar, 2022 1 commit
- Updated the handling of CUDAGeneratorImpl.h to new path · 7bef81f7
  Pruthvi Madugundu authored Mar 11, 2022
  
  7bef81f7
16 Feb, 2022 1 commit
- Fix torch._softmax_backward_data arguments · 980d5f44
  hubertlu-tw authored Feb 16, 2022
  
  980d5f44
28 Jan, 2022 1 commit
- Cherry-pick b2fdf9c4 from upstream Apex and resolve conflicts (#68) · 5de49cc9
  Jithun Nair authored Jan 28, 2022
  
  5de49cc9
26 Jan, 2022 1 commit
- Update ATen/CUDAGeneratorImpl.h to ATen/cuda/CUDAGeneratorImpl.h to resolve hipify issue · cfe106d6
  Jithun Nair authored Jan 26, 2022
  
  cfe106d6
25 Jan, 2022 2 commits

Optimize layer normalization for AMD GPUs (#66) · 1cb3da87

Hubert Lu authored Jan 25, 2022

* Optimize fused layer normalization for MI100

* Optimize cuComputePartGradGammaBeta for AMD GPUs

1cb3da87

Fix bn_addrelu's bitmask type error (#67) · 151d150b

sarunyap authored Jan 25, 2022

This patch converts torch.cuda.LongTensor's argument of bn_addrelu's
bitmask to int to fix the type error.

151d150b

21 Jan, 2022 1 commit
- Remove debug print statement · 8f5ae436
  athitten authored Jan 20, 2022
```
Removing debug print statement that is not necessary.
```
  8f5ae436
14 Dec, 2021 3 commits
- Merge pull request #64 from ROCmSoftwarePlatform/IFU-master-2021-12-08 · db92ee13
  Jithun Nair authored Dec 14, 2021
```
IFU-master-2021-12-08
```
  db92ee13
- Conditionally define autocast_dtypes for different torch versions · 68364b49
  Hubert Lu authored Dec 14, 2021
  
  68364b49
- Skip failing unit tests (#61) · d150afdc
  Hubert Lu authored Dec 13, 2021
```
* Skip failing unit tests

* Modify the test skipping messages
```
  d150afdc
13 Dec, 2021 1 commit
- Remove deprecated THC/THC.h · 67ded2e2
  Hubert Lu authored Dec 13, 2021
  
  67ded2e2
09 Dec, 2021 5 commits
- Fix some bugs related to THCState and cutlass · cf0b0f01
  Hubert Lu authored Dec 09, 2021
  
  cf0b0f01
- Remove `THCState` from `apex/contrib/multihead_attn` (#1239) · 9615983e
  Masaki Kozuki authored Dec 09, 2021
```
* pass `self.mask_additive`

* clang-format

* removing THCState
```
  9615983e
- Add fused mixed precision lamb optimizer. (#1237) · d11ddccf
  Kevin Stephano authored Dec 08, 2021
```
* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.
```
  d11ddccf
- Update README.md · 692e1956
  Hubert Lu authored Dec 08, 2021
  
  692e1956
- Merge remote-tracking branch 'upstream/master' into IFU-master-2021-12-08 · 79906517
  hubertlu-tw authored Dec 08, 2021
  
  79906517
08 Dec, 2021 1 commit
- Merge pull request #55 from ROCmSoftwarePlatform/IFU-master-2021-10-15 · cc92a4b4
  Jithun Nair authored Dec 08, 2021
```
IFU-2021-10-15 (+ remove redundant defines + C10_CUDA_CHECK)
```
  cc92a4b4
06 Dec, 2021 2 commits

Replace THCudaCheck with C10_CUDA_CHECK · fec3141c
Hubert Lu authored Dec 06, 2021

fec3141c

remove THC headers/functions (#1192) · 2155dabf

Masaki Kozuki authored Oct 19, 2021

Changes include
- THC headers removal
- TH macros replacement
- fix some typo in comment
 Conflicts:
	apex/contrib/csrc/multihead_attn/additive_masked_softmax_dropout_cuda.cu
	apex/contrib/csrc/multihead_attn/encdec_multihead_attn_cuda.cu
	apex/contrib/csrc/multihead_attn/encdec_multihead_attn_norm_add_cuda.cu
	apex/contrib/csrc/multihead_attn/masked_softmax_dropout_cuda.cu
	apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_additive_mask_cuda.cu
	apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_cuda.cu
	apex/contrib/csrc/multihead_attn/self_multihead_attn_cuda.cu
	apex/contrib/csrc/multihead_attn/self_multihead_attn_norm_add_cuda.cu
	apex/contrib/csrc/multihead_attn/strided_batched_gemm.h

2155dabf

03 Dec, 2021 2 commits
- Merge remote-tracking branch 'origin/master' into IFU-master-2021-10-15 · 79a2d204
  hubertlu-tw authored Dec 03, 2021
  
  79a2d204
- Add IS_ROCM_PYTORCH if statement for some newly-added extensions · 39a65c92
  hubertlu-tw authored Dec 03, 2021
  
  39a65c92
02 Dec, 2021 4 commits
- Enable all supported CUDA extensions using --cuda_ext flag (#59) · 1e0f9bc6
  Jithun Nair authored Dec 02, 2021
```
* Use --cuda_ext flag to build all supported extensions

* Don't remove --cuda_ext since it'll be needed to build other extensions

* Need to clear all cmdline args so setup.py doesn't complain
```
  1e0f9bc6
- Merge pull request #58 from ROCmSoftwarePlatform/dev/hubertlu/unit_tests · 541da7a0
  Hubert Lu authored Dec 02, 2021
```
Add more unit tests for both distributed and extensions
```
  541da7a0
- Merge remote-tracking branch 'origin/master' into IFU-master-2021-10-15 · 1436a66a
  hubertlu-tw authored Dec 02, 2021
  
  1436a66a
- Merge remote-tracking branch 'origin/master' into dev/hubertlu/unit_tests · 2228f1bf
  Hubert Lu authored Dec 02, 2021
  
  2228f1bf
01 Dec, 2021 2 commits
- Enable Distributed FusedLAMB (#57) · 08e88b1b
  athitten authored Dec 01, 2021
  
  08e88b1b
- Update run_rocm_distributed.sh · 3f3da214
  Hubert Lu authored Dec 01, 2021
  
  3f3da214
29 Nov, 2021 1 commit
- include iostream (#1144) · 51b402df
  X Wang authored Aug 20, 2021
  
  51b402df
22 Nov, 2021 1 commit
- Update run_rocm.sh · 405956c3
  Hubert Lu authored Nov 22, 2021
```
Change python3.6 to python
```
  405956c3
19 Nov, 2021 5 commits

Bug fix for self_multihead_attn_norm_add · bcf9d067
Hubert Lu authored Nov 19, 2021

bcf9d067
Add unit tests for Apex extensions and distributed Apex · 15498555
Hubert Lu authored Nov 19, 2021

15498555

minimal bert pipeline parallel test (#1216) · aa756cec

eqy authored Nov 18, 2021

* minimal bert pipeline parallel test

* fix global and cleanup

* use get_forward_backward_func

* cleanup and fix some tests

aa756cec

porting GradScaler (#1220) · fcae8fa3

Masaki Kozuki authored Nov 19, 2021


Co-authored-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

fcae8fa3

[POC] Support Megatron-LM's `rampup_batch_size` argument (#1212) · 35336133

Masaki Kozuki authored Nov 19, 2021

* init logging use

* fix

* clean up

* fp32 p2p comm

* init

* Dynamic global batch size with `MegatronPretrainingSampler`

I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for
global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now

* revive original pipeline parallel test

* update MULTIGPU_TEST: add dynamic batchsize test

* run MegatronPretrainingRandomSampler

* fix comment

* fix

* update

* cosmetic

* add note

* Apply 2 suggestion(s) to 2 file(s)

* change following https://github.com/NVIDIA/apex/pull/1210

* fix

35336133