Commits · 1fd257e2cd777f1ef7df37590f6dc6b2a73cc518 · OpenDAS / apex

19 Oct, 2021 1 commit
- Enable the following modules in apex/contrib: · 1fd257e2
  Abhishree authored Oct 19, 2021
```
1) multihead_attn
2) xentropy
3) fused_adam and distributed_fused_adam
```
  1fd257e2
04 Oct, 2021 1 commit
- in multi tensor apply, skip empty tensors (#54) · 297ab210
  Jeff Daily authored Oct 04, 2021
  
  297ab210
07 Sep, 2021 1 commit

Enable group batch norm (--bnp) on ROCm (only bn_group = 1) (#51) · e57c84e0

sarunyap authored Sep 07, 2021

* Enable group batch norm (--bnp) on ROCm (only bn_group = 1)

Enable NHWC group batch norm on a single GPU on ROCm (bn_group = 1).
The multi-GPU case (bn_group > 1) will be revisited in the future.

The following are the main changes:

1) Use MIOpen data structures/functions in HIP instead of CUDNN
2) For the warp-level primitive code, we ensure that the code operates
   on 64-thread wide warp instead of 32-thread wide
3) Disable all the bn_group > 1 paths

Notes:

1) Multi-stream is not tested.
2) We have not optimized for performance

* Fix bnp hipification

Avoid calling hipify-perl in setup.py and rely on PyTorch's internal
hipification mechanism.

* Make bnp data pointers contiguous

The contrib group batch norm implementation assumes that all input
tensors are contiguous.  When non-contiguous tensors are passed to the
function, it gives a wrong result.  This commit explicitly calls
.contiguous() to make all input tensors contiguous before accessing
them.

* Fix HIP lane id in bnp

Fix typo

* Fix ReLU bitmask for HIP in bnp

The ReLU bitmask is derived by using the __ballot function which returns
a 64-bit value in HIP.  This commit fixes the ReLU bitmask storage size
and offsets on ROCm.

This patch also fixes the kernel to set ReLU bitmask to 1 when the data
is less than or equal to zero (not only less than).  Not doing so can
cause a stability issue.

* Remove multiple of 64 offset for HIP in bnp

The multiple of 64 offset is not necessary.

* Use FP16 intermediate output to determine whether to rectify in bnp

Group batch norm takes FP16 tensors and produces the FP16 output,
however, all arithmetic operations are done in FP32, thus intermediate
outputs are in FP32.  For the fusion kernels, ReLU determines the FP32
intermediate output to decide whether to rectify it.  ReLU must rectify
the intermediate output if it is less than or "equal" to zero.  There is
a chance that the intermediate FP32 output is very close to zero, and
when it is converted to FP16, it becomes zero.  In this case, this
output is not rectified when it should be.  Since the output is not
rectified in the forward pass, the gradient is not rectified in the
backward pass.  This can cause a stability issue.

This patch can have a negative impact on the performance of group batch
norm as we perform FP32-FP16 conversion multiple times.

* Disable dispatchX ParallelSums in HIP in bnp

dispatchX is not required for the bn_group = 1 case.

* Use traditional load/store for HIP in bnp

The built-in function has a high floating point rounding error.  Thus,
we replace it with the traditional load/store.  Doing so breaks the
aligned pointer property in the load/store functions.  We conservatively
use traditional load/store for all memory access.

* Replace shfl_down with shfl_sync in parallel sums for HIP in bnp

This commit separates the HIP code from the CUDA code in parallel sums

* Remove -U__HIP_NO_HALF_CONVERSIONS__ for HIP in bnp

Since the built-in function is removed, -U__HIP_NO_HALF_CONVERSIONS__ is
no longer needed.

* Preserve CUDA's ReLU condition path for USE_ADD_RELU in bnp

* Add test for bnp

The test evaluates correctness of batch norm, batch norm + ReLU, and
batch norm + add + ReLU against the reference implementation.

For the forward activation output, we validate it against the PyTorch's
implementation.  The group batch norm activation output must be allclose
with the PyTorch activation output for the test to pass.

For the backward gradient output, we validate it against the Python
implementation.  Due to the floating point rounding error in the batch
norm implementation, the group batch norm gradient output might not be
allclose with the Python implementation output when ReLU is being used
although the majority of the elements are very close to each other.
Thus, we use the norm difference threshold to determine whether the test
is passed or failed instead of allclose.

* Use the warp size variable than hard coding the warp size in bnp

Use C10_WARP_SIZE from c10/macros/Macros.h in the host functions and use
warpSize in the device kernels instead of hard coding the warp size.

e57c84e0

01 Sep, 2021 2 commits
- Merge pull request #53 from ROCmSoftwarePlatform/hipify_workaround_include_dirs · 37d8410c
  Jithun Nair authored Sep 01, 2021
```
work around hipify not finding headers
```
  37d8410c
- work around hipify not finding headers · 888e72ad
  Jeff Daily authored Sep 01, 2021
  
  888e72ad
31 Aug, 2021 2 commits
- Merge pull request #52 from ROCmSoftwarePlatform/add_distributed_fused_lamb · 02ada95d
  Jithun Nair authored Aug 31, 2021
```
add distributed fused lamb
```
  02ada95d
- enable --distributed_lamb for rocm · 955256d1
  Jeff Daily authored Aug 31, 2021
  
  955256d1
25 Jun, 2021 2 commits
- Merge pull request #50 from ROCmSoftwarePlatform/numeric_torch_version_check · 95797c8d
  Jeff Daily authored Jun 25, 2021
```
Make torch version check numeric
```
  95797c8d
- Make torch version check numeric · 799785ab
  Jithun Nair authored Jun 25, 2021
  
  799785ab
04 Mar, 2021 3 commits
- Merge pull request #48 from ROCmSoftwarePlatform/IFU-2020-03-04 · 107f1ff5
  Jeff Daily authored Mar 04, 2021
```
IFU-2020-03-04
```
  107f1ff5
- Merge remote-tracking branch 'upstream/master' into IFU-2020-03-04 · c285a67c
  Jeff Daily authored Mar 04, 2021
  
  c285a67c
- Merge pull request #47 from ROCmSoftwarePlatform/revert_workaround · dde39c9f
  Peng authored Mar 04, 2021
```
Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)
```
  dde39c9f
25 Feb, 2021 1 commit
- Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)" · fbb8cd93
  Jeff Daily authored Feb 25, 2021
```
This reverts commit bdd481d1.
```
  fbb8cd93
23 Feb, 2021 1 commit
- fast layer norm (#1037) · e2083df5
  yjk21 authored Feb 23, 2021
  
  e2083df5
10 Feb, 2021 1 commit

fix import container_abcs issue (#1049) · a78ccf0b

Shoufa Chen authored Feb 10, 2021

* copy-paste friendly

* fix import container_abcs issue

Nightly PyTorch has removed `container_abcs` from `torch._six`.  https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35

* fix import container_abcs issue

Nightly PyTorch has removed `container_abcs` from `torch._six`.
https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35

* keep existing for pytorch1.7 and earlier

a78ccf0b

25 Jan, 2021 1 commit

fix bugs in syncbn (#46) · 3f49dbf0

Jeff Daily authored Jan 25, 2021

- incorrect use of __shfl_down
- fix warp size assumptions
- update unit tests to exit on failure

3f49dbf0

21 Jan, 2021 2 commits
- fix cross-compiled ROCm builds when no GPUs detected (#45) · c1e88fae
  Jeff Daily authored Jan 21, 2021
  
  c1e88fae
- use __launch_bounds__ for multi_tensor_apply (#44) · 5baa68d3
  Jeff Daily authored Jan 21, 2021
```
use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests
```
  5baa68d3
20 Jan, 2021 1 commit
- cuda rng changes for graph capture with apex MHA (#1025) · eefb1ba2
  Burc Eryilmaz authored Jan 20, 2021
```
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  eefb1ba2
19 Jan, 2021 1 commit
- Merge pull request #43 from ROCmSoftwarePlatform/IFU-2021-01-18 · 85b56d01
  Jeff Daily authored Jan 19, 2021
```
IFU-2021-01-18
```
  85b56d01
18 Jan, 2021 5 commits
- skip failing tests on ROCm · 13c8d152
  Jeff Daily authored Jan 18, 2021
  
  13c8d152
- missing #include <c10/cuda/CUDAGuard.h> · 4ebf2b90
  Jeff Daily authored Jan 18, 2021
  
  4ebf2b90
- update setup.py to more closely align with upstream · 2332c4d6
  Jeff Daily authored Jan 18, 2021
```
Mostly whitespace or formatting issues addressed.
Diff with upstream is reduced; ROCm changes are more clear.
```
  2332c4d6
- Merge remote-tracking branch 'upstream/master' · dcc7b513
  Jeff Daily authored Jan 18, 2021
```
Conflicts:
csrc/multi_tensor_apply.cuh
setup.py
tests/L0/run_optimizers/test_adagrad.py
tests/L0/run_optimizers/test_fused_optimizer.py
tests/L0/run_optimizers/test_lamb.py
```
  dcc7b513
- Merge pull request #42 from sarunyap/reduce-block-fix · d061bf20
  Jeff Daily authored Jan 18, 2021
```
Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm
```
  d061bf20
15 Jan, 2021 1 commit
- Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm · ff232fb8
  Sarunya Pumma authored Nov 28, 2020
  
  ff232fb8
31 Dec, 2020 3 commits
- Merge pull request #41 from lcskrishna/cl/skip-tests · 76e4e054
  Chaitanya Sri Krishna Lolla authored Dec 31, 2020
```
Skip the unit tests
```
  76e4e054
- missing import statement · 41bbf93c
  lcskrishna authored Dec 31, 2020
  
  41bbf93c
- skip the unit tests · 5bae299e
  lcskrishna authored Dec 31, 2020
  
  5bae299e
17 Dec, 2020 3 commits
- Merge pull request #1015 from jpool-nv/patch-1 · 154c6336
  Thor Johnsen authored Dec 17, 2020
```
Update ASP README to highlight default recipe
```
  154c6336
- Update ASP README to highlight default recipe · 56914d4f
  jpool-nv authored Dec 17, 2020
```
The Recipe was presented after some non-standard API calls, so moving the suggested usage up, giving it its own section, and reinforcing the suggested usage in the non-standard section.
```
  56914d4f
- Merge pull request #38 from lcskrishna/cl/rocm-hipify-revamp · 663d5a4d
  Chaitanya Sri Krishna Lolla authored Dec 16, 2020
```
Hipify revamp changes for apex extensions on ROCm.
```
  663d5a4d
16 Dec, 2020 1 commit
- update readme and minor changes · 3fdb8db9
  lcskrishna authored Dec 16, 2020
  
  3fdb8db9
15 Dec, 2020 4 commits
- fixed spelling mistakes · 8efd60b2
  lcskrishna authored Dec 15, 2020
  
  8efd60b2
- update readme and add a note about deprecating old hipification process · 3b917de4
  lcskrishna authored Dec 14, 2020
  
  3b917de4
- fix compile args for multi-tensor extension · f4ad42c1
  lcskrishna authored Dec 14, 2020
  
  f4ad42c1
- refactor based on latest hipify revamp · 91003340
  lcskrishna authored Dec 14, 2020
  
  91003340
10 Dec, 2020 1 commit
- cleanup of extensions · 539bad24
  lcskrishna authored Dec 10, 2020
  
  539bad24
09 Dec, 2020 2 commits
- updated hipify changes for apex contrib · 9b4c68c7
  lcskrishna authored Dec 08, 2020
  
  9b4c68c7
- update setup file for rocm due to newer hipify changes · ef209a74
  lcskrishna authored Dec 08, 2020
  
  ef209a74