Commits · f79993d92cfefbaf20a670dcbd61749ac70aed28 · OpenDAS / apex

15 Oct, 2021 1 commit
- Merge remote-tracking branch 'upstream/master' into IFU-master-2021-10-15 · f79993d9
  hubertlu-tw authored Oct 15, 2021
  
  f79993d9
14 Oct, 2021 2 commits

change chunking scheme for full-allreduce case, add parameter order argument,... · 1d5f7e55

Burc Eryilmaz authored Oct 13, 2021

change chunking scheme for full-allreduce case, add parameter order argument, both to enable contiguous chunking of allgather (#1190)

1d5f7e55

Fix dist lamb (#1185) · d9a46fde

Nan Zheng authored Oct 14, 2021

1. remove the weight broadcast in the constructor
2. disable unnecessary allreduces for clip-after-ar

d9a46fde

13 Oct, 2021 1 commit
- check in (#1189) · 4e9fae9b
  eqy authored Oct 13, 2021
  
  4e9fae9b
08 Oct, 2021 2 commits
- Remove `custom_fwd`/`custom_bwd` from fused softmax (#1188) · 14ccf598
  Masaki Kozuki authored Oct 09, 2021
```
* run backward

* remove custom_fwd/custom_bwd
```
  14ccf598
- check in (#1187) · 3ad9db2a
  eqy authored Oct 07, 2021
  
  3ad9db2a
07 Oct, 2021 1 commit
- Update layer_norm_cuda_kernel.cu (#1184) · 5adf7bc2
  eqy authored Oct 06, 2021
  
  5adf7bc2
06 Oct, 2021 1 commit
- ColumnParallelLinearWithAsyncAllreduce autocast support (#1183) · b3da6036
  Masaki Kozuki authored Oct 06, 2021
```
* [ColumnParallelLinear] Test behavior in autocast

* fix test

* casts manually to autocast dtype
```
  b3da6036
04 Oct, 2021 1 commit
- in multi tensor apply, skip empty tensors (#54) · 297ab210
  Jeff Daily authored Oct 04, 2021
  
  297ab210
02 Oct, 2021 1 commit

transformer utils (#1181) · 365fdc18

Masaki Kozuki authored Oct 02, 2021


Co-authored-by: Piotr Bialecki <pbialecki@nvidia.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: Rishi Puri <riship@nvidia.com>
Co-authored-by: Sangkug Lym <slym@nvidia.com>

365fdc18

30 Sep, 2021 1 commit
- use cuda caching allocator from pytorch (#1180) · bdac244e
  X Wang authored Sep 30, 2021
  
  bdac244e
28 Sep, 2021 1 commit
- cleanup missing THCDeviceUtils.cuh header (#1177) · 2a559c51
  X Wang authored Sep 28, 2021
  
  2a559c51
24 Sep, 2021 2 commits
- Fix typo in contrib FusedLamb. (#1172) · 70d4a0ba
  romerojosh authored Sep 24, 2021
  
  70d4a0ba
- THCDeviceUtils.cuh -> ATen/cuda/DeviceUtils.cuh (#1173) · 76daa454
  Masaki Kozuki authored Sep 24, 2021
  
  76daa454
08 Sep, 2021 1 commit

enable ninja (#1164) · 9ce0a10f

Masaki Kozuki authored Sep 08, 2021

- passing include directories to `CUDAExtension`'s `include_dirs` argument
- removing `-I/path/to/dir` arguments from `extra_compile_args`

9ce0a10f

07 Sep, 2021 1 commit

Enable group batch norm (--bnp) on ROCm (only bn_group = 1) (#51) · e57c84e0

sarunyap authored Sep 07, 2021

* Enable group batch norm (--bnp) on ROCm (only bn_group = 1)

Enable NHWC group batch norm on a single GPU on ROCm (bn_group = 1).
The multi-GPU case (bn_group > 1) will be revisited in the future.

The following are the main changes:

1) Use MIOpen data structures/functions in HIP instead of CUDNN
2) For the warp-level primitive code, we ensure that the code operates
   on 64-thread wide warp instead of 32-thread wide
3) Disable all the bn_group > 1 paths

Notes:

1) Multi-stream is not tested.
2) We have not optimized for performance

* Fix bnp hipification

Avoid calling hipify-perl in setup.py and rely on PyTorch's internal
hipification mechanism.

* Make bnp data pointers contiguous

The contrib group batch norm implementation assumes that all input
tensors are contiguous.  When non-contiguous tensors are passed to the
function, it gives a wrong result.  This commit explicitly calls
.contiguous() to make all input tensors contiguous before accessing
them.

* Fix HIP lane id in bnp

Fix typo

* Fix ReLU bitmask for HIP in bnp

The ReLU bitmask is derived by using the __ballot function which returns
a 64-bit value in HIP.  This commit fixes the ReLU bitmask storage size
and offsets on ROCm.

This patch also fixes the kernel to set ReLU bitmask to 1 when the data
is less than or equal to zero (not only less than).  Not doing so can
cause a stability issue.

* Remove multiple of 64 offset for HIP in bnp

The multiple of 64 offset is not necessary.

* Use FP16 intermediate output to determine whether to rectify in bnp

Group batch norm takes FP16 tensors and produces the FP16 output,
however, all arithmetic operations are done in FP32, thus intermediate
outputs are in FP32.  For the fusion kernels, ReLU determines the FP32
intermediate output to decide whether to rectify it.  ReLU must rectify
the intermediate output if it is less than or "equal" to zero.  There is
a chance that the intermediate FP32 output is very close to zero, and
when it is converted to FP16, it becomes zero.  In this case, this
output is not rectified when it should be.  Since the output is not
rectified in the forward pass, the gradient is not rectified in the
backward pass.  This can cause a stability issue.

This patch can have a negative impact on the performance of group batch
norm as we perform FP32-FP16 conversion multiple times.

* Disable dispatchX ParallelSums in HIP in bnp

dispatchX is not required for the bn_group = 1 case.

* Use traditional load/store for HIP in bnp

The built-in function has a high floating point rounding error.  Thus,
we replace it with the traditional load/store.  Doing so breaks the
aligned pointer property in the load/store functions.  We conservatively
use traditional load/store for all memory access.

* Replace shfl_down with shfl_sync in parallel sums for HIP in bnp

This commit separates the HIP code from the CUDA code in parallel sums

* Remove -U__HIP_NO_HALF_CONVERSIONS__ for HIP in bnp

Since the built-in function is removed, -U__HIP_NO_HALF_CONVERSIONS__ is
no longer needed.

* Preserve CUDA's ReLU condition path for USE_ADD_RELU in bnp

* Add test for bnp

The test evaluates correctness of batch norm, batch norm + ReLU, and
batch norm + add + ReLU against the reference implementation.

For the forward activation output, we validate it against the PyTorch's
implementation.  The group batch norm activation output must be allclose
with the PyTorch activation output for the test to pass.

For the backward gradient output, we validate it against the Python
implementation.  Due to the floating point rounding error in the batch
norm implementation, the group batch norm gradient output might not be
allclose with the Python implementation output when ReLU is being used
although the majority of the elements are very close to each other.
Thus, we use the norm difference threshold to determine whether the test
is passed or failed instead of allclose.

* Use the warp size variable than hard coding the warp size in bnp

Use C10_WARP_SIZE from c10/macros/Macros.h in the host functions and use
warpSize in the device kernels instead of hard coding the warp size.

e57c84e0

04 Sep, 2021 1 commit

fix CUBLAS guards (#1162) · 54b93919

Burc Eryilmaz authored Sep 04, 2021



* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path

* safer guard around CUBLAS constants, remove unreferenced variable

* more guard changes

* guard against cublas version instead of cuda
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>

54b93919

02 Sep, 2021 13 commits
- Merge pull request #1161 from NVIDIA/optional_caller_supplied_communicator · ae1cdd64
  Thor Johnsen authored Sep 02, 2021
```
Optional NCCL communicator argument to init method
```
  ae1cdd64
- Optional NCCL communicator argument to init method · e777bddb
  Thor Johnsen authored Sep 02, 2021
  
  e777bddb
- Merge pull request #1160 from NVIDIA/bug_fix_in_wgrad · 9b880665
  Thor Johnsen authored Sep 02, 2021
```
Bug fix in wgrad
```
  9b880665
- Bug fix in wgrad · 9e295728
  Thor Johnsen authored Sep 02, 2021
  
  9e295728
- Merge pull request #1159 from NVIDIA/more_bug_fixes · 0506fe36
  Thor Johnsen authored Sep 02, 2021
```
Bug fixes
```
  0506fe36
- Revert some changes · 8c4a0075
  Thor Johnsen authored Sep 02, 2021
  
  8c4a0075
- Bug fixes · 8cdcc821
  Thor Johnsen authored Sep 02, 2021
  
  8cdcc821
- Merge pull request #1158 from NVIDIA/bug_fixes · 0cb1cb3b
  Thor Johnsen authored Sep 02, 2021
```
Various bug fixes in fused spatial parallel bottleneck block
```
  0cb1cb3b
- More detailed output · 67a0ffcb
  Thor Johnsen authored Sep 02, 2021
  
  67a0ffcb
- Bug fixes · bc9114c9
  Thor Johnsen authored Sep 02, 2021
  
  bc9114c9
- option to set param views to flat buffer (#1152) · 17eec271
  Burc Eryilmaz authored Sep 02, 2021
```
* option to set param views to flat buffer

* remove redundant variables in init_stage1
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
Co-authored-by: ptrblck <ptrblck@users.noreply.github.com>
```
  17eec271
- use prescaling for collective (#1157) · 2e98baa7
  Burc Eryilmaz authored Sep 02, 2021
  
  2e98baa7
- Add full all-reduce code path for DistributedFusedAdam (#1146) · 1cb9c5c3
  Kexin Yu authored Sep 01, 2021
```
* add full all-reduce code path

* debug

* debug
Co-authored-by: ptrblck <ptrblck@users.noreply.github.com>
```
  1cb9c5c3
01 Sep, 2021 7 commits
- Merge pull request #53 from ROCmSoftwarePlatform/hipify_workaround_include_dirs · 37d8410c
  Jithun Nair authored Sep 01, 2021
```
work around hipify not finding headers
```
  37d8410c
- Merge pull request #1154 from NVIDIA/rework_spatial_bottleneck_code_split · d934eca3
  Thor Johnsen authored Sep 01, 2021
```
Add functions to compute grad_out1, grad_out1_halo
```
  d934eca3
- Add functions to compute grad_out1, grad_out1_halo · b6980a0d
  Thor Johnsen authored Sep 01, 2021
  
  b6980a0d
- work around hipify not finding headers · 888e72ad
  Jeff Daily authored Sep 01, 2021
  
  888e72ad
- Seryilmaz/fuse norm into scale (#1149) · 4d190db6
  Burc Eryilmaz authored Sep 01, 2021
```
* fuse norm into scale

* add fused norm into dlamb
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  4d190db6
- Seryilmaz/more cublas lt (#1147) · 6af09dd9
  Burc Eryilmaz authored Aug 31, 2021
```
* support for fused dense layer with cublasLt, fusion in both fprop and bprop

* fix typo causing syntax error

* add fused GEMM+gelu+GEMM modue

* fix typo for workspace size

* update cublas check for 11600

* add tests for fused dense layer

* fix CUDA 10.x path
Co-authored-by: Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
```
  6af09dd9
- Merge pull request #1148 from azrael417/thorsten-view-fix · 9d86158d
  Kexin Yu authored Aug 31, 2021
```
wrapper function for flat view creation in _lazy_init_stage2
```
  9d86158d
31 Aug, 2021 3 commits
- Merge pull request #52 from ROCmSoftwarePlatform/add_distributed_fused_lamb · 02ada95d
  Jithun Nair authored Aug 31, 2021
```
add distributed fused lamb
```
  02ada95d
- enable --distributed_lamb for rocm · 955256d1
  Jeff Daily authored Aug 31, 2021
  
  955256d1
- Merge pull request #1151 from NVIDIA/spatial_fast_bottleneck · ed713c84
  Thor Johnsen authored Aug 31, 2021
```
Spatially Distributed Fast Bottleneck block
```
  ed713c84