- 21 Oct, 2021 1 commit
-
-
Jeff Daily authored
-
- 19 Oct, 2021 1 commit
-
-
Abhishree authored
1) multihead_attn 2) xentropy 3) fused_adam and distributed_fused_adam
-
- 04 Oct, 2021 1 commit
-
-
Jeff Daily authored
-
- 07 Sep, 2021 1 commit
-
-
sarunyap authored
* Enable group batch norm (--bnp) on ROCm (only bn_group = 1) Enable NHWC group batch norm on a single GPU on ROCm (bn_group = 1). The multi-GPU case (bn_group > 1) will be revisited in the future. The following are the main changes: 1) Use MIOpen data structures/functions in HIP instead of CUDNN 2) For the warp-level primitive code, we ensure that the code operates on 64-thread wide warp instead of 32-thread wide 3) Disable all the bn_group > 1 paths Notes: 1) Multi-stream is not tested. 2) We have not optimized for performance * Fix bnp hipification Avoid calling hipify-perl in setup.py and rely on PyTorch's internal hipification mechanism. * Make bnp data pointers contiguous The contrib group batch norm implementation assumes that all input tensors are contiguous. When non-contiguous tensors are passed to the function, it gives a wrong result. This commit explicitly calls .contiguous() to make all input tensors contiguous before accessing them. * Fix HIP lane id in bnp Fix typo * Fix ReLU bitmask for HIP in bnp The ReLU bitmask is derived by using the __ballot function which returns a 64-bit value in HIP. This commit fixes the ReLU bitmask storage size and offsets on ROCm. This patch also fixes the kernel to set ReLU bitmask to 1 when the data is less than or equal to zero (not only less than). Not doing so can cause a stability issue. * Remove multiple of 64 offset for HIP in bnp The multiple of 64 offset is not necessary. * Use FP16 intermediate output to determine whether to rectify in bnp Group batch norm takes FP16 tensors and produces the FP16 output, however, all arithmetic operations are done in FP32, thus intermediate outputs are in FP32. For the fusion kernels, ReLU determines the FP32 intermediate output to decide whether to rectify it. ReLU must rectify the intermediate output if it is less than or "equal" to zero. There is a chance that the intermediate FP32 output is very close to zero, and when it is converted to FP16, it becomes zero. In this case, this output is not rectified when it should be. Since the output is not rectified in the forward pass, the gradient is not rectified in the backward pass. This can cause a stability issue. This patch can have a negative impact on the performance of group batch norm as we perform FP32-FP16 conversion multiple times. * Disable dispatchX ParallelSums in HIP in bnp dispatchX is not required for the bn_group = 1 case. * Use traditional load/store for HIP in bnp The built-in function has a high floating point rounding error. Thus, we replace it with the traditional load/store. Doing so breaks the aligned pointer property in the load/store functions. We conservatively use traditional load/store for all memory access. * Replace shfl_down with shfl_sync in parallel sums for HIP in bnp This commit separates the HIP code from the CUDA code in parallel sums * Remove -U__HIP_NO_HALF_CONVERSIONS__ for HIP in bnp Since the built-in function is removed, -U__HIP_NO_HALF_CONVERSIONS__ is no longer needed. * Preserve CUDA's ReLU condition path for USE_ADD_RELU in bnp * Add test for bnp The test evaluates correctness of batch norm, batch norm + ReLU, and batch norm + add + ReLU against the reference implementation. For the forward activation output, we validate it against the PyTorch's implementation. The group batch norm activation output must be allclose with the PyTorch activation output for the test to pass. For the backward gradient output, we validate it against the Python implementation. Due to the floating point rounding error in the batch norm implementation, the group batch norm gradient output might not be allclose with the Python implementation output when ReLU is being used although the majority of the elements are very close to each other. Thus, we use the norm difference threshold to determine whether the test is passed or failed instead of allclose. * Use the warp size variable than hard coding the warp size in bnp Use C10_WARP_SIZE from c10/macros/Macros.h in the host functions and use warpSize in the device kernels instead of hard coding the warp size.
-
- 01 Sep, 2021 2 commits
-
-
Jithun Nair authored
work around hipify not finding headers
-
Jeff Daily authored
-
- 31 Aug, 2021 2 commits
-
-
Jithun Nair authored
add distributed fused lamb
-
Jeff Daily authored
-
- 25 Jun, 2021 2 commits
-
-
Jeff Daily authored
Make torch version check numeric
-
Jithun Nair authored
-
- 04 Mar, 2021 3 commits
-
-
Jeff Daily authored
IFU-2020-03-04
-
Jeff Daily authored
-
Peng authored
Revert "pass all TensorListMetadata as pointer to pinned host memory (#13)
-
- 25 Feb, 2021 1 commit
-
-
Jeff Daily authored
This reverts commit bdd481d1.
-
- 23 Feb, 2021 1 commit
-
-
yjk21 authored
-
- 10 Feb, 2021 1 commit
-
-
Shoufa Chen authored
* copy-paste friendly * fix import container_abcs issue Nightly PyTorch has removed `container_abcs` from `torch._six`. https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35 * fix import container_abcs issue Nightly PyTorch has removed `container_abcs` from `torch._six`. https://github.com/pytorch/pytorch/commit/58eb23378f2a376565a66ac32c93a316c45b6131#diff-b3c160475f0fbe8ad50310f92d3534172ba98203387a962b7dc8f4a23b15cf4dL35 * keep existing for pytorch1.7 and earlier
-
- 25 Jan, 2021 1 commit
-
-
Jeff Daily authored
- incorrect use of __shfl_down - fix warp size assumptions - update unit tests to exit on failure
-
- 21 Jan, 2021 2 commits
-
-
Jeff Daily authored
-
Jeff Daily authored
use __launch_bounds__(1024) for multi_tensor_apply, re-enable skipped tests
-
- 20 Jan, 2021 1 commit
-
-
Burc Eryilmaz authored
Co-authored-by:Sukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
-
- 19 Jan, 2021 1 commit
-
-
Jeff Daily authored
IFU-2021-01-18
-
- 18 Jan, 2021 5 commits
-
-
Jeff Daily authored
-
Jeff Daily authored
-
Jeff Daily authored
Mostly whitespace or formatting issues addressed. Diff with upstream is reduced; ROCm changes are more clear.
-
Jeff Daily authored
Conflicts: csrc/multi_tensor_apply.cuh setup.py tests/L0/run_optimizers/test_adagrad.py tests/L0/run_optimizers/test_fused_optimizer.py tests/L0/run_optimizers/test_lamb.py
-
Jeff Daily authored
Fix reduce_block_into_lanes for multi_tensor_l2norm for ROCm
-
- 15 Jan, 2021 1 commit
-
-
Sarunya Pumma authored
-
- 31 Dec, 2020 3 commits
-
-
Chaitanya Sri Krishna Lolla authored
Skip the unit tests
-
lcskrishna authored
-
lcskrishna authored
-
- 17 Dec, 2020 3 commits
-
-
Thor Johnsen authored
Update ASP README to highlight default recipe
-
jpool-nv authored
The Recipe was presented after some non-standard API calls, so moving the suggested usage up, giving it its own section, and reinforcing the suggested usage in the non-standard section.
-
Chaitanya Sri Krishna Lolla authored
Hipify revamp changes for apex extensions on ROCm.
-
- 16 Dec, 2020 1 commit
-
-
lcskrishna authored
-
- 15 Dec, 2020 4 commits
-
-
lcskrishna authored
-
lcskrishna authored
-
lcskrishna authored
-
lcskrishna authored
-
- 10 Dec, 2020 1 commit
-
-
lcskrishna authored
-
- 09 Dec, 2020 1 commit
-
-
lcskrishna authored
-