- 15 Apr, 2022 1 commit
-
-
Hubert Lu authored
* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda * Comment out CUDA-specific implementations * Resolve filename collision of *cpp files with to-hipify code and *cu files
-
- 14 Apr, 2022 1 commit
-
-
mahathis authored
* Added suuport for memory format API(torch.channels_last) in GBN Group Batch Norm (GBN) is an NHWC operation. It assumes that the underlying memory format of an input tensor is NHWC. It originally does not support PyTorch's memory_format API. To support PyTorch's memory_format API, i.e., .to(memory_format=...) or .contiguous(memory_format=...), we add the torch_channels_last flag to indicate whether the workload adopts the PyTorch memory_format API by setting memory_format=torch.channels_last. This flag allows GBN to handle memory formats of input tensors properly. An example to use memory_format in GBN: """ from apex.contrib.groupbn.batch_norm import BatchNorm2d_NHWC GBN = BatchNorm2d_NHWC(planes, fuse_relu=True, bn_group=1, torch_channels_last=True) """ The cases that GBN handles are as follows: 1. torch_channels_last=True and input tensor's memory_format=torch.channels_last, GBN will generate the torch.channels_last output tensor. 2. torch_channels_last=True and input tensor's memory_format=torch.contiguous_format, GBN will convert the input tensor to torch.channels_last and will generate the torch.channels_last output tensor. 3. use_pytorch_channels_last=False and input tensor's memory_format=torch.contiguous_format, GBN will generate the torch.contiguous_format output tensor. * Add GBN unit tests for channel_last memory format Co-authored-by:hubertlu-tw <hubertlu@amd.com>
-
- 13 Apr, 2022 1 commit
-
-
Hubert Lu authored
* Faster `--fast_multihead_attn` build (#1245) * merge .so files * odr * fix build * update import * apply psf/black with max line length of 120 * update * fix * update * build fixed again but undefined symbol again * fix 2, still layer norm grad is undefined * remove unused cpp files * without layer_norm.cuh, import works * import fast_multihead_attn works... but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit causing .shared objects not to be able to link `HostApplyLayerNorm` and `HostLayerNormGradient`? * clean up layer norm * Fix some bugs Co-authored-by:Masaki Kozuki <mkozuki@nvidia.com>
-
- 06 Apr, 2022 1 commit
-
-
Hubert Lu authored
Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74) * First attempt to make rocblas flag backward compatible * Fix some bugs * Fix some bugs * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions * Add groupbn extension unit tests for ROCm * Fix some bugs
-
- 23 Mar, 2022 1 commit
-
-
Hubert Lu authored
* Add rocblas_alt_impl flag in MLP * Refactor rocblas_alt_impl implementation and only use it for backprop
-
- 18 Mar, 2022 1 commit
-
-
athitten authored
* Add missing flags arg in gemm_switch_fp32accum call * Add rocblas_alt_impl flag in MHA <rev> Add rocblas_alt_impl flag for all bwd gemms in MHA module * Use ifdef for rocblas_gemm_flags_fp16_alt_impl to target at various AMD hardware Co-authored-by:hubertlu-tw <hubertlu@amd.com>
-
- 11 Mar, 2022 1 commit
-
-
Pruthvi Madugundu authored
-
- 16 Feb, 2022 1 commit
-
-
hubertlu-tw authored
-
- 28 Jan, 2022 1 commit
-
-
Jithun Nair authored
-
- 26 Jan, 2022 1 commit
-
-
Jithun Nair authored
-
- 25 Jan, 2022 2 commits
- 21 Jan, 2022 1 commit
-
-
athitten authored
Removing debug print statement that is not necessary.
-
- 14 Dec, 2021 3 commits
-
-
Jithun Nair authored
IFU-master-2021-12-08
-
Hubert Lu authored
-
Hubert Lu authored
* Skip failing unit tests * Modify the test skipping messages
-
- 13 Dec, 2021 1 commit
-
-
Hubert Lu authored
-
- 09 Dec, 2021 5 commits
-
-
Hubert Lu authored
-
Masaki Kozuki authored
* pass `self.mask_additive` * clang-format * removing THCState
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
Hubert Lu authored
-
hubertlu-tw authored
-
- 08 Dec, 2021 1 commit
-
-
Jithun Nair authored
IFU-2021-10-15 (+ remove redundant defines + C10_CUDA_CHECK)
-
- 06 Dec, 2021 2 commits
-
-
Hubert Lu authored
-
Masaki Kozuki authored
Changes include - THC headers removal - TH macros replacement - fix some typo in comment Conflicts: apex/contrib/csrc/multihead_attn/additive_masked_softmax_dropout_cuda.cu apex/contrib/csrc/multihead_attn/encdec_multihead_attn_cuda.cu apex/contrib/csrc/multihead_attn/encdec_multihead_attn_norm_add_cuda.cu apex/contrib/csrc/multihead_attn/masked_softmax_dropout_cuda.cu apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_additive_mask_cuda.cu apex/contrib/csrc/multihead_attn/self_multihead_attn_bias_cuda.cu apex/contrib/csrc/multihead_attn/self_multihead_attn_cuda.cu apex/contrib/csrc/multihead_attn/self_multihead_attn_norm_add_cuda.cu apex/contrib/csrc/multihead_attn/strided_batched_gemm.h
-
- 03 Dec, 2021 2 commits
-
-
hubertlu-tw authored
-
hubertlu-tw authored
-
- 02 Dec, 2021 4 commits
-
-
Jithun Nair authored
* Use --cuda_ext flag to build all supported extensions * Don't remove --cuda_ext since it'll be needed to build other extensions * Need to clear all cmdline args so setup.py doesn't complain
-
Hubert Lu authored
Add more unit tests for both distributed and extensions
-
hubertlu-tw authored
-
Hubert Lu authored
-
- 01 Dec, 2021 2 commits
- 29 Nov, 2021 1 commit
-
-
X Wang authored
-
- 22 Nov, 2021 1 commit
-
-
Hubert Lu authored
Change python3.6 to python
-
- 19 Nov, 2021 5 commits
-
-
Hubert Lu authored
-
Hubert Lu authored
-
eqy authored
* minimal bert pipeline parallel test * fix global and cleanup * use get_forward_backward_func * cleanup and fix some tests
-
Masaki Kozuki authored
Co-authored-by:
Sangkug Lym <slym@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
Masaki Kozuki authored
* init logging use * fix * clean up * fp32 p2p comm * init * Dynamic global batch size with `MegatronPretrainingSampler` I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now * revive original pipeline parallel test * update MULTIGPU_TEST: add dynamic batchsize test * run MegatronPretrainingRandomSampler * fix comment * fix * update * cosmetic * add note * Apply 2 suggestion(s) to 2 file(s) * change following https://github.com/NVIDIA/apex/pull/1210 * fix
-