Commits · 2a4864d574e17272a72f2a39d72bbe59bc90989b · OpenDAS / apex

18 Sep, 2023 1 commit
- Develop · 9765f725
  flyingdown authored Sep 18, 2023
  
  9765f725
12 Jun, 2023 1 commit

flyingdown authored Jun 06, 2023

2.添加环境变量APEX_ROCBLAS_GEMM_ALLOW_HALF用于控制是否使用fp16r
3.添加dcu版本信息

whl包名修改

readme更新安装步骤

f8b650c8

23 Apr, 2023 1 commit

Add FusedLARS optimizer (#109) · e519c1e3

luise.chen authored Mar 24, 2023

* Add fused_lars optimizer

* Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW

* Add flow of using nesterov in FusedLARS

e519c1e3

23 Mar, 2023 1 commit

Add FusedLARS optimizer (#109) · 7a428776

luise.chen authored Mar 24, 2023

* Add fused_lars optimizer

* Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW

* Add flow of using nesterov in FusedLARS

7a428776

11 Nov, 2022 1 commit
- add --gpu-max-threads-per-block=1024 options · 32ab028c
  flyingdown authored Nov 11, 2022
  
  32ab028c
08 Nov, 2022 2 commits
- 修改setup.py，修复编译错误，适配dtk-22.10 · b10621d1
  flyingdown authored Nov 08, 2022
  
  b10621d1
- replace distributed_fused_lamb.py · 86dfa18d
  flyingdown authored Aug 04, 2022
  
  86dfa18d
21 Sep, 2022 1 commit

Make index_mul_2d extension backward compatible for Atomic header include (#96) · 719215bd

Hubert Lu authored Sep 21, 2022



* Make index_mul_2d extension backward compatible for Atomic header include

* Typo
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

719215bd

19 Sep, 2022 1 commit

Faster build (#95) · 89f5722c

Hubert Lu authored Sep 19, 2022

* Remove redundant import's and enable ninja for MHA extension

* Remove redundant CUDAExtension import's

89f5722c

08 Sep, 2022 1 commit

Enable --transducer extension for ROCm (#88) · ae5ca671

Hubert Lu authored Sep 08, 2022

* Enable --transducer extension for ROCm

* Enable --transducer unit tests for ROCm

* Skip some failing tests in test_transducer_joint.py

* Skip test_transducer_joint_pack for transducer extension

* Keep transducer extension CUDA-compatible

ae5ca671

07 Sep, 2022 1 commit
- Keep --peer_memory and --nccl_p2p CUDA-compatible · bc64ee83
  hubertlu-tw authored Sep 07, 2022
  
  bc64ee83
23 Aug, 2022 2 commits
- Enable --focal_loss and --index_mul_2d_cuda extensions on ROCm · ebb4e88a
  hubertlu-tw authored Aug 23, 2022
  
  ebb4e88a
- add customized fused op index mulitiplication (#1438) · 40e15362
  hanbao authored Aug 02, 2022
```
Co-authored-by: Han Bao <hbao@nvidia.com>
```
  40e15362
22 Aug, 2022 1 commit
- Enable --peer_memory and --nccl_p2p extensions for ROCm · c662c703
  hubertlu-tw authored Aug 22, 2022
  
  c662c703
07 Jul, 2022 1 commit
- Remove `pyprof` and `reparameterization` (#1404) · 8a7a3325
  Masaki Kozuki authored Jul 06, 2022
```
* remove pyprof

* remove reparameterization

* remove pyprof test

* clean up
```
  8a7a3325
31 May, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming (#79) · cf77e9b5

Hubert Lu authored May 31, 2022

* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming

* Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement

cf77e9b5

21 Apr, 2022 1 commit

Give Some Extensions Version Guard in Build&Runtime (#1358) · f9305e75

Masaki Kozuki authored Apr 21, 2022

* guard

* update

* remove unnecessary version guard

* runtime version guard

* cosmetic

* skip tests appropriately

f9305e75

19 Apr, 2022 1 commit
- [submodule update] Bump cudnn-frontend to v0.6.1 (#1353) · d89f5e66
  Masaki Kozuki authored Apr 18, 2022
```
* bump version

* add guard

* fix the cond
```
  d89f5e66
15 Apr, 2022 1 commit

Apex transformer (#77) · 27a47345

Hubert Lu authored Apr 14, 2022

* Add setup_simple.py for debugging the compiling issue of scaled_masked_softmax_cuda

* Comment out CUDA-specific implementations

* Resolve filename collision of *cpp files with to-hipify code and *cu files

27a47345

13 Apr, 2022 1 commit

Cherry-picked the commit from upstream for faster --fast_multihead_attn build (#76) · 29b36315

Hubert Lu authored Apr 13, 2022



* Faster `--fast_multihead_attn` build (#1245)

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

* Fix some bugs
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

29b36315

06 Apr, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142

Hubert Lu authored Apr 06, 2022

Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)

* First attempt to make rocblas flag backward compatible

* Fix some bugs

* Fix some bugs

* Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions

* Add groupbn extension unit tests for ROCm

* Fix some bugs

5ecad142

05 Apr, 2022 2 commits
- Rename nccl_p2p extension to nccl_p2p_cuda · d8db8c15
  Thor Johnsen authored Apr 05, 2022
  
  d8db8c15
- Rename peer_memory extension to peer_memory_cuda · 6e7e2d90
  Thor Johnsen authored Apr 05, 2022
  
  6e7e2d90
30 Mar, 2022 1 commit

Conv-Bias-ReLU fusion (#1332) · 23cfb576

Gil Shomron authored Mar 30, 2022



* Enabled Conv-Bias-ReLU fusion

The following modules are enabled using cuDNN runtime fusion:
1) Conv-Bias-ReLU (+backward)
2) Conv-Bias (+backward)
3) Conv-Bias-Mask-ReLU (+backward)

* Casts cleanup and autocast in unittest

- Remove redundant dtype casts
- Simulate the usage in the unittest by using torch.cuda.amp.autocast
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

* Fixed save_for_backward
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: root <root@luna-0277.selene.nvidia.com>

23cfb576

25 Mar, 2022 1 commit
- Add bottleneck block · 3ade5b26
  Thor Johnsen authored Mar 24, 2022
  
  3ade5b26
24 Mar, 2022 1 commit

Add CUDA Focal Loss Implementation (#1337) · 28f8539c

Masaki Kozuki authored Mar 24, 2022



Take-over of #1097

* Add fast CUDA focal loss implementation.

* Enable fast math for CUDA focal loss.

* Correct typo.

* replace deprecated macros

* Add fast CUDA focal loss implementation.

* Enable fast math for CUDA focal loss.

* Correct typo.

* replace deprecated macros

* TORCH_CUDA_CHECK -> AT_CUDA_CHECK

The former is defined in torch/csrc/profiler/cuda.cpp so it's not available usually.
The latter however is defined in ATen/cuda/Exceptions.h as an alias of C10_CUDA_CHECK.

* add test

* clean up

* guard for torchvision
Co-authored-by: Wil Kong <alpha0422@gmail.com>

28f8539c

23 Mar, 2022 1 commit
- Peer memory halo exchange · 40a0e025
  Thor Johnsen authored Mar 22, 2022
  
  40a0e025
11 Mar, 2022 1 commit
- Updated the handling of CUDAGeneratorImpl.h to new path · 7bef81f7
  Pruthvi Madugundu authored Mar 11, 2022
  
  7bef81f7
27 Feb, 2022 1 commit
- build fused grad accum w/ wgrad only if cuda>10 (#1312) · 47c269b6
  Masaki Kozuki authored Feb 26, 2022
  
  47c269b6
26 Feb, 2022 1 commit

[transformer] Fuse grad accumulation with wgrad (#1297) · ddc08039

Masaki Kozuki authored Feb 25, 2022



* fuse grad accumulation w/ weight grad
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* fp32 training path

* not using *args, **kwargs

* backward: moved the tensor dimension cnversion
Co-authored-by: Sangkug Lym <slym@nvidia.com>

* move files to csrc/megatron

* fix fp32 path

* fix typo

* add  to  in order to select the correct custom extension

* fix typo

* comment on import guard

* update test: enable gradient_accumulation_fusion

* 86

* remove redundant call of `test_column_parallel_linear`
Co-authored-by: Sangkug Lym <slym@nvidia.com>

ddc08039

10 Feb, 2022 1 commit
- 8.6 requires CUDA 11.1 (#1289) · e1aa1fc1
  Masaki Kozuki authored Feb 10, 2022
  
  e1aa1fc1
01 Feb, 2022 1 commit

Add the permutation related support as the extension for asp lib. (#1194) · 89edb819

ChongyuNVIDIA authored Feb 02, 2022

* Add the permutation related support as the extension for asp lib.

* [Fix] Track the permutation sequence for progressive channel swap strategy.

* Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings.

* Fix the deprecated functions in ASP unit tests.

* Fix the sparsity info typo in ASP lib.

* [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search.

* Update the README.md with identical random seed setting and NeurIPS info.

* Integrate the Pybind11 enhancement of permutation search into ASP lib.

89edb819

28 Jan, 2022 1 commit
- Cherry-pick b2fdf9c4 from upstream Apex and resolve conflicts (#68) · 5de49cc9
  Jithun Nair authored Jan 28, 2022
  
  5de49cc9
19 Jan, 2022 1 commit
- pass flags to transducer joint kernel (#1273) · c4e85f7b
  Masaki Kozuki authored Jan 18, 2022
  
  c4e85f7b
13 Jan, 2022 1 commit
- support new path to CUDAGeneratorImpl.h (#1267) · b2fdf9c4
  Shintaro Iwasaki authored Jan 13, 2022
  
  b2fdf9c4
16 Dec, 2021 1 commit
- version guard (#1253) · e8473822
  Masaki Kozuki authored Dec 16, 2021
  
  e8473822
15 Dec, 2021 1 commit
- Add `--threads 4` to `extra_compile_args["nvcc"]` (#1251) · f63dac80
  Masaki Kozuki authored Dec 15, 2021
```
* apply formatter & remove duplicate func def

* dry CUDA_HOME None check

* `--threads 4`
```
  f63dac80
14 Dec, 2021 1 commit

Faster `--fast_multihead_attn` build (#1245) · 7ec8ed67

Masaki Kozuki authored Dec 14, 2021

* merge .so files

* odr

* fix build

* update import

* apply psf/black with max line length of 120

* update

* fix

* update

* build fixed again but undefined symbol again

* fix 2, still layer norm grad is undefined

* remove unused cpp files

* without layer_norm.cuh, import works

* import fast_multihead_attn works...

but why? Was unnecessary `#include "layer_norm.cuh"` was the culprit
causing .shared objects not to be able to link `HostApplyLayerNorm` and
`HostLayerNormGradient`?

* clean up layer norm

7ec8ed67

09 Dec, 2021 2 commits

Add fused mixed precision lamb optimizer. (#1237) · d11ddccf

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

d11ddccf

Add fused mixed precision lamb optimizer. (#1237) · 3c8f5161

Kevin Stephano authored Dec 08, 2021

* Add fused mixed precision lamb optimizer.

* Fix device usage in constructor.

* Fix sending param_group tensor state to device.

* Remove unneeded device set.

3c8f5161