Commits · 2a4864d574e17272a72f2a39d72bbe59bc90989b · OpenDAS / apex

19 Sep, 2023 1 commit
- revert multihead_attn to fp32_r · 227be6be
  root authored Sep 19, 2023
  
  227be6be
18 Sep, 2023 1 commit
- Develop · 9765f725
  flyingdown authored Sep 18, 2023
  
  9765f725
06 Sep, 2023 1 commit
- Revert "Changes to support hipblas migration (#113)" · e4d21865
  Pruthvi Madugundu authored Sep 06, 2023
```
This reverts commit 8fc9b21f.
```
  e4d21865
11 Aug, 2023 1 commit
- Changes to support hipblas migration (#113) · 8fc9b21f
  Pruthvi Madugundu authored Aug 11, 2023
  
  8fc9b21f
12 Jun, 2023 1 commit

flyingdown authored Jun 06, 2023

2.添加环境变量APEX_ROCBLAS_GEMM_ALLOW_HALF用于控制是否使用fp16r
3.添加dcu版本信息

whl包名修改

readme更新安装步骤

f8b650c8

23 Apr, 2023 5 commits

Update rccl header include path (#110) · 2d8b3600
Pruthvi Madugundu authored Mar 29, 2023

2d8b3600

Cherry-picks some commits to replace torch.Tensor and remove dependency on six (#107) · 3d72ea06

Hubert Lu authored Mar 01, 2023



* replace torch.Tensor with torch.empty (#1578)

* replace torch.Tensor with torch.empty

* nit

* nit

* torch.empty() must have args (#1584)

* use `torch.tensor` to create a tensor with initializer values (#1588)

* use `torch.tensor` with init values
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Update apex/contrib/sparsity/sparse_masklib.py

* remove torch._six
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* retire `torch._six`

as per the upstream commit of `b005ec62b9`.
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* use std collections.abc
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

3d72ea06

Luise/gbn optimization (#105) · cdc17060

luise.chen authored Feb 14, 2023

* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50

cdc17060

Update register keyword handling for C++17 (#100) · f34cade5

Pruthvi Madugundu authored Dec 20, 2022

* Update register keyword handling for C++17

The keyword 'register' for storage class is removed in C++17,
so keeping it active for only c++14 and lower.

* Updates to the code

f34cade5

Add fused_dense in the extension unit test script · 722e1c3f
hubertlu-tw authored Dec 09, 2022

722e1c3f

30 Mar, 2023 1 commit
- Update rccl header include path (#110) · 18921471
  Pruthvi Madugundu authored Mar 29, 2023
  
  18921471
01 Mar, 2023 1 commit

Cherry-picks some commits to replace torch.Tensor and remove dependency on six (#107) · 03d70c41

Hubert Lu authored Mar 01, 2023



* replace torch.Tensor with torch.empty (#1578)

* replace torch.Tensor with torch.empty

* nit

* nit

* torch.empty() must have args (#1584)

* use `torch.tensor` to create a tensor with initializer values (#1588)

* use `torch.tensor` with init values
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* Update apex/contrib/sparsity/sparse_masklib.py

* remove torch._six
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* retire `torch._six`

as per the upstream commit of `b005ec62b9`.
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* use std collections.abc
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

---------
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

03d70c41

13 Feb, 2023 1 commit

Luise/gbn optimization (#105) · 56c283b6

luise.chen authored Feb 14, 2023

* GroupBN: Reduced buffering for better hiding calculations in some loops of length OUTER_LOOPS

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN and BN_relu kernels for improvement of resnet50

* GroupBN: Use C_ELEMENTS_PER_CTA=64 for BN_add_relu kernels for ~10% E2E improvement of resnet50

56c283b6

20 Dec, 2022 1 commit

Update register keyword handling for C++17 (#100) · f05aaca0

Pruthvi Madugundu authored Dec 20, 2022

* Update register keyword handling for C++17

The keyword 'register' for storage class is removed in C++17,
so keeping it active for only c++14 and lower.

* Updates to the code

f05aaca0

09 Dec, 2022 1 commit
- Add fused_dense in the extension unit test script · d63b5d1f
  hubertlu-tw authored Dec 09, 2022
  
  d63b5d1f
14 Nov, 2022 1 commit
- modify rocblas_gemm_ex's compute_type to rocblas_datatype_f16_r for fp16 · db7007ae
  flyingdown authored Nov 14, 2022
  
  db7007ae
08 Nov, 2022 1 commit
- replace distributed_fused_lamb.py · 86dfa18d
  flyingdown authored Aug 04, 2022
  
  86dfa18d
21 Sep, 2022 1 commit

Make index_mul_2d extension backward compatible for Atomic header include (#96) · 719215bd

Hubert Lu authored Sep 21, 2022



* Make index_mul_2d extension backward compatible for Atomic header include

* Typo
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

719215bd

08 Sep, 2022 1 commit

Enable --transducer extension for ROCm (#88) · ae5ca671

Hubert Lu authored Sep 08, 2022

* Enable --transducer extension for ROCm

* Enable --transducer unit tests for ROCm

* Skip some failing tests in test_transducer_joint.py

* Skip test_transducer_joint_pack for transducer extension

* Keep transducer extension CUDA-compatible

ae5ca671

23 Aug, 2022 2 commits
- Enable --focal_loss and --index_mul_2d_cuda extensions on ROCm · ebb4e88a
  hubertlu-tw authored Aug 23, 2022
  
  ebb4e88a
- add customized fused op index mulitiplication (#1438) · 40e15362
  hanbao authored Aug 02, 2022
```
Co-authored-by: Han Bao <hbao@nvidia.com>
```
  40e15362
22 Aug, 2022 2 commits
- Fixed peer halo exchange module test · fd0f7631
  Thor Johnsen authored Aug 15, 2022
  
  fd0f7631
- Enable --peer_memory and --nccl_p2p extensions for ROCm · c662c703
  hubertlu-tw authored Aug 22, 2022
  
  c662c703
08 Aug, 2022 1 commit
- Revert code changes to mutltihead_attn tests · 51783cc7
  hubertlu-tw authored Aug 08, 2022
  
  51783cc7
29 Jul, 2022 1 commit
- Fix some compiling errors · 038ed999
  hubertlu-tw authored Jul 29, 2022
  
  038ed999
26 Jul, 2022 1 commit

Improvements in distributed Adam optimizer for Megatron (#1432) · 2e025ab5

Tim Moon authored Jul 26, 2022

* Improvements in distributed Adam optimizer for Megatron

Add option to allocate gradient buckets out of one large buffer. Add option to initialize params in user-provided order. Perform communication when saving optimizer state. Support param sync with any dtype.

* Style fixes in distributed Adam helper classes

Review suggestions from @crcrpar

2e025ab5

21 Jul, 2022 1 commit
- Bug fixes, perf improvements · f687e7fa
  Thor Johnsen authored Jul 21, 2022
  
  f687e7fa
14 Jul, 2022 1 commit

[contrib] Fix the reference implementation of multihead_attn (#1423) · 809043f5

Masaki Kozuki authored Jul 14, 2022



* follow the current signature
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* call .backward on outputs
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* update the other caller of _softmax_backward_data
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

809043f5

05 Jul, 2022 1 commit

Add features to distributed Adam for Megatron support (#1414) · cd499737

Tim Moon authored Jul 05, 2022

* Add features to distributed Adam for Megatron support

Support gradient clipping, gradient scaling, FP32 grad accumulation, and multiple dtypes and devices.

* Restore closure arg to distributed Adam

Review suggestion from @crcrpar

cd499737

23 Jun, 2022 1 commit

Move distributed Adam unit test to contrib dir (#1406) · 57f890a7

Tim Moon authored Jun 22, 2022

* Increase default bucket size in distributed Adam

* Move distributed Adam unit test to contrib tests

Integrate into unit testing framework

* Tweak hyperparameters for dist Adam optimizer test

Improves numerical stability so we can keep tight tolerances. Adopting suggestions from @crcrpar.

* Use distributed test infrastructure in distributed Adam unit test

Suggestion from @crcrpar.

57f890a7

22 Jun, 2022 1 commit

Gradient clipping with fused kernels (#1405) · dcb02fcf

Tim Moon authored Jun 21, 2022

* Gradient clipping routine with fused kernels

Identical API as PyTorch. Falls back to PyTorch impl when not computing L2 norm.

* Add unit test for gradient clipping

* Add fp16 case to gradient clipping unit test

* Tweaks to grad clipping unit test

Review suggestions from @crcrpar

* Debug gradient clipping tests

When checking that incorrect results produce assertion errors, make sure to generate a discrepancy outside the range of numerical error.

dcb02fcf

16 Jun, 2022 1 commit

Remove legacy fuser usage from multihead attention in contrib in favor of the... · 1403c21a

Kevin Stephano authored Jun 15, 2022

Remove legacy fuser usage from multihead attention in contrib in favor of the default which should be nvfuser.  Modify test scripts to activate fusion. (#1403)

1403c21a

14 Jun, 2022 1 commit
- Update documentation to reflect DistributedFusedAdam uses AdamW · 846f7f8a
  Tim Moon authored Jun 14, 2022
```
Adjust test options to have tighter tolerances.
```
  846f7f8a
13 Jun, 2022 1 commit
- Add ZeRO-2 support in DistributedFusedAdam · 6e412916
  Tim Moon authored Jun 13, 2022
  
  6e412916
31 May, 2022 1 commit

Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming (#79) · cf77e9b5

Hubert Lu authored May 31, 2022

* Make rocblas_gemm_flags_fp16_alt_impl backward-compat for new naming

* Use BACKWARD_PASS_GUARD_CLASS to prevent lengthy if-statement

cf77e9b5

29 Apr, 2022 1 commit
- [FastLayerNorm] Support hidden dim of 14336 (#1368) · 77f9d73c
  yjk21 authored Apr 29, 2022
  
  77f9d73c
21 Apr, 2022 1 commit

Give Some Extensions Version Guard in Build&Runtime (#1358) · f9305e75

Masaki Kozuki authored Apr 21, 2022

* guard

* update

* remove unnecessary version guard

* runtime version guard

* cosmetic

* skip tests appropriately

f9305e75

19 Apr, 2022 1 commit
- [submodule update] Bump cudnn-frontend to v0.6.1 (#1353) · d89f5e66
  Masaki Kozuki authored Apr 18, 2022
```
* bump version

* add guard

* fix the cond
```
  d89f5e66
14 Apr, 2022 2 commits

Added support for memory format API(torch.channels_last) in GBN (#72) · dd584a59

mahathis authored Apr 14, 2022



* Added suuport for memory format API(torch.channels_last) in GBN

Group Batch Norm (GBN) is an NHWC operation.  It assumes that the
underlying memory format of an input tensor is NHWC.  It originally does
not support PyTorch's memory_format API.

To support PyTorch's memory_format API, i.e., .to(memory_format=...) or
.contiguous(memory_format=...), we add the torch_channels_last
flag to indicate whether the workload adopts the PyTorch memory_format
API by setting memory_format=torch.channels_last.  This flag allows GBN
to handle memory formats of input tensors properly.

An example to use memory_format in GBN:

"""
from apex.contrib.groupbn.batch_norm import BatchNorm2d_NHWC

GBN = BatchNorm2d_NHWC(planes, fuse_relu=True, bn_group=1, torch_channels_last=True)

"""

The cases that GBN handles are as follows:

1. torch_channels_last=True and input tensor's
memory_format=torch.channels_last, GBN will generate the
torch.channels_last output tensor.

2. torch_channels_last=True and input tensor's
memory_format=torch.contiguous_format, GBN will convert the input tensor
to torch.channels_last and will generate the torch.channels_last output
tensor.

3. use_pytorch_channels_last=False and input tensor's
memory_format=torch.contiguous_format, GBN will generate the
torch.contiguous_format output tensor.

* Add GBN unit tests for channel_last memory format
Co-authored-by: hubertlu-tw <hubertlu@amd.com>

dd584a59

Bit faster · 5698eeeb
Thor Johnsen authored Apr 14, 2022

5698eeeb