Commits · 4b8ffef40d988059b41c4abcd1bb7a5a76b52a8e · OpenDAS / TransformerEngine

01 Nov, 2024 2 commits

[JAX] Fix for Disable FusedAttn with FFI by default (#1304) · 4b8ffef4
Phuong Nguyen authored Nov 01, 2024
```
rm default value for NVTE_JAX_FUSED_ATTN_WITH_FFI
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
4b8ffef4

Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam (#1078) · 05c0fb02

Kunlun Li authored Nov 02, 2024



* Add precision aware fused adam
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Minor changes based on review comments.
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Kunlun Li <94586211+kunlunl@users.noreply.github.com>

---------
Signed-off-by: kunlunl <kunlunl@nvidia.com>
Signed-off-by: Kunlun Li <94586211+kunlunl@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

05c0fb02

31 Oct, 2024 2 commits

[TE/JAX] Disable FusedAttn with FFI by default (#1298) · 23caab3f
Phuong Nguyen authored Oct 31, 2024
```
* disable fused attn with ffi

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
23caab3f

[TE/JAX] Custom call with FFI - lowering all attributes with bind all (#1289) · 9dddb36d

Phuong Nguyen authored Oct 31, 2024



* lowering a dict of attrs

* improve err message with line and func info

* implement a product() for ffi dimensions

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

9dddb36d

30 Oct, 2024 2 commits

[JAX] Consolidate FFI and old descriptor implementation for fused attention. (#1295) · c036765b

Michael Goldfarb authored Oct 29, 2024

Consolidate FFI and old descriptor impleemntation for fused attention.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Co-authored-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>

c036765b

Add missed arguments of apply_rotary_pos_emb in MHA (#1296) · ed1e85c4

Xiaowei Ren authored Oct 29, 2024



* add missed arguments of apply_rotary_pos_emb in MHA
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove an unnecessary f
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add one more assert for cp_group len
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

ed1e85c4

29 Oct, 2024 2 commits

Add check for GPU availability in attention (#1287) · 8bdb54fe

Charlene Yang authored Oct 29, 2024



* check if GPU is available
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8bdb54fe

[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored and moved to TE/common (#1067) · 933294dc

Alp Dener authored Oct 29, 2024



* moved userbuffers code to TE/common
Signed-off-by: Alp Dener <adener@nvidia.com>

* moved comm+GEMM overlap code to TE/common
Signed-off-by: Alp Dener <adener@nvidia.com>

* removed PyTorch depdency from comm+GEMM overlap in TE/common
Signed-off-by: Alp Dener <adener@nvidia.com>

* added TE/PyTorch wrappers for refactored comm+GEMM overlap code in TE/common
Signed-off-by: Alp Dener <adener@nvidia.com>

* updated TE/PyTorch Python API to match the refactored comm+GEMM overlap code
Signed-off-by: Alp Dener <adener@nvidia.com>

* updated unit tests to work with refactored comm+GEMM overlap code
Signed-off-by: Alp Dener <adener@nvidia.com>

* added a pylint exception to comm+GEMM overlap test runner
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixing linting errors
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* added documentation for te.initialize_ub
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed compile errors when building with NVTE_UB_WITH_MPI=1
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed default bootstrap backend
Signed-off-by: Alp Dener <adener@nvidia.com>

* switched default bootstrap backend priority to MPI > Gloo > NCCL
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* updated bootstrap backend documentation
Signed-off-by: Alp Dener <adener@nvidia.com>

* close UB bootstrap socket to avoid interfering with CUDA Multicast shareable file handle send/recv
Signed-off-by: Alp Dener <adener@nvidia.com>

* added torch::Tensor wrappers for communication buffer and atomic counters so PyTorch can factor externally allocated memory into its garbage collection threshold
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* automated handling of world, local and node ranks/sizes within C++ CommOverlapHelper to simplify Python function signatures
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed incorrect read of environment variables
Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected priority for _SOCKET_IFNAME environment variables in UB bootstrapping
Signed-off-by: Alp Dener <adener@nvidia.com>

* moved multicast support check to cuda_runtime.h and replaced cudaDeviceGetProp call with cached sm_count()
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* removed commented out old code and replaced external collective function type defines with aliases
Signed-off-by: Alp Dener <adener@nvidia.com>

* compile-time CUDA version guard for CUDA Driver Multicast attribute
Signed-off-by: Alp Dener <adener@nvidia.com>

* added compile-time CUDA version guards to Multicast code in Userbuffers
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* condensed UB docs, corrected const violations
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed autodoc rst for UB calls, added CUDA version guard on Multicast UB kernels
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed incorrect UB type reporting for P2P overlaps, comment reformatting
Signed-off-by: Alp Dener <adener@nvidia.com>

* add docstring to tex.ubuf_built_with_mpi()
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

933294dc

28 Oct, 2024 1 commit

[PyTorch] Remove fast param getter from modules (#1291) · 35bbe740

Tim Moon authored Oct 28, 2024



* Add fallback for fast param getter
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove fast param getter
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warning
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>

35bbe740

25 Oct, 2024 3 commits

[C/PyTorch] Add max_t support for THD (#1244) · 7fb22c37

Charlene Yang authored Oct 25, 2024



* WIP: add max_t support for THD
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* WIP: save tensors for debug and point to new FE
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix stats in bwd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix stats in fwd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add docstring for DPA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add docstring
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: first try on adding max_b and max_t
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit c3d522e9f5aef3c8ddfec5bf6ff24c3db97bb059.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "WIP: first try on adding max_b and max_t"

This reverts commit 3bc01ebaf2aa846fd16634e2d33b0d0f5803a076.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update docstring and fix max_seqlen logic for thd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert two lines of change in docstring
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: add get_max_b/t
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix max_seqlen code and docstring
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* sucess: add max_b/max_t
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove debug code
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* change max_b/max_t buckets
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix b vs orig_b
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix b vs orig_b with 0 fill
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FE for T3HD/TH3D
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add max_b to conversion kernels
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix changes after last merge
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add Jax support for max_t
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update FE to 1.8.0-rc
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FE to 1.8.0
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* code review/formating fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix Stats shape for <9.6
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* return nullptr for offset_stats when cudnn < 9.6
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add more version control
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

7fb22c37

[C/PyTorch] Add THD MQA/GQA (#1266) · 83f9cc09

Charlene Yang authored Oct 25, 2024



* add THD MQA/GQA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix nvte_get_fused_attn_backend
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

83f9cc09

[TE/JAX] Update required JAX version for FFI custom calls with cudaGraph (#1285) · 7cef7566
Phuong Nguyen authored Oct 25, 2024
```
Update jax version for ffi
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
7cef7566

24 Oct, 2024 3 commits

[Paddle] Update type names for Paddle 3.0 (#1286) · 7a5fd0c9
Tim Moon authored Oct 24, 2024
```
Update class names for Paddle 3.0
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
7a5fd0c9

[JAX] XLA Custom Calls with FFI for FusedAttnFwd, Quantize, Transpose,... · 18c2234c

Hua Huang authored Oct 24, 2024


[JAX] XLA Custom Calls with FFI for FusedAttnFwd, Quantize, Transpose, ActLuFP8, LayerNormForwardFP8FFI, and LayerNormBackwardFFI (#1263)

* Add TransposeFFI, test passed
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add ActLuFP8FFI; fix TransposeFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add QuantizeFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add FusedAttnForwardFFI and some unit tests
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Minor fix
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add LayerNormForwardFP8FFI & LayerNormBackwardFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Revise FusedAttnForwardFFI()
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add FFI_CudaGraph_Traits

All tests passed, ready for merge
Signed-off-by: Hua Huang <huah@nvidia.com>

* Bug fix for FFI data type mismatch

Also add a safeguard on the entrance to FFI function
Signed-off-by: Hua Huang <huah@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

18c2234c

[JAX] Fix correctness of JAX fused attention with CP and improve numerics... · 20c75295

Michael Goldfarb authored Oct 24, 2024


[JAX] Fix correctness of JAX fused attention with CP and improve numerics check in unit tests (#1282)

Fix correctness of JAX fused attention with CP.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>

20c75295

22 Oct, 2024 2 commits

Add THD + GQA supports (#1260) · d9b4bfb5

Reese Wang authored Oct 23, 2024



Add THD + GQA supports for cuDNN >= 9.6
Signed-off-by: Reese Wang <rewang@nvidia.com>

d9b4bfb5

Fused Attention Support 64-bit Ragged Offsets for Large THD Tensors (#1230) · 7b18f235

Michael Goldfarb authored Oct 22, 2024



* Use 64-bit offsets for cuDNN 9.5+
* Align workspace tensors to 16B.
* Fix bug where std::accumulate overflowed on large tensor shapes.
* Only support 64-bit offsets on arbitrary sequence length fp16 backend.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>

7b18f235

17 Oct, 2024 4 commits

Fix seq_dim in CP implementation (#1264) · a488b8b1
Xiaowei Ren authored Oct 17, 2024
```
fix seq_dim in CP implementation
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
```
a488b8b1

[TE/JAX] Enabling CudaGraph for custom calls with FFI (#1228) · 12f30ead

Phuong Nguyen authored Oct 17, 2024



* register CmdBufferCompatible traits via C++ API

* renamed FFI_Traits

* use register_ffi_target()

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

12f30ead

[Bugfix] Fix bias for 0-dim tensors in gemm (#1246) · 8e97c8da

Xin Yao authored Oct 17, 2024



* fix bias for 0-dim tensor
Signed-off-by: Xin Yao <xiny@nvidia.com>

* add check
Signed-off-by: Xin Yao <xiny@nvidia.com>

* use numel() instead of nullptr
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>

8e97c8da

[PyTorch] Fix wgrads for GroupedLinear when weights don't require grad (#1258) · 2d7020e2

Xin Yao authored Oct 17, 2024



Fix wgrad for GroupedLinear when weights doesn't require grad
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

2d7020e2

16 Oct, 2024 5 commits

[PyTorch] Fix FP8 activation recompute (#1254) · a5181512
Kirthi Shankar Sivamani authored Oct 16, 2024
```
Fix FP8 activation recompute
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
a5181512

Upgrade pylint to 3.3.1 (#1257) · 6e90fcb7

Kirthi Shankar Sivamani authored Oct 16, 2024



* Upgrade pylint and first round formatting
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* round 2
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* round 3
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Format and fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Paddle lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Reviews
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* FIxes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* More linting
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Run formatter
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Paddle lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

6e90fcb7

[PyTorch] Drop FA as an installation requirement (#1226) · 161b1d98

Charlene Yang authored Oct 15, 2024



* WIP: make FA2 optional
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* WIP: fix logic
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor tweak
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add L1 test to test all supported FA versions
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update version to 2.1.1 and trim L1 tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update onnxruntime version
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove onnxruntime from L1 FA versions tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

161b1d98

fix assertion bug for SWA API in TE-JAX (#1242) · 43b9e1ee

Md Fahim Faysal Khan authored Oct 15, 2024



fixed assertion bug for SWA
Signed-off-by: Md Fahim Faysal Khan <mdfahimfaysa@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>

43b9e1ee

[PyTorch] Build custom ORT ops before running ONNX export tests (#1252) · f6b766bd

Tim Moon authored Oct 15, 2024



* Build custom ORT ops before running ONNX tests
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove ONNX from context parallelism tests
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Export ONNX ops that do compute in FP32

Matches internal impl of TE kernels.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add build script for custom ORT ops
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>

f6b766bd

15 Oct, 2024 1 commit
- Check for backend support in Jax context parallel fused attention test (#1227) · 20c55e46
  Michael Goldfarb authored Oct 15, 2024
```
Update test to check support for context parallel attention.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
```
  20c55e46
14 Oct, 2024 1 commit
- Do not link against CUDA driver when building (#1240) · 86f07be4
  Tim Moon authored Oct 14, 2024
```
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
  86f07be4
12 Oct, 2024 1 commit

[PyTorch] Let Fused RoPE support CP with THD format (#1238) · 55dcbb4b

Xin Yao authored Oct 12, 2024



* Let Fused RoPE support THD with CP
Signed-off-by: Xin Yao <xiny@nvidia.com>

* add comment
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>

55dcbb4b

11 Oct, 2024 2 commits

Add FlashAttention3 to CP implementations (#1232) · b36bd0a4

Xiaowei Ren authored Oct 11, 2024



* fa2 function import renaming
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* refine fa_fwd_kwargs and fa_bwd_kwargs
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* import FA3 fucntions for CP
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix output of FA3 fwd
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix rng_state in a2a implementation with FA3
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* hack lse correction for packed lse format
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* make CP thd out correction work with packed lse
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix for packed softmax_lse
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix softmax_lse shape
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* change lse_packed to constexpr
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

b36bd0a4

Fix bug in torch compile and seqdim is integer (#1217) · 9ee2dbdd

李金梁 authored Oct 12, 2024



* Fix bug in torch compile and seqdim is integer
Signed-off-by: 李金梁 <975761915@qq.com>

* Update attention.py

change the jit_fuser to torch.compile on flash_attn_fwd_out_correction
Signed-off-by: 李金梁 <975761915@qq.com>

* Annotate fused functions
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: 李金梁 <975761915@qq.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

9ee2dbdd

10 Oct, 2024 2 commits

Small fixes to Float8Tensor (#1225) · 3b89c36f

Przemyslaw Tredak authored Oct 10, 2024



* Fixes to Float8Tensor
Signed-off-by: Przemyslaw Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Przemyslaw Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

3b89c36f

[JAX] Expose sliding window attn to TE-JAX API (#1205) · 85e60e64

Hua Huang authored Oct 10, 2024



* Expose JAX sliding window attn API
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* No SWA in context parallel; fix RNG seed in test
Signed-off-by: Hua Huang <huah@nvidia.com>

* Handle SAW API discrepancy in cuDNN and Python
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add SAW API for flax, all tests passed

Will update tests/jax/test_praxis_layers.py next
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update test_praxis_layers.py for SWA, test passed
Signed-off-by: Hua Huang <huah@nvidia.com>

* Use tuple window_size; update for PR #1212
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add and adjust some pytest.skip
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Revised following Reese Wang's comments

Still need further debugging:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-NO_BIAS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-NO_BIAS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError:

These errors does not exist in the previous commit
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix no-SWA test case errors in previous commit
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add Padding mask w/ sliding windows sanity tests
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Use float32 for the reference code softmax calculation
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Reese Wang <rewang@nvidia.com>

85e60e64

09 Oct, 2024 3 commits

[PyTorch] Improve `get_qkv_layout` (#1214) · 5b6546c8

Charlene Yang authored Oct 09, 2024



* improve get_attention_backend logic
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* polish logic and wording
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove redundant comment
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5b6546c8

[PyTorch] Add documentation for FP8 attention checkpointing (#1223) · 2d875521

Charlene Yang authored Oct 09, 2024



* add extra_state change description for different TE versions
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add FAQ page
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FAQ page
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix extra_state tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

2d875521

[PyTorch] Debug dtype casting in operation-based API (#1202) · 5b89f1ad

Tim Moon authored Oct 08, 2024



* Handle Float8Tensor when casting module dtype

Keep data in Float8Tensor and only change nominal dtype. Monkey-patch PyTorch module casting functions to handle Float8Tensor. Add tests.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Respect autocast dtype in linear op
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Suppress linter warning
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Suppress linter warning
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak comments

Review suggestion from @ptrendx
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5b89f1ad

08 Oct, 2024 1 commit

[PyTorch] Miscellaneous fixes for FA3 attention (#1174) · e762592e

Charlene Yang authored Oct 08, 2024



* add qkv descales to FA3
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix sbhd shapes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* force the same dtype when comparing FA3 and cuDNN FP8
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "force the same dtype when comparing FA3 and cuDNN FP8"

This reverts commit 19e7f877026a19a32d2f02c6c9de20df4ae2e064.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* force the same dtype when comparing FA3 and cuDNN FP8
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add try/except for FA3 when custom qkv descales are not supported
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace FA3 installation warning with a debug logging message
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove unused imports
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* avoid varlen_func for FP8 and improve messaging
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add SWA support for FA3
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* change preference reason for FP8 logic
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fix
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

e762592e

07 Oct, 2024 2 commits

Fix cuDNN sliding window size (#1212) · c3b3cd21

Charlene Yang authored Oct 07, 2024



* adjust window size to (i-window_size_left,i] for cuDNN
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* reduce the window to make any errors more pronouced
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

c3b3cd21

Hierarchical CP implementation (Ulysses + Ring) (#1209) · c24a4c41

Xiaowei Ren authored Oct 07, 2024



* change API for hierarchical CP
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* move fp8 code before qkv reshape
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* try to insert A2A for hierarchical CP
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* make fwd work
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove a redundant sync
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* make bwd of hierarchical CP work
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix dout a2a in bwd
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix q_f16 with fp8
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* assert hierarchical CP implementation does not support THD format
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* assert hierarchical CP does not support attn bias
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add unit test for hierarchical CP
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix cp_comm_type in unit test
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix and code cleaning
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* minor change
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* an assert info change
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* dout shape fix
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* move function definitions to the front of the first call
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix tensor view comments
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* refine CP unit test
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* typo fix
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* typo fix
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* save cp_size_a2a and rank_a2a in fwd
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add more explainations of cp_group in doc_string
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

c24a4c41

04 Oct, 2024 1 commit

[PyTorch] Minor optimizations to reduce CPU overheads in modules (#1191) · 9d976bcd

Tim Moon authored Oct 03, 2024



* CPU perf optimization in linear autograd function

Avoid enable_grad context when possible in cast function. Cache distributed group properties.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* CPU perf optimization in prepare_forward function

Avoid torch.nn.Module impl of __setattr__.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid module import in TE module forwards
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use fast getter for params
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Reuse tensor dims in linear autograd func
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Apply optimizations to grouped linear
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug test failures
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Debug test failures
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid deepcopy in tests
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Move _fast_setattr logic to __setattr__ method
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

9d976bcd