Commits · f0ed3d509f299198bec849c81fc8eda3dae77f3f · OpenDAS / TransformerEngine

24 Apr, 2024 1 commit

[PyTorch] Avoid using LRU cache for cu_seqlens (#798) · f0ed3d50

Kirthi Shankar Sivamani authored Apr 24, 2024



* Try using global buffer for cu_seqlens
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Avoid using functools.lru_cache
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

f0ed3d50

16 Apr, 2024 1 commit

[C/PyTorch] Add FP8 DPA and MHA (#768) · 83a4c219

cyanguwa authored Apr 15, 2024



* WIP: fp8 v1 fprop integration
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: minor fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add debug info
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add more debug info
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fprop working for h1; w/ debug info
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: add bprop
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* cleanup; bprop running but has mismatches
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add gitlab frontend as submodule
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up and add back v0.9.2 FE support; fprop/bprop passing with 5e-2 tols
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix after merge; add bias_b/h to caching descriptor
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* distinguish fwd/bwd tensor types for bprop
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fix for F16 cases; include added dqkv_type and d_scale_dp
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* adjust out shape for bwd in test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add casting from/to FP8 to DPA module
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: bshd_bshd_bshd layout
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: support all sbhd/bshd layouts
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add qkvpacked and kvpacked support in both FusedAttnFunc and C levels
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove qkvpacked/kvpacked calls in DPA module (used for testing)
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove tp setup; add allow_non_contiguous; update FE; revert to sbh3d in tests; clean up
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add NVTE_FP8_DPA_BWD to control whether to use FP8 bwd or F16 bwd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix MQA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix MQA/GQA in FP8 v1 API
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FE to 705d8e3, with API change
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* test causal mask
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* restrict mha_fill for THD format
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fused attn with CP and comment out is_alibi code
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up FE0.9 vs FE1.0 FP8 implementations, and related unit tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* change NVTE_FP8_DPA_BWD default to 1, and fix its use in qkvpacked/kvpacked APIs
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint and self.tp_size/group in FusedAttention()
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FE to 6902c94
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add FP8 MHA support
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update to FE v1.3.0
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes for FP8 MHA with different configs
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* emit stats regardless of is_training
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix linear when input is not Float8Tensor
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix d_out type when f16 bprop
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix user buffer for layernorm_linear/linear and revert two FP8 casts in MHA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add docstring for fp8_dpa/mha in recipe
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix backend selection to avoid FA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace transpose with transpose_2d
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* use RMSE for FP8 unit tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace two more transpose with transpose_2d
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add FP8 initialization to FusedAttention
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rm docs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Revert "add FP8 initialization to FusedAttention"

This reverts commit 15fffd825d6f23f31ea709b16ba01dfd61efabf8.
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change order of ctxs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* minor fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add back docs and mark as beta
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes for tests and docs
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

83a4c219

12 Apr, 2024 1 commit

[PyTorch] cuda graph support (#575) · 73f8d90f

Kirthi Shankar Sivamani authored Apr 12, 2024



* FP8 cuda graphs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Charlene Yang <charleney@nvidia.com>

* Fix numerics
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* exclude torch compile from numerics tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* More numerics fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix CI
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rm fusion from unfused path
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Charlene Yang <charleney@nvidia.com>

73f8d90f

06 Apr, 2024 1 commit

Enable DGRAD RS overlap (#754) · e3de4037

Jaemin Choi authored Apr 05, 2024



* Enable DGRAD RS overlap
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>

* fix lint; apply suggestions
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

e3de4037

03 Apr, 2024 1 commit
- Revert "Update FA version to 2.5.6 (#714)" · 47276e1b
  Kirthi Shankar Sivamani authored Apr 02, 2024
```
This reverts commit 965803c9.
```
  47276e1b
21 Mar, 2024 2 commits

TP-RS overlap with send/recv ring-exchange (#724) · b855656b

Sangkug Lym authored Mar 21, 2024



* TP-RS overlap with send/recv

Atomic GEMM based TP-RS overlap with send/recv
Signed-off-by: Sangkug Lym <slym@nvidia.com>

Specify userbuffer overlap method of each overlap instance
Signed-off-by: Sangkug Lym <slym@nvidia.com>

P2P TP-RS overlap with fp8 GEMM outputs
Signed-off-by: Sangkug Lym <slym@nvidia.com>

Fix TP-RS overlap with send/recv
Signed-off-by: Sangkug Lym <slym@nvidia.com>

* cleanup
Signed-off-by: Sangkug Lym <slym@nvidia.com>

* cleanup
Signed-off-by: Sangkug Lym <slym@nvidia.com>

* linting
Signed-off-by: Sangkug Lym <slym@nvidia.com>

* fix typo
Signed-off-by: Sangkug Lym <slym@nvidia.com>

---------
Signed-off-by: Sangkug Lym <slym@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

b855656b

[Pytorch] Update context parallel softmax lse correction func (#716) · 59bfc17b

Kite0011 authored Mar 21, 2024



[Pytorch] Update context parallel softmax lse correction func.
Signed-off-by: kitefang <kitefang@tencent.com>
Co-authored-by: kitefang <kitefang@tencent.com>

59bfc17b

20 Mar, 2024 1 commit
- Update FA version to 2.5.6 (#714) · 965803c9
  Kirthi Shankar Sivamani authored Mar 20, 2024
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  965803c9
06 Mar, 2024 1 commit

[PyTorch] Adjusted the logic of MHA and DPA to enable speculative decoding (#668) · b459ccc9

Oleg Goncharov authored Mar 06, 2024



* Modified MHA and DPA logic to use causal softmax and FA for inference
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Adjusted unfused attention and softmax logic for inference
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Cleaned up the code per pylint
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added test cases to evaluate numerics of incremental decoding
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Apply suggestions from code review
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

* Apply suggestions from code review [sequence start-end]
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

* Apply suggestions from code review [inference_params offset update]]
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

* Fixed bug in KV-cache indices and updated test suite
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added inference_params description and applied suggestions from the code review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Adjusted absolute tolerances in numerics tests
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Cleaned up the files per pylint
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

b459ccc9

28 Feb, 2024 1 commit

[C/PyTorch/Jax] Add support for more bias shapes (#677) · b8eea8aa

cyanguwa authored Feb 28, 2024



* added support for arbitrary bias shapes for fused_attn
Signed-off-by: Alp Dener <adener@nvidia.com>

* Fix linting
Signed-off-by: Alp Dener <adener@nvidia.com>

* Add b1ss/bhss/11ss bias shapes when not requiring dBias
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add bias_b/h to plan cache
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fixed compile errors after PR653 merge
Signed-off-by: Alp Dener <adener@nvidia.com>

* updated JAX unittests for new bias shapes
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed mismatched mask type checking
Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected skip condition
Signed-off-by: Alp Dener <adener@nvidia.com>

* fix selection logic for A100s
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* corrected skip checks for bias shapes
Signed-off-by: Alp Dener <adener@nvidia.com>

* resolved test issues but neginf with float16 is still problematic with JAX
Signed-off-by: Alp Dener <adener@nvidia.com>

* new bias shapes passing TE JAX CI for seqlen <= 512, seq_q == seq_kv and h_q == h_kv conditions
Signed-off-by: Alp Dener <adener@nvidia.com>

* TE/JAX fused attn tests for new bias shapes passing with neg_inf=-2**27 for Bfloat16 and -2**15 for Float16
Signed-off-by: Alp Dener <adener@nvidia.com>

* code style fixes and test parameter ID cleanup
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed incorrect skip condition for backward fused attn test
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: Alp Dener <adener@nvidia.com>

b8eea8aa

24 Feb, 2024 1 commit

[PyTorch] Non-reentrant mode for activation recompute (#670) · 82bc797f

Alp Dener authored Feb 23, 2024



* added non-reentrant mode support to TE checkpoint
Signed-off-by: Alp Dener <adener@nvidia.com>

* updated get_cuda_rng_tracker kwarg to get_rng_state_tracker to remain consistent with other TE API
Signed-off-by: Alp Dener <adener@nvidia.com>

* docstring cleanup
Signed-off-by: Alp Dener <adener@nvidia.com>

* added mechanism to disable bias_gelu_nvfusion in LayerNormMLP when checkpointing in non-reentrant mode
Signed-off-by: Alp Dener <adener@nvidia.com>

* refactored checkpoint and recompute hook names to match PyTorch implementation
Signed-off-by: Alp Dener <adener@nvidia.com>

* Fixed incorrect reference before assignment
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed argument error in calling native PyTorch checkpoint
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed linting errors for missing docstrings
Signed-off-by: Alp Dener <adener@nvidia.com>

* Fix lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* bias GELU fusion consistency between checkpoint test and reference comparison
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

82bc797f

15 Feb, 2024 1 commit

Use fused implementation of RoPE in MultiHeadAttention (#658) · 8d62d5c2

Przemyslaw Tredak authored Feb 15, 2024



* Use fused implementation of RoPE in MultiHeadAttention
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix freqs dtype
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

8d62d5c2

08 Feb, 2024 1 commit

[C++/PyTorch] Add alibi_slopes support (#608) · 94de051f

cyanguwa authored Feb 08, 2024



* test alibi between fa and fu
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move alibi slopes and bias to global to avoid repeating calculation
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix alibi slopes/bias generation
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix _is_flash_attention_supported to allow alibi type
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable padding mask when alibi is used for fused attn arbi backend
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add support for custom [n_heads] alibi_slopes in flash, fused, unfused attention
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up last commit
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove alibi_type=none tests as they are unnecessary
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update cudnn-frontend to 1.0.2
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* change bias/dbias shape to allow b,1/1,h/b,h in arbi backend
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* tweak tests for arbi post_scale_bias [1,h,s,s] or alibi_slopes [n_heads]
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* change bias/dbias shape in max512 backend - incomplete
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove max512 changes from last commit and disable max512 (and arbi temporarily) for [b, h, s, s]; pending cuDNN backend support
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up and tweak backend selection logic
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace || with () in docstring
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix bias shape for max512 backend
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* combine slopes/bias generation to one function get_alibi() and fix alibi tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix PR557 bugs
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/pytorch/attention.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

* encapsulate global alibi tensors into a dict cache
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* reduce alibi slopes test size
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update to cudnn-frontend 1.0.3
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* use dBias shape to define bias_b/bias_h because jax materializes dBias rather than Bias in bwd abstract
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

94de051f

06 Feb, 2024 1 commit

[PyTorch] Refactor caching of cumulative sequence lengths (#630) · da30634a

Tim Moon authored Feb 05, 2024



Do not cache sequence lengths based on layer number
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

da30634a

03 Feb, 2024 1 commit

Update cudnn-frontend to 1.0.3 to fix cuDNN v9 SDPA NaNs (#650) · 2aee0591

cyanguwa authored Feb 02, 2024



* Update cudnn frontend to 1.0.3 to fix cudnn v9 Nans
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* make d_out contiguous for bwd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove cudnnDestroy to let torch handle it
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/pytorch/attention.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/pytorch/attention.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/pytorch/attention.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

2aee0591

30 Jan, 2024 1 commit

Fixed offloading for PyT version/ Added Attention activation offloading... · 44574def

Selvaraj Anandaraj authored Jan 29, 2024


Fixed offloading for PyT version/ Added Attention activation offloading support/ Native FP8 support (#632)

* Fixed offloading for PyT version/ Added Attention activation offloading support/ Native FP8 support
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

* Removed activation offloading for fused attention
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

* Fixed the illegal memory access issue for activation offloading of attention
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

* Removed the version guard
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

* Pipeline failures fix
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

* Fixed lint erros
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

* Lint error fix
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

---------
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

44574def

26 Jan, 2024 1 commit
- [PyTorch] Fix MultiheadAttention docstring (#634) · e531cd2f
  Isaac Ong authored Jan 26, 2024
```
Fix MHA docstring
Signed-off-by: Isaac Ong <isaacong.jw@gmail.com>
```
  e531cd2f
25 Jan, 2024 1 commit

[Common][PyTorch] Fused `apply_rotorary_pos_emb` (#517) · 6c1a8bb5

Xin Yao authored Jan 26, 2024



* fused apply rope
Signed-off-by: Xin Yao <xiny@nvidia.com>

* Apply suggestions from code review
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>

* resolve comments
Signed-off-by: Xin Yao <xiny@nvidia.com>

* make rotary_percent optional
Signed-off-by: Xin Yao <xiny@nvidia.com>

* fix ci
Signed-off-by: Xin Yao <xiny@nvidia.com>

* fix test
Signed-off-by: Xin Yao <xiny@nvidia.com>

* add rope test to qa
Signed-off-by: Xin Yao <xiny@nvidia.com>

* fix linting
Signed-off-by: Xin Yao <xiny@nvidia.com>

* sync apex: add transpose_output_memory
Signed-off-by: Xin Yao <xiny@nvidia.com>

* small fix
Signed-off-by: Xin Yao <xiny@nvidia.com>

* sync apex: fuse sin/cos
Signed-off-by: Xin Yao <xiny@nvidia.com>

* sync apex: fused rope for thd format
Signed-off-by: Xin Yao <xiny@nvidia.com>

* fix lint
Signed-off-by: Xin Yao <xiny@nvidia.com>

* Fix license headers
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* add support for bshd format
Signed-off-by: Xin Yao <xiny@nvidia.com>

* support different seq length
Signed-off-by: Xin Yao <xiny@nvidia.com>

* update
Signed-off-by: Xin Yao <xiny@nvidia.com>

* update copyright
Signed-off-by: Xin Yao <xiny@nvidia.com>

* remove transpose_output_memory
Signed-off-by: Xin Yao <xiny@nvidia.com>

* Make outputs contiguous in SBHD case
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>

6c1a8bb5

24 Jan, 2024 1 commit

[PyTorch] forward attention_type in MultiHeadAttention (#621) · bea70f2e

Marks101 authored Jan 24, 2024



[PyTorch] fix forward attention_type in MultiheadAttention
Signed-off-by: Markus Schnoes <markus.schnoes@gmx.de>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

bea70f2e

20 Jan, 2024 1 commit

Fix failing CI due to PR #557 merge (#616) · bacefdbb

Sudhakar Singh authored Jan 19, 2024



fix failing tests due to PR #557
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

bacefdbb

18 Jan, 2024 1 commit

make TransformerLayer accept a `bshd` or `sbhd` tensor format (#557) · 36047fd7

Sudhakar Singh authored Jan 18, 2024



* make TransformerLayer accept a `bshd` or `sbhd` tensor format
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Fixes from feedback
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* more feedback fixes
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove incorrect info from docstring
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix from feedback
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

---------
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

36047fd7

10 Jan, 2024 1 commit

[PyTorch] upgrade context parallelism implementations (#572) · 94f54d71

Xiaowei Ren authored Jan 09, 2024



* try to use cuDNN fused attention for context parallelism
Signed-off-by: xren <xren@nvidia.com>

* assert CP is only supported with NVTE_F16_arbitrary_seqlen
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* port fused attn api to context parallelism
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add one more assert
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* assert CP does not support padded tokens
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add qkv_format into CP implementation
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove qkv_format from CP function
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix qkv_for,at
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix bwd error with FA v2
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* make cp implementation support non-causal masking
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove redundant asserts for CP
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* minor assert information change
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* assert core attn bias has not been supported with CP yet
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* make CP work with window_sizes of [-1, -1] and [-1, 0]
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add draft code for fa test with cp
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* move fused attn test to a specific folder
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add assert_close to flash attn cp test
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add more tests for CP
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add optional arguments for FA v2.4+
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* minor change
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add skip condition for CP test
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* class and function naming fix
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* docstring fix
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* do not use fused attn if backend does not work with CP
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* create a separate folder for CP test as it needs multi-GPUs
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add attn_mask_type check in attn_forwrad_func_with_cp
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* code format fix
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------
Signed-off-by: xren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

94f54d71

08 Jan, 2024 2 commits

[PyTorch] Refactor parameter splitting in Linear and LayerNormLinear (#590) · bb759adc

Tim Moon authored Jan 08, 2024



* Refactor parameter split in Linear module

Remove module state from noop_cat. Support arbitrary names in parameter split. Handle tensor parallelism.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make noop_cat a standalone operation
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update parameter splits in LayerNormLinear
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug case without bias

Fix pylint complaints.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove unused import
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>

bb759adc

Use jit_fuser for bias-dropout-add fusion (#589) · 7ce7dfe5

Jaemin Choi authored Jan 08, 2024



* Use jit_fuser for bias-dropout-add fusion
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>

* Use jit_fuser for CP FA kernel
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Jaemin Choi <jaeminc@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Jaemin Choi <jaeminc@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

7ce7dfe5

06 Jan, 2024 1 commit

Bump FlashAttn version and add deterministic option for FAv2 (#585) · f2bd53c4

Kirthi Shankar Sivamani authored Jan 06, 2024



* Deterministic FA, bump minimum supported version
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix MQA/GQA
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Address review comments
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

f2bd53c4

05 Jan, 2024 1 commit
- Fix UB names in MHA (#588) · 1bb8b6eb
  Przemyslaw Tredak authored Jan 05, 2024
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
  1bb8b6eb
03 Jan, 2024 3 commits

Respect pyTorch determinism flag (#582) · d155eaac

Przemyslaw Tredak authored Jan 02, 2024


Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

d155eaac

Provide pre-computed max sequence to remove unnecessary kernels and D2H copies (#555) · b90b638d

Sangkug Lym authored Jan 03, 2024



* Provide pre-computed max sequence to remove unnecessary kernels and D2H copies
Signed-off-by: Sangkug Lym <slym@nvidia.com>

* Tweak comments
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Sangkug Lym <slym@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

b90b638d

Change the copyright to include 2024 (#583) · cd798c97
Przemyslaw Tredak authored Jan 02, 2024
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
cd798c97

02 Jan, 2024 1 commit

Avoid redundant computation for cu_seqlens (#535) · fad3044b

Hongbin Liu authored Jan 02, 2024



avoid redundant computation for cu_seqlens
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

fad3044b

16 Dec, 2023 1 commit

[PyTorch] Add sliding window support to FlashAttention (#551) · 27aa609c

cyanguwa authored Dec 15, 2023



* add sliding window to FA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix forward logic
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* change bert test to causal as unfused does not support padding
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix FlashAttention for v2-2.3 versions
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* verify FA swa works
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix mask related restrictions and duplicate code after merge
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix swa test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add docstring for get_swa func
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move repeated code into a function
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert mask change
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism filter and fix FA warning message
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add message for determinism filter
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* simplify check_set_window_size()
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix check_set_window_size in transformer layers
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix indent
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

27aa609c

15 Dec, 2023 2 commits

Disable dynamo for Fused Attention (#558) · 7e7f0920

Przemyslaw Tredak authored Dec 15, 2023



* Disable dynamo for Fused Attention
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Added test
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

7e7f0920

[PyTorch] Fix bug in micro batched inference with rotary embeddings (#536) · 37b3b7a7

Fabian Joswig authored Dec 15, 2023



[fix] fixed micro batched inference with RoPE
Signed-off-by: Fabian Joswig <fabian.joswig@deepl.com>
Co-authored-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

37b3b7a7

07 Dec, 2023 1 commit

Integrate cuDNN frontend v1 to fused attention (#497) · 32db3928

cyanguwa authored Dec 07, 2023



* Integrate cuDNN frontend v1 to fused attention and miscellaneous fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix jax/paddle for unit tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix jax/pytorch lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* simplify stride generation
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix and/or logic in get_backend
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix flag_max512 and test_numerics
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove v.contiguous() since get_qkv_layout covers it
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* skip fp8 tests for sm89
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* further fix jax CI
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix jax CI
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert mask type to comma-separated list
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix last two commits
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* integrate v1/pre-release-5
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* cleanup prerelease5 integration and fix FA2.1 commit
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* force dropout to 0 if not training
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix Jax CI
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* testing bias/alibi and padding+causal; add alibi to unfused DPA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* set flag_arb to false when non determinism is not allowed
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* followup on prev commit; remove redundant python env var setting
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: minor tweaks for tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* prepare for tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix determinism logic for fused attn
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add bias to bwd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix gpt_checkpointing/dpa_accuracy problem
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix some seg fault issues
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add failure notes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove use of non-deter var for backend selection
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fix for lint and CI
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix workspace size in bwd and uncomment bias test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix get_alibi and remove check_support
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update tests status
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove workspace_opt from FADescriptor_v1
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable arbitrary backend + post scale bias in Jax; waiting on PR 525
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up bhsd order
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* swap bias/rng_state order in aux_ctx_tensor and add bias to aux_ctx_tensor in _qkvpacked/_kvpacked API
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove support for padding_causal + cross for max512
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* change alibi bias to float32 for bias_1_4/5 tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* further clean up tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix thd fwd output shape for FlashAttention and add backend info for DPA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix definition of workspace limit when dbias is present
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* further tweak DP_WORKSPACE_LIMIT definition
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disallow alibi+no_mask for sdpa flash and update alibi tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update jax/paddle after PR525 and fix DP_WORKSPACE_LIMIT for dbias Jax tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable dbias for non-hopper archs
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix layernorm lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remode unused arg for lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove build dir in setup.py
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* change selection logic to prefer fused attn on sm90
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix distributed jax test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix h and s order in header
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update to cudnn fe v1 branch
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove manual setting of workopt path due to dbias after v1 update
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix paddle CI
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add post_scale_bias and alibi to sdpa flash support matrix
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix support matrix in header files
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move headers back to .cu and change seed/offset to int64
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update Megatron commit in L1 test and remove all prints in fused attn test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix L1 Megatron test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fp8 arg in L1 Megatron script
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* print only when debug flag is on
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove checkpointing loading to avoid loading other tests results
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

32db3928

28 Nov, 2023 1 commit

Use non-deprecated PyTorch methods to silence warnings (#541) · 54e46e21

Deepak Narayanan authored Nov 28, 2023



Getting warnings of the following form using ToT TE:

```
/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py:852: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
   data_ptr = grad_outputs[0].storage().data_ptr()
```
Signed-off-by: Deepak Narayanan <2724038+deepakn94@users.noreply.github.com>

54e46e21

17 Nov, 2023 1 commit

Disable FAv2.1+ for causal mask in cross attention (#522) · da55d247

cyanguwa authored Nov 17, 2023



* disable FAv2.1 if causal+cross attn
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove comment and add warning
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* include both causal and padding+causal
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add a space
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

da55d247

15 Nov, 2023 1 commit

Fix flash-attn checks and RoPE DPA (#506) · 7f2f7dd2

cyanguwa authored Nov 14, 2023



* fix condition checks related to FA head_dim
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* force q,k,v contiguous when RoPE is in use
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Expand FA version
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

7f2f7dd2

03 Nov, 2023 1 commit

fix bwd error of context parallelism implementation with FA v2 (#498) · 74eb7c33

Xiaowei Ren authored Nov 03, 2023



fix bwd error with FA v2
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

74eb7c33

23 Oct, 2023 1 commit

[PyTorch] Fixes and tests for FP8 + activation recompute (#487) · 427c736d

Kirthi Shankar Sivamani authored Oct 23, 2023



* initial test fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Drop eval for selective checkpointing tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove redundant recompute for FA
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* CI fix; Decouple fused attention and numerics tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

427c736d

17 Oct, 2023 1 commit
- Improve documentation (#478) · 0963020f
  Kirthi Shankar Sivamani authored Oct 16, 2023
```
Improve docs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  0963020f