- 02 Jan, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 Dec, 2024 1 commit
-
-
Charlene Yang authored
* add swa (left,0) + padding + brcm support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * final fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * upgrade to FE 1.9-rc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip thd + CP + fused attn tests for cuDNN 9.6+ due to different stats shapes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 18 Dec, 2024 3 commits
-
-
Phuong Nguyen authored
* Move test distributed encoder to L0 distributed test suit --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Reese Wang <rewang@nvidia.com>
-
Charlene Yang authored
* WIP: fix get_swa_mask for padding Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix mask type setting Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix the order of checking valid swa and changing mask type Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revamp to get full mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Charlene Yang authored
add weights_only=False for torch.load Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 17 Dec, 2024 2 commits
-
-
Reese Wang authored
* Add util functions to attn_mask_type Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add util functions to qkv_layout Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix THD cross reference code Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove explicit segment_pad, encoding it to segment_ids Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add jax.jit, replace _token with segment_ids, rename bias shape enum Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add comment for make_mask Signed-off-by:
Reese Wang <rewang@nvidia.com> * Clean code Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add doc strings for the added functions Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove cache for fa deterministic which causes UT failed Signed-off-by:
Reese Wang <rewang@nvidia.com> * Rename fixture to avoid conflict Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
Charlene Yang authored
add max_t for KV Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 16 Dec, 2024 1 commit
-
-
Youngeun Kwon authored
* draft implementation of fsdp2 fp8 all gather Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix the convergence issue Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * Add warning Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add comments Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add ref Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add related tests Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Dec, 2024 2 commits
-
-
Phuong Nguyen authored
* softmax custom calls with correct encapsulates * rm jax deprecated features --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Jingyue Wu authored
-
- 12 Dec, 2024 2 commits
-
-
Kirthi Shankar Sivamani authored
Add Jeremy to ci users Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Phuong Nguyen authored
* fix ctx.aval_out indexing for workspace * add cudnn init to prepare phase of norm custom calls * add thread_local for norm registry instance --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 10 Dec, 2024 1 commit
-
-
Reese Wang authored
* Bug Fix: Use default factory for not sharing mutable default values --------- Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 06 Dec, 2024 2 commits
-
-
Phuong Nguyen authored
* cuDNN normalization integration * TE Norm refactor * TE Norm APIs changes. --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Tim Moon authored
Debug Mcore integration test Avoid FP8 on Ampere and older. Generate synthetic data instead of depending on external data. Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 05 Dec, 2024 3 commits
-
-
Xiaowei Ren authored
* always have padding mask type for both flash and fused attentions Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove an redundant assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com>
-
Tim Moon authored
Store module extra state in tensor Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
Tim Moon authored
Debug jobs to deploy nightly docs Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 04 Dec, 2024 1 commit
-
-
Michael Goldfarb authored
Scale sequence length in CP tests to avoid tiny sizes. Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
- 02 Dec, 2024 2 commits
-
-
Youngeun Kwon authored
* draft implementation Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * compile error fix Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix compile error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * remove print Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Edit comments Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * edit the bulk-overlap test case Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add runtime version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix the version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> --------- Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
* Update list of CI users Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update list of CI users Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 27 Nov, 2024 1 commit
-
-
Xiaowei Ren authored
* retain_graph=True for grouped gemm Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove an unnecessary retain_graph=True Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make retain_graph in graph capture configurable Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com>
-
- 25 Nov, 2024 2 commits
-
-
Michael Goldfarb authored
Moved framework agnostic THD kernels to common. --------- Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
buptzyb authored
* Align RNG tracker with megatron Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Fix module_params order and warmup bug in cudagraph Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Add fp8_group argument and fix fp8 accuracy issue for cudagraph Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Add TE modules and weights filters to support MoE models Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Revert self.fp8 Signed-off-by:
Robin Zhang <robinz@nvidia.com> * Use hooks to filter module params Signed-off-by:
Robin Zhang <robinz@nvidia.com> * Filter all TE modules in hooks Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Format code Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update graph.py Signed-off-by:
Xin Yao <yaox12@outlook.com> * Revert CudaRNGStatesTracker Signed-off-by:
Robin Zhang <robinz@nvidia.com> * Format Update Signed-off-by:
Yifei Song <yifeis@nvidia.com> * Revert "Use hooks to filter module params" This reverts commit 73a22e2e8bcf43ec84c23bc844b8d16d06626e26. Signed-off-by:
Yifei Song <yifeis@nvidia.com> * Remove filtering module params Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Signed-off-by:
Yifei Song <yifeis@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <yaox12@outlook.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 22 Nov, 2024 1 commit
-
-
Tim Moon authored
* Add helper function to convert C++ container to string Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 21 Nov, 2024 1 commit
-
-
Tim Moon authored
* Handle deprecated `hidden_size` arg in norm modules Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Support initializing norm ops on CPU Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add integration test for Megatron-LM Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename Mcore integration test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Handle case in RMSNorm where hidden dim is not provided Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 20 Nov, 2024 1 commit
-
-
Charlene Yang authored
* fix GQA error message Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 15 Nov, 2024 3 commits
-
-
Kenichi Maehashi authored
use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR Signed-off-by:Kenichi Maehashi <webmaster@kenichimaehashi.com>
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
Tim Moon authored
* Add activation ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix lint warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Update to use QuantizedTensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Respect PyTorch autograd dtype Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename CastFloat8 op to Quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for fused dSwiGLU-cast-transpose Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Nov, 2024 2 commits
-
-
Kirthi Shankar Sivamani authored
* Limit to one call of ctx.saved_tensors per autograd bwd Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
* Remove manual FP8 scale update for FP8 params Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 13 Nov, 2024 2 commits
-
-
Tim Moon authored
Debug ONNX export with te.Sequential ONNX export assumes that all state dict objects are tensor, even extra state. Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
Jennifer Zhou authored
fix an int conversion error Signed-off-by:Jennifer Zhou <jennifer@jezh.me>
-
- 12 Nov, 2024 1 commit
-
-
Hua Huang authored
* FFI for all softmax functions Signed-off-by:
Hua Huang <huah@nvidia.com> * FFI for FusedAttnBackward and Dequantize FusedAttnBackward passed all testes in test_fused_attn.py. Dequantize is not used currently; finish it for completeness. Signed-off-by:
Hua Huang <huah@nvidia.com> * Fix FusedAttnBackward FFI pybind & simplify Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert changes to tests/jax/test_fused_attn.py Signed-off-by:
Hua Huang <huah@nvidia.com> --------- Signed-off-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 11 Nov, 2024 2 commits
-
-
Kirthi Shankar Sivamani authored
* Fix file extensions Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix build Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * upgrade paddle container for CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Ming-Xu Huang authored
* Implement ring attention primative for Jax. Signed-off-by:
Michael Goldfarb <mgoldfarb@nvidia.com> Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Michael Goldfarb <mgoldfarb@nvidia.com> Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
Michael Goldfarb <mgoldfarb@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 08 Nov, 2024 1 commit
-
-
Phuong Nguyen authored
* split cudnn utils from fused_attn/util --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 07 Nov, 2024 1 commit
-
-
Phuong Nguyen authored
* added prepare phase for the FusedAttnForwardFFI * enabled FusedAttnForwardFFI by default * moved prepare phase into pybind --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 06 Nov, 2024 1 commit
-
-
Hua Huang authored
* FFI for some transpose & activation functions Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove comments in transformer_engine/jax/csrc/extensions/activation.cpp Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com> Signed-off-by:
Hua Huang <huangh1994@outlook.com> --------- Signed-off-by:
Hua Huang <huah@nvidia.com> Signed-off-by:
Hua Huang <huangh1994@outlook.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-