Commits · 0d874a4e7f1b73692cc6f4d7ee06542680bc15a5 · OpenDAS / TransformerEngine

02 Jan, 2026 1 commit
- Update copyright to include year 2026 (#2553) · 830ef60f
  Kirthi Shankar Sivamani authored Jan 02, 2026
```
Update copyright to include 2026
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  830ef60f
11 Nov, 2025 1 commit

[PyTorch] FSDP2 Support for TE (#2245) · 29537c96

vthumbe1503 authored Nov 10, 2025



* fix for float8 tensor fsdp2 training
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* zeros_like should return fp32 for fsdp2 to work
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* minor cleanup
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix unsharded weights not releasing memory
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* implement using fsdp preallgather and postallgather functions
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* FSDP2 works on Hopper/L40
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor comment
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* some fixes for fp8 + handwavy changes for mxfp8
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* only transpose saved for backward pass allgather in case of L40/Hoppergst
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* missed minor change to hopper use-case
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* communicate only required data in mxfp8, fix for updating weight usages when required instead of doing upfront in fwd pass
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* changes for meta Dtensors for weights and better all gather data handling in fsdp hook functions
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* better solution to figure out forward pass in FSDP2
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* adress review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* everything functioning except hack for transformerlayer
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix merge conflict
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* revert change of commit id for cudnnt-frontend
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* unnecessary change
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor issues with linting, add some comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor stuff
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert space removal

Add default usage handling for rowwise and columnwise data.
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* fix the fsdp state collection issue, and minor review comments addressing
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* revert change for dgrad redundant computation
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* bug: get fsdp param group's training state instead of root training state; address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address coderabbit review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* adress review comments; fix fp8 allgather test to do after fsdp lazy init
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* remove detach
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* do what makes sense
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/float8_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* adress review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* have better dtype for fsdp_post_all_gather arguments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* minor comment
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* improve comment
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix the error in CI
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* minor comment add
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* accidentally removed view function
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix minor bug for h100
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* minor addition
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* implement padding removal/addition for allgather
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint error
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* adress review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* improve the reset parameter logic for dtensors
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* other cosmetic changes
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* cosmetic changes
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* cosmetic changes
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/module/layernorm_linear.py
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

29537c96

17 Oct, 2025 1 commit

Fix test of FSDP2 by correcting init logic and applying autocast (#2105) · c593bcef

Neil Tenenholtz authored Oct 17, 2025



* Fix test of FSDP2 by correcting init logic and applying autocast

This fixes multiple issues in the FSDP2 test, namely
1. Previously fp8 init was performed when `args.fp8_init == False`. I have updated the logic to match what I presume was intended by leveraging the nullcontext context manager.
2. `te.fp8_autocast` was previously not called; the recipe was created but was unused. The autocast context manager now wraps the model's computation.
Signed-off-by: Neil Tenenholtz <ntenenz@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix typo
Signed-off-by: Neil Tenenholtz <ntenenz@users.noreply.github.com>

* Update tests/pytorch/distributed/run_fsdp2_model.py
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix bug when constructing context for model init
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Neil Tenenholtz <ntenenz@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

c593bcef

14 Oct, 2025 1 commit

Generalize quantization APIs for FP8/FP4/.. recipes (#2256) · 85a91997

Kirthi Shankar Sivamani authored Oct 14, 2025



* Initial API change
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change all imports and api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix typo
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix recipe tets
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix more tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix docs, tests, and make Jax change as well
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change internal uses of fp8_autocast
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Address nits
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rename file
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* CG function, and small test fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change instances of make_graphed_callables internally
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix distributed tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix test and add more docs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cleanup test imports and minimize internal file imports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Make is_bf16_available public
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better docs and better api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply suggestions from code review
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* fix nvfp4 test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

85a91997

29 Apr, 2025 1 commit
- [DCU] fix fsdp2 · 16de530e
  yuguo authored Apr 29, 2025
  
  16de530e
02 Jan, 2025 1 commit
- Update copyright to include 2025 (#1388) · c9ea6be9
  Kirthi Shankar Sivamani authored Jan 02, 2025
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  c9ea6be9
16 Dec, 2024 1 commit

Enabling FP8 all-gather for TE Float8Tensor when using Torch FSDP2 (#1358) · 0196ed44

Youngeun Kwon authored Dec 16, 2024



* draft implementation of fsdp2 fp8 all gather
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix the convergence issue
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add warning
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* disable lint error
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix the lint error
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix lint error
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint error
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint error
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* add comments
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* add ref
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* add related tests
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

0196ed44