Commits · 6273cede50f50f6e48314fddb9d22da2d16ef871 · OpenDAS / TransformerEngine

"vscode:/vscode.git/clone" did not exist on "b56b6ca0d650c653c80ec113e27d6a8e640a4b2f"

24 Oct, 2025 1 commit

[PyTorch] Support delay_wgrad_compute cudagraph (#1948) · 6273cede

buptzyb authored Oct 24, 2025



* support cudagraph dw
Signed-off-by: Robin Zhang <robinz@nvidia.com>

* fix lint
Signed-off-by: Robin Zhang <robinz@nvidia.com>

* fix ci
Signed-off-by: Robin Zhang <robinz@nvidia.com>

---------
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

6273cede

23 Oct, 2025 3 commits

[PyTorch Debug] Fix issue with microbatching + debug value caching (#2108) · 021e1e62

Paweł Gadziński authored Oct 24, 2025



* fix perf issue
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

021e1e62

[JAX] Make SR rng state always 2D (num_devices, 4) to fix partitioning issue (#2294) · e2f2a0b4

jberchtold-nvidia authored Oct 23, 2025



* Make SR rng state always 2D (num_devices, 4)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix pure-jax impl
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix test shape
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

e2f2a0b4

Overhaul the compilation for the arch-specific features (#2279) · eb34783c

Przemyslaw Tredak authored Oct 22, 2025



* Added sm_120f to the build
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the arch specific handling
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Support for CUDA<12.9
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Moved through the rest of the files
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Common cases
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Remove pure 100 from the list
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* CMake changes, (not yet working)
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Do not pass the arch-specific thing from build_tools
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Moved some of the files to arch-specific compilation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix and also changing the order of compilation to hopefully get the
compilation time lower
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the files overwriting custom compile properties
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually make this whole thing work
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add space to the error message
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* Apply suggestions from code review
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* Fixes from review
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changing the naming to be more intuitive
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add missing cassert include for device-side asserts
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

eb34783c

22 Oct, 2025 3 commits

[JAX] Defer TE/JAX cublas shape check on fp8 gemms until lowering (#2292) · 2ac3c168
jberchtold-nvidia authored Oct 22, 2025
```
Defer cublas check on fp8 gemms until lowering
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
2ac3c168

[JAX] NVFP4 recipe with option to enable/disable SR, RHT, and 2D quantization (#2270) · 818b30cc

jberchtold-nvidia authored Oct 22, 2025



* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add test for SR state preservation across VJP boundaries
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix sharding of SR rng state
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update tolerances slightly now that SR is enabled
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use hashlib for deterministic hashes across runs for SR
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename uses_rht on scaled tensors to has_applied_rht
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add assert
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Move decision of whether to use RHT into helper.py and add dedicated RHT tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix use_rht attr usage
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix pure-jax rht usage criteria
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Adjust tolerances after rebase
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

818b30cc

[PyTorch] Decouple python quantization classes and refactor custom quantization (#2276) · ce2e8bd1

Evgeny Tsykunov authored Oct 22, 2025



* rename experimental -> custom_recipes
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Decouple python base classes (api)
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* update test_custom_recipe
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Rename experimental -> custom
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Minor
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix import
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Update tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Evgeny Tsykunov <e.tsykunov@gmail.com>

* Update tests/pytorch/test_custom_recipe.py
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Evgeny Tsykunov <e.tsykunov@gmail.com>

* quantization_base -> quantized_tensor rename
Signed-off-by: Evgeny <etsykunov@nvidia.com>

---------
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny Tsykunov <e.tsykunov@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ce2e8bd1

21 Oct, 2025 2 commits

Add post-processing API for FP8 primary weights to support CUDA Graph (#2266) · 2712bb95

Kunlun Li authored Oct 22, 2025



* Add post-processing API for FP8 primary weights to support CUDA Graph
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Add post-processing support for plain pytorch tensors
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Update type hint
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: kunlunl <kunlunl@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

2712bb95

[PyTorch][MOE] Support NVFP4 Grouped Linear (#2215) · b4a1d4d6

Zhongbo Zhu authored Oct 20, 2025



* pipeclean, fix nvfp4 padding of 32 alignment
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* numerical test passed
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix CI failure with test_cast_master_weights_to_fp8 (in a hacky way)
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* found CUDA mis-aligned address error in training in multi-swizzle, hack the vec_load_size to 1 to unblock
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* leave comments about alignment issue
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fused bulk alloc nvfp4
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix RHT sign mask CPU overhead
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* resolve comments
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Remove incorrect logic that treats 0-D tensor as uninitialized

Tensor shape logic still requires treating 0-D tensor as uninitialized.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix invalid conversion from tensor to int
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

b4a1d4d6

20 Oct, 2025 2 commits

[PyTorch] Fix CI failures due to deterministic attention backend (#2288) · bd55e7ba

Kirthi Shankar Sivamani authored Oct 20, 2025



* Fix CI failures due to deterministic attention
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* some more cleanup
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix debug test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

bd55e7ba

Fix error with triton 3.5 (#2286) · dd7ab715

fzyzcjy authored Oct 20, 2025



* Update permutation.py
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

* Update permutation.py
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

* Update transformer_engine/pytorch/triton/permutation.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/pytorch/triton/permutation.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

dd7ab715

18 Oct, 2025 1 commit

Wheels for cuda 13 (#2278) · fd234d80

Kirthi Shankar Sivamani authored Oct 18, 2025



* Support wheel build for cuda 13
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes for cu13 runtime, format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add documentation
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better error handling
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix jax sdist
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Modify function names
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

fd234d80

17 Oct, 2025 4 commits

Make `CanonicalizeGemmInput()` support non-TN layout FP8 GEMM on Blackwell... · ee384ab5

Alp Dener authored Oct 17, 2025

Make `CanonicalizeGemmInput()` support non-TN layout FP8 GEMM on Blackwell with column-wise/transposed data (#2233)

Modified CanonicalizeGemmInput() logic to pull from column-wise data for FP8 GEMM on Blackwell when row-wise is not available.
Signed-off-by: Alp Dener <adener@nvidia.com>

ee384ab5

Bump up FA to 2.8.3 (#2282) · a7a69ca6

Haowen Zheng authored Oct 18, 2025


Signed-off-by: 将来 <jianglai.zhw@alibaba-inc.com>
Co-authored-by: 将来 <jianglai.zhw@alibaba-inc.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

a7a69ca6

fall back after failing ldconfig-based lib loading for cuDNN (#2277) · bd380048
Tim Geypens authored Oct 17, 2025
```
Signed-off-by: Tim Geypens <tim.geypens@gmail.com>
```
bd380048

NVFP4 Move RHT BLAS to GPU (#2275) · 05dc1e62

Kevin Tong authored Oct 17, 2025



* CUDA RHT
Signed-off-by: Kevin Tong <kevin@augmentcode.com>

* Fix cuda graphs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix bug where RHT mask is tensor instead of int
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Kevin Tong <kevin@augmentcode.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>

05dc1e62

16 Oct, 2025 2 commits

[PyTorch] Add record_stream and untyped_storage func op in QuantizedTensor (#2144) · 81c363bf

xiaoxi-wangfj authored Oct 17, 2025



* [PyTorch] Add record_stream and untyped_storage func op in QuantizedTensor
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* Update transformer_engine/pytorch/tensor/float8_blockwise_tensor.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

---------
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

81c363bf

Added support for DistOpt with offloading with MoE's (#2264) · 452c7374

Selvaraj Anandaraj authored Oct 16, 2025

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

452c7374

15 Oct, 2025 1 commit

[PyTorch Debug] Fix issue with start_end_list logging feature (#2252) · 4c572f04

Paul Gibbons authored Oct 15, 2025



* fixes for start_end_list usage in TE debug
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

4c572f04

14 Oct, 2025 4 commits

[PyTorch] Bump minimum cuDNN version for fused attention with FP8 current scaling (#2236) · fd2f589f

Tim Moon authored Oct 14, 2025



* Require cuDNN 9.14.0+ for fused attention with FP8 current scaling
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

fd2f589f

Generalize quantization APIs for FP8/FP4/.. recipes (#2256) · 85a91997

Kirthi Shankar Sivamani authored Oct 14, 2025



* Initial API change
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change all imports and api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix typo
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix recipe tets
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix more tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix docs, tests, and make Jax change as well
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change internal uses of fp8_autocast
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Address nits
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rename file
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* CG function, and small test fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change instances of make_graphed_callables internally
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix distributed tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix test and add more docs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cleanup test imports and minimize internal file imports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Make is_bf16_available public
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better docs and better api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply suggestions from code review
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* fix nvfp4 test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

85a91997

[JAX] Add BRCM support for THD (#2242) · ca6fedcf

Kshitij Lakhani authored Oct 14, 2025



* Add BRCM support when creating a test mask for fused attn
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add support for BRCM to correctly generate the mask needed for calculating the seqlens and offsets for THD
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Skip drop=0 and no_bias case for BRCM as cuDNN does not suport this
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Skip BRCM test cases where max_seqlen_q > max_seqlen_kv
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Refactor the segment id run length code for BRCM seqoffset and seqlens calculations
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix the drop inequality skip condition in fused attn
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Adjust the BRCM id name in the test to make it consistent
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix the brcm mask condition.
Fix the condition for cross atnn type pattern to only apply for brcm
Change the num segments per sequence to 3 instead of 2
Reduce one test pattern data size and make it such that it triggers brcm
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix lint errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix incorrectly changed dtype to numpy bool_ rather than native python bool
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Restore the numsegments to earlier value
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add example for THD BRCM
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

ca6fedcf

[PyTorch] Use Quantization API for reference NVFP4 recipe (#2259) · dfacd9f7

Evgeny Tsykunov authored Oct 14, 2025



* Fix update_quantized in ref nvfp4 quantizer
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Subclass quantization API
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Use recipe.Custom and quantizer factories for reference NVFP4
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Linter fix
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

dfacd9f7

13 Oct, 2025 5 commits

FSDP grad fusion support (#2191) · a3b749b1

Selvaraj Anandaraj authored Oct 13, 2025



* FSDP grad fusion support
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Re-factored grad overwriting usage
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* Update transformer_engine/pytorch/ops/basic/basic_linear.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@nvidia.com>

* Update transformer_engine/pytorch/ops/fused/backward_linear_add.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@nvidia.com>

* Update transformer_engine/pytorch/ops/fused/backward_linear_scale.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@nvidia.com>

* Update transformer_engine/pytorch/ops/fused/userbuffers_backward_linear.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@nvidia.com>

* Modified API usage, added arg details
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

a3b749b1

[JAX] Add assertion message to amax -> scale computation (#2263) · 76e1af33
jberchtold-nvidia authored Oct 13, 2025
```
assertion check
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
76e1af33

[Common][JAX] Improve error message for cublas fp8 gemm with incorrect shape (#2261) · 8c364b4d

jberchtold-nvidia authored Oct 13, 2025



* Improve error message for cublas fp8 gemm with incorrect shape
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Removed unnecessary non-contracting size check
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename inner dim -> leading dim
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

8c364b4d

Disable torch autocast context in rope forward pass (#2240) · 8eec2004

Peter St. John authored Oct 13, 2025


Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

8eec2004

Offloading support for multiple attention layouts (#2024) · 7ad130ef

Selvaraj Anandaraj authored Oct 13, 2025



* Added multi-layout support for attention
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>

* Comment/cleanup
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>

* Bug fix on import time
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

7ad130ef

09 Oct, 2025 5 commits

Don't pickle an empty dict in LayerNorm and pt base modules (#2253) · dd9433e7

Peter St. John authored Oct 09, 2025

Don't pickle an empty dict in LayerNorm and BasicOperation layers
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

dd9433e7

[JAX] NVFP4 support in TE/JAX (#2254) · 8a7ab3dd

jberchtold-nvidia authored Oct 09, 2025


Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

8a7ab3dd

Update minimum python version to 3.10 and add checks in CI (#2247) · e99be1b6

Kirthi Shankar Sivamani authored Oct 09, 2025



* Update minimum python version to 3.10 and update CI
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

e99be1b6

[PyTorch] Deprecate old `float8_tensor.py` (#2250) · 9bf4175f
Kirthi Shankar Sivamani authored Oct 08, 2025
```
Deprecate old float8_tensor.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
9bf4175f
Disallow pure E5M2 recipe for `Float8BlockScaling` (#2251) · e37e33e1
Kirthi Shankar Sivamani authored Oct 08, 2025
```
Catch unsupported GEMM during recipe init
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
e37e33e1

08 Oct, 2025 2 commits

[JAX] Async issuing D2H memcpy for grouped_gemm group_sizes array (#2213) · af2a0c16

Hua Huang authored Oct 08, 2025



* Try async copy of grouped GEMM group_sizes data
Signed-off-by: Hua Huang <huah@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

af2a0c16

[PyTorch] Unblock fused bgrad quantization path for nvfp4 (#2246) · 66f9b3cb
Kirthi Shankar Sivamani authored Oct 07, 2025
```
Unblock path for fusing NVFP4 quantize and bgrad
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
66f9b3cb

07 Oct, 2025 2 commits

`NVFP4BlockScaling` recipe docs (#2241) · 76bced54

Kirthi Shankar Sivamani authored Oct 07, 2025



* Improve docstring for NVFP4 recipe
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add NVFP4BlockScaling to recipe docs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Grammar
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* improve wording
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/recipe/__init__.py
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/recipe/__init__.py
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/recipe/__init__.py
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/recipe/__init__.py
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/common/recipe/__init__.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

76bced54

[JAX] Activation/Normalization to output amax for later quantization in CurrentScaling (#2238) · 127b6d3a

Phuong Nguyen authored Oct 07, 2025



* reuse amax for current scaling
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

127b6d3a

06 Oct, 2025 1 commit

[JAX] Fix for GEMM + fuse bias + AllReduce (#2230) · 0db0f4d2

Phuong Nguyen authored Oct 06, 2025



* not fuse bias for output all reduction case + unit tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* norm to reduce dgamma along tpsp as well
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* clean up tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix test_distributed_layernorm byte counts
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* increase tols for jax_gemm
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

0db0f4d2

04 Oct, 2025 2 commits

Fix FP8 current scaling attention logic (#2234) · 08779fd8

Kirthi Shankar Sivamani authored Oct 03, 2025



* Fix in FP8 attention selection logic
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Improve logic
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

08779fd8

Fix bug where CUTLASS kernel was not being compiled for SM90a (#2235) · 5be81251
Tim Moon authored Oct 03, 2025
```
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
5be81251