Commits · c2efa7144ae8d0560a67a71151a9a0afdb4d46d2 · OpenDAS / TransformerEngine

03 Dec, 2025 12 commits
- Merge branch 'develop_v2.9' into release_v2.9 · c2efa714
  wenjh authored Dec 03, 2025
  
  c2efa714
- Don't compile ptx code · 214228c1
  wenjh authored Dec 03, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  214228c1
- Make release_v2.9 compile pass · 99e60246
  wenjh authored Dec 03, 2025
  
  99e60246
- Merge branch 'develop_v2.9' into release_v2.9 · 3a0747a9
  wenjh authored Dec 03, 2025
  
  3a0747a9
- Fix build error · cbb14a5f
  wenjh authored Dec 03, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  cbb14a5f
- Remove unsupport files · 98c5534c
  wenjh authored Dec 03, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  98c5534c
- Merge branch 'develop_v2.9' into release_v2.9 · 33062330
  wenjh authored Dec 03, 2025
  
  33062330
- Fix build error · b3dcfc28
  wenjh authored Dec 03, 2025
  
  b3dcfc28
- Merge branch 'develop_v2.9' into release_v2.9 · adb1e9c5
  wenjh authored Dec 03, 2025
  
  adb1e9c5
- Update build and install command · 1e3c6a25
  wenjh authored Dec 03, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  1e3c6a25
- Merge nv release_v2.9 · 0a5016b1
  wenjh authored Dec 03, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  0a5016b1
- Merge nv main up to v2.10.0.dev0 · 063ef88d
  wenjh authored Dec 03, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  063ef88d
26 Nov, 2025 2 commits

Merge branch 'develop_v2.8' into 'main' · 91670b05

wenjh authored Nov 26, 2025

[DCU] Skip some tests in test_sanity.py

See merge request dcutoolkit/deeplearing/TransformerEngine!61

91670b05

Merge branch 'fix_develop2.8_zc' into 'develop_v2.8' · 3a040217
wenjh authored Nov 26, 2025
```
[DCU]Fix some bugs

See merge request dcutoolkit/deeplearing/TransformerEngine!56
```
3a040217

12 Nov, 2025 4 commits
- Merge branch 'develop_v2.8' into 'main' · e3780e3a
  wenjh authored Nov 12, 2025
```
Fix build error

See merge request dcutoolkit/deeplearing/TransformerEngine!60
```
  e3780e3a
- Fix build error · a622988a
  wenjh authored Nov 12, 2025
  
  a622988a
- Merge branch 'develop_v2.8' into 'main' · a145a62a
  wenjh authored Nov 12, 2025
```
Fix hipblaslt handle manage

See merge request dcutoolkit/deeplearing/TransformerEngine!59
```
  a145a62a
- Fix hipblaslt handle manage · f4bd89eb
  wenjh authored Nov 12, 2025
  
  f4bd89eb
08 Nov, 2025 2 commits
- Merge branch 'develop_v2.8' into 'main' · e32965ff
  wenjh authored Nov 08, 2025
```
Fix user args core dump in mt

See merge request dcutoolkit/deeplearing/TransformerEngine!57
```
  e32965ff
- Fix user args core dump in mt · a13c52ad
  wenjh authored Nov 08, 2025
  
  a13c52ad
03 Nov, 2025 8 commits
- [DCU] fix some bugs in test_numerics.py · f7c66e28
  zhaochao authored Nov 03, 2025
  
  f7c66e28
- [DCU]Skip configurations that FlashAttention does not support · 87682fe2
  zhaochao authored Nov 03, 2025
```
Signed-off-by: zhaochao <zhaochao1@sugon.com>
```
  87682fe2
- [DCU]Resolve the issue of checkpoint test weights not existing. · 9d34e27a
  zhaochao authored Nov 03, 2025
```
Signed-off-by: zhaochao <zhaochao1@sugon.com>
```
  9d34e27a
- [DCU] Fix the bug in test_onnx_export.py under L0 · d5cd815f
  zhaochao authored Nov 03, 2025
```
Signed-off-by: zhaochao <zhaochao1@sugon.com>
```
  d5cd815f
- [DCU] Skip alpha non-1 tests · ef65dd33
  zhaochao authored Nov 03, 2025
```
Signed-off-by: zhaochao <zhaochao1@sugon.com>
```
  ef65dd33
- [DCU] fix bug with cannot import name 'use_lightop_w8a8' from 'transformer_engine.pytorch.utils' · 3d36696b
  zhaochao authored Nov 03, 2025
```
Signed-off-by: zhaochao <zhaochao1@sugon.com>
```
  3d36696b
- [DCU] Skip some tests in test_cuda_graphs.py under L0 · 2fc4b10c
  zhaochao authored Nov 03, 2025
```
Signed-off-by: zhaochao <zhaochao1@sugon.com>
```
  2fc4b10c
- [DCU] Skip some tests in test_sanity.py · 6af7b77d
  zhaochao authored Nov 03, 2025
```
Signed-off-by: zhaochao <zhaochao1@sugon.com>
```
  6af7b77d
31 Oct, 2025 2 commits

Merge branch 'TE_develop2.8' into 'develop_v2.8' · 3a5755b1

wenjh authored Oct 31, 2025

[DCU]Fix memory overflow and test-didistributed in L1_pytorch_istributed_unittest

See merge request dcutoolkit/deeplearing/TransformerEngine!49

3a5755b1

[JAX] Ensure JAX reference impl uses an accurate backend in our tests (#2322) · 70f53666
jberchtold-nvidia authored Oct 30, 2025
```
Ensure JAX reference impl uses an accurate backend
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
70f53666

30 Oct, 2025 4 commits

[PyT] Bump the min version expected to supported FP8 current scaling... · 9cc089a2

Kshitij Lakhani authored Oct 30, 2025


[PyT] Bump the min version expected to supported FP8 current scaling determinism on Blackwell (#2316)

* Bump the min version expected to supported FP8 cs det on Blackwell
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Disable fused attn for cudnn < 9.14 for FP8 CS. Disable fused attn for cudnn < 9.18 for FP8 deterministic CS
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

9cc089a2

[PyTorch] Fix attention backend and tests for `sm120` (#2320) · 0acd0e7d

Kirthi Shankar Sivamani authored Oct 30, 2025



* Fix attention backend and tests for sm120
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Disable MLA only for backward
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

0acd0e7d

[JAX] Fix: Skip determinism tests for bprop for all sm >=100 (#2315) · fe9b1509

Kshitij Lakhani authored Oct 30, 2025



* Fix: Skip determinism tests for bprop for all sm >=100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add username to TODO
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Assert in fused attn bwd pass for sm100+
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fe9b1509

[PyTorch] Fix CI failures due to deterministic attention backend (#2288) · fa71964f

Kirthi Shankar Sivamani authored Oct 20, 2025



* Fix CI failures due to deterministic attention
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* some more cleanup
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix debug test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

fa71964f

28 Oct, 2025 1 commit

[PyTorch] Add max_logit support for MuonClip (#2195) · c4c185db

Charlene Yang authored Oct 24, 2025



* add max_score for fused/unfused F16 non-CP
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* calculate max per head instead of max over all heads
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fused attn max_score shape
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert FE to github
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update FE to 1.15.0-rc
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix merge
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* reduce ew kernels; fix causal masks; add more tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fix to tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove logic for flash-attn
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: add CP support for p2p/a2a/all_gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor improvements of implementation/tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* WIP: add thd support
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add thd to UnfusedDPA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* more fixes for lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update to FE 1.15
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove unneeded changes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable unfused for thd + pad_between_seqs
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable thd for unfused until bug is fixed
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix all_gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rename max_score to max_logit
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all_gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all_gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable fused attn + thd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

c4c185db

24 Oct, 2025 4 commits

Overhaul the compilation for the arch-specific features (#2279) · 8b9849a2

Przemyslaw Tredak authored Oct 22, 2025



* Added sm_120f to the build
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the arch specific handling
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Support for CUDA<12.9
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Moved through the rest of the files
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Common cases
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Remove pure 100 from the list
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* CMake changes, (not yet working)
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Do not pass the arch-specific thing from build_tools
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Moved some of the files to arch-specific compilation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix and also changing the order of compilation to hopefully get the
compilation time lower
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the files overwriting custom compile properties
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually make this whole thing work
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add space to the error message
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* Apply suggestions from code review
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* Fixes from review
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changing the naming to be more intuitive
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add missing cassert include for device-side asserts
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

8b9849a2

Include TE core headers in final build (#2291) · 9b75db37
Kirthi Shankar Sivamani authored Oct 22, 2025
```
Include TE core headers in build
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
9b75db37

[JAX] NVFP4 recipe with option to enable/disable SR, RHT, and 2D quantization (#2270) · 7e72d411

jberchtold-nvidia authored Oct 22, 2025



* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add test for SR state preservation across VJP boundaries
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix sharding of SR rng state
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update tolerances slightly now that SR is enabled
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use hashlib for deterministic hashes across runs for SR
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename uses_rht on scaled tensors to has_applied_rht
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add assert
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Move decision of whether to use RHT into helper.py and add dedicated RHT tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix use_rht attr usage
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix pure-jax rht usage criteria
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Adjust tolerances after rebase
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

7e72d411

Wheels for cuda 13 (#2278) · c2a643d5

Kirthi Shankar Sivamani authored Oct 18, 2025



* Support wheel build for cuda 13
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes for cu13 runtime, format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add documentation
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better error handling
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix jax sdist
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Modify function names
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

c2a643d5

17 Oct, 2025 1 commit

[JAX] Fix imports in test for deprecated jax.experimental.pjit (#2274) · 739c6565

Kshitij Lakhani authored Oct 16, 2025



* Fix imports in test for deprecated jax.experimental.pjit
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix: Pass NamedSharding instead of PartitionSpec to compare_ops() so that when the in and out sharding is used to create a jitted function, it has the mesh info
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

739c6565