Commits · 0c4618804636c341aa0a153b2263d6c56184fd60 · OpenDAS / TransformerEngine

02 Sep, 2025 2 commits
- Fix build error of cpp unit test · 0c461880
  wenjh authored Sep 02, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  0c461880
- Fix build problems while not support fp4 · 8e0fd518
  wenjh authored Sep 02, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  8e0fd518
28 Aug, 2025 2 commits
- [DCU] fix quantize bug · d86ee4c8
  yuguo authored Aug 28, 2025
  
  d86ee4c8
- [DCU] fix bugs · 546bb548
  yuguo authored Aug 28, 2025
  
  546bb548
27 Aug, 2025 2 commits
- [DCU] fix compile · 5b6190b2
  yuguo authored Aug 27, 2025
  
  5b6190b2
- Merge commit '734bcedd' of... · 87e3e56e
  yuguo authored Aug 27, 2025
```
Merge commit '734bcedd' of https://github.com/NVIDIA/TransformerEngine
```
  87e3e56e
26 Aug, 2025 4 commits
- Merge branch 'develop_v2.5' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 2f11bd2e
  yuguo authored Aug 26, 2025
  
  2f11bd2e
- [DCU] fix · 4927d10e
  yuguo authored Aug 26, 2025
  
  4927d10e
- Merge branch 'develop_v2.5' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 9d26d942
  yuguo authored Aug 26, 2025
  
  9d26d942
- [DCU] fix · 2e870ed9
  yuguo authored Aug 26, 2025
  
  2e870ed9
25 Aug, 2025 4 commits
- Merge branch 'develop_v2.5' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 11bc1775
  yuguo authored Aug 25, 2025
  
  11bc1775
- [DCU] fix moe tensorwise int8 · 059d92e2
  yuguo authored Aug 25, 2025
  
  059d92e2
- Merge branch 'develop_v2.5' · e12a1085
  wenjh authored Aug 25, 2025
  
  e12a1085
- Fix some test problem in pytorch unittest · 62550505
  wenjh authored Aug 25, 2025
  
  62550505
23 Aug, 2025 3 commits
- Merge branch 'develop_v2.5' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 374b85bd
  yuguo authored Aug 23, 2025
  
  374b85bd
- [DCU] tensorwise int8 gemm surpport bias · 11864d3d
  yuguo authored Aug 23, 2025
  
  11864d3d
- [DCU] fix tensorwise int8 moe bugs · 32edae18
  yuguo authored Aug 23, 2025
  
  32edae18
21 Aug, 2025 4 commits
- Merge branch 'develop_v2.5' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 1b971e27
  yuguo authored Aug 21, 2025
  
  1b971e27
- fix · 0cf10d1c
  yuguo authored Aug 21, 2025
  
  0cf10d1c
- Merge branch 'develop_v2.5' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 20065c44
  yuguo authored Aug 21, 2025
  
  20065c44
- [DCU] tensorwise int8 train opt · 7a923605
  yuguo authored Aug 21, 2025
  
  7a923605
20 Aug, 2025 1 commit
- Merge branch 'develop_v2.5_swap' into 'develop_v2.5' · 686e93cd
  yuguo authored Aug 20, 2025
```
add swap env

See merge request dcutoolkit/deeplearing/TransformerEngine!40
```
  686e93cd
19 Aug, 2025 1 commit
- add swap env · d19a5a44
  evt_fugx1 authored Aug 19, 2025
  
  d19a5a44
18 Aug, 2025 5 commits

Changed VERSION to 2.8.0.dev0 · 734bcedd
Przemek Tredak authored Aug 18, 2025
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
734bcedd
[JAX] Fix for TE GEMM - Always AllGather RHS non-contracting dims with FSDP axis (#2075) · 3fc1e4bf
Phuong Nguyen authored Aug 18, 2025
```
* fix fsdp
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
3fc1e4bf

[PyTorch] Check if the given recipe is supported in `fp8_autocast` (#2073) · 0e3e270f

Xin Yao authored Aug 19, 2025



* check if the given recipe is supported in fp8_autocast
Signed-off-by: Xin Yao <xiny@nvidia.com>

* resolve comments
Signed-off-by: Xin Yao <xiny@nvidia.com>

* check only when enabled
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

0e3e270f

Update list of authorized CI users (#2078) · 988af0fd

Tim Moon authored Aug 18, 2025



* Update list of authorized CI users
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update .github/workflows/trigger-ci.yml
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

988af0fd

[JAX] Fix Flax variable creation when quantizers are created directly from a recipe (#2079) · 757fd1cf
jberchtold-nvidia authored Aug 18, 2025
```
Fix flax variables when creating quantizers directly from a recipe
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
757fd1cf

16 Aug, 2025 1 commit

fix: fixes multi head attention for context parallel: rotary embedding to use... · 6ba98d43

jomitchellnv authored Aug 15, 2025

fix: fixes multi head attention for context parallel: rotary embedding to use padded cu_seq_lens (#2077)

fix: fixes mha to use padded cu_seq_lens during cp
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>

6ba98d43

15 Aug, 2025 4 commits

Fuse linear+scale+add (#2042) · c654e4fe

Jan Bielak authored Aug 15, 2025



* Add `nvte_cublas_gemm_scaled`
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Support use of `alpha` and `beta` in `tex.generic_gemm`
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Support use of `alpha` and `beta` in `general_gemm`
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Support use of `alpha` and `beta` in `BasicLinear._functional_forward` and `BasicLinear._functional_backward`
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Add `ForwardLinearScaleAdd` fusion
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Add `BackwardLinearScale` fusion
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Apply suggestions from code review
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Remove calls to `validate_gemm_scale` from `BasicLinear`
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

---------
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

c654e4fe

Merge branch 'develop_v2.5' · 592c9f40
wenjh authored Aug 15, 2025

592c9f40
Avoid acc problem of test_gpt_*_activation_recompute · c4bb6049
wenjh authored Aug 15, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
c4bb6049

[JAX] Trim dist fused attn tests in L1 (#2050) · 92f431bf

Kshitij Lakhani authored Aug 14, 2025



* Move some dist fused attn tests to L2
1. TestReorderCausalLoadBalancing: Run two (non symmetric) BSHD/SBHD data shape combination
2. TestDistributedSelfAttn: Run only one (smaller) BSHD type data shape combination
3. TestDistributedCrossAttn: Run only one (smaller) BSHD type data shape combination
4. TestDistributedContextParallelSelfAttn: Run all cp1 combinations
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Use pytest_parametrize_wrapper for splitting fused attn distributed JAX tests as L1 and L2
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Undo pytest -k split commands in qa scripts
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix usage of pytest_parametrize_wrapper in test_distributed_fused_attn
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove test code for L2 dist residing in L2 test.sh
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Add comments for code. Swap the test data shapes in REORDER_CAUSAL_LOAD_BALANCING_DATA_SHAPES
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add L0 to the data shape dictionaries in the distributed test
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Code clean up
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

92f431bf

14 Aug, 2025 6 commits

[Core] Add launch bounds to swizzle kernels (#2076) · 12065ac2

Kirthi Shankar Sivamani authored Aug 14, 2025



Add launch bounds to swizzle kernel, use empty scale inv
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

12065ac2

[PyTorch] Disable fused dbias-quantize kernel for unsupported recipes (#2007) · a169e9e7

Tim Moon authored Aug 13, 2025



* Unfused impl for dbias-quantize
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Unfused impl for dact-dbias-quantize
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable fused bgrad-quantize for unsupported recipes
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove unfused dbias-quantize impls

Not supported in the core lib.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Support unfused impls in tex functions
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweaks
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove unused imports
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

a169e9e7

[PyTorch] Avoid registering FP8 scale update in ops without backward pass (#2063) · 26b4b71a

Tim Moon authored Aug 13, 2025



Avoid registering FP8 recipe update in ops without backward pass
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

26b4b71a

[PyTorch] Register weight and bias params in linear op (#2027) · ccbc8cf4

Tim Moon authored Aug 13, 2025



* Register weight/bias params in linear op
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak docs
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make sure linear op checkpoint is backward-compatible
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix linter warning
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Check for invalid case before setting bias
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

ccbc8cf4

[Common] Reduce CUDA driver calls (#2067) · c582f6be

Xin Yao authored Aug 14, 2025



* reduce driver calls
Signed-off-by: Xin Yao <xiny@nvidia.com>

* reduce driver calls
Signed-off-by: Xin Yao <xiny@nvidia.com>

* adjust tests to capture this
Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

c582f6be

fix: update grad_output quant to avoid redundant work (#1736) · 44fbe9e6

Kshiteej K authored Aug 14, 2025



* fix: update grad_output quant to avoid redundant work
Signed-off-by: kshitij12345 <kshitijkalambarkar@gmail.com>

* add test
Signed-off-by: kshitij12345 <kshitijkalambarkar@gmail.com>

* don't keep only columnwise quant if requires_dgrad=False
Signed-off-by: kshitij12345 <kshitijkalambarkar@gmail.com>

* fix stray merge
Signed-off-by: kshitij12345 <kshitijkalambarkar@gmail.com>

* fix for ctx.use_bias is True case
Signed-off-by: kshitij12345 <kshitijkalambarkar@gmail.com>

* Skip if FP8 not available
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: kshitij12345 <kshitijkalambarkar@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

44fbe9e6

13 Aug, 2025 1 commit

[JAX] Cleanup the MLP warning for TE GEMM + TP (#2054) · bbddcb92

Phuong Nguyen authored Aug 13, 2025



* fix pspec check
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* cleaning
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* add docstring
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* use dict.get()
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix lint
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

bbddcb92