Commits · f9faa7ca0bb091125bacb49a9264ec28b7a28f88 · OpenDAS / TransformerEngine

15 Jul, 2025 2 commits
- Merge branch 'develop_v2.4' · f9faa7ca
  wenjh authored Jul 15, 2025
  
  f9faa7ca
- Fix import pytorch error · 3939e719
  wenjh authored Jul 15, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  3939e719
11 Jul, 2025 3 commits
- Merge branch 'develop_v2.4' · fdf60506
  wenjh authored Jul 11, 2025
  
  fdf60506
- Merge branch 'develop_v2.4' into w8a8_dev_v2.4 · 3b1f30a9
  wenjh authored Jul 11, 2025
  
  3b1f30a9
- Support w8a8_matmul_extension · 6a20ff90
  wenjh authored Jul 11, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  6a20ff90
09 Jul, 2025 4 commits
- Merge branch 'develop_v2.4' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 403db136
  yuguo authored Jul 09, 2025
  
  403db136
- [DCU] channelwise batchgemm for MOE · 76023d21
  yuguo authored Jul 09, 2025
  
  76023d21
- Merge branch 'develop_v2.4' · c3a36d7e
  wenjh authored Jul 09, 2025
  
  c3a36d7e
- Fix int8 gemm nt and wgrad · 5fcf30ba
  wenjh authored Jul 09, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  5fcf30ba
08 Jul, 2025 2 commits
- Merge branch 'develop_v2.4' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 800f4dff
  yuguo authored Jul 08, 2025
  
  800f4dff
- [DCU] Preliminary support for channelwise · 9fe13a33
  yuguo authored Jul 08, 2025
  
  9fe13a33
03 Jul, 2025 2 commits
- Merge branch 'develop_v2.4' · 1e018a45
  wenjh authored Jul 03, 2025
  
  1e018a45
- Fix kernel crash on block_len=64 · 40a4d896
  wenjh authored Jul 03, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  40a4d896
02 Jul, 2025 1 commit
- Resolve merge issues from develop_v2.4 · 4ef4eae6
  wenjh authored Jul 02, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  4ef4eae6
01 Jul, 2025 2 commits

Merge develop_v2.4 · 0e886dab
wenjh authored Jul 01, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
0e886dab

[Blockwise] Add support block_len=64 support · b944277c

wenjh authored Jun 25, 2025



Add env to chose blocklen of blockwise quantize.
Signed-off-by: wenjh <wenjh@sugon.com>

Fix pytest of blockwise error
Signed-off-by: wenjh <wenjh@sugon.com>

Resolve new api in  int8 gemm test
Signed-off-by: wenjh <wenjh@sugon.com>

Fix incorrect launch parm
Signed-off-by: wenjh <wenjh@sugon.com>

Fix 1D blockwise(64) acc error
Signed-off-by: wenjh <wenjh@sugon.com>

b944277c

20 Jun, 2025 5 commits
- Merge branch 'develop_v2.4' into 'main' · e56de127
  yuguo authored Jun 20, 2025
```
[DCU] fix megatron MOE int8 train bugs

See merge request dcutoolkit/deeplearing/TransformerEngine!37
```
  e56de127
- [DCU] fix megatron MOE int8 train bugs · 251dcc7e
  yuguo authored Jun 20, 2025
  
  251dcc7e
- [DCU] fix 2.5 compile issues · 68487b2a
  yuguo authored Jun 20, 2025
  
  68487b2a
- Merge branch 'develop_v2.4' into 'main' · 25e709aa
  yuguo authored Jun 20, 2025
```
[DCU] fix megatron MOE int train issues

See merge request dcutoolkit/deeplearing/TransformerEngine!36
```
  25e709aa
- [DCU] fix megatron MOE int train issues · 7640a8d4
  yuguo authored Jun 20, 2025
  
  7640a8d4
19 Jun, 2025 5 commits
- [DCU] fix TORCH_COMM_CU_NUMS conflicts in 2.5 · ffbef335
  yuguo authored Jun 19, 2025
  
  ffbef335
- Merge branch 'develop_v2.4' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 067c2b3d
  yuguo authored Jun 19, 2025
  
  067c2b3d
- [DCU] add TORCH_COMM_CU_NUMS and fix · d6c32078
  yuguo authored Jun 19, 2025
  
  d6c32078
- Merge branch 'develop_v2.4' · 4cc47ca6
  wenjh authored Jun 19, 2025
  
  4cc47ca6
- Fix verify acc failed of blockwise quantizer · 8eff19c9
  wenjh authored Jun 19, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  8eff19c9
18 Jun, 2025 6 commits
- Fix build error · e704bbc8
  wenjh authored Jun 18, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  e704bbc8
- Resolve merge issue from nv of vector blockwise · 331f2fc4
  wenjh authored Jun 18, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  331f2fc4
- Merge branch 'develop_v2.4' · 1f9c104b
  wenjh authored Jun 18, 2025
  
  1f9c104b
- Fix vector blockwise acc problem · 8a03ff34
  wenjh authored Jun 18, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  8a03ff34
- [DCU] fix 2.5 compile issues · 2b1428ff
  yuguo authored Jun 18, 2025
  
  2b1428ff
- Fix lack of lds in vector_blockwise · d1bf39cf
  wenjh authored Jun 18, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  d1bf39cf
17 Jun, 2025 2 commits
- [DCU] fix 2.5 get_num_compute_streams · b4a2489f
  yuguo authored Jun 17, 2025
  
  b4a2489f
- Merge commit 'a69692ac' of... · 2b05e121
  yuguo authored Jun 17, 2025
```
Merge commit 'a69692ac' of https://github.com/NVIDIA/TransformerEngine
```
  2b05e121
16 Jun, 2025 2 commits
- Merge branch 'develop_v2.4' into 'main' · 0fd441c2
  yuguo authored Jun 16, 2025
```
[DCU] fix in8 simul fp8 fused wgrad accumulation

See merge request dcutoolkit/deeplearing/TransformerEngine!32
```
  0fd441c2
- [DCU] fix in8 simul fp8 fused wgrad accumulation · 3653fbfb
  yuguo authored Jun 16, 2025
  
  3653fbfb
13 Jun, 2025 4 commits

Changed VERSION to 2.6.0.dev0 · a69692ac
Przemek Tredak authored Jun 13, 2025
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
a69692ac

[JAX] Add support for Fused Attn MLA head_dim_qk != head_dim_v (#1851) · 1ddfa0c6

Kshitij Lakhani authored Jun 13, 2025



* Add support for Fused Attn MLA head_dim_qk != head_dim_v
	Modify is_fused_attn_kernel_available() to accept different head_dims for qk and v
	Modify FusedAttnHelper to accept different head_dims for qk and v and modify assert dims checks in parse_qkv_aval()
	Modify FusedAttnFwdPrimitive and FusedAttnBwdPrimitive to accept different head_dims for qk and v
	Modify Fused Attn related cpp and csrc extension API calls to accept different head_dims for qk and v
	Modify DotProductAttention call() to extract head dims separately for qk and v
	Modify the FusedAttn Tests to accommodate for API changes in FusedAttn API
	Add test case for head_dim_qk != head_dim_v (failing)
	Modify the baseline JAX appropriately to reshape the output vector based on v dims and not q dims
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix context dims in general DPA in test_fused_attn
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix dim for output tensor by replacing with v head dim rather than q head dim
Add test cases for jax fused attn where head_dim_qk != head_dim_v for a combination of data types and attention type
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Modify the fused attn jax unit test case for head dim qk != head dim v
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Use new FusedAttnRunner function signature for separate hidden dim for qk and v in Fused Attn distributed tests
Code clean up
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix usage of is_fused_attn signature in distributed tests
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Remove unnecessary assert
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

1ddfa0c6

Add support for head_dim > 128 (#1797) · 71c76b6b

Charlene Yang authored Jun 14, 2025



* add support for head dim > 128
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* remove debugging
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* raise tols slightly to tolerate 1/2048 mismatches
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix is_training for test_te_layer
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add bprop support for blackwell
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor tweak for format
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix backend selection results
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* bump sm100 to sm100+
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add sq=1 test for MLA
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* enable sq=1 for bprop
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* minor tweak in comments
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix head_dim logic and remove pytest skip
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add FE fix for d>128
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* update FE again to take in small fixes
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add cuDNN version info in L0 tests
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* increase tols for Unfused + large dim
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* Revert "add cuDNN version info in L0 tests"

This reverts commit 3e1b426ca5319a2c0540b9e73bba7047d0e583e5.
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix tols for Unfused
Signed-off-by: Charlene Yang <charleney@nvidia.com>

---------
Signed-off-by: Charlene Yang <charleney@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

71c76b6b

Merge branch 'develop_v2.4' into 'main' · 98de2cdd

yuguo authored Jun 13, 2025

[DCU] fix blockwise int8 train issues in megatron

See merge request dcutoolkit/deeplearing/TransformerEngine!30

98de2cdd