Commits · 64d522acb54b541cf6b0d9cd159fe72533e02a52 · OpenDAS / TransformerEngine

14 May, 2025 2 commits
- [Workaround] multi tensor scale acc restrictions · 64d522ac
  wenjh authored May 14, 2025
  
  64d522ac
- Close Env NVTE_FORCE_ROCM_GEMM after tested gemm · 28726eaf
  wenjh authored May 14, 2025
  
  28726eaf
13 May, 2025 1 commit
- [DCU] surpport blockwise fp8 quantize · b8fe26e7
  yuguo authored May 13, 2025
  
  b8fe26e7
09 May, 2025 1 commit
- Merge commit '04c730c0' of... · ab3e5a92
  yuguo authored May 09, 2025
```
Merge commit '04c730c0' of https://github.com/NVIDIA/TransformerEngine
```
  ab3e5a92
08 May, 2025 4 commits

Merge branch 'main' of http://10.6.10.68/dcutoolkit/deeplearing/TransformerEngine · a8d19fd9
yuguo authored May 08, 2025

a8d19fd9
[DCU] add batchgemm test · 9d0f1c9b
yuguo authored May 08, 2025

9d0f1c9b

[Workaround] Force NVTE_FORCE_ROCM_GEMM=1 · 6dfe66e9

wenjh authored May 08, 2025



The acc problem in test_grouped_linear_accuracy and test_grouped_gemm is
because calc test out and ref out using diff kernel.
Make NVTE_FORCE_ROCM_GEMM=1 can force these tests to call rocm gemm using
same kernel.
Signed-off-by: wenjh <wenjh@sugon.com>

6dfe66e9

[ROCBLAS_GEMM] Default use of hipMallocAsync · 7a47930f

wenjh authored May 08, 2025



Default use of hipMallocAsync rather than hipMalloc in rocblas_gemm and
add support of fp16_fp16_fp32 in rocblas_gemm.
Signed-off-by: wenjh <wenjh@sugon.com>

7a47930f

07 May, 2025 2 commits
- [DCU] fix batchgemm · e8f92b93
  yuguo authored May 07, 2025
  
  e8f92b93
- [DCU] surpport NVTE_MOE_BATCHCOUNT · c37084b9
  yuguo authored May 07, 2025
  
  c37084b9
06 May, 2025 5 commits
- [DCU] fix batch gemm · c686efc1
  yuguo authored May 06, 2025
  
  c686efc1
- [DCU] Fix launch bounds · cdb862cd
  wenjh authored May 06, 2025
```
Fix launch bounds of multi_tensor_apply_kernel and
thd_out_correction_kernel.
Signed-off-by: wenjh <wenjh@sugon.com>
```
  cdb862cd
- Merge branch 'main' of http://10.6.10.68/dcutoolkit/deeplearing/TransformerEngine · ab9d7598
  yuguo authored May 06, 2025
  
  ab9d7598
- [DCU] new rocm gemm · 229be5e8
  yuguo authored May 06, 2025
  
  229be5e8
- [ROCM_GEMM] Fix launch params · 6efebcd0
  wenjh authored May 06, 2025
```
Fix launch params are larger than launch bounds(256) for kernels in
rocm_gemm.cu
Signed-off-by: wenjh <wenjh@sugon.com>
```
  6efebcd0
30 Apr, 2025 1 commit

[rocblas] Use HandleManager to avoid mem leakage · 388ac735

wenjh authored Apr 30, 2025


Signed-off-by: wenjh <wenjh@sugon.com>

[RocblasGemm] Provide support of AB(bf16)D(fp32)
Signed-off-by: wenjh <wenjh@sugon.com>

388ac735

29 Apr, 2025 5 commits
- Fix missing import · 08f14085
  wenjh authored Apr 29, 2025
  
  08f14085
- [DCU] Fix WS leak when init+destroy ub more than 1 · 8fda607c
  yuguo authored Apr 29, 2025
  
  8fda607c
- Merge branch 'main' of http://10.6.10.68/dcutoolkit/deeplearing/TransformerEngine · 9da3621b
  yuguo authored Apr 29, 2025
  
  9da3621b
- [DCU] fix fsdp2 · 16de530e
  yuguo authored Apr 29, 2025
  
  16de530e
- [PytorchUnitTest] Fix errors while running tests · 86f2e9a9
  wenjh authored Apr 29, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  86f2e9a9
28 Apr, 2025 1 commit
- [DCU] fix bad alloc · 11b6b7e4
  yuguo authored Apr 28, 2025
  
  11b6b7e4
27 Apr, 2025 2 commits
- [GemmTests] Fix gemm tests acc failed · 34ea55b9
  wenjh authored Apr 27, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  34ea55b9
- [DTK-25.04] Workaround compiler bugs. · 5749aec6
  wenjh authored Apr 27, 2025
```
Ref params of rmsnorm will make program corruption with 'nil' error.
Signed-off-by: wenjh <wenjh@sugon.com>
```
  5749aec6
25 Apr, 2025 5 commits
- Merge branch 'main' of http://10.6.10.68/dcutoolkit/deeplearing/TransformerEngine · 8de7a1ce
  yuguo authored Apr 25, 2025
  
  8de7a1ce
- [DCU] tp overlap opt · caf2fbf2
  yuguo authored Apr 25, 2025
  
  caf2fbf2
- [DAS-RMSNorm] TE 2.3 returns 3 values of rmsnorm · daa15293
  panning authored Apr 25, 2025
```
API `rmsnorm_forward` of python returns 3 values rather than 2 from V2.3
Signed-off-by: wenjh <wenjh@sugon.com>
```
  daa15293
- Merge branch 'main' of http://10.6.10.68/dcutoolkit/deeplearing/TransformerEngine · 0b0a70a5
  yuguo authored Apr 25, 2025
  
  0b0a70a5
- [DCU] fix rocblas backend · e80f260d
  yuguo authored Apr 25, 2025
  
  e80f260d
24 Apr, 2025 2 commits

[DCU] Fix failed test cases · 3ce226ae

wenjh authored Apr 23, 2025



Due to the difference of warp size between nvidia(32) and dtk(64), the
OperatorTest/CTDBiasTestSuite.TestCTDBias/* are all failed except:

* OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xfloat32X65536X128
* OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xfloat16X65536X128
* OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xbfloat16X65536X128
* OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xfloat8e5m2X65536X128
* OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xfloat8e4m3X65536X128

This commit is intended to fix this.
Signed-off-by: wenjh <wenjh@sugon.com>

3ce226ae

[DCU] Fix crash test cases · 46c81675

wenjh authored Apr 19, 2025



Due to the compiler compiling incorrect code. The following test case crashed:

* OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xbfloat16X2048X12288
* OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xbfloat16X65536X128
* OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xbfloat16X256X65536

This commit is intended to fix these test cases.
Signed-off-by: wenjh <wenjh@sugon.com>

46c81675

23 Apr, 2025 2 commits
- [DCU] fix gemm compile · 3e001bbd
  yuguo authored Apr 23, 2025
  
  3e001bbd
- [DCU] surpport rocm gemm rocblas · 8b27a2b7
  yuguo authored Apr 23, 2025
  
  8b27a2b7
22 Apr, 2025 1 commit
- [DCU] little fix · 73f3ac47
  yuguo authored Apr 22, 2025
  
  73f3ac47
18 Apr, 2025 4 commits

[DCU] overlap bug fix in ECO and BW finally · 456a96c8
yuguo authored Apr 18, 2025

456a96c8
Changed VERSION to 2.4.0.dev0 · 04c730c0
Przemek Tredak authored Apr 18, 2025
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
04c730c0

Split wgrad&dgrad from backward() to support a2a overlap (#1653) · 9f8aaddf

Hongbin Liu authored Apr 18, 2025



* split wgrad for GroupedLinear
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* support wgrad split for linear and ln_linear
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* add comments and fix WeightGradStore
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* support bias and fix unit tests
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* minor fix
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* support fuse_grad_accumulation=false
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add wgrad split for layernorm_mlp
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* minor fix
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix unittest
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add unittest for distributed interface apply Dener's suggestion
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fix
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* replace split_bw with delay_wgrad_compute
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/pytorch/module/layernorm_mlp.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/pytorch/module/linear.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/pytorch/module/layernorm_linear.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* remove comments
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

---------
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

9f8aaddf

[JAX] Deprecate Praxis layers (#1694) · 1a6a6d7b

Phuong Nguyen authored Apr 17, 2025



rm pax/praxis
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

1a6a6d7b

17 Apr, 2025 2 commits

Re Do symmetric memory merge request (#1682) · 39c0e709

wdykas authored Apr 17, 2025



* re merge request
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add docstring
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

---------
Signed-off-by: Peter Dykas <wdykas@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

39c0e709

[PyTorch] Move swizzle scaling factor to cpp (#1683) · 4e036c8c

Xin Yao authored Apr 18, 2025



* move swizzle scaling factor to cpp
Signed-off-by: Xin Yao <xiny@nvidia.com>

* resolve comments
Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

4e036c8c