Commits · 98de2cdd62c63e8ea46e2a3b472f398a01759887 · OpenDAS / TransformerEngine

13 Jun, 2025 2 commits
- Merge branch 'develop_v2.4' into 'main' · 98de2cdd
  yuguo authored Jun 13, 2025
```
[DCU] fix blockwise int8 train issues in megatron

See merge request dcutoolkit/deeplearing/TransformerEngine!30
```
  98de2cdd
- [DCU] fix blockwise int8 train issues in megatron · ecdd8251
  yuguo authored Jun 13, 2025
  
  ecdd8251
12 Jun, 2025 4 commits
- Merge branch 'develop_v2.4' · 5b82e699
  wenjh authored Jun 12, 2025
  
  5b82e699
- [INT8] Make int8 rounding instead of truncation · 7f946529
  wenjh authored Jun 12, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  7f946529
- Merge branch 'develop_v2.4' · 9a815d0b
  wenjh authored Jun 12, 2025
  
  9a815d0b
- [Workaround] Improve acc of vectorise scaling · e2860c76
  wenjh authored Jun 12, 2025
```
Same intention of commit 3e38a2ea

.
This commit is to improve acc.
Signed-off-by: wenjh <wenjh@sugon.com>
```
  e2860c76
11 Jun, 2025 2 commits
- Merge branch 'develop_v2.4' into 'main' · 3d57ff8c
  yuguo authored Jun 11, 2025
```
[DCU] add NVTE_TP_OVERLAP_AGGREGATE

See merge request dcutoolkit/deeplearing/TransformerEngine!28
```
  3d57ff8c
- [DCU] add NVTE_TP_OVERLAP_AGGREGATE · b1864da3
  yuguo authored Jun 11, 2025
  
  b1864da3
10 Jun, 2025 2 commits
- Merge branch 'develop_v2.4' into 'main' · bfd4074f
  yuguo authored Jun 10, 2025
```
[DCU] avoid rtc trans kernel bug (need fix)

See merge request dcutoolkit/deeplearing/TransformerEngine!26
```
  bfd4074f
- [DCU] avoid rtc trans kernel bug (need fix) · fdb21575
  yuguo authored Jun 10, 2025
  
  fdb21575
09 Jun, 2025 7 commits
- Merge branch 'develop_v2.4' · 84e198a3
  wenjh authored Jun 09, 2025
  
  84e198a3
- Fix build error of test_cublaslt_gemm · 7d2b9c77
  wenjh authored Jun 09, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  7d2b9c77
- Merge branch 'develop_v2.4' into 'main' · 638296df
  yuguo authored Jun 09, 2025
```
[DCU] fix

See merge request dcutoolkit/deeplearing/TransformerEngine!24
```
  638296df
- [DCU] fix · 6d461a10
  yuguo authored Jun 09, 2025
  
  6d461a10
- Merge branch 'develop_v2.4' into 'main' · ccb9a1b1
  yuguo authored Jun 09, 2025
```
[DCU] surpport cast master weight to int8

See merge request dcutoolkit/deeplearing/TransformerEngine!23
```
  ccb9a1b1
- [DCU] surpport cast master weight to int8 · 0a8072fa
  yuguo authored Jun 09, 2025
  
  0a8072fa
- [TEST] Fix build error of test_cublaslt_gemm · 2cbe1b70
  wenjh authored Jun 09, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  2cbe1b70
06 Jun, 2025 1 commit

[Workaround] Use bf16 lds to save fp32 input · 3e38a2ea

wenjh authored Jun 06, 2025



quantize_transpose_vector_blockwise function use lds exceeding 64kb when
input type is fp32. But max size of lds in dcu is 64kb, thus we use lds
as bfp16 for workaround.
Signed-off-by: wenjh <wenjh@sugon.com>

3e38a2ea

05 Jun, 2025 2 commits
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 32184507
  yuguo authored Jun 05, 2025
  
  32184507
- [DCU] support block fp8 simu with int8 for MOE · b7afba08
  yuguo authored Jun 05, 2025
  
  b7afba08
04 Jun, 2025 4 commits
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 1b303e91
  yuguo authored Jun 04, 2025
  
  1b303e91
- fix · 735227cd
  yuguo authored Jun 04, 2025
  
  735227cd
- Merge branch 'develop_v2.3' of... · 129d7526
  yuguo authored Jun 04, 2025
```
Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine into develop_v2.3
```
  129d7526
- [DCU] support block fp8 simu with int8 for Dense · f6937668
  yuguo authored Jun 04, 2025
  
  f6937668
28 May, 2025 2 commits
- Merge branch 'develop_v2.3' · 52ba87a1
  wenjh authored May 28, 2025
  
  52ba87a1
- [Workaround] Dtk-25.04.1 need add hip_assert.h for hiprtc · 7e4e1e40
  wenjh authored May 28, 2025
  
  7e4e1e40
27 May, 2025 7 commits
- Merge branch 'develop_v2.3' · 74128807
  wenjh authored May 27, 2025
  
  74128807
- [DTK-25.04.1] Add support of __shfl_*sync apis · 47d6a78f
  wenjh authored May 27, 2025
  
  47d6a78f
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 3273bc20
  yuguo authored May 27, 2025
  
  3273bc20
- [DCU] combine 1f1b needs NVTE_OVERLAP_GRAD_REDUCE · 521f8d3b
  yuguo authored May 27, 2025
  
  521f8d3b
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 75e9ef24
  yuguo authored May 27, 2025
  
  75e9ef24
- Merge branch 'develop_v2.3' of... · 291fcf52
  yuguo authored May 27, 2025
```
Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine into develop_v2.3
```
  291fcf52
- [DCU] surpport cumask for tp overlap · c74c8789
  yuguo authored May 27, 2025
  
  c74c8789
26 May, 2025 3 commits
- Merge branch 'develop_v2.3' · 5753c5bb
  wenjh authored May 26, 2025
  
  5753c5bb
- [FP8] Fix build error · 7d0f5b7f
  wenjh authored May 26, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  7d0f5b7f
- [DCU] Use ocp fp8(same as nvidia) · 9666d263
  wenjh authored May 26, 2025
```
Use ocp fp8.
Workaround: test_cast_float8blockwise.cu link wrong std::max
Signed-off-by: wenjh <wenjh@sugon.com>
```
  9666d263
23 May, 2025 2 commits
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · f9d870f4
  yuguo authored May 23, 2025
  
  f9d870f4
- [DCU] surpport blockwise int8 quant · 80c5079c
  yuguo authored May 23, 2025
  
  80c5079c
22 May, 2025 2 commits
- Merge branch 'develop_v2.3' · 7405fe09
  wenjh authored May 22, 2025
  
  7405fe09
- Fix build error of userbuffer.cu · c636071d
  wenjh authored May 22, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  c636071d