Commits · e2860c768b551f6e9c11c486a7ce49acd85ec868 · OpenDAS / TransformerEngine

12 Jun, 2025 1 commit

[Workaround] Improve acc of vectorise scaling · e2860c76

wenjh authored Jun 12, 2025

Same intention of commit 3e38a2ea

.
This commit is to improve acc.
Signed-off-by: wenjh <wenjh@sugon.com>

e2860c76

11 Jun, 2025 1 commit
- [DCU] add NVTE_TP_OVERLAP_AGGREGATE · b1864da3
  yuguo authored Jun 11, 2025
  
  b1864da3
10 Jun, 2025 1 commit
- [DCU] avoid rtc trans kernel bug (need fix) · fdb21575
  yuguo authored Jun 10, 2025
  
  fdb21575
09 Jun, 2025 4 commits
- Fix build error of test_cublaslt_gemm · 7d2b9c77
  wenjh authored Jun 09, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  7d2b9c77
- [DCU] fix · 6d461a10
  yuguo authored Jun 09, 2025
  
  6d461a10
- [DCU] surpport cast master weight to int8 · 0a8072fa
  yuguo authored Jun 09, 2025
  
  0a8072fa
- [TEST] Fix build error of test_cublaslt_gemm · 2cbe1b70
  wenjh authored Jun 09, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  2cbe1b70
06 Jun, 2025 1 commit

[Workaround] Use bf16 lds to save fp32 input · 3e38a2ea

wenjh authored Jun 06, 2025



quantize_transpose_vector_blockwise function use lds exceeding 64kb when
input type is fp32. But max size of lds in dcu is 64kb, thus we use lds
as bfp16 for workaround.
Signed-off-by: wenjh <wenjh@sugon.com>

3e38a2ea

05 Jun, 2025 2 commits
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 32184507
  yuguo authored Jun 05, 2025
  
  32184507
- [DCU] support block fp8 simu with int8 for MOE · b7afba08
  yuguo authored Jun 05, 2025
  
  b7afba08
04 Jun, 2025 4 commits
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 1b303e91
  yuguo authored Jun 04, 2025
  
  1b303e91
- fix · 735227cd
  yuguo authored Jun 04, 2025
  
  735227cd
- Merge branch 'develop_v2.3' of... · 129d7526
  yuguo authored Jun 04, 2025
```
Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine into develop_v2.3
```
  129d7526
- [DCU] support block fp8 simu with int8 for Dense · f6937668
  yuguo authored Jun 04, 2025
  
  f6937668
28 May, 2025 2 commits
- Merge branch 'develop_v2.3' · 52ba87a1
  wenjh authored May 28, 2025
  
  52ba87a1
- [Workaround] Dtk-25.04.1 need add hip_assert.h for hiprtc · 7e4e1e40
  wenjh authored May 28, 2025
  
  7e4e1e40
27 May, 2025 7 commits
- Merge branch 'develop_v2.3' · 74128807
  wenjh authored May 27, 2025
  
  74128807
- [DTK-25.04.1] Add support of __shfl_*sync apis · 47d6a78f
  wenjh authored May 27, 2025
  
  47d6a78f
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 3273bc20
  yuguo authored May 27, 2025
  
  3273bc20
- [DCU] combine 1f1b needs NVTE_OVERLAP_GRAD_REDUCE · 521f8d3b
  yuguo authored May 27, 2025
  
  521f8d3b
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · 75e9ef24
  yuguo authored May 27, 2025
  
  75e9ef24
- Merge branch 'develop_v2.3' of... · 291fcf52
  yuguo authored May 27, 2025
```
Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine into develop_v2.3
```
  291fcf52
- [DCU] surpport cumask for tp overlap · c74c8789
  yuguo authored May 27, 2025
  
  c74c8789
26 May, 2025 3 commits
- Merge branch 'develop_v2.3' · 5753c5bb
  wenjh authored May 26, 2025
  
  5753c5bb
- [FP8] Fix build error · 7d0f5b7f
  wenjh authored May 26, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  7d0f5b7f
- [DCU] Use ocp fp8(same as nvidia) · 9666d263
  wenjh authored May 26, 2025
```
Use ocp fp8.
Workaround: test_cast_float8blockwise.cu link wrong std::max
Signed-off-by: wenjh <wenjh@sugon.com>
```
  9666d263
23 May, 2025 2 commits
- Merge branch 'develop_v2.3' of http://10.16.6.30/dcutoolkit/deeplearing/TransformerEngine · f9d870f4
  yuguo authored May 23, 2025
  
  f9d870f4
- [DCU] surpport blockwise int8 quant · 80c5079c
  yuguo authored May 23, 2025
  
  80c5079c
22 May, 2025 4 commits
- Merge branch 'develop_v2.3' · 7405fe09
  wenjh authored May 22, 2025
  
  7405fe09
- Fix build error of userbuffer.cu · c636071d
  wenjh authored May 22, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  c636071d
- [DCU] Add width to __shfl · 6ed9a3e4
  wenjh authored May 22, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  6ed9a3e4
- [ROCM6.3] Fix build on rocm-6.3 · b27e513d
  wenjh authored May 22, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  b27e513d
21 May, 2025 4 commits
- Merge branch 'develop_v2.3' into 'main' · 7462e0e4
  yuguo authored May 21, 2025
```
[DCU] remove cudaStreamSynchronize for tp overlap

See merge request dcutoolkit/deeplearing/TransformerEngine!13
```
  7462e0e4
- [DCU] remove cudaStreamSynchronize for tp overlap · 92d59fe4
  yuguo authored May 21, 2025
  
  92d59fe4
- [DCU] fix merge · b65e50ba
  yuguo authored May 21, 2025
  
  b65e50ba
- Merge commit '1d903f5e' of... · f8c2af4c
  yuguo authored May 21, 2025
```
Merge commit '1d903f5e' of https://github.com/NVIDIA/TransformerEngine
```
  f8c2af4c
20 May, 2025 4 commits
- Merge branch 'develop_v2.3' into 'main' · e92773a3
  yuguo authored May 20, 2025
```
[DCU] cudaStreamSynchronize for tp gemm overlap

See merge request dcutoolkit/deeplearing/TransformerEngine!11
```
  e92773a3
- [DCU] cudaStreamSynchronize for tp gemm overlap · aec86199
  yuguo authored May 20, 2025
  
  aec86199
- Merge branch 'develop_v2.3' into 'main' · 9e6e1871
  yuguo authored May 20, 2025
```
Develop v2.3

See merge request dcutoolkit/deeplearing/TransformerEngine!9
```
  9e6e1871
- [DCU] surpport delay_wgrad_compute in batchgemm · 460b006c
  yuguo authored May 20, 2025
  
  460b006c