Commits · 8fc9d8f113b6d44c03400bf07ac95b3b4eb86e8c · OpenDAS / TransformerEngine

23 Jan, 2026 2 commits

Fix issues related to L0cpp tests · 8fc9d8f1

maxiao3 authored Jan 23, 2026



1,Resolve out-of-bounds issues for types struct
2,Fix TestFusedCastFloat8Vectorwise test case failure
Signed-off-by: maxiao3 <maxiao3@sugon.com>

See merge request dcutoolkit/deeplearing/TransformerEngine!73

8fc9d8f1

[DCU] Remove redundant shared memory in rowwise kernel · 261e476b

zc20020701 authored Jan 23, 2026


Signed-off-by: zhaochao <zhaochao1@sugon.com>

See merge request dcutoolkit/deeplearing/TransformerEngine!72
Co-authored-by: zhaochao <zhaochao1@sugon.com>

261e476b

12 Jan, 2026 1 commit
- Fix building on nmz · 0fce42f7
  wenjh authored Jan 12, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  0fce42f7
09 Jan, 2026 1 commit
- Fix swizzle, swap_first_dims and RMSNorm issues · e6f2caf5
  wenjh authored Jan 09, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  e6f2caf5
08 Jan, 2026 1 commit

Fix tests of L0 test_numeric and L1 test_fusible_ops · 953b6d68

wenjh authored Jan 08, 2026


Signed-off-by: wenjh <wenjh@sugon.com>

See merge request dcutoolkit/deeplearing/TransformerEngine!67

953b6d68

07 Jan, 2026 1 commit
- Add nmz support · dc86f372
  wenjh authored Jan 07, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  dc86f372
31 Dec, 2025 1 commit
- Add bias fwd/bwd at group gemm · cb2fe806
  wenjh authored Dec 31, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  cb2fe806
15 Dec, 2025 1 commit
- Fix blaslt group gemm crush · e698a0a7
  wenjh authored Dec 15, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  e698a0a7
10 Dec, 2025 2 commits
- [DCU]fix mip and compile issues · 121d9224
  tabuchixiangcai3 authored Dec 10, 2025
```
Signed-off-by: Tangao <2205747538@qq.com>
```
  121d9224
- [DCU]Fix compilation unable to find nvte-extract_ded_and_offset · ba058648
  tabuchixiangcai3 authored Dec 10, 2025
```
Signed-off-by: Tangao <2205747538@qq.com>
```
  ba058648
03 Dec, 2025 3 commits
- Make release_v2.9 compile pass · 99e60246
  wenjh authored Dec 03, 2025
  
  99e60246
- Fix build error · cbb14a5f
  wenjh authored Dec 03, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  cbb14a5f
- Fix build error · b3dcfc28
  wenjh authored Dec 03, 2025
  
  b3dcfc28
14 Nov, 2025 1 commit

[JAX] Make all jax attention calls use non-packed common calls (#2358) · b88f727b

Paweł Gadziński authored Nov 14, 2025



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* add notes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* small fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

b88f727b

13 Nov, 2025 1 commit
- [PyTorch] Fix amax computation using output_t data in normalization (#2355) · d0d40631
  Evgeny Tsykunov authored Nov 13, 2025
```
Fix amax computation using output_t data in normalization
Signed-off-by: Evgeny <etsykunov@nvidia.com>
```
  d0d40631
12 Nov, 2025 3 commits

[Feature] Enable rope application with offsets for training (#2188) · e4bfa628

Sudhakar Singh authored Nov 12, 2025



* enable applying rope offsets in backwared
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add tests for rope offsets for thd/bshd/sbhd formats
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fixes
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

e4bfa628

Fix build error · a622988a
wenjh authored Nov 12, 2025

a622988a
Fix hipblaslt handle manage · f4bd89eb
wenjh authored Nov 12, 2025

f4bd89eb

10 Nov, 2025 1 commit

Move Triton to common (#2359) · 5ea83432

Teddy Do authored Nov 10, 2025



* move triton to common and change paths
Signed-off-by: tdophung <tdophung@nvidia.com>

* Formatting
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>

5ea83432

08 Nov, 2025 1 commit
- Fix user args core dump in mt · a13c52ad
  wenjh authored Nov 08, 2025
  
  a13c52ad
07 Nov, 2025 2 commits

[common] Remove kvpacked and qkvpacked attention functions for every kernel type. (#2287) · 3454f84d

Paweł Gadziński authored Nov 07, 2025



* code drop
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* depracted compile time warning + \warning -> \deprecated
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

3454f84d

Disable cuDNN attention for known IMA and NaNs (#2344) · 26aad6b0

Kirthi Shankar Sivamani authored Nov 07, 2025



* Fix cuDNN backend selection for more case. Add CG as a option as well
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix logic
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix cuDNN checks
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add more checks
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix cuddn version
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix error message
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add check for window size
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

26aad6b0

06 Nov, 2025 1 commit
- Fix out of bounds access in the FP4 dequantize kernel (#2346) · f3b97c26
  Przemyslaw Tredak authored Nov 06, 2025
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
  f3b97c26
31 Oct, 2025 1 commit
- [Common] Deleted unused header (#2324) · e7227af9
  Oleg Goncharov authored Oct 31, 2025
```
Deleted unused header
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
```
  e7227af9
30 Oct, 2025 2 commits

[Common] Split cast/gated kernels by scaling mode (#2248) · 0e80c847

Oleg Goncharov authored Oct 30, 2025



* Separated gated and dequantize kernels
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Separated quantize, dequantize and gated functions
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed lint issues
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed persistent lint issues
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Added missing compute capability 10.0 check for Quantize FP8 TMA kernels
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed the issue which was added again by autofix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Changed files description. Completely removed non-identity activations from the NVFP4 transpose test suite
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Removed unsupported template arguments in NVFP4 quantize
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed undefined symbol error
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed condition
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

* Fixed CUDA version check
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Changed arch conditions order
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Clean up
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Small fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Small fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the PR review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Split quantize helper into two (FWD and BWD) functions
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Moved activation functions from cast.cu. Removed cast.cu from the fast-math compilation list
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Enabled fast math for activations by default
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled fast math for activations by default
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

0e80c847

CMake to respect MAX_JOBS or NVTE_MAX_JOBS (#2319) · f0295f9d
Phuong Nguyen authored Oct 30, 2025
```
fix max jobs
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
f0295f9d

27 Oct, 2025 1 commit

Remove `nvidia-mathdx` dependency (#2295) · d7c9777e

Kirthi Shankar Sivamani authored Oct 27, 2025



* Remove nvidia-mathdx dep
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix SR
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add comment
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

d7c9777e

25 Oct, 2025 1 commit

[PyTorch] Add max_logit support for MuonClip (#2195) · 87cb26c6

Charlene Yang authored Oct 24, 2025



* add max_score for fused/unfused F16 non-CP
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* calculate max per head instead of max over all heads
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fused attn max_score shape
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert FE to github
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update FE to 1.15.0-rc
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix merge
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* reduce ew kernels; fix causal masks; add more tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fix to tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove logic for flash-attn
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: add CP support for p2p/a2a/all_gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor improvements of implementation/tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* WIP: add thd support
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add thd to UnfusedDPA
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* more fixes for lint
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update to FE 1.15
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove unneeded changes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable unfused for thd + pad_between_seqs
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable thd for unfused until bug is fixed
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix all_gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rename max_score to max_logit
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all_gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix all_gather
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* disable fused attn + thd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

87cb26c6

24 Oct, 2025 1 commit
- [Common] Fix checks in quantize_transpose_vector_blockwise_fp4 (#2299) · 060811c9
  jberchtold-nvidia authored Oct 24, 2025
```
fix checks in unoptimized non-rht fp4 quantize kernel
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  060811c9
23 Oct, 2025 1 commit

Overhaul the compilation for the arch-specific features (#2279) · eb34783c

Przemyslaw Tredak authored Oct 22, 2025



* Added sm_120f to the build
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the arch specific handling
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Support for CUDA<12.9
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Moved through the rest of the files
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Common cases
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Remove pure 100 from the list
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* CMake changes, (not yet working)
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Do not pass the arch-specific thing from build_tools
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Moved some of the files to arch-specific compilation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix and also changing the order of compilation to hopefully get the
compilation time lower
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the files overwriting custom compile properties
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually make this whole thing work
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add space to the error message
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* Apply suggestions from code review
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* Fixes from review
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changing the naming to be more intuitive
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add missing cassert include for device-side asserts
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>

eb34783c

21 Oct, 2025 1 commit

[PyTorch][MOE] Support NVFP4 Grouped Linear (#2215) · b4a1d4d6

Zhongbo Zhu authored Oct 20, 2025



* pipeclean, fix nvfp4 padding of 32 alignment
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* numerical test passed
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix CI failure with test_cast_master_weights_to_fp8 (in a hacky way)
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* found CUDA mis-aligned address error in training in multi-swizzle, hack the vec_load_size to 1 to unblock
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* leave comments about alignment issue
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fused bulk alloc nvfp4
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix RHT sign mask CPU overhead
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* resolve comments
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Remove incorrect logic that treats 0-D tensor as uninitialized

Tensor shape logic still requires treating 0-D tensor as uninitialized.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix invalid conversion from tensor to int
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

b4a1d4d6

18 Oct, 2025 1 commit

Wheels for cuda 13 (#2278) · fd234d80

Kirthi Shankar Sivamani authored Oct 18, 2025



* Support wheel build for cuda 13
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fixes for cu13 runtime, format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add documentation
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better error handling
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix jax sdist
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Modify function names
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

fd234d80

17 Oct, 2025 2 commits

Make `CanonicalizeGemmInput()` support non-TN layout FP8 GEMM on Blackwell... · ee384ab5

Alp Dener authored Oct 17, 2025

Make `CanonicalizeGemmInput()` support non-TN layout FP8 GEMM on Blackwell with column-wise/transposed data (#2233)

Modified CanonicalizeGemmInput() logic to pull from column-wise data for FP8 GEMM on Blackwell when row-wise is not available.
Signed-off-by: Alp Dener <adener@nvidia.com>

ee384ab5

fall back after failing ldconfig-based lib loading for cuDNN (#2277) · bd380048
Tim Geypens authored Oct 17, 2025
```
Signed-off-by: Tim Geypens <tim.geypens@gmail.com>
```
bd380048

16 Oct, 2025 2 commits
- [DCU] remove redundant gemm · 47077129
  yuguo authored Oct 16, 2025
  
  47077129
- [DCU]Fix memory overflow and test-didistributed in L1_pytorch_istributed_unittest · 2a64c9a6
  tabuchixiangcai3 authored Oct 16, 2025
```
Signed-off-by: Tangao <2205747538@qq.com>
```
  2a64c9a6
15 Oct, 2025 3 commits
- Fix typo · a26a0c30
  wenjh authored Oct 15, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  a26a0c30
- [DCU] fix compile issues · aa62d24c
  yuguo authored Oct 15, 2025
  
  aa62d24c
- [DCU] fix compile issues · 8d5cd8c6
  yuguo authored Oct 15, 2025
  
  8d5cd8c6
14 Oct, 2025 1 commit

Generalize quantization APIs for FP8/FP4/.. recipes (#2256) · 85a91997

Kirthi Shankar Sivamani authored Oct 14, 2025



* Initial API change
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change all imports and api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix typo
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix recipe tets
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix more tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix docs, tests, and make Jax change as well
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change internal uses of fp8_autocast
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Address nits
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rename file
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* CG function, and small test fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change instances of make_graphed_callables internally
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix distributed tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix test and add more docs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cleanup test imports and minimize internal file imports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Make is_bf16_available public
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better docs and better api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply suggestions from code review
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* fix nvfp4 test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

85a91997