Commits · 9df0c4a347a14ce4d028a16684697a0b38d11a8f · OpenDAS / TransformerEngine

13 Feb, 2026 1 commit

[JAX] TE Permutation integration to Maxtext (#2672) · 5d112e3c

Teddy Do authored Feb 13, 2026

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* adding more stuff missing from cherry picky jeremy PR for inspecting
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix some tracing issues when intergating to maxtext
Signed-off-by: tdophung <tdophung@nvidia.com>

* Have sort_chunks_by_index handle situations where input buffer is larger than num tokens
Signed-off-by: tdophung <tdophung@nvidia.com>

* remove unnecessary assert and comments
Signed-off-by: JAX Toolbox <jax@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove Jeremy's PR for inspect ffi
Signed-off-by: JAX Toolbox <jax@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* untouch the amax file, also change comment on te
Signed-off-by: JAX Toolbox <jax@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Signed-off-by: JAX Toolbox <jax@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: JAX Toolbox <jax@nvidia.com>

5d112e3c

12 Feb, 2026 4 commits

Get rid of nvshmem dependency for cuBLASMp integration (#2661) · 496620a9

vcherepanov-nv authored Feb 12, 2026



* Remove nvshmem usage
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Renamings
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* NCCL dependency
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Check for not yet allocated workspace
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Address greptile comments
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Add a comment per greptile
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix a typo
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Display human-readable cuBLASMp error message
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

---------
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

496620a9

Add sigmoid GLU (#2656) · 33ca6150

Kim, Jin (Jay@SKT) authored Feb 13, 2026



* Add sigmoid GLU
Signed-off-by: Kim, Jin <jinn.kim@sk.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Kim, Jin <jinn.kim@sk.com>

* Add test for GLU op
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix incorrect reshape
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Apply suggestion from @timmoon10
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Add omitted tests for GLU op
Signed-off-by: Kim, Jin <jinn.kim@sk.com>

* Add GLU activation type support in JAX extension
Signed-off-by: Kim, Jin <jinn.kim@sk.com>

* [PyTorch] Add Sigmoid activation for GLU support in numerics test (#2656)
Signed-off-by: Kim, Jin <jinn.kim@sk.com>

---------
Signed-off-by: Kim, Jin <jinn.kim@sk.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

33ca6150

[Common] Fuse pre-swizzling into grouped MXFP8 quantization kernel (#2630) · 93d51c82

Oleg Goncharov authored Feb 12, 2026



* Added GEMM-ready preswizzling option
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

93d51c82

fix(build): Handle namespace packages for PyPI CUDA detection (#2580) · c4175fca

Santosh Bhavani authored Feb 11, 2026



fix: handle nvidia namespace packages where __file__ is None
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

c4175fca

11 Feb, 2026 2 commits

[C] NVFP4 quantization for `GroupedTensor` (#2655) · 402ea54b

Kirthi Shankar Sivamani authored Feb 12, 2026



* NVFP4 GroupedQuantize
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Co-authored-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix fp4
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Remove unnecessary file
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Co-authored-by: Zhongbo Zhu <zhongboz@nvidia.com>

402ea54b

[PyTorch] Python `GroupedTensor` (#2654) · ac81c85b

Kirthi Shankar Sivamani authored Feb 11, 2026



* PyTorch-Python GroupedTensor
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/pytorch/tensor/storage/grouped_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove mxfp8 gq test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix recipe tests and FP8 weights
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix device test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Disable grouped weights for unsupported recipes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

ac81c85b

10 Feb, 2026 1 commit

[pyTorch] Fix the compilation warnings (#2663) · b09ff7e9

Przemyslaw Tredak authored Feb 10, 2026



* Fix the compilation warnings for the PyTorch extension
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Apply suggestion from @greptile-apps[bot]
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

b09ff7e9

07 Feb, 2026 1 commit

[Common] Bucket batch size with higher granularity for THD (#2653) · dccf67e7

Charlene Yang authored Feb 06, 2026



bucket max_b with more granularity when >512
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

dccf67e7

06 Feb, 2026 2 commits

[Common] MXFP8 kernel for grouped tensors (#2586) · 73939472

Oleg Goncharov authored Feb 06, 2026



* Rebased to main
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed the year to 2026
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added compilation guards
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Added BWD pass
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Added dbias and dact tests. Refactoring.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Added grouped MXFP8 DACT and ACT API and tests
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed a typo
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* More fixes from the review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Relaxed requirement for last dim from mod128 to mod32
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added alignment checks when tensor descriptors are modified
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>

73939472

Fix exp2f_rcp to properly handle nan and 0xFE cases (#2647) · 71971e33
Jacket authored Feb 05, 2026
```
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
```
71971e33

04 Feb, 2026 2 commits
- Fix undefined use_int8 error · 99a1c744
  wenjh authored Feb 04, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  99a1c744
- Remove dump code of tensorwise_int8_bgrad_kernel · 2bb532fb
  wenjh authored Feb 04, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  2bb532fb
03 Feb, 2026 1 commit

[Common] Fix NVFP4 tuned-kernel numerics (#2639) · 29b84c16

Oleg Goncharov authored Feb 03, 2026



* Fixed scaling-factor computation for FP32 to match the reference implementation.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Uncommented the tuned kernel path
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

29b84c16

30 Jan, 2026 2 commits

Fix minimum version of cublas for grouped gemm (#2631) · c3769cb7

Paweł Gadziński authored Jan 30, 2026



* version change
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ifx
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

c3769cb7

Fix out-of-bounds issues for types struct in common/common.h · d2c77acc
wenjh authored Jan 30, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
d2c77acc

28 Jan, 2026 1 commit

[common] Add support for cuBLASLt GEMM for GroupedTensor (#2502) · b9f40131

Paweł Gadziński authored Jan 28, 2026



* code drop
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add FP8 scale support and fix alignment for grouped GEMM

- Add FP8 scale_inv pointer handling in nvte_grouped_gemm for proper FP8 GEMM
- Fix random padding in tests to ensure 16-byte alignment for all dtypes
- Reorder GroupedGemmSetupWorkspace members for natural alignment
- Remove debug prints
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Grouped GEMM: code cleanup and NULL C support

- Remove unused alignment parameter from GroupedGemmSetupWorkspace::from_buffers
- Simplify select_grouped_operand by removing dead code branches
- Add GroupedOperandSelection.tensor field to avoid passing tensor separately
- Extract set_fp8_scale_pointers and init_matrix_layouts helpers
- Add safety check for FP8 on Hopper column-wise fallback
- Support NULL C tensor when beta=0 (uses D as placeholder)
- Remove unused get_scale_inv() from test
- Add use_null_c test parameter and test case
- Fix documentation: alpha/beta are single element tensors only
Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Grouped GEMM: per-matrix alpha/beta support

- Change alpha/beta from single values to per-matrix arrays
- Validate alpha/beta have exactly num_tensors elements
- Update kernel to index alpha_ptr[idx] and beta_ptr[idx]
- Move alpha/beta validation to validate_grouped_gemm_inputs
- Update tests to use per-matrix alpha/beta arrays
- Update documentation
Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix alpha/beta numel - use SimpleTensor::numel()
Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Refactor: move grouped GEMM to separate file and cleanup API
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Require Blackwell (SM100) and cuBLAS 13.1+ for grouped GEMM
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/common/gemm/config.h
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* changed
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* suggestions
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* refactored hopper tensor selection
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>

b9f40131

23 Jan, 2026 4 commits

Fix issues related to L1cpp tests · 284d3f6f

maxiao3 authored Jan 23, 2026



1,not find nvte_dgelu
2,fsdp_group is not none
3,CPUOffloadEnabled change to cpp_offload_v1
Signed-off-by: maxiao3 <maxiao3@sugon.com>

See merge request dcutoolkit/deeplearing/TransformerEngine!74

284d3f6f

Fix issues related to L0cpp tests · 8fc9d8f1

maxiao3 authored Jan 23, 2026



1,Resolve out-of-bounds issues for types struct
2,Fix TestFusedCastFloat8Vectorwise test case failure
Signed-off-by: maxiao3 <maxiao3@sugon.com>

See merge request dcutoolkit/deeplearing/TransformerEngine!73

8fc9d8f1

[DCU] Remove redundant shared memory in rowwise kernel · 261e476b

zc20020701 authored Jan 23, 2026


Signed-off-by: zhaochao <zhaochao1@sugon.com>

See merge request dcutoolkit/deeplearing/TransformerEngine!72
Co-authored-by: zhaochao <zhaochao1@sugon.com>

261e476b

[Common] Disabled the tuned NVFP4 kernels (#2615) · a0a89a8e

Oleg Goncharov authored Jan 23, 2026



* Disabled the tuned NVFP4 kernels
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled fast math in cpp tests
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

a0a89a8e

22 Jan, 2026 1 commit

Add support for SWA (left, right) with FusedAttention (#2477) · c6a92a4d

Sudhakar Singh authored Jan 22, 2026

* SWA (left, right) with FusedAttention changes cherry-picked from https://github.com/NVIDIA/TransformerEngine/pull/1369

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix test_kv_cache failures
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove unnecessary comments
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix some more filter issues, address feedback
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix for local test case failures - `bottom_right_diagonal` should be calculated in `fused_attn_fwd` call as well
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* make conditions more accurate
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add cp tests to test swa (left, right)
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove dead code and make conditions better
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* feedback form Charlene
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* small er
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* plumb `bottom_right_diagonal` through jax
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* plumb `bottom_right_diagonal` through jax
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add missing fields
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* use proper mask type in CP
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

c6a92a4d

21 Jan, 2026 3 commits

[pyTorch] CPU performance optimizations (#2439) · 605786f4

Przemyslaw Tredak authored Jan 21, 2026



* PoC of the changes
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Early exit from the Free function for the empty tensor
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Use the proper function for nvtx range
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Only do mark_not_offload when the cpu_offloading is enabled
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* First pass on making the setattr issue not come back
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually add pytest.ini
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changes to __init__
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* A different way
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* WAR the fact that it is not possible to set __setattr__ dynamically
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Simpler solution and fixes
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the inference mode DPA
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Start of debugging debug tools
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* More fixes in debug
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Speculative moving the validate_name to the constructor
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Making the debug tools names saner
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the setattr usage in the tensor parallel group setting
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Adding try/finally - it does not seem to impact the time in observable
way
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing lint issues and the thunder test
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix 1 of the debug tests
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removed the warning and enforcement in the CI
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* try-finally in the context manager
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing the debug tests
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

605786f4

Fixed the year to 2026 (#2611) · 36f4e451
Oleg Goncharov authored Jan 21, 2026
```
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
```
36f4e451

[Common] Tuned NVFP4 cast kernel (#2412) · fbb16f4a

Oleg Goncharov authored Jan 21, 2026



* Implemented persistent nvfp4 kernel
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix FP4 guard in ptx
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix in ptx. reduxf32 guard
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per PR review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixes per PR review. Added parameter to turn off the persistency
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Modified reference CPU implementation in C++ unit tests to match GPU (numerical truncation). Tightened the numerical tolerance
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled persistency by default, as non-persistent kernel is more performant when inputs are large
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Use the tuned kernel also for the rowwise only quantization
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed typo
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Addressed comments from the PR review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Resolved conflicts
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Macros renaming
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fbb16f4a

20 Jan, 2026 1 commit

[Common] Enable determinism for cuDNN >= 9.18.1 on Blackwell (#2584) · 27fc168e

Charlene Yang authored Jan 20, 2026



* update FE to 1.17
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism flag
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to qa/
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move bias/dbias/versioning/dropout logic to C API
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update qa/L0_pytorch_unittest/test.sh

make .xml file specific to deterministic tests in qa/
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax extension
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update tests/jax/test_fused_attn.py

fix typo
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix indentation
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix the AI fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix Jax extension call
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes based on comments
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix selection logic and fwd arg
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix version check in Jax test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix pytorch CI failures
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix Jax CI failures
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix non-/determinism logic and CI
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix formatting
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix and/or logic
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update to 9.18.1 for requirement
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* reduce Jax CI tests for determinism
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

27fc168e

17 Jan, 2026 1 commit

Add logic for block-scaled tensors with GEMM swizzled scales (#2486) · 99df8810

Tim Moon authored Jan 16, 2026



* Add general C API for setting tensor params
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Implement general accessors for NVTETensor
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Refactor tex swizzling to skip if scales are already swizzled
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add checks for non-swizzled scales in MXFP8 and NVFP4 kernels
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Support pre-swizzled scales in MXFP8Tensor
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add tex function to swizzle MXFP8 scales
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in inplace swizzle function
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak comments to use "compact/swizzled format"
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* MXFP8 quantize kernel with pre-swizzled scales
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Expose pre-swizzled scales in modules
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in multi-swizzle
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Support MXFP8 gated activations with swizzled scales
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add PyTorch infrastructure for pre-swizzled NVFP4 tensors
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Deprecate DSv3-specific quantization logic in C API
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove support for DSv3 compact data from quantizer
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove DSv3 compact data format from core lib
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in FP8 all-gather
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update JAX to use new swizzled scale API
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Review suggestion from @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestions from @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update C++ swizzle test with swizzled scales API
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Return default tensor params when querying params for invalid NVTETensor
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug DSv3 FP8 test failures
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug Userbuffers test failures
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make sure gated activations populate FP8 transpose if needed
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Review suggestions from @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable pre-swizzling with debug quantizer
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix merge conflicts and review suggestions

Update copyright years. Tweak comments. Fix various complaints from @greptile-apps.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use explicitly sized types in config accessors

Miscellaneous review suggestions from @ptrendx.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Make util header for function that compute swizzled scale index
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Apply suggestions from @greptile-apps
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Update expected error message in FP8 block-scaling test
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @yaox12
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

99df8810

16 Jan, 2026 1 commit

[JAX] Custom partitioning for Permutation primitives (#2591) · a652730f

Teddy Do authored Jan 16, 2026



* initial impl, not tested
Signed-off-by: tdophung <tdophung@nvidia.com>

* consolidate different unpermute primitives with with_pad and with_merging_probs booleans. Implement partitioning for all permutation primitives
Signed-off-by: tdophung <tdophung@nvidia.com>

* Add distributed test for non-padding permutation
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix issues in distributed test for padding permutation. Make common kernel zero intiialize output permuted scales, permuted probs and output tokens
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* revert zeroing in triton common kernel as it is a race condition. Instead, add extra input (aliased wiuth output) buffer to inner primitive of permutation on jax side to pass in zero intitiated buffers done with jnp zeros
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix utils to handle input output aliasing in autotuned kernels
Signed-off-by: tdophung <tdophung@nvidia.com>

* Clean up comments, and add more comments explaining input output alias in utils
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint and greptile comment
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix issues that lint fixing introduced
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

a652730f

15 Jan, 2026 2 commits

fix: enable opt for cutlass sources to avoid infinite compile time (#2595) · 6a34b657
Jacket authored Jan 15, 2026
```
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
```
6a34b657

(Bug fix) Fix accuracy issue for blockwise scaling+E8 scale on Blackwell (#2589) · fcfa0c3c

Hongbin Liu authored Jan 15, 2026



* bug fix
Signed-off-by: hongbinl <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/common/swizzle/swizzle_block_scaling.cu

Mask to 8 bits to prevent potential bit overlap
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hongbin Liu  <lhb8125@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/common/swizzle/swizzle_block_scaling.cu
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hongbin Liu  <lhb8125@users.noreply.github.com>

* fix bug in 2d too
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

---------
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu  <lhb8125@users.noreply.github.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

fcfa0c3c

12 Jan, 2026 1 commit
- Fix building on nmz · 0fce42f7
  wenjh authored Jan 12, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  0fce42f7
10 Jan, 2026 1 commit

Debug doc generation (#2576) · 2f8ae81c

Tim Moon authored Jan 09, 2026



Debug Doxygen and LaTeX warnings
Signed-off-by: Tim Moon <tmoon@nvidia.com>

2f8ae81c

09 Jan, 2026 1 commit
- Fix swizzle, swap_first_dims and RMSNorm issues · e6f2caf5
  wenjh authored Jan 09, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  e6f2caf5
08 Jan, 2026 1 commit

Fix tests of L0 test_numeric and L1 test_fusible_ops · 953b6d68

wenjh authored Jan 08, 2026


Signed-off-by: wenjh <wenjh@sugon.com>

See merge request dcutoolkit/deeplearing/TransformerEngine!67

953b6d68

07 Jan, 2026 2 commits

[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant (#2564) · de51c96b

Zhongbo Zhu authored Jan 07, 2026



* fix
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* resolve review comments
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Comment tweaks
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

de51c96b

Add nmz support · dc86f372
wenjh authored Jan 07, 2026
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
dc86f372

06 Jan, 2026 1 commit

[Common] Fix long compile time in padding.cu on arch 75 (#2562) · df69100c

jberchtold-nvidia authored Jan 06, 2026



* Fix long compile time in padding.cu
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

df69100c

05 Jan, 2026 1 commit
- Fix out of bound ID passed to `cutlass::arch::NamedBarrier::sync` (#2554) · 4f364c8e
  Kirthi Shankar Sivamani authored Jan 05, 2026
```
Fix barrier ID
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  4f364c8e
02 Jan, 2026 1 commit
- Update copyright to include year 2026 (#2553) · 830ef60f
  Kirthi Shankar Sivamani authored Jan 02, 2026
```
Update copyright to include 2026
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  830ef60f
31 Dec, 2025 1 commit

Fix overflow of padding/unpadding kernel (#2548) · 697b52cb

刘俊 authored Dec 31, 2025


Signed-off-by: fuyue.lj <fuyue.lj@antgroup.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

697b52cb