Commits · 52ee5ea06737f1e1604d154b943aa51fab9b0f3d · OpenDAS / TransformerEngine

23 Jan, 2026 1 commit

Fix bugs in permutation custom partitioning (#2617) · 52ee5ea0

Teddy Do authored Jan 22, 2026



* Use correct block size for workspace in row id map creation, also shard workspace correctly based on 2nd dim of routing_map/row_id map
Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com>

* reduce size of largest test case on single_GPU scenario to fit on L40 and A100 in CI line up
Signed-off-by: tdophung <hanhdp99@gmail.com>

---------
Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com>
Signed-off-by: tdophung <hanhdp99@gmail.com>
Co-authored-by: DoubleCheeseCheetos <hanhdp99@gmail.com>

52ee5ea0

22 Jan, 2026 4 commits

Add support for SWA (left, right) with FusedAttention (#2477) · c6a92a4d

Sudhakar Singh authored Jan 22, 2026

* SWA (left, right) with FusedAttention changes cherry-picked from https://github.com/NVIDIA/TransformerEngine/pull/1369

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix test_kv_cache failures
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove unnecessary comments
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix some more filter issues, address feedback
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix for local test case failures - `bottom_right_diagonal` should be calculated in `fused_attn_fwd` call as well
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* make conditions more accurate
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add cp tests to test swa (left, right)
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove dead code and make conditions better
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* feedback form Charlene
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* small er
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* plumb `bottom_right_diagonal` through jax
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* plumb `bottom_right_diagonal` through jax
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add missing fields
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* use proper mask type in CP
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

c6a92a4d

[PyT] Update THD sink attention logic for cudnn >=9.18.0 (#2568) · 0f0e229b

Chen Cui authored Jan 22, 2026



* Update THD sink attention logic for newer cudnn versions

THD Sink attention is supported in 9.18.0
Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update thd sink attention logic for cp>1
Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add unit test for thd + sink attention
Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address comments
Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* do not skip thd cp sink attention test
Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* disable deterministic mode for sink attention
Signed-off-by: Chen Cui <chcui@nvidia.com>

---------
Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

0f0e229b

Permutation to always return group_size/tokens_per_expert (#2613) · 3d46bf61
Teddy Do authored Jan 22, 2026
```
return tokens_per_experts always
Signed-off-by: tdophung <tdophung@nvidia.com>
```
3d46bf61

[JAX] Fix cb.CUDAOptions usage for Triton 3.6.0 (#2610) · 8bf37f0e

jberchtold-nvidia authored Jan 21, 2026



* Fix cb.CUDAOptions usage for Triton 3.6.0
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update utils.py
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update utils.py
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update utils.py
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8bf37f0e

21 Jan, 2026 3 commits

[pyTorch] CPU performance optimizations (#2439) · 605786f4

Przemyslaw Tredak authored Jan 21, 2026



* PoC of the changes
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Early exit from the Free function for the empty tensor
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Use the proper function for nvtx range
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Only do mark_not_offload when the cpu_offloading is enabled
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* First pass on making the setattr issue not come back
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually add pytest.ini
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changes to __init__
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* A different way
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* WAR the fact that it is not possible to set __setattr__ dynamically
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Simpler solution and fixes
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the inference mode DPA
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Start of debugging debug tools
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* More fixes in debug
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Speculative moving the validate_name to the constructor
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Making the debug tools names saner
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the setattr usage in the tensor parallel group setting
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Adding try/finally - it does not seem to impact the time in observable
way
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing lint issues and the thunder test
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix 1 of the debug tests
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removed the warning and enforcement in the CI
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* try-finally in the context manager
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing the debug tests
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

605786f4

Fixed the year to 2026 (#2611) · 36f4e451
Oleg Goncharov authored Jan 21, 2026
```
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
```
36f4e451

[Common] Tuned NVFP4 cast kernel (#2412) · fbb16f4a

Oleg Goncharov authored Jan 21, 2026



* Implemented persistent nvfp4 kernel
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix FP4 guard in ptx
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix in ptx. reduxf32 guard
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per PR review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixes per PR review. Added parameter to turn off the persistency
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Modified reference CPU implementation in C++ unit tests to match GPU (numerical truncation). Tightened the numerical tolerance
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled persistency by default, as non-persistent kernel is more performant when inputs are large
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Use the tuned kernel also for the rowwise only quantization
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed typo
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Addressed comments from the PR review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Resolved conflicts
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Macros renaming
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fbb16f4a

20 Jan, 2026 2 commits

[Common] Enable determinism for cuDNN >= 9.18.1 on Blackwell (#2584) · 27fc168e

Charlene Yang authored Jan 20, 2026



* update FE to 1.17
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism flag
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to qa/
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move bias/dbias/versioning/dropout logic to C API
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update qa/L0_pytorch_unittest/test.sh

make .xml file specific to deterministic tests in qa/
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax extension
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update tests/jax/test_fused_attn.py

fix typo
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix indentation
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix the AI fixes
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix Jax extension call
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes based on comments
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix selection logic and fwd arg
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix version check in Jax test
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix pytorch CI failures
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix Jax CI failures
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix non-/determinism logic and CI
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix formatting
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix and/or logic
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update to 9.18.1 for requirement
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* reduce Jax CI tests for determinism
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

27fc168e

Changed VERSION to 2.13.0.dev0 · dfdd3820
Przemek Tredak authored Jan 20, 2026
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
dfdd3820

17 Jan, 2026 1 commit

Add logic for block-scaled tensors with GEMM swizzled scales (#2486) · 99df8810

Tim Moon authored Jan 16, 2026



* Add general C API for setting tensor params
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Implement general accessors for NVTETensor
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Refactor tex swizzling to skip if scales are already swizzled
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add checks for non-swizzled scales in MXFP8 and NVFP4 kernels
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Support pre-swizzled scales in MXFP8Tensor
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add tex function to swizzle MXFP8 scales
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in inplace swizzle function
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak comments to use "compact/swizzled format"
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* MXFP8 quantize kernel with pre-swizzled scales
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Expose pre-swizzled scales in modules
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in multi-swizzle
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Support MXFP8 gated activations with swizzled scales
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add PyTorch infrastructure for pre-swizzled NVFP4 tensors
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Deprecate DSv3-specific quantization logic in C API
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove support for DSv3 compact data from quantizer
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove DSv3 compact data format from core lib
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in FP8 all-gather
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update JAX to use new swizzled scale API
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Review suggestion from @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestions from @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update C++ swizzle test with swizzled scales API
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Return default tensor params when querying params for invalid NVTETensor
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug DSv3 FP8 test failures
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug Userbuffers test failures
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make sure gated activations populate FP8 transpose if needed
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Review suggestions from @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable pre-swizzling with debug quantizer
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix merge conflicts and review suggestions

Update copyright years. Tweak comments. Fix various complaints from @greptile-apps.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use explicitly sized types in config accessors

Miscellaneous review suggestions from @ptrendx.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Make util header for function that compute swizzled scale index
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Apply suggestions from @greptile-apps
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Update expected error message in FP8 block-scaling test
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @yaox12
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

99df8810

16 Jan, 2026 1 commit

[JAX] Custom partitioning for Permutation primitives (#2591) · a652730f

Teddy Do authored Jan 16, 2026



* initial impl, not tested
Signed-off-by: tdophung <tdophung@nvidia.com>

* consolidate different unpermute primitives with with_pad and with_merging_probs booleans. Implement partitioning for all permutation primitives
Signed-off-by: tdophung <tdophung@nvidia.com>

* Add distributed test for non-padding permutation
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix issues in distributed test for padding permutation. Make common kernel zero intiialize output permuted scales, permuted probs and output tokens
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* revert zeroing in triton common kernel as it is a race condition. Instead, add extra input (aliased wiuth output) buffer to inner primitive of permutation on jax side to pass in zero intitiated buffers done with jnp zeros
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix utils to handle input output aliasing in autotuned kernels
Signed-off-by: tdophung <tdophung@nvidia.com>

* Clean up comments, and add more comments explaining input output alias in utils
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix lint and greptile comment
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix issues that lint fixing introduced
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

a652730f

15 Jan, 2026 5 commits

fix: enable opt for cutlass sources to avoid infinite compile time (#2595) · 6a34b657
Jacket authored Jan 15, 2026
```
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
```
6a34b657

[JAX] Install Cmake in TE/JAX build Github Action (#2603) · 6cbdb042

jberchtold-nvidia authored Jan 15, 2026



* install cmake in jax build github action
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update build.yml
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

6cbdb042

[JAX] Disable fused attention in encoder tests for determinism (#2601) · 2236292a
jberchtold-nvidia authored Jan 15, 2026
```
disable fused attention in encoder tests for determinism
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
2236292a

docs: Update README Latest News section (#2583) · 4df43dbe

Santosh Bhavani authored Jan 14, 2026



* Move older news to Previous
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* Add Nov 2025 news entries
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

---------
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

4df43dbe

(Bug fix) Fix accuracy issue for blockwise scaling+E8 scale on Blackwell (#2589) · fcfa0c3c

Hongbin Liu authored Jan 15, 2026



* bug fix
Signed-off-by: hongbinl <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/common/swizzle/swizzle_block_scaling.cu

Mask to 8 bits to prevent potential bit overlap
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hongbin Liu  <lhb8125@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/common/swizzle/swizzle_block_scaling.cu
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Hongbin Liu  <lhb8125@users.noreply.github.com>

* fix bug in 2d too
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

---------
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu  <lhb8125@users.noreply.github.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

fcfa0c3c

14 Jan, 2026 1 commit

Revert adding pytorch-triton as a build requirement (#2592) · bd007993

Teddy Do authored Jan 14, 2026



* Remove pyhtorch-triton as a requirement and remove auto-fetching pytorch-triton as it is a placeeholder in pyPI
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix docstring
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

bd007993

13 Jan, 2026 2 commits

ONNX: Fix FP8 quantization for the second MLP in LayerNormMLP (#2577) · 69636a08

Victor Oliveira authored Jan 13, 2026



ONNX: Fix FP8 quantization for the second MLP in LayernormMLP
Signed-off-by: Victor Oliveira <victor.oliveira@getcruise.com>

69636a08

[PyTorch] Bunch of fixes for cpu offloading (#2535) · fe8fad59

Paweł Gadziński authored Jan 13, 2026



* code drop
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* test fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fe8fad59

10 Jan, 2026 1 commit

Debug doc generation (#2576) · 2f8ae81c

Tim Moon authored Jan 09, 2026



Debug Doxygen and LaTeX warnings
Signed-off-by: Tim Moon <tmoon@nvidia.com>

2f8ae81c

09 Jan, 2026 2 commits

Update list of authorized CI users (#2581) · 32f403fd
Tim Moon authored Jan 09, 2026
```
Update list of CI users
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
32f403fd

[JAX] Refactor and trim TE JAX Attn testing (#2542) · 5f0e3b93

Kshitij Lakhani authored Jan 08, 2026



* Pick a leaner set of combinations for TE JAX CP attn tests such that only those cp,dp,tp combinations are picked where cp*dp*tp is equal to num gpus
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Consolidate the test cases run for different B,S,H,D and QKV layout
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Code and comments clean up
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Make FP16 + GQA test cross attn instead of self attn to generalize the test
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5f0e3b93

08 Jan, 2026 1 commit

Solve pytorch-triton and triton package contention (#2540) · 5f828c25

Teddy Do authored Jan 07, 2026



* Add triton version detection logic, and NVTE_USE_PYTORCH_TRITON knob for jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* change build requirements and installation to reflect new option
Signed-off-by: tdophung <tdophung@nvidia.com>

* reduce boilerplate comments
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix typo
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* make env var more precise
Signed-off-by: tdophung <tdophung@nvidia.com>

* make env variables checking consitent
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5f828c25

07 Jan, 2026 3 commits

Fix 50% comparison mismatch in sort_chunks_by_index (Cont.) (#2575) · 08dc786c

Teddy Do authored Jan 07, 2026



* force initialization to int32
Signed-off-by: tdophung <tdophung@nvidia.com>

* address greptile comment
Signed-off-by: tdophung <tdophung@nvidia.com>

* del useless comments, add more restriction to int32
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>

08dc786c

[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant (#2564) · de51c96b

Zhongbo Zhu authored Jan 07, 2026



* fix
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* resolve review comments
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Comment tweaks
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

de51c96b

Fix 50% comparison mismatch in sort_chunks_by_index (#2566) · 702fc5ee

Teddy Do authored Jan 06, 2026



* force initialization to int32
Signed-off-by: tdophung <tdophung@nvidia.com>

* address greptile comment
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>

702fc5ee

06 Jan, 2026 3 commits

[JAX] Fix test_layer to support fused attention and adjust test encoder... · 404a3ee0

jberchtold-nvidia authored Jan 06, 2026


[JAX] Fix test_layer to support fused attention and adjust test encoder tolerance to account for minor diff (#2563)

Fix failing unit tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

404a3ee0

[Common] Fix long compile time in padding.cu on arch 75 (#2562) · df69100c

jberchtold-nvidia authored Jan 06, 2026



* Fix long compile time in padding.cu
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

df69100c

[docs] Getting started refactor (#2534) · a9767407

Paweł Gadziński authored Jan 06, 2026



* docs: Add comprehensive Getting Started guide with benchmarks

- Add new Getting Started documentation with PyTorch and JAX tutorials
- Include benchmark scripts demonstrating TE performance benefits
- Add CSS styling for code output and tabs
- Replace old quickstart notebooks with improved documentation
- Add transformer layer diagram (SVG)
- Update docs configuration and workflow
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* 2026 in copyright
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

a9767407

05 Jan, 2026 2 commits

Add tests that reset_parameters doesn't change parameter initial value ranges (#2550) · c90a9214

Peter St. John authored Jan 04, 2026



* Add tests for 2528 and 2529
Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* Update tests/pytorch/test_deferred_init.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update tests/pytorch/test_deferred_init.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

c90a9214

Fix out of bound ID passed to `cutlass::arch::NamedBarrier::sync` (#2554) · 4f364c8e
Kirthi Shankar Sivamani authored Jan 05, 2026
```
Fix barrier ID
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
4f364c8e

02 Jan, 2026 3 commits

[PyTorch] Fix garbage initialized permuted_scale (#2547) · c988548f
xiaoxi-wangfj authored Jan 03, 2026
```
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Co-authored-by: Teddy Do <tdophung@nvidia.com>
```
c988548f

Document environment variables (#2552) · 27dc83bf

Kirthi Shankar Sivamani authored Jan 02, 2026



* Document envvars
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add remaining envvars
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* More missing ones
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update docs/envvars.rst
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update docs/envvars.rst
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

27dc83bf

Update copyright to include year 2026 (#2553) · 830ef60f
Kirthi Shankar Sivamani authored Jan 02, 2026
```
Update copyright to include 2026
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
830ef60f

31 Dec, 2025 3 commits

[PyTorch] Support cudagraph recomputation (#2518) · 324be332

Robin Zhang authored Jan 01, 2026



* replace autograd.grad with autograd.backward
Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* get/set graphable rng state
Signed-off-by: Robin Zhang <robinz@nvidia.com>

* fix lint
Signed-off-by: Robin Zhang <robinz@nvidia.com>

---------
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

324be332

Fix overflow of padding/unpadding kernel (#2548) · 697b52cb

刘俊 authored Dec 31, 2025


Signed-off-by: fuyue.lj <fuyue.lj@antgroup.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

697b52cb

[JAX] Fix incorrect calculation of segment pos from segment ids in user-facing API (#2523) · 26c82db6

Kshitij Lakhani authored Dec 31, 2025



* Fix incorrect calculation of segment pos from segment ids for thd cases and load balanced cases in from_segment_ids_and_pos. Enforce passing of segment_pos for THD cases and lod balanced cases
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Correct the assert condition
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Modify fused attn tests to pass new args to from_segment_ids_and_pos()
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Calculate seg ids before pos
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* 1. Change the signature for from_segment_ids_and_pos()
2. Add support for THD in from_segment_ids_and_pos()
3. Assert if load balanced segment_ids is passed to generate a segment_pos
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Pass keyword-only args by name
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* nit: Fix typo to use seg_ids instead of segment_ids
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* nit: Fix comments
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Modify the function call to differentiate between load balancing and actually reordered segment_ids and segment_pos
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix the is_segment_ids_reordered to be set only when CP and load balancing
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix comments for from_segment_ids_and_pos()
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Code clean up

for more information, see https://pre-commit.ci



Fix lint errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

26c82db6

27 Dec, 2025 1 commit

[PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization (#1921) · 5ba01faa

xiaoxi-wangfj authored Dec 27, 2025



* [PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization

1.Fused `moe_permute_with_probs` + `Fp8Padding` and fused `moe_unpermute` + `Fp8Unpadding`,
  that can remove the explicit padding/unpadding of moe expert, improved performance and reduced peak gpu memory usage.
2.Add tests of fused permute/pad and unpermute/unpad.
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [PyTorch/Common] Fuse permute+pad and unpermute+unpad support with_merging_probs
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [PyTorch]format code
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [Common]perf expert_idx loaded once
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* fix: pad_offsets can be None
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* add padding + merging probs bwd support. Not tested
Signed-off-by: tdophung <tdophung@nvidia.com>

* Fix garbage initialized act grad
Signed-off-by: tdophung <tdophung@nvidia.com>

* all test passing for jax permutation + pad
Signed-off-by: tdophung <tdophung@nvidia.com>

* change tokens_per_experts APIs to num_out_tokens with conservative allocation of worst case padding for output buffer
Signed-off-by: tdophung <tdophung@nvidia.com>

* change test permutation to reduce test time
Signed-off-by: tdophung <tdophung@nvidia.com>

* triggering PR refresh
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* Remove some tests cases from pytorch side. Add a separate toekn_dispatch test for sanity in case combine accidentally undo an error on dispatch in the roundtrip test. Add distinction between L0 and L2 in test cases in jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* remove chance for inefficiency in moving between CPU and GPU, remove redundant primitive using a new static bool for padding, add assert for align size
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix lint in jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* account for both jax newer and older than version 0.8.2. Adjusted gpu triton binding accordingly
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix typo
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: tdophung <tdophung@nvidia.com>

5ba01faa

22 Dec, 2025 1 commit

Fix ptxas compilation on sm103 for triton kernels (#2539) · 97a09c29

Teddy Do authored Dec 22, 2025



* add triton ptxas path  for gb300 to find where it is to avoid compilation errors
Signed-off-by: tdophung <tdophung@nvidia.com>

* add these flags in advance to preven future breaks when ops are extended to multi gpus
Signed-off-by: tdophung <tdophung@nvidia.com>

* add this also to L1
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>

97a09c29