Commits · 702fc5eecadd4709e049c2e370fdf5607e79975f · OpenDAS / TransformerEngine

07 Jan, 2026 1 commit

Fix 50% comparison mismatch in sort_chunks_by_index (#2566) · 702fc5ee

Teddy Do authored Jan 06, 2026



* force initialization to int32
Signed-off-by: tdophung <tdophung@nvidia.com>

* address greptile comment
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>

702fc5ee

06 Jan, 2026 3 commits

[JAX] Fix test_layer to support fused attention and adjust test encoder... · 404a3ee0

jberchtold-nvidia authored Jan 06, 2026


[JAX] Fix test_layer to support fused attention and adjust test encoder tolerance to account for minor diff (#2563)

Fix failing unit tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

404a3ee0

[Common] Fix long compile time in padding.cu on arch 75 (#2562) · df69100c

jberchtold-nvidia authored Jan 06, 2026



* Fix long compile time in padding.cu
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

df69100c

[docs] Getting started refactor (#2534) · a9767407

Paweł Gadziński authored Jan 06, 2026



* docs: Add comprehensive Getting Started guide with benchmarks

- Add new Getting Started documentation with PyTorch and JAX tutorials
- Include benchmark scripts demonstrating TE performance benefits
- Add CSS styling for code output and tabs
- Replace old quickstart notebooks with improved documentation
- Add transformer layer diagram (SVG)
- Update docs configuration and workflow
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* 2026 in copyright
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

a9767407

05 Jan, 2026 2 commits

Add tests that reset_parameters doesn't change parameter initial value ranges (#2550) · c90a9214

Peter St. John authored Jan 04, 2026



* Add tests for 2528 and 2529
Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* Update tests/pytorch/test_deferred_init.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update tests/pytorch/test_deferred_init.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

c90a9214

Fix out of bound ID passed to `cutlass::arch::NamedBarrier::sync` (#2554) · 4f364c8e
Kirthi Shankar Sivamani authored Jan 05, 2026
```
Fix barrier ID
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
4f364c8e

02 Jan, 2026 3 commits

[PyTorch] Fix garbage initialized permuted_scale (#2547) · c988548f
xiaoxi-wangfj authored Jan 03, 2026
```
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Co-authored-by: Teddy Do <tdophung@nvidia.com>
```
c988548f

Document environment variables (#2552) · 27dc83bf

Kirthi Shankar Sivamani authored Jan 02, 2026



* Document envvars
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add remaining envvars
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* More missing ones
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update docs/envvars.rst
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update docs/envvars.rst
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

27dc83bf

Update copyright to include year 2026 (#2553) · 830ef60f
Kirthi Shankar Sivamani authored Jan 02, 2026
```
Update copyright to include 2026
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
830ef60f

31 Dec, 2025 3 commits

[PyTorch] Support cudagraph recomputation (#2518) · 324be332

Robin Zhang authored Jan 01, 2026



* replace autograd.grad with autograd.backward
Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* get/set graphable rng state
Signed-off-by: Robin Zhang <robinz@nvidia.com>

* fix lint
Signed-off-by: Robin Zhang <robinz@nvidia.com>

---------
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

324be332

Fix overflow of padding/unpadding kernel (#2548) · 697b52cb

刘俊 authored Dec 31, 2025


Signed-off-by: fuyue.lj <fuyue.lj@antgroup.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

697b52cb

[JAX] Fix incorrect calculation of segment pos from segment ids in user-facing API (#2523) · 26c82db6

Kshitij Lakhani authored Dec 31, 2025



* Fix incorrect calculation of segment pos from segment ids for thd cases and load balanced cases in from_segment_ids_and_pos. Enforce passing of segment_pos for THD cases and lod balanced cases
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Correct the assert condition
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Modify fused attn tests to pass new args to from_segment_ids_and_pos()
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Calculate seg ids before pos
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* 1. Change the signature for from_segment_ids_and_pos()
2. Add support for THD in from_segment_ids_and_pos()
3. Assert if load balanced segment_ids is passed to generate a segment_pos
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Pass keyword-only args by name
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* nit: Fix typo to use seg_ids instead of segment_ids
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* nit: Fix comments
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Modify the function call to differentiate between load balancing and actually reordered segment_ids and segment_pos
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix the is_segment_ids_reordered to be set only when CP and load balancing
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix comments for from_segment_ids_and_pos()
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Code clean up

for more information, see https://pre-commit.ci



Fix lint errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

26c82db6

27 Dec, 2025 1 commit

[PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization (#1921) · 5ba01faa

xiaoxi-wangfj authored Dec 27, 2025



* [PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization

1.Fused `moe_permute_with_probs` + `Fp8Padding` and fused `moe_unpermute` + `Fp8Unpadding`,
  that can remove the explicit padding/unpadding of moe expert, improved performance and reduced peak gpu memory usage.
2.Add tests of fused permute/pad and unpermute/unpad.
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [PyTorch/Common] Fuse permute+pad and unpermute+unpad support with_merging_probs
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [PyTorch]format code
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [Common]perf expert_idx loaded once
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* fix: pad_offsets can be None
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* add padding + merging probs bwd support. Not tested
Signed-off-by: tdophung <tdophung@nvidia.com>

* Fix garbage initialized act grad
Signed-off-by: tdophung <tdophung@nvidia.com>

* all test passing for jax permutation + pad
Signed-off-by: tdophung <tdophung@nvidia.com>

* change tokens_per_experts APIs to num_out_tokens with conservative allocation of worst case padding for output buffer
Signed-off-by: tdophung <tdophung@nvidia.com>

* change test permutation to reduce test time
Signed-off-by: tdophung <tdophung@nvidia.com>

* triggering PR refresh
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* Remove some tests cases from pytorch side. Add a separate toekn_dispatch test for sanity in case combine accidentally undo an error on dispatch in the roundtrip test. Add distinction between L0 and L2 in test cases in jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* remove chance for inefficiency in moving between CPU and GPU, remove redundant primitive using a new static bool for padding, add assert for align size
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix lint in jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* account for both jax newer and older than version 0.8.2. Adjusted gpu triton binding accordingly
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix typo
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: tdophung <tdophung@nvidia.com>

5ba01faa

22 Dec, 2025 1 commit

Fix ptxas compilation on sm103 for triton kernels (#2539) · 97a09c29

Teddy Do authored Dec 22, 2025



* add triton ptxas path  for gb300 to find where it is to avoid compilation errors
Signed-off-by: tdophung <tdophung@nvidia.com>

* add these flags in advance to preven future breaks when ops are extended to multi gpus
Signed-off-by: tdophung <tdophung@nvidia.com>

* add this also to L1
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>

97a09c29

20 Dec, 2025 2 commits

[PyTorch][NVFP4][MOE] NVFP4 Grouped Quantize with Hadamard Transform (#2411) · eb8e792b

Zhongbo Zhu authored Dec 20, 2025



* rowwise colwise RHT group quant v1
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* remove local array RW
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* change wait_barrier
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fast math options
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* use mult to replace div
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* format
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* bulk move random states
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* greptile
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* lint
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* revert to use divides
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* avoid fp32 bf16 round-trip in RHT cast fusion
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* trigger fastmath by toggle NVTE_RHT_CAST_FUSION_USE_FAST_MATH
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* integrate row col rht fusion, functional
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* numerics aligned
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* style
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* remove device sync
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* 128 padding
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* revert colwise rng state creation because of row-col fused kernel
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix CI, linter
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* refactor RS for generating two random values
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Avoid invalid configs with templated kernel
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix acc pipeline init with 0 arrival count
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* restore rowwise-only mode
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* switch to dynamic atomic scheduler
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Avoid instantiating group RHT+cast kernel without row-wise or col-wise output
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Include fast math option in quantization config
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings and review nits
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Use TE license
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug where kernel is always launched on stream
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Restore BF16 intermediate downcast in fused RHT-cast kernels
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix numerical test of grouped kernel
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Make sure row-wise and col-wise quantization use different RNG seeds
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Restore autoformatter
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

eb8e792b

[JAX] Remove unused TE DPA module dtype which fixes cuDNN backend detection to... · 47902e96

jberchtold-nvidia authored Dec 19, 2025


[JAX] Remove unused TE DPA module dtype which fixes cuDNN backend detection to properly use input dtypes (#2485)

* Remove unused TE DPA module dtype which fixes cuDNN backend detection to properly use input dtypes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Warning fallback
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* adjust test tolerances slightly for encoder tests due to change in backend
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

47902e96

19 Dec, 2025 2 commits

[JAX] Handle meshs set with jax.set_mesh (#2532) · d46d5db4

jberchtold-nvidia authored Dec 19, 2025



* Handle meshs set with jax.set_mesh
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

d46d5db4

[PyTorch] Make sure Float8Tensor.contiguous supports autograd (#2533) · 6fd62098

Sudhakar Singh authored Dec 18, 2025



* add early return back (removed in 2427)
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Make sure Float8Tensor.contiguous supports autograd

Expand quantized tensor tests to check identity ops.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>

6fd62098

18 Dec, 2025 2 commits

ci: Use whitelisted sha for `get-release` (#2531) · 3e693970
oliver könig authored Dec 18, 2025
```
Signed-off-by: oliver könig <okoenig@nvidia.com>
```
3e693970

Fix meta device check failure when passing torch.device objects (#2519) · 14ddb430

LucienXian authored Dec 18, 2025



* Fix meta device check failure when passing torch.device objects
Signed-off-by: LucienXian <fl.xian@foxmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: LucienXian <fl.xian@foxmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

14ddb430

17 Dec, 2025 3 commits

[JAX] Add tutorial for integrating TE/JAX quantization into an existing framework (#2423) · 442513c5

jberchtold-nvidia authored Dec 17, 2025



* Tutorial for integration te/jax quantization into an existing framework
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add todos
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* support nvfp4 sr rng key, move wrapper module into TE itself, fix bfloat16 cast
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update docstrings
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix QKV proj and out proj in Flax example transformer layer
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use fused attention in quickstart_jax example
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remat policy
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add tutorial to docs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update title
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* remove unused dtype from TE DPA module
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix notebook title
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add explanation of flax module wrapper
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

442513c5

Add ccache support to TE and use it in GitHub actions (#2444) · 5c2f2ff5

Przemyslaw Tredak authored Dec 17, 2025



* Add ccache support to TE and use it in GitHub actions
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Move to allowed action with sccache
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Properly handle sccache
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix typo
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removing ccache from the custom docker workflows where we can't run the
action in the container
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* JAX already uses same cmake options to build the extension so there is
no need to set CXX too
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removed the unnecessary env variables
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

5c2f2ff5

Reset cache logic of weight workspace for NVFP4TensorStorage (#2524) · dbd0197e
Jinhang Choi authored Dec 16, 2025
```
reset weight ws cache for NVFP4TensorStorage
Signed-off-by: Jinhang Choi <jinhangc@nvidia.com>
```
dbd0197e

16 Dec, 2025 1 commit

Remove test skip logic for GEMM-AR tests (#2516) · eac8af6a

vcherepanov-nv authored Dec 16, 2025



* Use GEMM-AR fallback on newer cuBLASMp
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove test skip logic completely
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

---------
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

eac8af6a

15 Dec, 2025 3 commits

[PyTorch debug] Fix test for debug tools (#2507) · 2886cbce

Paweł Gadziński authored Dec 16, 2025



* Skip delayed wgrad tests in distributed numerics when debug mode is enabled
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

2886cbce

Check calling convention for amax switch. (#2506) · b215116a

kwyss-nvidia authored Dec 15, 2025



* Check calling convention for amax switch.

Wgrad gemms with colwise x colwise require
rowwise data via general_gemm. Since dy
has both for dgrad and wgrad, the brittleness
has likely not affected results.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Clear rowwise data when applicable.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test with columnwise cases.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Check enum value rather than implicit cast.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

---------
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

b215116a

fix ce loss calculation when some tokens are ignored (#2476) · 36f2dfd2

Yashaswi Karnati authored Dec 15, 2025



* fix ce loss with ignore idx
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: ykarnati <ykarnati@nvidia.com>

* remove fix comments
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* fallback divisor to 1
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* have arg for n_rows and n_non_ignore
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* fuse n_non_ignore to softmax kernel
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* fix incorrect arg
Signed-off-by: ykarnati <ykarnati@nvidia.com>

---------
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

36f2dfd2

12 Dec, 2025 1 commit

[PyTorch] Add triton requirement (#2490) · 8c9f7c25

Kirthi Shankar Sivamani authored Dec 12, 2025



* Add triton dep
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Teddy Do <tdophung@nvidia.com>

8c9f7c25

11 Dec, 2025 4 commits

[JAX] Unset NVTE_FUSED_RING_ATTENTION_USE_SCAN by default (#2503) · 887a4fca

Kshitij Lakhani authored Dec 11, 2025



* Unset NVTE_FUSED_RING_ATTENTION_USE_SCAN by default
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add TODO
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Change the warning check in P2P helper to warn against using scan loop
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

887a4fca

[PyTorch] Convert sample tuple to list in cudagraph input reuse (#2426) · 50352325

Robin Zhang authored Dec 12, 2025



Convert sample tuple to list in reuse
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

50352325

[PyTorch] Update RNG global states in tracker set_states (#2501) · 811e0908

Robin Zhang authored Dec 12, 2025



set_all_rng_states in set_states
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

811e0908

Add separate RNG states for column-wise quantization with Stochastic Rounding (#2487) · a5694f26

Evgeny Tsykunov authored Dec 11, 2025



* Add separate RNG states for columnwise quantization with Stochastic Rounding
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Fix single tensor path
Signed-off-by: Evgeny <etsykunov@nvidia.com>

---------
Signed-off-by: Evgeny <etsykunov@nvidia.com>

a5694f26

10 Dec, 2025 3 commits

[PyTorch] Add THD support for max_logit/MuonClip (#2480) · 93c5c65b

Charlene Yang authored Dec 10, 2025



* update FE; initial pass at thd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* produce Stats+Max instead of Max+Sum_Exp
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "produce Stats+Max instead of Max+Sum_Exp"

This reverts commit c7d2b77b2da9ff3f68344097284187ac427eeb6a.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

93c5c65b

[PyTorch Debug] Add nvdlfw-inspect to dependencies (#2173) · e411547b

Paweł Gadziński authored Dec 10, 2025



* code drop
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

e411547b

[JAX] Make softmax_type in FFI optional (#2491) · 5afbb0e1

jberchtold-nvidia authored Dec 09, 2025



* Make softmax_type in FFI optional
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add warn message
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5afbb0e1

09 Dec, 2025 4 commits

Jax primitives for permutation on single GPU (#2473) · 46c6ef31

Teddy Do authored Dec 09, 2025



* branch off of initial permutation jax-triton PR
Signed-off-by: tdophung <tdophung@nvidia.com>

* Set 0 as the size of dummy tensors to reduce memory usage.
Signed-off-by: tdophung <tdophung@nvidia.com>

* Correct setting of permuted_probs_stride_token, unpermuted_probs_stride_token and unpermuted_probs_stride_expert in unpermutation
Signed-off-by: tdophung <tdophung@nvidia.com>

* Implement primitives, wrapper, test for wrapper, edit trit
on binding to accomodate scalars
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Change implemementation of VJP functions to match correct pattern. Deduce some static scalar args from shapes of inputs. Accept B, S instead of num_tokens. Change test to use value_and_grad to test vjp funcs properly
Signed-off-by: tdophung <tdophung@nvidia.com>

* formatting
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix pylint
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix test to compare to the correct reference impl. relax 1 tol for grad compare, fix lint the rightway
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix test_permutation to use value_and_grad for reference impl, tighten tols, and add unpermute with probs for token combine bwd rule
Signed-off-by: tdophung <tdophung@nvidia.com>

* added forgotten file in prev commit
Signed-off-by: tdophung <tdophung@nvidia.com>

* format
Signed-off-by: tdophung <tdophung@nvidia.com>

* merge with_probs to without_probs
Signed-off-by: tdophung <tdophung@nvidia.com>

* add aserts and fix lint
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

46c6ef31

Fix the sm120 compilation with CUDA 12 (#2482) · dbaa02d0
Przemyslaw Tredak authored Dec 09, 2025
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
dbaa02d0
[PyTorch] Change order of args in another permutation triton kernel (#2488) · e05f87e1
Teddy Do authored Dec 09, 2025
```
change order
Signed-off-by: tdophung <tdophung@nvidia.com>
```
e05f87e1

Fix runtime lib loading logic (#2297) · 8ef3a33d

Kirthi Shankar Sivamani authored Dec 09, 2025



Fixes to runtime loading logic and add missing deps
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

8ef3a33d

08 Dec, 2025 1 commit

[Pytorch][Bug]MXFP8 Split tensor Bug fix (#2427) · c09411d8

vthumbe1503 authored Dec 09, 2025



* bug fixed, test added
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix contigous
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert unecessary change
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert another change
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* missed adding renamed file
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix minor issue
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix ci issue
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix the test for bfloat16
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

c09411d8