Commits · 2f8ae81c3b78db38f5ace8735eedb66269159c91 · OpenDAS / TransformerEngine

10 Jan, 2026 1 commit

Tim Moon authored Jan 09, 2026



Debug Doxygen and LaTeX warnings
Signed-off-by: Tim Moon <tmoon@nvidia.com>

2f8ae81c

08 Jan, 2026 1 commit

Solve pytorch-triton and triton package contention (#2540) · 5f828c25

Teddy Do authored Jan 07, 2026



* Add triton version detection logic, and NVTE_USE_PYTORCH_TRITON knob for jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* change build requirements and installation to reflect new option
Signed-off-by: tdophung <tdophung@nvidia.com>

* reduce boilerplate comments
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix typo
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* make env var more precise
Signed-off-by: tdophung <tdophung@nvidia.com>

* make env variables checking consitent
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5f828c25

07 Jan, 2026 1 commit

[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant (#2564) · de51c96b

Zhongbo Zhu authored Jan 07, 2026



* fix
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* resolve review comments
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Comment tweaks
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

de51c96b

06 Jan, 2026 1 commit

[Common] Fix long compile time in padding.cu on arch 75 (#2562) · df69100c

jberchtold-nvidia authored Jan 06, 2026



* Fix long compile time in padding.cu
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

df69100c

05 Jan, 2026 1 commit
- Fix out of bound ID passed to `cutlass::arch::NamedBarrier::sync` (#2554) · 4f364c8e
  Kirthi Shankar Sivamani authored Jan 05, 2026
```
Fix barrier ID
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  4f364c8e
02 Jan, 2026 2 commits
- [PyTorch] Fix garbage initialized permuted_scale (#2547) · c988548f
  xiaoxi-wangfj authored Jan 03, 2026
```
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Co-authored-by: Teddy Do <tdophung@nvidia.com>
```
  c988548f
- Update copyright to include year 2026 (#2553) · 830ef60f
  Kirthi Shankar Sivamani authored Jan 02, 2026
```
Update copyright to include 2026
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  830ef60f
31 Dec, 2025 3 commits

[PyTorch] Support cudagraph recomputation (#2518) · 324be332

Robin Zhang authored Jan 01, 2026



* replace autograd.grad with autograd.backward
Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* get/set graphable rng state
Signed-off-by: Robin Zhang <robinz@nvidia.com>

* fix lint
Signed-off-by: Robin Zhang <robinz@nvidia.com>

---------
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

324be332

Fix overflow of padding/unpadding kernel (#2548) · 697b52cb

刘俊 authored Dec 31, 2025


Signed-off-by: fuyue.lj <fuyue.lj@antgroup.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

697b52cb

[JAX] Fix incorrect calculation of segment pos from segment ids in user-facing API (#2523) · 26c82db6

Kshitij Lakhani authored Dec 31, 2025



* Fix incorrect calculation of segment pos from segment ids for thd cases and load balanced cases in from_segment_ids_and_pos. Enforce passing of segment_pos for THD cases and lod balanced cases
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Correct the assert condition
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Modify fused attn tests to pass new args to from_segment_ids_and_pos()
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Calculate seg ids before pos
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* 1. Change the signature for from_segment_ids_and_pos()
2. Add support for THD in from_segment_ids_and_pos()
3. Assert if load balanced segment_ids is passed to generate a segment_pos
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Pass keyword-only args by name
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* nit: Fix typo to use seg_ids instead of segment_ids
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* nit: Fix comments
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Modify the function call to differentiate between load balancing and actually reordered segment_ids and segment_pos
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix the is_segment_ids_reordered to be set only when CP and load balancing
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix comments for from_segment_ids_and_pos()
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Code clean up

for more information, see https://pre-commit.ci



Fix lint errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

26c82db6

27 Dec, 2025 1 commit

[PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization (#1921) · 5ba01faa

xiaoxi-wangfj authored Dec 27, 2025



* [PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization

1.Fused `moe_permute_with_probs` + `Fp8Padding` and fused `moe_unpermute` + `Fp8Unpadding`,
  that can remove the explicit padding/unpadding of moe expert, improved performance and reduced peak gpu memory usage.
2.Add tests of fused permute/pad and unpermute/unpad.
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [PyTorch/Common] Fuse permute+pad and unpermute+unpad support with_merging_probs
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [PyTorch]format code
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [Common]perf expert_idx loaded once
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* fix: pad_offsets can be None
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* add padding + merging probs bwd support. Not tested
Signed-off-by: tdophung <tdophung@nvidia.com>

* Fix garbage initialized act grad
Signed-off-by: tdophung <tdophung@nvidia.com>

* all test passing for jax permutation + pad
Signed-off-by: tdophung <tdophung@nvidia.com>

* change tokens_per_experts APIs to num_out_tokens with conservative allocation of worst case padding for output buffer
Signed-off-by: tdophung <tdophung@nvidia.com>

* change test permutation to reduce test time
Signed-off-by: tdophung <tdophung@nvidia.com>

* triggering PR refresh
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* Remove some tests cases from pytorch side. Add a separate toekn_dispatch test for sanity in case combine accidentally undo an error on dispatch in the roundtrip test. Add distinction between L0 and L2 in test cases in jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* remove chance for inefficiency in moving between CPU and GPU, remove redundant primitive using a new static bool for padding, add assert for align size
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix lint in jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* account for both jax newer and older than version 0.8.2. Adjusted gpu triton binding accordingly
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix typo
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: tdophung <tdophung@nvidia.com>

5ba01faa

20 Dec, 2025 2 commits

[PyTorch][NVFP4][MOE] NVFP4 Grouped Quantize with Hadamard Transform (#2411) · eb8e792b

Zhongbo Zhu authored Dec 20, 2025



* rowwise colwise RHT group quant v1
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* remove local array RW
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* change wait_barrier
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fast math options
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* use mult to replace div
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* format
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* bulk move random states
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* greptile
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* lint
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* revert to use divides
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* avoid fp32 bf16 round-trip in RHT cast fusion
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* trigger fastmath by toggle NVTE_RHT_CAST_FUSION_USE_FAST_MATH
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* integrate row col rht fusion, functional
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* numerics aligned
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* style
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* remove device sync
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* 128 padding
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* revert colwise rng state creation because of row-col fused kernel
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix CI, linter
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* refactor RS for generating two random values
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Avoid invalid configs with templated kernel
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix acc pipeline init with 0 arrival count
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* restore rowwise-only mode
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* switch to dynamic atomic scheduler
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Avoid instantiating group RHT+cast kernel without row-wise or col-wise output
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Include fast math option in quantization config
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings and review nits
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Use TE license
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug where kernel is always launched on stream
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Restore BF16 intermediate downcast in fused RHT-cast kernels
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix numerical test of grouped kernel
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Make sure row-wise and col-wise quantization use different RNG seeds
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Restore autoformatter
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

eb8e792b

[JAX] Remove unused TE DPA module dtype which fixes cuDNN backend detection to... · 47902e96

jberchtold-nvidia authored Dec 19, 2025


[JAX] Remove unused TE DPA module dtype which fixes cuDNN backend detection to properly use input dtypes (#2485)

* Remove unused TE DPA module dtype which fixes cuDNN backend detection to properly use input dtypes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Warning fallback
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* adjust test tolerances slightly for encoder tests due to change in backend
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

47902e96

19 Dec, 2025 2 commits

[JAX] Handle meshs set with jax.set_mesh (#2532) · d46d5db4

jberchtold-nvidia authored Dec 19, 2025



* Handle meshs set with jax.set_mesh
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

d46d5db4

[PyTorch] Make sure Float8Tensor.contiguous supports autograd (#2533) · 6fd62098

Sudhakar Singh authored Dec 18, 2025



* add early return back (removed in 2427)
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Make sure Float8Tensor.contiguous supports autograd

Expand quantized tensor tests to check identity ops.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>

6fd62098

18 Dec, 2025 1 commit

Fix meta device check failure when passing torch.device objects (#2519) · 14ddb430

LucienXian authored Dec 18, 2025



* Fix meta device check failure when passing torch.device objects
Signed-off-by: LucienXian <fl.xian@foxmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: LucienXian <fl.xian@foxmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

14ddb430

17 Dec, 2025 2 commits

[JAX] Add tutorial for integrating TE/JAX quantization into an existing framework (#2423) · 442513c5

jberchtold-nvidia authored Dec 17, 2025



* Tutorial for integration te/jax quantization into an existing framework
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add todos
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* support nvfp4 sr rng key, move wrapper module into TE itself, fix bfloat16 cast
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update docstrings
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix QKV proj and out proj in Flax example transformer layer
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use fused attention in quickstart_jax example
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remat policy
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add tutorial to docs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update title
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* remove unused dtype from TE DPA module
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix notebook title
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add explanation of flax module wrapper
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

442513c5

Reset cache logic of weight workspace for NVFP4TensorStorage (#2524) · dbd0197e
Jinhang Choi authored Dec 16, 2025
```
reset weight ws cache for NVFP4TensorStorage
Signed-off-by: Jinhang Choi <jinhangc@nvidia.com>
```
dbd0197e

15 Dec, 2025 2 commits

Check calling convention for amax switch. (#2506) · b215116a

kwyss-nvidia authored Dec 15, 2025



* Check calling convention for amax switch.

Wgrad gemms with colwise x colwise require
rowwise data via general_gemm. Since dy
has both for dgrad and wgrad, the brittleness
has likely not affected results.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Clear rowwise data when applicable.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test with columnwise cases.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Check enum value rather than implicit cast.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

---------
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

b215116a

fix ce loss calculation when some tokens are ignored (#2476) · 36f2dfd2

Yashaswi Karnati authored Dec 15, 2025



* fix ce loss with ignore idx
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: ykarnati <ykarnati@nvidia.com>

* remove fix comments
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* fallback divisor to 1
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* have arg for n_rows and n_non_ignore
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* fuse n_non_ignore to softmax kernel
Signed-off-by: ykarnati <ykarnati@nvidia.com>

* fix incorrect arg
Signed-off-by: ykarnati <ykarnati@nvidia.com>

---------
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

36f2dfd2

11 Dec, 2025 4 commits

[JAX] Unset NVTE_FUSED_RING_ATTENTION_USE_SCAN by default (#2503) · 887a4fca

Kshitij Lakhani authored Dec 11, 2025



* Unset NVTE_FUSED_RING_ATTENTION_USE_SCAN by default
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add TODO
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Change the warning check in P2P helper to warn against using scan loop
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

887a4fca

[PyTorch] Convert sample tuple to list in cudagraph input reuse (#2426) · 50352325

Robin Zhang authored Dec 12, 2025



Convert sample tuple to list in reuse
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

50352325

[PyTorch] Update RNG global states in tracker set_states (#2501) · 811e0908

Robin Zhang authored Dec 12, 2025



set_all_rng_states in set_states
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

811e0908

Add separate RNG states for column-wise quantization with Stochastic Rounding (#2487) · a5694f26

Evgeny Tsykunov authored Dec 11, 2025



* Add separate RNG states for columnwise quantization with Stochastic Rounding
Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Fix single tensor path
Signed-off-by: Evgeny <etsykunov@nvidia.com>

---------
Signed-off-by: Evgeny <etsykunov@nvidia.com>

a5694f26

10 Dec, 2025 2 commits

[PyTorch] Add THD support for max_logit/MuonClip (#2480) · 93c5c65b

Charlene Yang authored Dec 10, 2025



* update FE; initial pass at thd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* produce Stats+Max instead of Max+Sum_Exp
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "produce Stats+Max instead of Max+Sum_Exp"

This reverts commit c7d2b77b2da9ff3f68344097284187ac427eeb6a.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

93c5c65b

[JAX] Make softmax_type in FFI optional (#2491) · 5afbb0e1

jberchtold-nvidia authored Dec 09, 2025



* Make softmax_type in FFI optional
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add warn message
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5afbb0e1

09 Dec, 2025 4 commits

Jax primitives for permutation on single GPU (#2473) · 46c6ef31

Teddy Do authored Dec 09, 2025



* branch off of initial permutation jax-triton PR
Signed-off-by: tdophung <tdophung@nvidia.com>

* Set 0 as the size of dummy tensors to reduce memory usage.
Signed-off-by: tdophung <tdophung@nvidia.com>

* Correct setting of permuted_probs_stride_token, unpermuted_probs_stride_token and unpermuted_probs_stride_expert in unpermutation
Signed-off-by: tdophung <tdophung@nvidia.com>

* Implement primitives, wrapper, test for wrapper, edit trit
on binding to accomodate scalars
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Change implemementation of VJP functions to match correct pattern. Deduce some static scalar args from shapes of inputs. Accept B, S instead of num_tokens. Change test to use value_and_grad to test vjp funcs properly
Signed-off-by: tdophung <tdophung@nvidia.com>

* formatting
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix pylint
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix test to compare to the correct reference impl. relax 1 tol for grad compare, fix lint the rightway
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix test_permutation to use value_and_grad for reference impl, tighten tols, and add unpermute with probs for token combine bwd rule
Signed-off-by: tdophung <tdophung@nvidia.com>

* added forgotten file in prev commit
Signed-off-by: tdophung <tdophung@nvidia.com>

* format
Signed-off-by: tdophung <tdophung@nvidia.com>

* merge with_probs to without_probs
Signed-off-by: tdophung <tdophung@nvidia.com>

* add aserts and fix lint
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

46c6ef31

Fix the sm120 compilation with CUDA 12 (#2482) · dbaa02d0
Przemyslaw Tredak authored Dec 09, 2025
```
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
```
dbaa02d0
[PyTorch] Change order of args in another permutation triton kernel (#2488) · e05f87e1
Teddy Do authored Dec 09, 2025
```
change order
Signed-off-by: tdophung <tdophung@nvidia.com>
```
e05f87e1

Fix runtime lib loading logic (#2297) · 8ef3a33d

Kirthi Shankar Sivamani authored Dec 09, 2025



Fixes to runtime loading logic and add missing deps
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

8ef3a33d

08 Dec, 2025 1 commit

[Pytorch][Bug]MXFP8 Split tensor Bug fix (#2427) · c09411d8

vthumbe1503 authored Dec 09, 2025



* bug fixed, test added
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix contigous
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert unecessary change
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert another change
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/pytorch/tensor/mxfp8_tensor.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* address review comments
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* missed adding renamed file
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix minor issue
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix ci issue
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix the test for bfloat16
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

c09411d8

06 Dec, 2025 1 commit

[JAX] Add CP + THD + AG + Striped>1 + SWA support (#2379) · fd0cd12e

Kshitij Lakhani authored Dec 05, 2025



* Add generic stripe_height support for load balancing
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix imports in test for deprecated jax.experimental.pjit
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add test case for stripe_height greater than 1. Add stripe_height arg to reordering methods
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Add Striped 1 and 4 test cases. Refactor the Load Balancing test case. Fix the incorrect shape in striping inverser reordering
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Modify test code for CP + AG + THD + stripe height greater than 1
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Add stripe_height arg to fused attn and fused attn fwd API. Add appropriate mask checks for AG+THD+CP and pick BRCM to be executed per rank. Add Fused Attn Primitive for CP + THD +AG + Striping. Add a method to reorder and all gather segment ids and offsets for kv
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* TMP: Throwaway testing commit
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Add comments in primitive registration process
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* TMP: Throwaway test commit
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Undoing incorrect rebase/merge leftovers
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* TMP: Throwaway test commits
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Add support for calculating q and kv seqlens and offsets per rank for CP+THD+AG+SW+Striped>1 primitive
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Augment jax primitive register code comments
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Fix the array sizes and padding values returned for seqlens and offsets to fit what the fused attn primitive non cp computation
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add support in new primitive for softmax_offset related changes. Put in missing primitive registering line in again. Increase the seqoffsets arrays lengths by 1
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Add new set of helper functions for seqlens and seqoffsets fo AG+THD+CP+Stripe>1 which accounts for batching and seq offsets size b+1
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add backward primitive for CP+THD+AG+Striped>1
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Modify tests for backward primitive for CP+THD+AG+Striped>1
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Move stripe_height along with other static args in fused_attn_bwd rule. Fix typo in CP+AG+TH+Striped>1 primitive
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Code clean up: remove older version for calculating seqlens and offsets for CP+AG+THD+striped>1 primitive
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add test for CP+THD+AG+Striped>1
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix missing var
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add SWA tests for AG+Striped>1+CP+THD+SWA
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Restoring test code
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove assert preventing SWA code path in CP+AG+Striped primitive
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Parametrize num_segments_per_seq in tests
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Clean up test code
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Clean up test code in TE common
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Clean up debug statements
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Rename stripe_height to stripe_size
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Code clean up and add additional comments
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

nit: Apply suggestions from code review
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>

Fix type on fused attn tests
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Fix seqoffsets length to be passed onto FusedAttn primitive as it is b and not b+1 needed by cuDNN
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove commented code
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>

Fix linting issues
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Fix incorrect greptile change
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Skip THD test cases for CP + AG + Dual chunk. Skip BSHD cases for CP + AG + Striped>1. Correct the layout and shapr parameters passed to the tests
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Pass stripe_size explicitly for ring attn tests for THD cases
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove TODO
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Explicitly fail if THD + AG is being used with a non padding causal mask
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* nit: Correct the ID for the test dist fused attn tests to account for cp*2 which is done under the hood
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Set num_segments_per_seq defaults to None instead of 0
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Augment comments. Add ValueError for stripe_size=0
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Test only 1 num_segments_per_seq combination for CP+AG+THD+Striped>1+SWA instead of 2. Modify the num segments and window size to easily to debug values
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Default stripe_size to None instead of 0. Modify stripe_size check for <=0 instead of ==0
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove incorrectly added file
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Explicitly pass zero sized arrays for seg ids and pos in the CP + AG + Striped primitive rather than using the seqlens or the offsets as placeholders
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix linting errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add a deep dive doc for CP+THD+AG+Stripe>1+SWA regarding design considerations and decisions
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Put docs and pngs into it's separate dir
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Replace png screenshots with markdown coe blocks for the attention patterns. Remove unecessary pngs
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Add doc file to index.rst. Fix grammatical errors
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fd0cd12e

05 Dec, 2025 1 commit

Fix bugs from refactoring C++ tensor class (#2481) · f0572aa5

Tim Moon authored Dec 04, 2025



Remve assumption in quantize/activation kernels that data buffer is initialized
Signed-off-by: Tim Moon <tmoon@nvidia.com>

f0572aa5

04 Dec, 2025 1 commit

[Core] Fix inconsistent logic in C++ tensor class (#2330) · 61822061

Tim Moon authored Dec 04, 2025



* Initialize empty tensors with shape=[0] instead of shape=[].
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix runtime crash in LayerNorm

Still seeing correctness issues.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make sure norm workspace sizes are not zero
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove assumption in swizzle kernel that data is available.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove assumption in multi-swizzle kernel that data is available.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove unnecessary explicit call to default constructor
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid accessing tensor data pointer if tensor has no entries
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Apply suggestions from code review
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/common/swizzle/swizzle.cu
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Review suggestions from @ptrendx and @greptile-apps
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Prefer using row-wise/col-wise shape based on which has data
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix merge conflict, expand docs, fix inconsistency in dim function
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Change Tensor::has_data to check whether tensor is initialized, not whether pointer is valid.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Review suggestion from @greptile-apps
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Debug incorrect tensor initialization in tests
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Clarify comments that has_data does not guarantee safe pointer accesses
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug test failure when computing amaxes
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

61822061

02 Dec, 2025 3 commits

Add primary weighs fp8 support for mxfp8 (#2055) · d126cdd6

Kunlun Li authored Dec 03, 2025



* Add primary weighs fp8 support for mxfp8
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Fix unit test and add better error log to unit test
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Move post all-gather processing out of for loop
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Add descriptions and ASCII diagrams for partial cast and partial amax functions
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Minor fix based on greptile bot
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix compilation errors due to arch-specific PTX instructions
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove unused noop flag from C API
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Expose test_partial_cast
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Skip mxfp8 partial cast test if mxfp8 is not available
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Fix pytest error
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* pylint ignore unused manual_post_all_gather_processing
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Fix error when using is_mxfp8_available
Signed-off-by: kunlunl <kunlunl@nvidia.com>

---------
Signed-off-by: kunlunl <kunlunl@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>

d126cdd6

[Common] NVTEGroupedTensor class and helpers (#2388) · 14b53313

Phuong Nguyen authored Dec 02, 2025



* add grouped_tensor classes and helpers
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm non-contiguous option and dptrs
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* address comments + rework CheckIn/OutputGroupedTensor
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix for compilation
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* make first_dims/last_dims optional + data.shape 2d
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* added assertion
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* rs conflicts
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* add data.shape info
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* added logical shape field
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* compilation fix
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixed issues raised by greptile
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* return default dtype when grouped_tensor is empty
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* use has_data() for dim queries
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update comments
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix index bound
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Update transformer_engine/common/transformer_engine.cpp
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Update transformer_engine/common/transformer_engine.cpp
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* restore Tensor.has_data() + add experimental marks
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* restore Tensor::has_columnwise_data
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* cleanup
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

14b53313

[JAX] Triton binding (#2437) · f1512b21

Phuong Nguyen authored Dec 02, 2025



* init triton binding with test case/example

* added Triton as TE-JAX test dependency

* grid with blocksize from autotune
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

f1512b21

26 Nov, 2025 2 commits

Docs fix (#2301) · df39a7c2

Paweł Gadziński authored Nov 26, 2025



* init
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* lines lenght
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* subtitle --- fix in many files:
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* cross entropy _input -> input rename
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* cross entropy _input -> input rename
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* a lot of small fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* torch_version() change
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add missing module and fix warnings
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* removed training whitespace:
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Update docs/api/pytorch.rst
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* Fix import
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix more imports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix NumPy docstring parameter spacing and indentation

- Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide
- Fix inconsistent indentation in cpu_offload.py docstring
- Modified 51 Python files across transformer_engine/pytorch
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

df39a7c2

[PyTorch] Avoid initializing recipe state in fusible op base class constructor (#2421) · 9ca89e97

Tim Moon authored Nov 25, 2025



Do not initialize recipe state in base op class

Op attrs may not be set. Move recipe state initialization to linear op constructor.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

9ca89e97

25 Nov, 2025 1 commit

[PyTorch Debug] Debug support for GroupedLinear (#1953) · 9f61f8a5

Paweł Gadziński authored Nov 26, 2025



* main
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* docs
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* add
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* test fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

9f61f8a5