Commits · 26c82db6cd04a5d33c650404b837987f81de79d2 · OpenDAS / TransformerEngine

31 Dec, 2025 1 commit

[JAX] Fix incorrect calculation of segment pos from segment ids in user-facing API (#2523) · 26c82db6

Kshitij Lakhani authored Dec 31, 2025



* Fix incorrect calculation of segment pos from segment ids for thd cases and load balanced cases in from_segment_ids_and_pos. Enforce passing of segment_pos for THD cases and lod balanced cases
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Correct the assert condition
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Modify fused attn tests to pass new args to from_segment_ids_and_pos()
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Calculate seg ids before pos
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* 1. Change the signature for from_segment_ids_and_pos()
2. Add support for THD in from_segment_ids_and_pos()
3. Assert if load balanced segment_ids is passed to generate a segment_pos
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Pass keyword-only args by name
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* nit: Fix typo to use seg_ids instead of segment_ids
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* nit: Fix comments
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Modify the function call to differentiate between load balancing and actually reordered segment_ids and segment_pos
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix the is_segment_ids_reordered to be set only when CP and load balancing
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix comments for from_segment_ids_and_pos()
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Code clean up

for more information, see https://pre-commit.ci



Fix lint errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

26c82db6

27 Dec, 2025 1 commit

[PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization (#1921) · 5ba01faa

xiaoxi-wangfj authored Dec 27, 2025



* [PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization

1.Fused `moe_permute_with_probs` + `Fp8Padding` and fused `moe_unpermute` + `Fp8Unpadding`,
  that can remove the explicit padding/unpadding of moe expert, improved performance and reduced peak gpu memory usage.
2.Add tests of fused permute/pad and unpermute/unpad.
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [PyTorch/Common] Fuse permute+pad and unpermute+unpad support with_merging_probs
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [PyTorch]format code
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* [Common]perf expert_idx loaded once
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* fix: pad_offsets can be None
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

* add padding + merging probs bwd support. Not tested
Signed-off-by: tdophung <tdophung@nvidia.com>

* Fix garbage initialized act grad
Signed-off-by: tdophung <tdophung@nvidia.com>

* all test passing for jax permutation + pad
Signed-off-by: tdophung <tdophung@nvidia.com>

* change tokens_per_experts APIs to num_out_tokens with conservative allocation of worst case padding for output buffer
Signed-off-by: tdophung <tdophung@nvidia.com>

* change test permutation to reduce test time
Signed-off-by: tdophung <tdophung@nvidia.com>

* triggering PR refresh
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* Remove some tests cases from pytorch side. Add a separate toekn_dispatch test for sanity in case combine accidentally undo an error on dispatch in the roundtrip test. Add distinction between L0 and L2 in test cases in jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* remove chance for inefficiency in moving between CPU and GPU, remove redundant primitive using a new static bool for padding, add assert for align size
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix lint in jax
Signed-off-by: tdophung <tdophung@nvidia.com>

* account for both jax newer and older than version 0.8.2. Adjusted gpu triton binding accordingly
Signed-off-by: tdophung <tdophung@nvidia.com>

* format code
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix typo
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: tdophung <tdophung@nvidia.com>

5ba01faa

09 Dec, 2025 1 commit

Jax primitives for permutation on single GPU (#2473) · 46c6ef31

Teddy Do authored Dec 09, 2025



* branch off of initial permutation jax-triton PR
Signed-off-by: tdophung <tdophung@nvidia.com>

* Set 0 as the size of dummy tensors to reduce memory usage.
Signed-off-by: tdophung <tdophung@nvidia.com>

* Correct setting of permuted_probs_stride_token, unpermuted_probs_stride_token and unpermuted_probs_stride_expert in unpermutation
Signed-off-by: tdophung <tdophung@nvidia.com>

* Implement primitives, wrapper, test for wrapper, edit trit
on binding to accomodate scalars
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Change implemementation of VJP functions to match correct pattern. Deduce some static scalar args from shapes of inputs. Accept B, S instead of num_tokens. Change test to use value_and_grad to test vjp funcs properly
Signed-off-by: tdophung <tdophung@nvidia.com>

* formatting
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix pylint
Signed-off-by: tdophung <tdophung@nvidia.com>

* fix test to compare to the correct reference impl. relax 1 tol for grad compare, fix lint the rightway
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix test_permutation to use value_and_grad for reference impl, tighten tols, and add unpermute with probs for token combine bwd rule
Signed-off-by: tdophung <tdophung@nvidia.com>

* added forgotten file in prev commit
Signed-off-by: tdophung <tdophung@nvidia.com>

* format
Signed-off-by: tdophung <tdophung@nvidia.com>

* merge with_probs to without_probs
Signed-off-by: tdophung <tdophung@nvidia.com>

* add aserts and fix lint
Signed-off-by: tdophung <tdophung@nvidia.com>

---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

46c6ef31

06 Dec, 2025 1 commit

[JAX] Add CP + THD + AG + Striped>1 + SWA support (#2379) · fd0cd12e

Kshitij Lakhani authored Dec 05, 2025



* Add generic stripe_height support for load balancing
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix imports in test for deprecated jax.experimental.pjit
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add test case for stripe_height greater than 1. Add stripe_height arg to reordering methods
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Add Striped 1 and 4 test cases. Refactor the Load Balancing test case. Fix the incorrect shape in striping inverser reordering
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Modify test code for CP + AG + THD + stripe height greater than 1
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Add stripe_height arg to fused attn and fused attn fwd API. Add appropriate mask checks for AG+THD+CP and pick BRCM to be executed per rank. Add Fused Attn Primitive for CP + THD +AG + Striping. Add a method to reorder and all gather segment ids and offsets for kv
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* TMP: Throwaway testing commit
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Add comments in primitive registration process
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* TMP: Throwaway test commit
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Undoing incorrect rebase/merge leftovers
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* TMP: Throwaway test commits
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Add support for calculating q and kv seqlens and offsets per rank for CP+THD+AG+SW+Striped>1 primitive
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Augment jax primitive register code comments
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* Fix the array sizes and padding values returned for seqlens and offsets to fit what the fused attn primitive non cp computation
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add support in new primitive for softmax_offset related changes. Put in missing primitive registering line in again. Increase the seqoffsets arrays lengths by 1
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Add new set of helper functions for seqlens and seqoffsets fo AG+THD+CP+Stripe>1 which accounts for batching and seq offsets size b+1
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add backward primitive for CP+THD+AG+Striped>1
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Modify tests for backward primitive for CP+THD+AG+Striped>1
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Move stripe_height along with other static args in fused_attn_bwd rule. Fix typo in CP+AG+TH+Striped>1 primitive
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Code clean up: remove older version for calculating seqlens and offsets for CP+AG+THD+striped>1 primitive
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add test for CP+THD+AG+Striped>1
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix missing var
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add SWA tests for AG+Striped>1+CP+THD+SWA
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Restoring test code
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove assert preventing SWA code path in CP+AG+Striped primitive
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Parametrize num_segments_per_seq in tests
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Clean up test code
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Clean up test code in TE common
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Clean up debug statements
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Rename stripe_height to stripe_size
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Code clean up and add additional comments
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

nit: Apply suggestions from code review
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>

Fix type on fused attn tests
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Fix seqoffsets length to be passed onto FusedAttn primitive as it is b and not b+1 needed by cuDNN
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove commented code
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>

Fix linting issues
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Fix incorrect greptile change
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Skip THD test cases for CP + AG + Dual chunk. Skip BSHD cases for CP + AG + Striped>1. Correct the layout and shapr parameters passed to the tests
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Pass stripe_size explicitly for ring attn tests for THD cases
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove TODO
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Explicitly fail if THD + AG is being used with a non padding causal mask
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* nit: Correct the ID for the test dist fused attn tests to account for cp*2 which is done under the hood
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Set num_segments_per_seq defaults to None instead of 0
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Augment comments. Add ValueError for stripe_size=0
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Test only 1 num_segments_per_seq combination for CP+AG+THD+Striped>1+SWA instead of 2. Modify the num segments and window size to easily to debug values
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Default stripe_size to None instead of 0. Modify stripe_size check for <=0 instead of ==0
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove incorrectly added file
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Explicitly pass zero sized arrays for seg ids and pos in the CP + AG + Striped primitive rather than using the seqlens or the offsets as placeholders
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix linting errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add a deep dive doc for CP+THD+AG+Stripe>1+SWA regarding design considerations and decisions
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Put docs and pngs into it's separate dir
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Replace png screenshots with markdown coe blocks for the attention patterns. Remove unecessary pngs
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Add doc file to index.rst. Fix grammatical errors
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fd0cd12e

02 Dec, 2025 2 commits

[JAX] Make test_layer.py tolerances stricter (#2306) · cc42a577

jberchtold-nvidia authored Dec 02, 2025



* wip
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* wip
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* revert change to utils.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

cc42a577

[JAX] Triton binding (#2437) · f1512b21

Phuong Nguyen authored Dec 02, 2025



* init triton binding with test case/example

* added Triton as TE-JAX test dependency

* grid with blocksize from autotune
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

f1512b21

18 Nov, 2025 1 commit

[JAX] Add support for sink attention in JAX (#2225) · 15cefbc5

Paweł Gadziński authored Nov 18, 2025



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* removed packed versions
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* jax
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix:
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* sofmtax_fusion -> softmax_fusion_type
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

15cefbc5

14 Nov, 2025 1 commit

[JAX] Improve support and testing for direct recipe usage without autocast contexts (#2366) · a0754757

jberchtold-nvidia authored Nov 14, 2025



* Refactor to avoid storing a global quantization config so direct recipe passing works as intended
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix use_split_accumulator for current scaling recipe
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix tests that pass direct recipe and were missing quantize meta set
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Revert "fix use_split_accumulator for current scaling recipe"

This reverts commit a74ab7df812ec0a069b1bdd208debb93ec25a900.
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix ci failures
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix amax_history post_init
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update transformer_engine/jax/quantize/quantizer.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix ci failures
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix ci issue
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* address comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* make recipe assertion classes in test_recipe_characteristics not inherit from unittest.TestCase
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

a0754757

13 Nov, 2025 1 commit

[JAX] Support for checkpointing quantizations (#2356) · 67d63d02

jberchtold-nvidia authored Nov 13, 2025



* Support for checkpointing quantizations
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add jaxpr test for quant checkpoint name
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Revert "Support for checkpointing quantizations"

This reverts commit f7b784940369d0da2a77c57fa6ea744e883c5832.
Signed-off-by: JAX Toolbox <jax@nvidia.com>

* Checkpoint quantizations
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* revert other files
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* move checkpointing to VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix ci failure
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: JAX Toolbox <jax@nvidia.com>
Co-authored-by: JAX Toolbox <jax@nvidia.com>

67d63d02

10 Nov, 2025 1 commit

[JAX] Fused layers argument default values changed (#2347) · 7a585983

Teddy Do authored Nov 10, 2025



* Changing default activations in MLP, TransformerLayer, dropout rate after FC1 to 0, and return_layernorm_output to False
Signed-off-by: tdophung <tdophung@nvidia.com>

* Fixing the failing tests by hard coding  arguments to the previous values instead of relying on newer default values
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

7a585983

07 Nov, 2025 1 commit

[JAX] Add test to check jaxpr that amax is reused for nvfp4 recipe (#2348) · 4ff3eed1

jberchtold-nvidia authored Nov 06, 2025



* Add test to check jaxpr that amax is reused for nvfp4 recipe
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Move test to test_helper.py and rename file
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

4ff3eed1

03 Nov, 2025 1 commit

[JAX] L1_jax_distributed_test suit with individual executions (#2321) · c57ffc51

Phuong Nguyen authored Nov 03, 2025



* L1 rework
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* comment out test_multi_process_grouped_gemm for now
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm e5m2 from test norm + MXFP8
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

c57ffc51

30 Oct, 2025 1 commit

[JAX] Fix: Skip determinism tests for bprop for all sm >=100 (#2315) · 5e8a9a96

Kshitij Lakhani authored Oct 30, 2025



* Fix: Skip determinism tests for bprop for all sm >=100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add username to TODO
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Assert in fused attn bwd pass for sm100+
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5e8a9a96

23 Oct, 2025 1 commit

[JAX] Make SR rng state always 2D (num_devices, 4) to fix partitioning issue (#2294) · e2f2a0b4

jberchtold-nvidia authored Oct 23, 2025



* Make SR rng state always 2D (num_devices, 4)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix pure-jax impl
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix test shape
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

e2f2a0b4

22 Oct, 2025 1 commit

[JAX] NVFP4 recipe with option to enable/disable SR, RHT, and 2D quantization (#2270) · 818b30cc

jberchtold-nvidia authored Oct 22, 2025



* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add test for SR state preservation across VJP boundaries
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix sharding of SR rng state
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update tolerances slightly now that SR is enabled
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use hashlib for deterministic hashes across runs for SR
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename uses_rht on scaled tensors to has_applied_rht
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add assert
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Move decision of whether to use RHT into helper.py and add dedicated RHT tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix use_rht attr usage
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix pure-jax rht usage criteria
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Adjust tolerances after rebase
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

818b30cc

17 Oct, 2025 1 commit

[JAX] Fix imports in test for deprecated jax.experimental.pjit (#2274) · 9dd61922

Kshitij Lakhani authored Oct 16, 2025



* Fix imports in test for deprecated jax.experimental.pjit
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix: Pass NamedSharding instead of PartitionSpec to compare_ops() so that when the in and out sharding is used to create a jitted function, it has the mesh info
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

9dd61922

14 Oct, 2025 2 commits

Generalize quantization APIs for FP8/FP4/.. recipes (#2256) · 85a91997

Kirthi Shankar Sivamani authored Oct 14, 2025



* Initial API change
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change all imports and api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix typo
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix recipe tets
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix more tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix docs, tests, and make Jax change as well
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change internal uses of fp8_autocast
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Address nits
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rename file
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* CG function, and small test fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change instances of make_graphed_callables internally
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix distributed tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix test and add more docs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cleanup test imports and minimize internal file imports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Make is_bf16_available public
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better docs and better api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply suggestions from code review
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* fix nvfp4 test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

85a91997

[JAX] Add BRCM support for THD (#2242) · ca6fedcf

Kshitij Lakhani authored Oct 14, 2025



* Add BRCM support when creating a test mask for fused attn
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add support for BRCM to correctly generate the mask needed for calculating the seqlens and offsets for THD
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Skip drop=0 and no_bias case for BRCM as cuDNN does not suport this
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Skip BRCM test cases where max_seqlen_q > max_seqlen_kv
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Refactor the segment id run length code for BRCM seqoffset and seqlens calculations
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix the drop inequality skip condition in fused attn
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Adjust the BRCM id name in the test to make it consistent
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix the brcm mask condition.
Fix the condition for cross atnn type pattern to only apply for brcm
Change the num segments per sequence to 3 instead of 2
Reduce one test pattern data size and make it such that it triggers brcm
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix lint errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix incorrectly changed dtype to numpy bool_ rather than native python bool
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Restore the numsegments to earlier value
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add example for THD BRCM
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

ca6fedcf

09 Oct, 2025 1 commit

[JAX] NVFP4 support in TE/JAX (#2254) · 8a7ab3dd

jberchtold-nvidia authored Oct 09, 2025


Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

8a7ab3dd

08 Oct, 2025 1 commit

[JAX] Async issuing D2H memcpy for grouped_gemm group_sizes array (#2213) · af2a0c16

Hua Huang authored Oct 08, 2025



* Try async copy of grouped GEMM group_sizes data
Signed-off-by: Hua Huang <huah@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

af2a0c16

06 Oct, 2025 1 commit

[JAX] Fix for GEMM + fuse bias + AllReduce (#2230) · 0db0f4d2

Phuong Nguyen authored Oct 06, 2025



* not fuse bias for output all reduction case + unit tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* norm to reduce dgamma along tpsp as well
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* clean up tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix test_distributed_layernorm byte counts
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* increase tols for jax_gemm
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

0db0f4d2

03 Oct, 2025 1 commit

[JAX] Clamped Swiglu Integration (#2194) · b840898b

vthumbe1503 authored Oct 03, 2025


Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
*Jax integration for clamped swiglu. This is the continuation of PR which added Clamped Swiglu(used in GPT OSS) support in TE along with Pytorch integration. This PR hooks up the clamped swiglu and dswiglu's nvte APIs to TE Jax.

b840898b

29 Sep, 2025 1 commit
- [JAX] Address tolerance check for current scaling dact dbias (#2211) · dfeef1a2
  jberchtold-nvidia authored Sep 29, 2025
```
Address tolerance check for current scaling dact
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  dfeef1a2
23 Sep, 2025 1 commit

[JAX] Local-Amax for Current-Scaling (#2183) · a92a0ad2

Ming-Xu Huang authored Sep 23, 2025



* Adding Amax Primitive and related args.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Enable local-amax for current-scaling and optionally run AR aross FSDP/TP/SP.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding doc for Amax Primitive.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the function name conflict.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Modification as feedback suggested.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix errors from lint.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the wrong amax-scope in the bwd.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Added more description for amax-scope
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the wrong attribute name.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Keep dim for AmaxCalcuation.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Remove keepDim and add shardy_rule
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix shardy_rule
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Remove extra-collective bytes from ref_coll_count due to local amax.
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Signed-off-by: Ming-Xu Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

a92a0ad2

15 Sep, 2025 1 commit

Lower precision gated-act to accelerate FP8 current-scaling. (#2153) · cd2034f3

Ming-Xu Huang authored Sep 15, 2025



* Applying the original precision as Norm outputs' and activation compuations.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding knob to control norm output precision.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Removing the knob and applying lower-precision norm with current-scaling only.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the error when quantizer==None
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>

cd2034f3

08 Sep, 2025 1 commit

Fixing few issues with multi-process launching. (#2155) · aa06107c

Ming-Xu Huang authored Sep 08, 2025



* Fixing few issues with multi-process launching.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

aa06107c

05 Sep, 2025 1 commit

[JAX] NoScaleTensor wrapper for non-quantized data (#2136) · c47f329b

jberchtold-nvidia authored Sep 05, 2025



* Custom call tests passing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix shardy issue with amax being shape 1,1,1 instead of shape (1,)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add higher-precision VJP tests to test_distributed_layernorm_mlp
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Cast non-quantized kernels to input dtype in VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename HighPrecisionTensor to NoScaleTensor
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use NoScaleTensor in pure JAX impls where it was missing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

c47f329b

03 Sep, 2025 1 commit

[JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135) · f378eaf2

Kshitij Lakhani authored Sep 03, 2025



* Fix failing tests for dropout=0.1 and bias for fused attn for blackwell
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix the skip message
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Assert in fused attn bwd pass for sm100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Add check for sm100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add support to get all devs in the process for jax
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Code clean up
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Represent attn bias using enum instead of string
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

f378eaf2

27 Aug, 2025 2 commits

[JAX] Decouple Recipe and ScalingMode (#1728) · c9508000

jberchtold-nvidia authored Aug 27, 2025



* Decouple recipe and scaling mode
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Expose global QuantizeConfig instance as a getter
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format and lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename UsageType to TensorSource
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

c9508000

FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) · 62a57dd4

Ming-Xu Huang authored Aug 27, 2025



* FP8 AllGather in FP8 GroupedGEMM

1. Support current scaling FP8 quantation with a given amax.
2. Support FP8 AG in fwd and BF16 RS in bwd.
3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Slightly refactor
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding documents of new args.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding unit-tests.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding license.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Move unit-tests to L1.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Move quantizaer store/reset into FP8 only.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding all layout support for Blackwell+
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adopt the feedback from code-review.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fixed the wrong stream used by d2d in groupedGEMM FFI.
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

62a57dd4

26 Aug, 2025 1 commit

[JAX] Add `tpsp_resource` in the `MeshResource` map (#2113) · d770886f

Phuong Nguyen authored Aug 26, 2025



* clean up sharding
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* added tpsp_resource
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* update tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rework test for MeshResource
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* add mesh_resource into fp8_autocast in test_helper.py
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

d770886f

25 Aug, 2025 1 commit
- [JAX] Add Transformer Layer tests for pre_scale_bias and post_scale_bias (#2104) · 47ab4a74
  Kshitij Lakhani authored Aug 25, 2025
```
Add Transformer Layer tests for pre_scale_bias and post_scale_bias
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
```
  47ab4a74
20 Aug, 2025 1 commit

[JAX] Error checking for mesh resource and update GemmPrimitive to use... · bc99a88d

jberchtold-nvidia authored Aug 20, 2025


[JAX] Error checking for mesh resource and update GemmPrimitive to use global_mesh_resource().fsdp_resource (#2088)

* Enforce global MeshResource is set
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use global_mesh_resource().fsdp_resource in gemm primitive
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update gemm.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

bc99a88d

15 Aug, 2025 1 commit

[JAX] Trim dist fused attn tests in L1 (#2050) · 92f431bf

Kshitij Lakhani authored Aug 14, 2025



* Move some dist fused attn tests to L2
1. TestReorderCausalLoadBalancing: Run two (non symmetric) BSHD/SBHD data shape combination
2. TestDistributedSelfAttn: Run only one (smaller) BSHD type data shape combination
3. TestDistributedCrossAttn: Run only one (smaller) BSHD type data shape combination
4. TestDistributedContextParallelSelfAttn: Run all cp1 combinations
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Use pytest_parametrize_wrapper for splitting fused attn distributed JAX tests as L1 and L2
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Undo pytest -k split commands in qa scripts
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix usage of pytest_parametrize_wrapper in test_distributed_fused_attn
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove test code for L2 dist residing in L2 test.sh
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Add comments for code. Swap the test data shapes in REORDER_CAUSAL_LOAD_BALANCING_DATA_SHAPES
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add L0 to the data shape dictionaries in the distributed test
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Code clean up
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

92f431bf

13 Aug, 2025 1 commit

[JAX] Add L2_jax_distributed_unittest (#2060) · ec65ba3c

jberchtold-nvidia authored Aug 12, 2025



* Add L2_jax_distributed_unittest
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add L1 entry for NORM_INPUT_SHAPES that was missing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

ec65ba3c

08 Aug, 2025 1 commit

[JAX] Remove cudaGraph compatible trait from GroupedGemmFFI and GroupedQuantizeFFI (#2048) · 9f9b4816

Phuong Nguyen authored Aug 08, 2025



* rm cudaGraph compatible trait from GroupedGEMM and groupedQuantize
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* add grouped_gemm jitting in the unit test
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

9f9b4816

07 Aug, 2025 1 commit

[JAX] TE Gemm custom call clean up (#2030) · cae1c436

Phuong Nguyen authored Aug 07, 2025



* rm batch_dim, sequence_dim, sequence_parallel_output
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm lhs_quantized_colwise and rhs_quantized_colwise
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm unnecessary transpose_batch_sequence arg from some modules
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cae1c436

06 Aug, 2025 1 commit

[JAX] Reduce L1 tests/jax/test_distributed_softmax.py test runtime (#2031) · 6d178b4e

jberchtold-nvidia authored Aug 06, 2025



* Pytest timings
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Reduce softmax test shape sizes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Switch softmax tests to use shardy by default
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

6d178b4e

24 Jul, 2025 1 commit

[JAX] Helper to disable TE custom calls + disable GemmPrimitive for non-MXFP8 recipes. (#1962) · 2a293456

Phuong Nguyen authored Jul 23, 2025



* add manage_primitives() helper

* disable GEMM primitives for non-MXFP8 recipes

* implement the NVTE_JAX_CUSTOM_CALLS + deprecate NVTE_JAX_CUSTOM_CALLS_RE

* replace NVTE_JAX_CUSTOM_CALLS_RE with NVTE_JAX_CUSTOM_CALLS in TE tests and examples

* fix use_jax_gemm contextmanager
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

2a293456

23 Jul, 2025 1 commit
- [JAX] Fix current scaling test_helper.py and enable test_helper.py in L0 (#1990) · 992ba01d
  jberchtold-nvidia authored Jul 23, 2025
```
Fix current scaling test_helper.py and enable test_helper.py in L0
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  992ba01d