Commits · 67d63d02f3efe1b8e0984788cc4e9ebf93bfd703 · OpenDAS / TransformerEngine

13 Nov, 2025 1 commit

[JAX] Support for checkpointing quantizations (#2356) · 67d63d02

jberchtold-nvidia authored Nov 13, 2025



* Support for checkpointing quantizations
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add jaxpr test for quant checkpoint name
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Revert "Support for checkpointing quantizations"

This reverts commit f7b784940369d0da2a77c57fa6ea744e883c5832.
Signed-off-by: JAX Toolbox <jax@nvidia.com>

* Checkpoint quantizations
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* revert other files
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* move checkpointing to VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix ci failure
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: JAX Toolbox <jax@nvidia.com>
Co-authored-by: JAX Toolbox <jax@nvidia.com>

67d63d02

10 Nov, 2025 1 commit

[JAX] Fused layers argument default values changed (#2347) · 7a585983

Teddy Do authored Nov 10, 2025



* Changing default activations in MLP, TransformerLayer, dropout rate after FC1 to 0, and return_layernorm_output to False
Signed-off-by: tdophung <tdophung@nvidia.com>

* Fixing the failing tests by hard coding  arguments to the previous values instead of relying on newer default values
Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

7a585983

07 Nov, 2025 1 commit

[JAX] Add test to check jaxpr that amax is reused for nvfp4 recipe (#2348) · 4ff3eed1

jberchtold-nvidia authored Nov 06, 2025



* Add test to check jaxpr that amax is reused for nvfp4 recipe
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Move test to test_helper.py and rename file
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

4ff3eed1

03 Nov, 2025 1 commit

[JAX] L1_jax_distributed_test suit with individual executions (#2321) · c57ffc51

Phuong Nguyen authored Nov 03, 2025



* L1 rework
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* comment out test_multi_process_grouped_gemm for now
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm e5m2 from test norm + MXFP8
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

c57ffc51

30 Oct, 2025 1 commit

[JAX] Fix: Skip determinism tests for bprop for all sm >=100 (#2315) · 5e8a9a96

Kshitij Lakhani authored Oct 30, 2025



* Fix: Skip determinism tests for bprop for all sm >=100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add username to TODO
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Assert in fused attn bwd pass for sm100+
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

5e8a9a96

23 Oct, 2025 1 commit

[JAX] Make SR rng state always 2D (num_devices, 4) to fix partitioning issue (#2294) · e2f2a0b4

jberchtold-nvidia authored Oct 23, 2025



* Make SR rng state always 2D (num_devices, 4)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix pure-jax impl
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix test shape
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

e2f2a0b4

22 Oct, 2025 1 commit

[JAX] NVFP4 recipe with option to enable/disable SR, RHT, and 2D quantization (#2270) · 818b30cc

jberchtold-nvidia authored Oct 22, 2025



* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add test for SR state preservation across VJP boundaries
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix sharding of SR rng state
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* update tolerances slightly now that SR is enabled
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use hashlib for deterministic hashes across runs for SR
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename uses_rht on scaled tensors to has_applied_rht
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add assert
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Move decision of whether to use RHT into helper.py and add dedicated RHT tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix use_rht attr usage
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix pure-jax rht usage criteria
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Adjust tolerances after rebase
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

818b30cc

17 Oct, 2025 1 commit

[JAX] Fix imports in test for deprecated jax.experimental.pjit (#2274) · 9dd61922

Kshitij Lakhani authored Oct 16, 2025



* Fix imports in test for deprecated jax.experimental.pjit
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix: Pass NamedSharding instead of PartitionSpec to compare_ops() so that when the in and out sharding is used to create a jitted function, it has the mesh info
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com>

9dd61922

14 Oct, 2025 2 commits

Generalize quantization APIs for FP8/FP4/.. recipes (#2256) · 85a91997

Kirthi Shankar Sivamani authored Oct 14, 2025



* Initial API change
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change all imports and api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix typo
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix recipe tets
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix more tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix docs, tests, and make Jax change as well
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change internal uses of fp8_autocast
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Address nits
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rename file
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* CG function, and small test fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change instances of make_graphed_callables internally
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix distributed tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Review
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix test and add more docs
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cleanup test imports and minimize internal file imports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Make is_bf16_available public
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Better docs and better api
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply suggestions from code review
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* fix nvfp4 test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

85a91997

[JAX] Add BRCM support for THD (#2242) · ca6fedcf

Kshitij Lakhani authored Oct 14, 2025



* Add BRCM support when creating a test mask for fused attn
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add support for BRCM to correctly generate the mask needed for calculating the seqlens and offsets for THD
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Skip drop=0 and no_bias case for BRCM as cuDNN does not suport this
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Skip BRCM test cases where max_seqlen_q > max_seqlen_kv
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Refactor the segment id run length code for BRCM seqoffset and seqlens calculations
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix the drop inequality skip condition in fused attn
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Adjust the BRCM id name in the test to make it consistent
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix the brcm mask condition.
Fix the condition for cross atnn type pattern to only apply for brcm
Change the num segments per sequence to 3 instead of 2
Reduce one test pattern data size and make it such that it triggers brcm
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix lint errors
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Fix incorrectly changed dtype to numpy bool_ rather than native python bool
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Restore the numsegments to earlier value
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add example for THD BRCM
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

ca6fedcf

09 Oct, 2025 1 commit

[JAX] NVFP4 support in TE/JAX (#2254) · 8a7ab3dd

jberchtold-nvidia authored Oct 09, 2025


Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

8a7ab3dd

08 Oct, 2025 1 commit

[JAX] Async issuing D2H memcpy for grouped_gemm group_sizes array (#2213) · af2a0c16

Hua Huang authored Oct 08, 2025



* Try async copy of grouped GEMM group_sizes data
Signed-off-by: Hua Huang <huah@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

af2a0c16

06 Oct, 2025 1 commit

[JAX] Fix for GEMM + fuse bias + AllReduce (#2230) · 0db0f4d2

Phuong Nguyen authored Oct 06, 2025



* not fuse bias for output all reduction case + unit tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* norm to reduce dgamma along tpsp as well
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* clean up tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix test_distributed_layernorm byte counts
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* increase tols for jax_gemm
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

0db0f4d2

03 Oct, 2025 1 commit

[JAX] Clamped Swiglu Integration (#2194) · b840898b

vthumbe1503 authored Oct 03, 2025


Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
*Jax integration for clamped swiglu. This is the continuation of PR which added Clamped Swiglu(used in GPT OSS) support in TE along with Pytorch integration. This PR hooks up the clamped swiglu and dswiglu's nvte APIs to TE Jax.

b840898b

29 Sep, 2025 1 commit
- [JAX] Address tolerance check for current scaling dact dbias (#2211) · dfeef1a2
  jberchtold-nvidia authored Sep 29, 2025
```
Address tolerance check for current scaling dact
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  dfeef1a2
23 Sep, 2025 1 commit

[JAX] Local-Amax for Current-Scaling (#2183) · a92a0ad2

Ming-Xu Huang authored Sep 23, 2025



* Adding Amax Primitive and related args.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Enable local-amax for current-scaling and optionally run AR aross FSDP/TP/SP.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding doc for Amax Primitive.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the function name conflict.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Modification as feedback suggested.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix errors from lint.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the wrong amax-scope in the bwd.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Added more description for amax-scope
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the wrong attribute name.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Keep dim for AmaxCalcuation.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Remove keepDim and add shardy_rule
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix shardy_rule
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Remove extra-collective bytes from ref_coll_count due to local amax.
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Signed-off-by: Ming-Xu Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

a92a0ad2

15 Sep, 2025 1 commit

Lower precision gated-act to accelerate FP8 current-scaling. (#2153) · cd2034f3

Ming-Xu Huang authored Sep 15, 2025



* Applying the original precision as Norm outputs' and activation compuations.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding knob to control norm output precision.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Removing the knob and applying lower-precision norm with current-scaling only.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the error when quantizer==None
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>

cd2034f3

08 Sep, 2025 1 commit

Fixing few issues with multi-process launching. (#2155) · aa06107c

Ming-Xu Huang authored Sep 08, 2025



* Fixing few issues with multi-process launching.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

aa06107c

05 Sep, 2025 1 commit

[JAX] NoScaleTensor wrapper for non-quantized data (#2136) · c47f329b

jberchtold-nvidia authored Sep 05, 2025



* Custom call tests passing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix shardy issue with amax being shape 1,1,1 instead of shape (1,)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add higher-precision VJP tests to test_distributed_layernorm_mlp
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Cast non-quantized kernels to input dtype in VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename HighPrecisionTensor to NoScaleTensor
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use NoScaleTensor in pure JAX impls where it was missing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

c47f329b

03 Sep, 2025 1 commit

[JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135) · f378eaf2

Kshitij Lakhani authored Sep 03, 2025



* Fix failing tests for dropout=0.1 and bias for fused attn for blackwell
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix the skip message
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Assert in fused attn bwd pass for sm100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Add check for sm100
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add support to get all devs in the process for jax
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Code clean up
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Represent attn bias using enum instead of string
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

f378eaf2

27 Aug, 2025 2 commits

[JAX] Decouple Recipe and ScalingMode (#1728) · c9508000

jberchtold-nvidia authored Aug 27, 2025



* Decouple recipe and scaling mode
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Expose global QuantizeConfig instance as a getter
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format and lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename UsageType to TensorSource
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

c9508000

FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) · 62a57dd4

Ming-Xu Huang authored Aug 27, 2025



* FP8 AllGather in FP8 GroupedGEMM

1. Support current scaling FP8 quantation with a given amax.
2. Support FP8 AG in fwd and BF16 RS in bwd.
3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Slightly refactor
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding documents of new args.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding unit-tests.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding license.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Move unit-tests to L1.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Move quantizaer store/reset into FP8 only.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding all layout support for Blackwell+
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adopt the feedback from code-review.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fixed the wrong stream used by d2d in groupedGEMM FFI.
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

62a57dd4

26 Aug, 2025 1 commit

[JAX] Add `tpsp_resource` in the `MeshResource` map (#2113) · d770886f

Phuong Nguyen authored Aug 26, 2025



* clean up sharding
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* added tpsp_resource
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* update tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rework test for MeshResource
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* add mesh_resource into fp8_autocast in test_helper.py
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

d770886f

25 Aug, 2025 1 commit
- [JAX] Add Transformer Layer tests for pre_scale_bias and post_scale_bias (#2104) · 47ab4a74
  Kshitij Lakhani authored Aug 25, 2025
```
Add Transformer Layer tests for pre_scale_bias and post_scale_bias
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
```
  47ab4a74
20 Aug, 2025 1 commit

[JAX] Error checking for mesh resource and update GemmPrimitive to use... · bc99a88d

jberchtold-nvidia authored Aug 20, 2025


[JAX] Error checking for mesh resource and update GemmPrimitive to use global_mesh_resource().fsdp_resource (#2088)

* Enforce global MeshResource is set
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use global_mesh_resource().fsdp_resource in gemm primitive
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update gemm.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

bc99a88d

15 Aug, 2025 1 commit

[JAX] Trim dist fused attn tests in L1 (#2050) · 92f431bf

Kshitij Lakhani authored Aug 14, 2025



* Move some dist fused attn tests to L2
1. TestReorderCausalLoadBalancing: Run two (non symmetric) BSHD/SBHD data shape combination
2. TestDistributedSelfAttn: Run only one (smaller) BSHD type data shape combination
3. TestDistributedCrossAttn: Run only one (smaller) BSHD type data shape combination
4. TestDistributedContextParallelSelfAttn: Run all cp1 combinations
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Use pytest_parametrize_wrapper for splitting fused attn distributed JAX tests as L1 and L2
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Undo pytest -k split commands in qa scripts
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix usage of pytest_parametrize_wrapper in test_distributed_fused_attn
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Remove test code for L2 dist residing in L2 test.sh
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

* Add comments for code. Swap the test data shapes in REORDER_CAUSAL_LOAD_BALANCING_DATA_SHAPES
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add L0 to the data shape dictionaries in the distributed test
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Code clean up
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kshitij  Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com>

92f431bf

13 Aug, 2025 1 commit

[JAX] Add L2_jax_distributed_unittest (#2060) · ec65ba3c

jberchtold-nvidia authored Aug 12, 2025



* Add L2_jax_distributed_unittest
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add L1 entry for NORM_INPUT_SHAPES that was missing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

ec65ba3c

08 Aug, 2025 1 commit

[JAX] Remove cudaGraph compatible trait from GroupedGemmFFI and GroupedQuantizeFFI (#2048) · 9f9b4816

Phuong Nguyen authored Aug 08, 2025



* rm cudaGraph compatible trait from GroupedGEMM and groupedQuantize
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* add grouped_gemm jitting in the unit test
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

9f9b4816

07 Aug, 2025 1 commit

[JAX] TE Gemm custom call clean up (#2030) · cae1c436

Phuong Nguyen authored Aug 07, 2025



* rm batch_dim, sequence_dim, sequence_parallel_output
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm lhs_quantized_colwise and rhs_quantized_colwise
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm unnecessary transpose_batch_sequence arg from some modules
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cae1c436

06 Aug, 2025 1 commit

[JAX] Reduce L1 tests/jax/test_distributed_softmax.py test runtime (#2031) · 6d178b4e

jberchtold-nvidia authored Aug 06, 2025



* Pytest timings
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Reduce softmax test shape sizes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Switch softmax tests to use shardy by default
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

6d178b4e

24 Jul, 2025 1 commit

[JAX] Helper to disable TE custom calls + disable GemmPrimitive for non-MXFP8 recipes. (#1962) · 2a293456

Phuong Nguyen authored Jul 23, 2025



* add manage_primitives() helper

* disable GEMM primitives for non-MXFP8 recipes

* implement the NVTE_JAX_CUSTOM_CALLS + deprecate NVTE_JAX_CUSTOM_CALLS_RE

* replace NVTE_JAX_CUSTOM_CALLS_RE with NVTE_JAX_CUSTOM_CALLS in TE tests and examples

* fix use_jax_gemm contextmanager
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

2a293456

23 Jul, 2025 1 commit
- [JAX] Fix current scaling test_helper.py and enable test_helper.py in L0 (#1990) · 992ba01d
  jberchtold-nvidia authored Jul 23, 2025
```
Fix current scaling test_helper.py and enable test_helper.py in L0
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  992ba01d
22 Jul, 2025 1 commit

[Common] Improved performance of mxfp8 cast kernels (#1628) · cb504cda

Oleg Goncharov authored Jul 22, 2025



* Fixed conflicts
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Minor code refactoring to avoid unnecessary checks
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed typo
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed dBias accumulation error due to initialization. Minor code refactoring
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Test case to reproduce the init error
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed rowwise dbias error
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Changed ptx API
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Added a struct for two packed FP8 values
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Rolled back to scalar code for columnwise scaling due to its better performance
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Minor corrections
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Rebased on main
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixes per code review
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Removed constexpr in C++ test suite to build faster
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Computed activations are now numerically truncated to InputType before scaling. Improved test suite.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Minor refactoring
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Minor refactoring
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Modified mismatches checks of MXFP8 to address FP8 numerics
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Implemented Jeremy's fixes to JAX test suite with an intermediate downcast
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Reduced the dims of the test tensors to improve CI runtime
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed memory alignment issue. Compute dbias without downcast.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fixed misaligned memory issue also in gated kernels. Reduced size of MXFP8 gated tests
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

cb504cda

21 Jul, 2025 1 commit

[Common] Skip cuDNN 9.10.0/9.10.1 due to bugs (#1937) · 0d802283

Charlene Yang authored Jul 21, 2025



* exclude 9.10.0/.1 for certain configs
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix kv_channels
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add get_backend to tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add init files
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix numerics and cuda graph tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix jax tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove prints
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor changes after renaming
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix import structure and rename get_attention_backends
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix docs and benchmarks
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix get backend calls
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "fix get backend calls"

This reverts commit 653cbb51c697bc2f975416bb3aac1d85f76c36dc.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "fix docs and benchmarks"

This reverts commit 98cd52e04ff7c53e26b412195f5744e39f7ed0e9.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix docs, benchmarks and pre-commit ci
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix dpa/mha flash attn selection
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix rng states
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix ModelConfig
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix backend selection on Ampere
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix issues from last merge
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update tests/pytorch/utils.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove initialization of rng_states to None
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* redefine ModelConfig
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix typo
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix ModelConfig
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix seed for CP tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update tests/pytorch/test_sanity.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* move fixture from utils to individual tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix CI
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

0d802283

19 Jul, 2025 1 commit
- [JAX] Update tolerance of distributed layernorm MLP for FP8 (#1971) · ca7407e3
  jberchtold-nvidia authored Jul 18, 2025
```
Update tolerance of distributed layernorm MLP for FP8
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  ca7407e3
18 Jul, 2025 1 commit

[JAX] Set `precision=HIGHEST` for the ref_grouped_gemm impl in the unit test (#1967) · 2d4644b7

Phuong Nguyen authored Jul 18, 2025



* set precision=HIGHEST for the ref_grouped_gemm impl in the unit test
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>


---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

2d4644b7

17 Jul, 2025 1 commit

[JAX] Remove unneccessary MXFP8 scale_inv padding (#1954) · 5350f277

Phuong Nguyen authored Jul 17, 2025



* remove unnecessary padding
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* adapt the test_distributed_layernorm byte count
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>


---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

5350f277

15 Jul, 2025 1 commit

[JAX] Resolve test conflict in JAX helper tests (#1916) · e7251f93

Emmanuel Ferdman authored Jul 16, 2025



* [JAX] Resolve test conflict in JAX helper tests
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* [JAX] Resolve test conflict in JAX helper tests
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

---------
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

e7251f93

14 Jul, 2025 1 commit

[JAX] GEMM custom op (#1855) · 214e2a4a

Alp Dener authored Jul 14, 2025



* added XLA FFI custom op for TE/common nvte_cublas_gemm
Signed-off-by: Alp Dener <adener@nvidia.com>

started GemmPrimitive, abstract done
Signed-off-by: Alp Dener <adener@nvidia.com>

gemm custom op working with BF16, needs testing for FP8/MXFP8
Signed-off-by: Alp Dener <adener@nvidia.com>

converted TE GEMM API to use ScaledTensor and added os ENV flag to use TE GEMM under general gemm() call
Signed-off-by: Alp Dener <adener@nvidia.com>

BF16 tests passing, FP8 tests should be passing but contracting_dims has a scoping issue
Signed-off-by: Alp Dener <adener@nvidia.com>

fp8 tests passing for E4M3, getting CUBLAS_STATUS_NOT_SUPPORTED for E5M2
Signed-off-by: Alp Dener <adener@nvidia.com>

updated GEMM API to use separate LHS and RHS quantizers instead of a QuantizerSet
Signed-off-by: Alp Dener <adener@nvidia.com>

new GemmPrimitive passing all Dense tests
Signed-off-by: Alp Dener <adener@nvidia.com>

import cleanup and reverted code chunk movement
Signed-off-by: Alp Dener <adener@nvidia.com>

removed unused .transpose() implementations from ScaledTensors
Signed-off-by: Alp Dener <adener@nvidia.com>

all custom call tests passing on Hopper, GEMM-related tests cover both GemmPrimitive and native JAX impl
Signed-off-by: Alp Dener <adener@nvidia.com>

removed direct calls to GemmPrimitive.enabled() from outside of cpp_extensions
Signed-off-by: Alp Dener <adener@nvidia.com>

removed unused changes to ScaledTensor classes and debug prints
Signed-off-by: Alp Dener <adener@nvidia.com>

* minor unit test cleanup
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* FP8 tests passing on Blackwell but MXFP8 outputs NaN
Signed-off-by: Alp Dener <adener@nvidia.com>

* reverted dense and fuseddense changes, FP8 test passing on Hopper and Blackwell, MXFP8 has issues with E5M2
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* MXFP8 issue traced to scale factor padding with NaNs instead of zeros
Signed-off-by: Alp Dener <adener@nvidia.com>

* padding scale with 2^-127 instead of nans
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix bug on rhs_scale_inv usage
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* cleanup E8M0 type converter use it in gemm.cpp
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* segfault fixed, passing all unittests on Blackwell
Signed-off-by: Alp Dener <adener@nvidia.com>

* fix for fuseddense tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix workspace alignment
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed GemmPrimitive custom partitioning to match jax.nn.scaled_matmul
Signed-off-by: Alp Dener <adener@nvidia.com>

all unit tests passing on H100x8 node
Signed-off-by: Alp Dener <adener@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



linting fixes
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed batch dimension numbers
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed FP8 scale sharding rule when there are no FP8 scales
Signed-off-by: Alp Dener <adener@nvidia.com>

added error message for unsupported Shardy partitioner
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed test tolerances for FP8 cases
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed shardy test skip cases
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* moved reshape of encoder output in encoder examples to make custom partitioning rules work correctly
Signed-off-by: Alp Dener <adener@nvidia.com>

* added helper functions for padding and unpadding block scales, changed GemmPrimitive to accept unpadded scales and pad them after sharding
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* updated shardy rules for all custom ops to decouple block scale rules from their tensors
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed linting errors
Signed-off-by: Alp Dener <adener@nvidia.com>

* changed unit test use_jax_gemm option to be a context to preserve external custom op settings, tightened multi-GPU encoder test tolerances, changed gemm() API to use contracting_dims and batched_dims separately instead of dimension_numbers
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed typo in test utils
Signed-off-by: Alp Dener <adener@nvidia.com>

* added sequence-first input warnings
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed datasets version for JAX examples
Signed-off-by: Alp Dener <adener@nvidia.com>

* reverting modification to force_1x_quantization decision
Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected gemm function syntax in unit tests
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

214e2a4a

11 Jul, 2025 1 commit
- [JAX] Update distributed LayerNormMLP test tolerance for L40 (#1901) · 11fecc41
  jberchtold-nvidia authored Jul 11, 2025
```
Update test tolerance for L40
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  11fecc41