Commits · 9f3e79bff824d3a9f10267dc414308011c87b093 · OpenDAS / TransformerEngine

03 Oct, 2025 1 commit

[JAX] Clamped Swiglu Integration (#2194) · b840898b

vthumbe1503 authored Oct 03, 2025


Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
*Jax integration for clamped swiglu. This is the continuation of PR which added Clamped Swiglu(used in GPT OSS) support in TE along with Pytorch integration. This PR hooks up the clamped swiglu and dswiglu's nvte APIs to TE Jax.

b840898b

27 Sep, 2025 1 commit

[JAX] CollectiveGemm (#2166) · d75bf43f

Phuong Nguyen authored Sep 27, 2025



* init cgemm + unit tests

* UB bootstrap with NCCL, no MPI dependency

* add NVLINK-P2P check + error message

* skip tests if no NVLINK available

* use std::vector to store ncclComm_t

* update misuse of TP warning
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

d75bf43f

27 Aug, 2025 1 commit

[JAX] Decouple Recipe and ScalingMode (#1728) · c9508000

jberchtold-nvidia authored Aug 27, 2025



* Decouple recipe and scaling mode
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Expose global QuantizeConfig instance as a getter
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format and lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename UsageType to TensorSource
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

c9508000

21 Aug, 2025 1 commit

[ TE-JAX ] Expose cp_strategy argument to DPA api (#2090) · 20be25a3

Md Fahim Faysal Khan authored Aug 21, 2025



* added cp strategy arg to DPA api
Signed-off-by: Md Fahim Faysal Khan <mdfahimfaysa@nvidia.com>

* converted DPA cp_strategy to string
Signed-off-by: Md Fahim Faysal Khan <mdfahimfaysa@nvidia.com>

---------
Signed-off-by: Md Fahim Faysal Khan <mdfahimfaysa@nvidia.com>

20be25a3

18 Aug, 2025 1 commit
- [JAX] Fix Flax variable creation when quantizers are created directly from a recipe (#2079) · 757fd1cf
  jberchtold-nvidia authored Aug 18, 2025
```
Fix flax variables when creating quantizers directly from a recipe
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  757fd1cf
12 Aug, 2025 1 commit

[JAX] Support custom recipe and custom collection name when creating quantizer sets (#2059) · 6a4e871e

jberchtold-nvidia authored Aug 12, 2025



* Support setting collection name for quantizer set Flax variables in TransformerEngineBase flax module
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Support creating quantizer set from a recipe directly
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix debug error format string in gemm.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

6a4e871e

07 Aug, 2025 1 commit

[JAX] TE Gemm custom call clean up (#2030) · cae1c436

Phuong Nguyen authored Aug 07, 2025



* rm batch_dim, sequence_dim, sequence_parallel_output
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm lhs_quantized_colwise and rhs_quantized_colwise
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm unnecessary transpose_batch_sequence arg from some modules
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cae1c436

06 Aug, 2025 1 commit

[JAX] Remove `dot_1_output_axes` usage in LayerNormMLP (#2029) · ed42b5ac

Phuong Nguyen authored Aug 06, 2025



* remove dot1_output_axes
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ed42b5ac

30 Jul, 2025 1 commit

[JAX] TE GEMM checkpointing policies (#2003) · 858755c0

jberchtold-nvidia authored Jul 30, 2025



* TE primitive checkpointing policies
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove batched gemm policy
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

858755c0

24 Jul, 2025 1 commit

[JAX] Fixing GemmPrimitive partitioning rules to handle tensor-parallelism... · 25a82192

Alp Dener authored Jul 24, 2025


[JAX] Fixing GemmPrimitive partitioning rules to handle tensor-parallelism correctly for sequence-parallel inputs (#1980)

* updated GemmPrimitive partitioning rules to explicitly control all-reduce vs. reduce-scatter for sequence-parallelism
Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected handling of FSDP sharding for the RHS operand
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* use correct logical axes variable to identify sequence-parallel dim in LayerNormDenseGeneral
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed linting issues
Signed-off-by: Alp Dener <adener@nvidia.com>

* added assert on sequence-parallel options when GemmPrimitive is disabled
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

25a82192

16 Jul, 2025 1 commit

[JAX] Support Flax sharding constraints (#1933) · c0c12e20

jberchtold-nvidia authored Jul 16, 2025



* Support flax sharding constraints
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add warning for deprecated TE logical axes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update examples
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

c0c12e20

14 Jul, 2025 1 commit

[JAX] GEMM custom op (#1855) · 214e2a4a

Alp Dener authored Jul 14, 2025



* added XLA FFI custom op for TE/common nvte_cublas_gemm
Signed-off-by: Alp Dener <adener@nvidia.com>

started GemmPrimitive, abstract done
Signed-off-by: Alp Dener <adener@nvidia.com>

gemm custom op working with BF16, needs testing for FP8/MXFP8
Signed-off-by: Alp Dener <adener@nvidia.com>

converted TE GEMM API to use ScaledTensor and added os ENV flag to use TE GEMM under general gemm() call
Signed-off-by: Alp Dener <adener@nvidia.com>

BF16 tests passing, FP8 tests should be passing but contracting_dims has a scoping issue
Signed-off-by: Alp Dener <adener@nvidia.com>

fp8 tests passing for E4M3, getting CUBLAS_STATUS_NOT_SUPPORTED for E5M2
Signed-off-by: Alp Dener <adener@nvidia.com>

updated GEMM API to use separate LHS and RHS quantizers instead of a QuantizerSet
Signed-off-by: Alp Dener <adener@nvidia.com>

new GemmPrimitive passing all Dense tests
Signed-off-by: Alp Dener <adener@nvidia.com>

import cleanup and reverted code chunk movement
Signed-off-by: Alp Dener <adener@nvidia.com>

removed unused .transpose() implementations from ScaledTensors
Signed-off-by: Alp Dener <adener@nvidia.com>

all custom call tests passing on Hopper, GEMM-related tests cover both GemmPrimitive and native JAX impl
Signed-off-by: Alp Dener <adener@nvidia.com>

removed direct calls to GemmPrimitive.enabled() from outside of cpp_extensions
Signed-off-by: Alp Dener <adener@nvidia.com>

removed unused changes to ScaledTensor classes and debug prints
Signed-off-by: Alp Dener <adener@nvidia.com>

* minor unit test cleanup
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* FP8 tests passing on Blackwell but MXFP8 outputs NaN
Signed-off-by: Alp Dener <adener@nvidia.com>

* reverted dense and fuseddense changes, FP8 test passing on Hopper and Blackwell, MXFP8 has issues with E5M2
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* MXFP8 issue traced to scale factor padding with NaNs instead of zeros
Signed-off-by: Alp Dener <adener@nvidia.com>

* padding scale with 2^-127 instead of nans
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix bug on rhs_scale_inv usage
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* cleanup E8M0 type converter use it in gemm.cpp
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* segfault fixed, passing all unittests on Blackwell
Signed-off-by: Alp Dener <adener@nvidia.com>

* fix for fuseddense tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix workspace alignment
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed GemmPrimitive custom partitioning to match jax.nn.scaled_matmul
Signed-off-by: Alp Dener <adener@nvidia.com>

all unit tests passing on H100x8 node
Signed-off-by: Alp Dener <adener@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



linting fixes
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed batch dimension numbers
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed FP8 scale sharding rule when there are no FP8 scales
Signed-off-by: Alp Dener <adener@nvidia.com>

added error message for unsupported Shardy partitioner
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed test tolerances for FP8 cases
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed shardy test skip cases
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* moved reshape of encoder output in encoder examples to make custom partitioning rules work correctly
Signed-off-by: Alp Dener <adener@nvidia.com>

* added helper functions for padding and unpadding block scales, changed GemmPrimitive to accept unpadded scales and pad them after sharding
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* updated shardy rules for all custom ops to decouple block scale rules from their tensors
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed linting errors
Signed-off-by: Alp Dener <adener@nvidia.com>

* changed unit test use_jax_gemm option to be a context to preserve external custom op settings, tightened multi-GPU encoder test tolerances, changed gemm() API to use contracting_dims and batched_dims separately instead of dimension_numbers
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed typo in test utils
Signed-off-by: Alp Dener <adener@nvidia.com>

* added sequence-first input warnings
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed datasets version for JAX examples
Signed-off-by: Alp Dener <adener@nvidia.com>

* reverting modification to force_1x_quantization decision
Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected gemm function syntax in unit tests
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

214e2a4a

13 Jun, 2025 2 commits

[JAX] Add support for Fused Attn MLA head_dim_qk != head_dim_v (#1851) · 1ddfa0c6

Kshitij Lakhani authored Jun 13, 2025



* Add support for Fused Attn MLA head_dim_qk != head_dim_v
	Modify is_fused_attn_kernel_available() to accept different head_dims for qk and v
	Modify FusedAttnHelper to accept different head_dims for qk and v and modify assert dims checks in parse_qkv_aval()
	Modify FusedAttnFwdPrimitive and FusedAttnBwdPrimitive to accept different head_dims for qk and v
	Modify Fused Attn related cpp and csrc extension API calls to accept different head_dims for qk and v
	Modify DotProductAttention call() to extract head dims separately for qk and v
	Modify the FusedAttn Tests to accommodate for API changes in FusedAttn API
	Add test case for head_dim_qk != head_dim_v (failing)
	Modify the baseline JAX appropriately to reshape the output vector based on v dims and not q dims
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix context dims in general DPA in test_fused_attn
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix dim for output tensor by replacing with v head dim rather than q head dim
Add test cases for jax fused attn where head_dim_qk != head_dim_v for a combination of data types and attention type
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Modify the fused attn jax unit test case for head dim qk != head dim v
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Use new FusedAttnRunner function signature for separate hidden dim for qk and v in Fused Attn distributed tests
Code clean up
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix usage of is_fused_attn signature in distributed tests
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Remove unnecessary assert
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

---------
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

1ddfa0c6

Add support for head_dim > 128 (#1797) · 71c76b6b

Charlene Yang authored Jun 14, 2025



* add support for head dim > 128
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* remove debugging
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* raise tols slightly to tolerate 1/2048 mismatches
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix is_training for test_te_layer
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add bprop support for blackwell
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor tweak for format
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix backend selection results
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* bump sm100 to sm100+
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add sq=1 test for MLA
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* enable sq=1 for bprop
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* minor tweak in comments
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix head_dim logic and remove pytest skip
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add FE fix for d>128
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* update FE again to take in small fixes
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* add cuDNN version info in L0 tests
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* increase tols for Unfused + large dim
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* Revert "add cuDNN version info in L0 tests"

This reverts commit 3e1b426ca5319a2c0540b9e73bba7047d0e583e5.
Signed-off-by: Charlene Yang <charleney@nvidia.com>

* fix tols for Unfused
Signed-off-by: Charlene Yang <charleney@nvidia.com>

---------
Signed-off-by: Charlene Yang <charleney@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

71c76b6b

16 May, 2025 1 commit

[JAX] Support logical partitioning axes in TE Flax modules (#1772) · 27612051

jberchtold-nvidia authored May 16, 2025



* [JAX] Update flax module param initialization to support logical partitioning axes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix ffn1 intermediate result being replicated
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add documentation and assert when logical_axes=None
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix bias in LayerNormMLP flax module
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix layer tests to not use nn_partitioning and instead use nn.with_logical_axes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

27612051

16 Apr, 2025 1 commit

Fix #1524 and other softmax mask functionality (#1681) · 0994fb48

Kshitij Lakhani authored Apr 15, 2025



* Add test cases for full coverage in jax/test_layer.py
- causal and window size None
- causal and window size default (-1,1)
- no_mask and window size default (-1,1)
- no_mask and window size default (2,2)
- padding and window size None
- padding_causal and window_size (2,2)
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Correct the condition where padding_causal_mask was being mapped to scaled upper triangle
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Fix Issue #1524
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Add a runner and test cases for jax.flax.module.Softmax class for fwd pass only
Segregate runner classes for Softmax module and softmax primitives
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Simplify logic when picking softmax primitives and softmax jax framework calls
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Simplify the logic for performing jax based softmax
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Code clean up
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add support table for mask, SWA and Softmax type. Code linting
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Explicit SWA conditons in comments. Fix Typo
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Resolve typo to remove None in SWA comments section
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

0994fb48

09 Apr, 2025 1 commit

[JAX] Scaling Enum Abstracting (#1655) · 962d9c53

Phuong Nguyen authored Apr 09, 2025



* scaling enum abstract

* rm NVTE_ from ScalingMode names

* rework scaling mode enum in grouped gemm

* fix norm sharding

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

962d9c53

04 Apr, 2025 2 commits

[JAX] Flatten_axis for quantization and Sharding propagation fixes (#1644) · ff884e20

Phuong Nguyen authored Apr 04, 2025



* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout

* add fatten_axis option

* added gated act to test encoder

* sharding constraint fixes

* fix padding when flattening first dim needs to be padded

* update test sizes so that padding is tested

* rm output sharding as it can be done in the flax module

* sharding scale_inv for mxfp8

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ff884e20

[JAX-Q] Distributed MXFP8 flax layer tests (#1643) · be1f647c
jberchtold-nvidia authored Apr 04, 2025
```
MXFP8 flax layer tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
be1f647c

01 Apr, 2025 1 commit

[JAX] Refactor + MXFP8 + GroupedGEMM (#1627) · cf9a7c2f

Phuong Nguyen authored Mar 31, 2025



* refactor + mxfp8

* added grouped gemm

* rename linear to dense

* added cublas init phase for groupedGemm

* relax the tol of test encoder multiprocessing mxfp8 by 0.001
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>

cf9a7c2f

18 Feb, 2025 1 commit
- [JAX] Flax with compute dtype inferred from input dtype. (#1485) · 6673f165
  Phuong Nguyen authored Feb 18, 2025
```
flax module with compute dtype inferred from the inputs
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
  6673f165
14 Feb, 2025 4 commits

[JAX] Expose THD format to the flax module (#1480) · af7b2b44

Reese Wang authored Feb 15, 2025



* Expose THD to flex MHA module
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Enhance docs
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

af7b2b44

[JAX] Lint Fix (#1484) · 45e9d8b6
Phuong Nguyen authored Feb 14, 2025
```
JAX Lint Fix
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
45e9d8b6

[JAX] Fixes for CI failures with the latest JAX (#1469) · e19b8281

Phuong Nguyen authored Feb 14, 2025



* fixes L1 test

* fix test_multigpu_encoder

* fixes for other multi-encoder tests

* jax.extend.ffi to jax.ffi

* initialization with float32

* add init_dtype as an optional arg to all modules

* update use_scan query from xla flags

* relax threshold for test_encoder fp8

* relax the tols

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

e19b8281

[JAX] Flax params initialization with weight_dtype (#1481) · 24e4f955

Phuong Nguyen authored Feb 13, 2025



* initialization with weight_dtype
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

24e4f955

11 Feb, 2025 1 commit

[JAX] Flax module init with a given dtype (#1472) · b87e539d

Phuong Nguyen authored Feb 11, 2025



* flax module to init params with given dtype
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* all tests passed
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* remove unneccessary reshape for kernel
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* remove casting output of dot
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* clean up
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

b87e539d

08 Jan, 2025 1 commit

[JAX] Add THD + SWA unit tests (#1390) · b898cbe1

Reese Wang authored Jan 08, 2025



* Fix SWA mask for THD and forcing seqlen_kv >= seqlen_q for SWA
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Generalize sliding window mask
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix pylint
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>

b898cbe1

02 Jan, 2025 1 commit
- Update copyright to include 2025 (#1388) · c9ea6be9
  Kirthi Shankar Sivamani authored Jan 02, 2025
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  c9ea6be9
04 Nov, 2024 1 commit

[JAX] Expose context parallel params to jax DPA api (#1292) · d7256866

Md Fahim Faysal Khan authored Nov 04, 2024



Exposed context parallel params to DPA api
Signed-off-by: Md Fahim Faysal Khan <mdfahimfaysa@nvidia.com>
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>

---------
Signed-off-by: Md Fahim Faysal Khan <mdfahimfaysa@nvidia.com>
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Co-authored-by: Michael Goldfarb <mgoldfarb@nvidia.com>

d7256866

10 Oct, 2024 1 commit

[JAX] Expose sliding window attn to TE-JAX API (#1205) · 85e60e64

Hua Huang authored Oct 10, 2024



* Expose JAX sliding window attn API
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* No SWA in context parallel; fix RNG seed in test
Signed-off-by: Hua Huang <huah@nvidia.com>

* Handle SAW API discrepancy in cuDNN and Python
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add SAW API for flax, all tests passed

Will update tests/jax/test_praxis_layers.py next
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update test_praxis_layers.py for SWA, test passed
Signed-off-by: Hua Huang <huah@nvidia.com>

* Use tuple window_size; update for PR #1212
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add and adjust some pytest.skip
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Revised following Reese Wang's comments

Still need further debugging:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-NO_BIAS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-NO_BIAS] - AssertionError:
FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError:

These errors does not exist in the previous commit
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix no-SWA test case errors in previous commit
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add Padding mask w/ sliding windows sanity tests
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Use float32 for the reference code softmax calculation
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Reese Wang <rewang@nvidia.com>

85e60e64

08 Aug, 2024 1 commit

[JAX] Support non-deterministic algo for cuDNN FA (#1056) · 86f27e12

Reese Wang authored Aug 08, 2024



* Support non-deterministic algo
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Refine the helper function name
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Move fixture to conftest.py
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>

86f27e12

02 Aug, 2024 1 commit

Link attention docs to the main docs and fix errors reported by Sphinx (#1062) · 098e3006

Przemyslaw Tredak authored Aug 01, 2024



* Link attention docs to the main docs and fix errors reported by Sphinx
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Lower the version of nbsphinx
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* More fixes
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the URL of example_attention.py to GitHub
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* More fixes in the attention tutorial
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

098e3006

03 Jul, 2024 1 commit

[JAX] Add experimental internal used THD(packed) fused attn API (#964) · 687697a7

Reese Wang authored Jul 03, 2024



* Integrate experimental ragged offset
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Use per sequence based offsets
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Format
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove v/o_seq_offsets
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add FP16 sanity tests and remove forward tests from the automatically run tests
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Enhance input checks
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Separate fused attn to 2 differnt APIs and add the docs
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add experimental to the docs
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Fix lint
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Add runtime segments check
Signed-off-by: Reese Wang <rewang@nvidia.com>

* Remove finished TODO
Signed-off-by: Reese Wang <rewang@nvidia.com>

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>

687697a7

14 Jun, 2024 1 commit

Apply formatting (#929) · 9416519d

Kirthi Shankar Sivamani authored Jun 13, 2024



* Apply formatting
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply formatting
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

9416519d

13 Jun, 2024 1 commit

[JAX] Splitting cpp_extensions.py (#899) · 5986342a

Phuong Nguyen authored Jun 13, 2024



* Splitted cpp_extensions.py, renamed mlp.py and fused_attn.py
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixed import in tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

5986342a

12 Jun, 2024 1 commit

[JAX] Rewrite the Format of FP8 Meta and Remove unused ShardingTypes. (#842) · dff11340

Ming-Xu Huang authored Jun 12, 2024



* Reformat FP8 Meta

1. Reformat FP8 meta to be one-set-per-tensor.
2. Remove fp8_max and scale_inv.
3. Remove unused functions in fp8.py
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix unit-tests
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Remove ShardingType and MajorShardingType
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix lint errors
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fixed unittests.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Rename few variables.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Add jit to update_amax_list
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fixed naming error in LayernormMLP
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fixed bugs in test_distributed_layernorm_mlp.py
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>

dff11340

10 Jun, 2024 1 commit
- [JAX] Made order of gated act consistent in all branches (#902) · 086a12fe
  Phuong Nguyen authored Jun 10, 2024
```
- Made order of gated act consistent in all branches
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
  086a12fe
22 May, 2024 1 commit

[JAX] Fixed the shape miss-matching issue in MLP. (#859) · 82e5b4d2

Ming-Xu Huang authored May 22, 2024



* Fixed the shape mismatching issue in MLP.
Signed-off-by: Ming Huang <mingh@nvidia.com>

* Add a corresponding test
Signed-off-by: Ming Huang <mingh@nvidia.com>

---------
Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>

82e5b4d2

13 May, 2024 1 commit

[JAX] Adding Gated/Non-gated ReLU, Quick GeLU, Squared ReLU (#826) · c473f0e6

Phuong Nguyen authored May 13, 2024



* renamed gelu to act

* added relu, srelu, qgelu

* fixes initialization for layernorm_fp8_mlp tests

* moved activation_fp8 prim into testunit file

* Moved NVTE_Activation_Enum to common/.../activation.h

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

c473f0e6

03 May, 2024 1 commit

[JAX] Generalizing Activation Primitives (#810) · aad4e173

Phuong Nguyen authored May 03, 2024



* templated primitives and respective C++ functions
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixes for LayerNormMLP, tests in test_custom_compute all passed
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* added default arg for pybind get_workspace_size funcs
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixes for TestTransFormer with non-gated act tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* renamed gelu to act
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* improved enum implementation, avoid using magic numbers
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Exposed C++ ActivationEnum to python side
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Changed error messages
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* changed conditional check on input shape for dbias_cast_transpose
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* changed dtype (tol) for bias grad tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixes so that layer_norm_fp8_mlp can take bias = None
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Set bias = None in flax modules
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

aad4e173