Commits · b88f727b44d7779200a7f57c279805930a3883ff · OpenDAS / TransformerEngine

13 Nov, 2025 1 commit

[JAX] Support for checkpointing quantizations (#2356) · 67d63d02

jberchtold-nvidia authored Nov 13, 2025



* Support for checkpointing quantizations
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add jaxpr test for quant checkpoint name
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Revert "Support for checkpointing quantizations"

This reverts commit f7b784940369d0da2a77c57fa6ea744e883c5832.
Signed-off-by: JAX Toolbox <jax@nvidia.com>

* Checkpoint quantizations
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* revert other files
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* move checkpointing to VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* fix ci failure
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: JAX Toolbox <jax@nvidia.com>
Co-authored-by: JAX Toolbox <jax@nvidia.com>

67d63d02

09 Oct, 2025 1 commit

[JAX] NVFP4 support in TE/JAX (#2254) · 8a7ab3dd

jberchtold-nvidia authored Oct 09, 2025


Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

8a7ab3dd

07 Oct, 2025 1 commit

[JAX] Activation/Normalization to output amax for later quantization in CurrentScaling (#2238) · 127b6d3a

Phuong Nguyen authored Oct 07, 2025



* reuse amax for current scaling
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

127b6d3a

05 Sep, 2025 1 commit

[JAX] NoScaleTensor wrapper for non-quantized data (#2136) · c47f329b

jberchtold-nvidia authored Sep 05, 2025



* Custom call tests passing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix shardy issue with amax being shape 1,1,1 instead of shape (1,)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add higher-precision VJP tests to test_distributed_layernorm_mlp
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Cast non-quantized kernels to input dtype in VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename HighPrecisionTensor to NoScaleTensor
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use NoScaleTensor in pure JAX impls where it was missing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

c47f329b

07 Aug, 2025 1 commit

[JAX] TE Gemm custom call clean up (#2030) · cae1c436

Phuong Nguyen authored Aug 07, 2025



* rm batch_dim, sequence_dim, sequence_parallel_output
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm lhs_quantized_colwise and rhs_quantized_colwise
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm unnecessary transpose_batch_sequence arg from some modules
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cae1c436

24 Jul, 2025 1 commit

[JAX] Fixing GemmPrimitive partitioning rules to handle tensor-parallelism... · 25a82192

Alp Dener authored Jul 24, 2025


[JAX] Fixing GemmPrimitive partitioning rules to handle tensor-parallelism correctly for sequence-parallel inputs (#1980)

* updated GemmPrimitive partitioning rules to explicitly control all-reduce vs. reduce-scatter for sequence-parallelism
Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected handling of FSDP sharding for the RHS operand
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* use correct logical axes variable to identify sequence-parallel dim in LayerNormDenseGeneral
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed linting issues
Signed-off-by: Alp Dener <adener@nvidia.com>

* added assert on sequence-parallel options when GemmPrimitive is disabled
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

25a82192

14 Jul, 2025 1 commit

[JAX] GEMM custom op (#1855) · 214e2a4a

Alp Dener authored Jul 14, 2025



* added XLA FFI custom op for TE/common nvte_cublas_gemm
Signed-off-by: Alp Dener <adener@nvidia.com>

started GemmPrimitive, abstract done
Signed-off-by: Alp Dener <adener@nvidia.com>

gemm custom op working with BF16, needs testing for FP8/MXFP8
Signed-off-by: Alp Dener <adener@nvidia.com>

converted TE GEMM API to use ScaledTensor and added os ENV flag to use TE GEMM under general gemm() call
Signed-off-by: Alp Dener <adener@nvidia.com>

BF16 tests passing, FP8 tests should be passing but contracting_dims has a scoping issue
Signed-off-by: Alp Dener <adener@nvidia.com>

fp8 tests passing for E4M3, getting CUBLAS_STATUS_NOT_SUPPORTED for E5M2
Signed-off-by: Alp Dener <adener@nvidia.com>

updated GEMM API to use separate LHS and RHS quantizers instead of a QuantizerSet
Signed-off-by: Alp Dener <adener@nvidia.com>

new GemmPrimitive passing all Dense tests
Signed-off-by: Alp Dener <adener@nvidia.com>

import cleanup and reverted code chunk movement
Signed-off-by: Alp Dener <adener@nvidia.com>

removed unused .transpose() implementations from ScaledTensors
Signed-off-by: Alp Dener <adener@nvidia.com>

all custom call tests passing on Hopper, GEMM-related tests cover both GemmPrimitive and native JAX impl
Signed-off-by: Alp Dener <adener@nvidia.com>

removed direct calls to GemmPrimitive.enabled() from outside of cpp_extensions
Signed-off-by: Alp Dener <adener@nvidia.com>

removed unused changes to ScaledTensor classes and debug prints
Signed-off-by: Alp Dener <adener@nvidia.com>

* minor unit test cleanup
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* FP8 tests passing on Blackwell but MXFP8 outputs NaN
Signed-off-by: Alp Dener <adener@nvidia.com>

* reverted dense and fuseddense changes, FP8 test passing on Hopper and Blackwell, MXFP8 has issues with E5M2
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* MXFP8 issue traced to scale factor padding with NaNs instead of zeros
Signed-off-by: Alp Dener <adener@nvidia.com>

* padding scale with 2^-127 instead of nans
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix bug on rhs_scale_inv usage
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* cleanup E8M0 type converter use it in gemm.cpp
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* segfault fixed, passing all unittests on Blackwell
Signed-off-by: Alp Dener <adener@nvidia.com>

* fix for fuseddense tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix workspace alignment
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed GemmPrimitive custom partitioning to match jax.nn.scaled_matmul
Signed-off-by: Alp Dener <adener@nvidia.com>

all unit tests passing on H100x8 node
Signed-off-by: Alp Dener <adener@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



linting fixes
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed batch dimension numbers
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed FP8 scale sharding rule when there are no FP8 scales
Signed-off-by: Alp Dener <adener@nvidia.com>

added error message for unsupported Shardy partitioner
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed test tolerances for FP8 cases
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed shardy test skip cases
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* moved reshape of encoder output in encoder examples to make custom partitioning rules work correctly
Signed-off-by: Alp Dener <adener@nvidia.com>

* added helper functions for padding and unpadding block scales, changed GemmPrimitive to accept unpadded scales and pad them after sharding
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* updated shardy rules for all custom ops to decouple block scale rules from their tensors
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed linting errors
Signed-off-by: Alp Dener <adener@nvidia.com>

* changed unit test use_jax_gemm option to be a context to preserve external custom op settings, tightened multi-GPU encoder test tolerances, changed gemm() API to use contracting_dims and batched_dims separately instead of dimension_numbers
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed typo in test utils
Signed-off-by: Alp Dener <adener@nvidia.com>

* added sequence-first input warnings
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed datasets version for JAX examples
Signed-off-by: Alp Dener <adener@nvidia.com>

* reverting modification to force_1x_quantization decision
Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected gemm function syntax in unit tests
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

214e2a4a

18 Jun, 2025 1 commit

[JAX] TensorUsage + FP8 GEMM with all layouts handling on BW (#1844) · 3a298e6b

Phuong Nguyen authored Jun 18, 2025



* TensorUsage + FP8 GEMM with all layouts handling on BW
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>


---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

3a298e6b

04 Apr, 2025 1 commit

[JAX] Flatten_axis for quantization and Sharding propagation fixes (#1644) · ff884e20

Phuong Nguyen authored Apr 04, 2025



* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout

* add fatten_axis option

* added gated act to test encoder

* sharding constraint fixes

* fix padding when flattening first dim needs to be padded

* update test sizes so that padding is tested

* rm output sharding as it can be done in the flax module

* sharding scale_inv for mxfp8

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ff884e20

01 Apr, 2025 1 commit

[JAX] Refactor + MXFP8 + GroupedGEMM (#1627) · cf9a7c2f

Phuong Nguyen authored Mar 31, 2025



* refactor + mxfp8

* added grouped gemm

* rename linear to dense

* added cublas init phase for groupedGemm

* relax the tol of test encoder multiprocessing mxfp8 by 0.001
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>

cf9a7c2f