Commits · 992ba01d4aacdd2a59e8f22d7c04d23a6c020752 · OpenDAS / TransformerEngine

14 Jul, 2025 1 commit

Alp Dener authored Jul 14, 2025



* added XLA FFI custom op for TE/common nvte_cublas_gemm
Signed-off-by: Alp Dener <adener@nvidia.com>

started GemmPrimitive, abstract done
Signed-off-by: Alp Dener <adener@nvidia.com>

gemm custom op working with BF16, needs testing for FP8/MXFP8
Signed-off-by: Alp Dener <adener@nvidia.com>

converted TE GEMM API to use ScaledTensor and added os ENV flag to use TE GEMM under general gemm() call
Signed-off-by: Alp Dener <adener@nvidia.com>

BF16 tests passing, FP8 tests should be passing but contracting_dims has a scoping issue
Signed-off-by: Alp Dener <adener@nvidia.com>

fp8 tests passing for E4M3, getting CUBLAS_STATUS_NOT_SUPPORTED for E5M2
Signed-off-by: Alp Dener <adener@nvidia.com>

updated GEMM API to use separate LHS and RHS quantizers instead of a QuantizerSet
Signed-off-by: Alp Dener <adener@nvidia.com>

new GemmPrimitive passing all Dense tests
Signed-off-by: Alp Dener <adener@nvidia.com>

import cleanup and reverted code chunk movement
Signed-off-by: Alp Dener <adener@nvidia.com>

removed unused .transpose() implementations from ScaledTensors
Signed-off-by: Alp Dener <adener@nvidia.com>

all custom call tests passing on Hopper, GEMM-related tests cover both GemmPrimitive and native JAX impl
Signed-off-by: Alp Dener <adener@nvidia.com>

removed direct calls to GemmPrimitive.enabled() from outside of cpp_extensions
Signed-off-by: Alp Dener <adener@nvidia.com>

removed unused changes to ScaledTensor classes and debug prints
Signed-off-by: Alp Dener <adener@nvidia.com>

* minor unit test cleanup
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* FP8 tests passing on Blackwell but MXFP8 outputs NaN
Signed-off-by: Alp Dener <adener@nvidia.com>

* reverted dense and fuseddense changes, FP8 test passing on Hopper and Blackwell, MXFP8 has issues with E5M2
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* MXFP8 issue traced to scale factor padding with NaNs instead of zeros
Signed-off-by: Alp Dener <adener@nvidia.com>

* padding scale with 2^-127 instead of nans
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix bug on rhs_scale_inv usage
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* cleanup E8M0 type converter use it in gemm.cpp
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* segfault fixed, passing all unittests on Blackwell
Signed-off-by: Alp Dener <adener@nvidia.com>

* fix for fuseddense tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix workspace alignment
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed GemmPrimitive custom partitioning to match jax.nn.scaled_matmul
Signed-off-by: Alp Dener <adener@nvidia.com>

all unit tests passing on H100x8 node
Signed-off-by: Alp Dener <adener@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



linting fixes
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed batch dimension numbers
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed FP8 scale sharding rule when there are no FP8 scales
Signed-off-by: Alp Dener <adener@nvidia.com>

added error message for unsupported Shardy partitioner
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed test tolerances for FP8 cases
Signed-off-by: Alp Dener <adener@nvidia.com>

fixed shardy test skip cases
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* moved reshape of encoder output in encoder examples to make custom partitioning rules work correctly
Signed-off-by: Alp Dener <adener@nvidia.com>

* added helper functions for padding and unpadding block scales, changed GemmPrimitive to accept unpadded scales and pad them after sharding
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* updated shardy rules for all custom ops to decouple block scale rules from their tensors
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed linting errors
Signed-off-by: Alp Dener <adener@nvidia.com>

* changed unit test use_jax_gemm option to be a context to preserve external custom op settings, tightened multi-GPU encoder test tolerances, changed gemm() API to use contracting_dims and batched_dims separately instead of dimension_numbers
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed typo in test utils
Signed-off-by: Alp Dener <adener@nvidia.com>

* added sequence-first input warnings
Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fixed datasets version for JAX examples
Signed-off-by: Alp Dener <adener@nvidia.com>

* reverting modification to force_1x_quantization decision
Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected gemm function syntax in unit tests
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

214e2a4a

12 Jun, 2025 4 commits

Fixes for JIT-able grouped_gemm (#1872) · ecaf3e21

Phuong Nguyen authored Jun 12, 2025



* fixes for jittable grouped_quantize

* fixes for jittable grouped_gemm

* fix contracting_dim for wgrad gemm

* exclude jitted grouped_gemm from the unit test as it does not work cudaGraph

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

ecaf3e21

[JAX] GroupedDense v.2 without dynamic shape (#1875) · c9d7f3f2

Phuong Nguyen authored Jun 12, 2025



* Implemented GroupedDense and TestGroupedDense for BF16, FP16, and FP8 
* Fix GroupedGemmFFI cuBLAS workspace alignment bug
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>

c9d7f3f2

Revert "[JAX] GroupedDense v.2 without dynamic shape" (#1874) · c3b7c2ae

Phuong Nguyen authored Jun 12, 2025

Revert "[JAX] GroupedDense v.2 without dynamic shape (#1721)"

This reverts commit 5d01ef21

.
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

c3b7c2ae

[JAX] GroupedDense v.2 without dynamic shape (#1721) · 5d01ef21

Phuong Nguyen authored Jun 12, 2025



* Implemented GroupedDense and TestGroupedDense for BF16, FP16, and FP8 
* Fix GroupedGemmFFI cuBLAS workspace alignment bug
Signed-off-by: Hua Huang <huah@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

5d01ef21

06 Jun, 2025 1 commit

[JAX] GroupedQuantizer and GroupedScaledTensor (#1666) · 7948779c

Phuong Nguyen authored Jun 06, 2025



* refactor the multi_stream utils + implement nvte_multi_tensor_quantize in TE/Common

* implement GroupedQuantizer and grouped_quantize in jaxx

* fix logical_axes_names for transpose tensor in ScaledTensor
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Ming Huang <mingh@nvidia.com>

7948779c

05 Jun, 2025 1 commit

[JAX] Fix 1x quantize kernel availability check on hopper (#1845) · f64d1459

jberchtold-nvidia authored Jun 05, 2025



* Fix 1x quantize kernel availability check on hopper

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

f64d1459

02 Jun, 2025 1 commit

[JAX] Use 1x quantization + jax transpose for performance for tensor-scaling (#1830) · 62f5c9ee

jberchtold-nvidia authored Jun 02, 2025



* Use 1x quantization + jax transpose on BW for performance
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use 1x quantization on Hopper as well as it is also faster
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Undo architecture check helper function
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

62f5c9ee

22 May, 2025 1 commit
- [JAX] Make primitive names more granular for better disabling granularity (#1811) · b17f3f4e
  jberchtold-nvidia authored May 22, 2025
```
Make primitive names more granular for better disabling granularity
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  b17f3f4e
30 Apr, 2025 1 commit
- [JAX] Fix distributed Layernorm test failure (#1734) · dac098d8
  jberchtold-nvidia authored Apr 30, 2025
```
Fix distributed layernorm test failure
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
  dac098d8
29 Apr, 2025 1 commit

[JAX] Distributed Current Scaling (#1699) · 4ceb3d4c

jberchtold-nvidia authored Apr 28, 2025



* Update test_helper.py and add QuantizeConfig class for CurrentScaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* WIP distributed current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Distributed Current Scaling (debugging). Distributed implementation with replicated scale_inv works for layernorm_mlp but feels like a hack

Has different per-device scale_inv values, but jax.debug.print only shows one of them. Since we're telling JAX/XLA that this scale is replicated, I think it assumes all the values are equal. However, it doesn't actually check this, so it seems we are able to get away with per-device scales for current scaling but I am not sure how stable this will be and may randomly fail if us or the user changes partitioning at all or if XLA decides to actually act on the assumption that all these scale_invs are the same.
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Implement distributed current scaling by computing a global amax and scale before quantization
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add encoder and mnist tests for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add primitive prefix to shardy unique_vars to prevent factor conflicts when performing unfused primitives for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove scale_shape primitive arg that is no longer used
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix expected result on multiprocessing encoder test
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint fix
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update multiprocessing current scaling tolerances
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Uncomment test case that was disabled for testing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove commented out debug line
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

4ceb3d4c

22 Apr, 2025 1 commit

[JAX] JAX Current Scaling (#1647) · 9a819334

jberchtold-nvidia authored Apr 22, 2025



* [JAX-Q] Single GPU current scaling for JAX
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix scale check dtype for MXFP8 scales affecting tests using assert_bitwise_scaled_tensors
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Address comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove cast to fp32 for norm primitives now that zero-centered gamma dtype issue is fixed
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix lint issue
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove unnecessary cast to fp32
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

9a819334

14 Apr, 2025 1 commit

Add experimental Shardy support. (#1642) · 6117b20c

Johannes Reifferscheid authored Apr 14, 2025



* Add experimental Shardy support.

Production use is not yet recommended.

---------
Signed-off-by: Johannes Reifferscheid <jreiffers@nvidia.com>

6117b20c

09 Apr, 2025 1 commit

[JAX] Scaling Enum Abstracting (#1655) · 962d9c53

Phuong Nguyen authored Apr 09, 2025



* scaling enum abstract

* rm NVTE_ from ScalingMode names

* rework scaling mode enum in grouped gemm

* fix norm sharding

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

962d9c53

04 Apr, 2025 1 commit

[JAX] Flatten_axis for quantization and Sharding propagation fixes (#1644) · ff884e20

Phuong Nguyen authored Apr 04, 2025



* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout

* add fatten_axis option

* added gated act to test encoder

* sharding constraint fixes

* fix padding when flattening first dim needs to be padded

* update test sizes so that padding is tested

* rm output sharding as it can be done in the flax module

* sharding scale_inv for mxfp8

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ff884e20

01 Apr, 2025 1 commit

[JAX] Refactor + MXFP8 + GroupedGEMM (#1627) · cf9a7c2f

Phuong Nguyen authored Mar 31, 2025



* refactor + mxfp8

* added grouped gemm

* rename linear to dense

* added cublas init phase for groupedGemm

* relax the tol of test encoder multiprocessing mxfp8 by 0.001
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>

cf9a7c2f

13 Mar, 2025 1 commit

[JAX] FFI API compatibility with both 0.4 and 0.5 (#1562) · 0e137883

Reese Wang authored Mar 13, 2025



Make ffi compatible with jax 0.4
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

0e137883

05 Mar, 2025 1 commit

Fix installation from PyPI wheels after a source install (#1526) · a3e6ed80

Kirthi Shankar Sivamani authored Mar 05, 2025



* Fix wheel install after src install
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix JAX imports
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* switch order of dirs for finding so
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Use existing dir src build
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix lint
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

a3e6ed80

14 Feb, 2025 1 commit

[JAX] Fixes for CI failures with the latest JAX (#1469) · e19b8281

Phuong Nguyen authored Feb 14, 2025



* fixes L1 test

* fix test_multigpu_encoder

* fixes for other multi-encoder tests

* jax.extend.ffi to jax.ffi

* initialization with float32

* add init_dtype as an optional arg to all modules

* update use_scan query from xla flags

* relax threshold for test_encoder fp8

* relax the tols

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

e19b8281

02 Jan, 2025 1 commit
- Update copyright to include 2025 (#1388) · c9ea6be9
  Kirthi Shankar Sivamani authored Jan 02, 2025
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
  c9ea6be9
24 Oct, 2024 1 commit

[JAX] XLA Custom Calls with FFI for FusedAttnFwd, Quantize, Transpose,... · 18c2234c

Hua Huang authored Oct 24, 2024


[JAX] XLA Custom Calls with FFI for FusedAttnFwd, Quantize, Transpose, ActLuFP8, LayerNormForwardFP8FFI, and LayerNormBackwardFFI (#1263)

* Add TransposeFFI, test passed
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add ActLuFP8FFI; fix TransposeFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add QuantizeFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add FusedAttnForwardFFI and some unit tests
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Minor fix
Signed-off-by: Hua Huang <huah@nvidia.com>

* Add LayerNormForwardFP8FFI & LayerNormBackwardFFI
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Revise FusedAttnForwardFFI()
Signed-off-by: Hua Huang <huah@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add FFI_CudaGraph_Traits

All tests passed, ready for merge
Signed-off-by: Hua Huang <huah@nvidia.com>

* Bug fix for FFI data type mismatch

Also add a safeguard on the entrance to FFI function
Signed-off-by: Hua Huang <huah@nvidia.com>

---------
Signed-off-by: Hua Huang <huah@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

18c2234c

19 Aug, 2024 1 commit

Stop using global mesh for custom_partitioning. (#1112) · ee541e83

Frédéric Bastien authored Aug 19, 2024


Signed-off-by: Frederic Bastien <fbastien@nvidia.com>
Co-authored-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>

ee541e83

17 Jul, 2024 1 commit

[JAX] Allow enabling partial custom calls through the environment variable (#1007) · 6c579267

Reese Wang authored Jul 17, 2024



* Add enabled() to BasePrimitive

* Add layernorm/rmsnorm fallback

* Add cast_fp8 fallback

* Add transpose/cast_transpose XLA fall back

* Act_lu fallback

* Add transpose fallback

* Add softmax fallback

* Unify the use of _cast_fp8

* Add tests for NVTE_JAX_CUSTOM_CALLS_RE

---------
Signed-off-by: Reese Wang <rewang@nvidia.com>
Co-authored-by: Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>

6c579267

14 Jun, 2024 1 commit

Apply formatting (#929) · 9416519d

Kirthi Shankar Sivamani authored Jun 13, 2024



* Apply formatting
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply formatting
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

9416519d

13 Jun, 2024 1 commit

[JAX] Splitting cpp_extensions.py (#899) · 5986342a

Phuong Nguyen authored Jun 13, 2024



* Splitted cpp_extensions.py, renamed mlp.py and fused_attn.py
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fixed import in tests
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

5986342a