Commits · c47f329b2084406093124851a3aeecb935183def · OpenDAS / TransformerEngine

05 Sep, 2025 1 commit

[JAX] NoScaleTensor wrapper for non-quantized data (#2136) · c47f329b

jberchtold-nvidia authored Sep 05, 2025



* Custom call tests passing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix shardy issue with amax being shape 1,1,1 instead of shape (1,)
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add higher-precision VJP tests to test_distributed_layernorm_mlp
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Cast non-quantized kernels to input dtype in VJPs
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename HighPrecisionTensor to NoScaleTensor
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Use NoScaleTensor in pure JAX impls where it was missing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

c47f329b

27 Aug, 2025 1 commit

[JAX] Decouple Recipe and ScalingMode (#1728) · c9508000

jberchtold-nvidia authored Aug 27, 2025



* Decouple recipe and scaling mode
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Expose global QuantizeConfig instance as a getter
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format and lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename UsageType to TensorSource
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update test_layer.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

c9508000

26 Aug, 2025 1 commit

[JAX] `ScaledTensor1x` to store `amax` (#2117) · 3d0ea80a

Phuong Nguyen authored Aug 26, 2025



* added amax as an optional arg
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

3d0ea80a

12 Aug, 2025 1 commit

[JAX] Support custom recipe and custom collection name when creating quantizer sets (#2059) · 6a4e871e

jberchtold-nvidia authored Aug 12, 2025



* Support setting collection name for quantizer set Flax variables in TransformerEngineBase flax module
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Support creating quantizer set from a recipe directly
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix debug error format string in gemm.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

6a4e871e

18 Jun, 2025 1 commit

[JAX] TensorUsage + FP8 GEMM with all layouts handling on BW (#1844) · 3a298e6b

Phuong Nguyen authored Jun 18, 2025



* TensorUsage + FP8 GEMM with all layouts handling on BW
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>


---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

3a298e6b

06 Jun, 2025 1 commit

[JAX] GroupedQuantizer and GroupedScaledTensor (#1666) · 7948779c

Phuong Nguyen authored Jun 06, 2025



* refactor the multi_stream utils + implement nvte_multi_tensor_quantize in TE/Common

* implement GroupedQuantizer and grouped_quantize in jaxx

* fix logical_axes_names for transpose tensor in ScaledTensor
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Ming Huang <mingh@nvidia.com>

7948779c

06 May, 2025 1 commit

[JAX] Fix failing L2 JAX unit tests (#1735) · fe31af80

jberchtold-nvidia authored May 06, 2025



* Fix L2 test_custom_call_compute.py L2 tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix test_helper.py
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Address comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

fe31af80

29 Apr, 2025 1 commit

[JAX] Distributed Current Scaling (#1699) · 4ceb3d4c

jberchtold-nvidia authored Apr 28, 2025



* Update test_helper.py and add QuantizeConfig class for CurrentScaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* WIP distributed current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Distributed Current Scaling (debugging). Distributed implementation with replicated scale_inv works for layernorm_mlp but feels like a hack

Has different per-device scale_inv values, but jax.debug.print only shows one of them. Since we're telling JAX/XLA that this scale is replicated, I think it assumes all the values are equal. However, it doesn't actually check this, so it seems we are able to get away with per-device scales for current scaling but I am not sure how stable this will be and may randomly fail if us or the user changes partitioning at all or if XLA decides to actually act on the assumption that all these scale_invs are the same.
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Implement distributed current scaling by computing a global amax and scale before quantization
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add encoder and mnist tests for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Add primitive prefix to shardy unique_vars to prevent factor conflicts when performing unfused primitives for current scaling
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove scale_shape primitive arg that is no longer used
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix expected result on multiprocessing encoder test
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint fix
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update multiprocessing current scaling tolerances
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Uncomment test case that was disabled for testing
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove commented out debug line
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

4ceb3d4c

22 Apr, 2025 1 commit

[JAX] JAX Current Scaling (#1647) · 9a819334

jberchtold-nvidia authored Apr 22, 2025



* [JAX-Q] Single GPU current scaling for JAX
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix scale check dtype for MXFP8 scales affecting tests using assert_bitwise_scaled_tensors
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Address comments
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove cast to fp32 for norm primitives now that zero-centered gamma dtype issue is fixed
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Fix lint issue
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Remove unnecessary cast to fp32
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Lint
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

9a819334

09 Apr, 2025 1 commit

[JAX] Scaling Enum Abstracting (#1655) · 962d9c53

Phuong Nguyen authored Apr 09, 2025



* scaling enum abstract

* rm NVTE_ from ScalingMode names

* rework scaling mode enum in grouped gemm

* fix norm sharding

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

962d9c53

04 Apr, 2025 1 commit

[JAX] Flatten_axis for quantization and Sharding propagation fixes (#1644) · ff884e20

Phuong Nguyen authored Apr 04, 2025



* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout

* add fatten_axis option

* added gated act to test encoder

* sharding constraint fixes

* fix padding when flattening first dim needs to be padded

* update test sizes so that padding is tested

* rm output sharding as it can be done in the flax module

* sharding scale_inv for mxfp8

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ff884e20

01 Apr, 2025 1 commit

[JAX] Refactor + MXFP8 + GroupedGEMM (#1627) · cf9a7c2f

Phuong Nguyen authored Mar 31, 2025



* refactor + mxfp8

* added grouped gemm

* rename linear to dense

* added cublas init phase for groupedGemm

* relax the tol of test encoder multiprocessing mxfp8 by 0.001
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>

cf9a7c2f