Commits · c638c4362b0286e032a74d49055b890c4ca27061 · OpenDAS / TransformerEngine

12 Apr, 2025 1 commit

[QA] Extend error handling (#1660) · c638c436

linxiddd authored Apr 12, 2025



[QA] Add error handling

- Standardize test failure handling using the unified 'test_fail' function and 'error_exit' function
Signed-off-by: Linxi Ding <linxid@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

c638c436

11 Apr, 2025 3 commits

[PyTorch] Add option in activation ops to cache input in FP8 (#1665) · 04642bf8

Tim Moon authored Apr 11, 2025



* Add option to cache activation input in FP8
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid casting to FP8 transpose
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Skip input caching if device is not supported
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add documentation that FP8 input caching is experimental
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>

04642bf8

Make shape cache invalidation more conservative. (#1670) · dfb3c486

kwyss-nvidia authored Apr 11, 2025



Repeated calls to nvte_shape should not invalidate
previous data pointers.

It would be possible to avoid unnecessary comparisons
by duplicating some of the logic from shape() so that
the cache is only relevant when columnwise shapes are
involved. Whether this code duplication is preferable
to the comparisons that arise from by value semantics
of reusing shape is a judgment call.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

dfb3c486

Add user to TE CI (#1669) · 2856c3e0
Kirthi Shankar Sivamani authored Apr 10, 2025
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
2856c3e0

10 Apr, 2025 1 commit

Blockwise scaling linear quantization recipe (#1559) · a8f0fe03

kwyss-nvidia authored Apr 10, 2025



* Add GEMM logic for blockwise quantized tensors.

GEMM test cases included in pytorch integration.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update NVTE_BLOCK_SCALING for GEMM.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gate feature on CUDA 12.9
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gemm typo.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove unecessary type converter change.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reflect epilogue availability and test supported epilogues.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* GEMM simplifications from recipe branch.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update GEMM DGelu tests to match support depending on output dtype.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Force pow2Scales in GEMM
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add GEMM test to pytorch test suite.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to GEMM test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update import for GEMM test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add license.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test gemm supported predicate.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use sgemm like interfaces and naming.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rewrite GEMM comment.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Recipe setup for Linear modules.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use 12.9 feature test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Run against tensor dumps from internal library.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update FIXME to TODO with linked issue.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update full recompute feature to save recipe.

The recompute context uses the same recipe
and fp8 settings as the original fwd pass.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback. Avoid reusing quantizer objects.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update logic in module.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update for PP bug.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test numerics.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update force_power_of_2 scales in the recipe.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update usage method to satisfy upstream changes.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* fix subchannel recipe in distributed test with bf16 gather
Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Edit and cleanup BF16 gather code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test import.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* support columnwise only mode to 1D quantize kernel
Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format and move enum
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Skip alloc.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* try async bf16 gather
Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format python code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Document and type code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytorch lint errors.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Dont set high precision dtype.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add test for sanity and CG; fix CG for sequential?
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Keep make_quantizers API stable

Update num_quantizers instead to pass cuda_graph tests.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix import name.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rename recipe method.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Skip grouped linear sanity test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Set usage before BF16 gather.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* refactor for nvte_quantize_v2
Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Cleanup nvte_quantize_v2
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Test fp32 scales.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Disable CUDA graph.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Simplify layernorm linear
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Cleanup layernorm linear.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* LayerNorm linear bwd gather logic.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Communication updates.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update transformer_engine/pytorch/ops/op.py

Apply MR comment change.
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>

* Lint fix.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Enable cuda graph tests.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reduce chance of spurious failure and reword.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Review suggestions from @timmoon10
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update CPP tests.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update common.h
Signed-off-by: Xin Yao <yaox12@outlook.com>

* Update test_float8blockwisetensor.py
Signed-off-by: Xin Yao <yaox12@outlook.com>

---------
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>

a8f0fe03

09 Apr, 2025 3 commits

[PyTorch] Debug checkpointing with te.Sequential (#1629) · 0da60449

Tim Moon authored Apr 09, 2025



* Debug checkpointing with te.Sequential
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

0da60449

[PyTorch] Explicitly specify quantized tensor usages needed for linear op backward (#1646) · 20e95ba3
Tim Moon authored Apr 09, 2025
```
Explicitly specify quantized tensor usages needed for linear op backward
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
20e95ba3

[JAX] Scaling Enum Abstracting (#1655) · 962d9c53

Phuong Nguyen authored Apr 09, 2025



* scaling enum abstract

* rm NVTE_ from ScalingMode names

* rework scaling mode enum in grouped gemm

* fix norm sharding

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

962d9c53

08 Apr, 2025 2 commits

[PyTorch] Debug GEMM refactor (#1652) · 9d4e11ea

Tim Moon authored Apr 08, 2025



* Minor stylistic tweaks and typo fixes

Review suggestions from @ptrendx
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix incorrect col strides for MXFP8 matrices
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

9d4e11ea

Enable reuse of dummy wgrad tensor (#1651) · ba5dc5dd

vasunvidia authored Apr 08, 2025



* Use dummy wgrads for lower memory consumption
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Bug fix to avoid sharing gradients.
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Disable automatic use of batch_p2p_comm for CP2
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Change weight to origin_weight for LN_LINEAR
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ba5dc5dd

07 Apr, 2025 5 commits

Subchannel Block quantized GEMM (#1545) · db2aaa9e

kwyss-nvidia authored Apr 07, 2025



* Add GEMM logic for blockwise quantized tensors.

GEMM test cases included in pytorch integration.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update NVTE_BLOCK_SCALING for GEMM.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gate feature on CUDA 12.9
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gemm typo.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove unecessary type converter change.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reflect epilogue availability and test supported epilogues.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* GEMM simplifications from recipe branch.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update GEMM DGelu tests to match support depending on output dtype.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Force pow2Scales in GEMM
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add GEMM test to pytorch test suite.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to GEMM test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update import for GEMM test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add license.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test gemm supported predicate.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use sgemm like interfaces and naming.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rewrite GEMM comment.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Refactor GEMM param canonicalization

Configure A and B matrices separately. Have separate code path for each scaling mode.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Prune number of tests.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

---------
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

db2aaa9e

Removing NVTE_NO_SCALING (#1650) · b362a6e0

Phuong Nguyen authored Apr 07, 2025



* rm no scaling enum
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* update jax enum
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

b362a6e0

Support FP8 primary weight in FSDP training (#1630) · c84d1708

Jianbin Chang authored Apr 07, 2025



Support fp8 primary weight in fsdp training
Signed-off-by: jianbinc <shjwudp@gmail.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

c84d1708

Fix cpp warnings (#1639) · a3ba4dff

Xin Yao authored Apr 07, 2025



* fix cpp warning
Signed-off-by: Xin Yao <xiny@nvidia.com>

* more fix
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>

a3ba4dff

[PyTorch][Common] Refactor RoPE (#1626) · ba605f18

Xin Yao authored Apr 07, 2025



* refactor to add cp support for sbhd/bshd
Signed-off-by: Xin Yao <xiny@nvidia.com>

* support interleaved
Signed-off-by: Xin Yao <xiny@nvidia.com>

* format
Signed-off-by: Xin Yao <xiny@nvidia.com>

* add interleaved to RotaryPositionEmbedding in test
Signed-off-by: Xin Yao <xiny@nvidia.com>

* update
Signed-off-by: Xin Yao <xiny@nvidia.com>

* merge sbhd/bshd and thd functions
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>

ba605f18

04 Apr, 2025 6 commits

[JAX] Flatten_axis for quantization and Sharding propagation fixes (#1644) · ff884e20

Phuong Nguyen authored Apr 04, 2025



* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout

* add fatten_axis option

* added gated act to test encoder

* sharding constraint fixes

* fix padding when flattening first dim needs to be padded

* update test sizes so that padding is tested

* rm output sharding as it can be done in the flax module

* sharding scale_inv for mxfp8

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ff884e20

[JAX-Q] Distributed MXFP8 flax layer tests (#1643) · be1f647c
jberchtold-nvidia authored Apr 04, 2025
```
MXFP8 flax layer tests
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
```
be1f647c

Blockwise float8 quantizer and quantized tensor class (#1513) · 1bbeab1c

kwyss-nvidia authored Apr 03, 2025



* Blockwise float8 quantizer and quantized tensor class.

The classes are configurable for 128x128 blocksize
and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication,
however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet
implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with
exact comparison to reference quantizer behavior as well as an attempt
to hit interesting branches through the API such as tensor creation
in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included, and are direct ports
of equivalents in the kitchen repository, where a subchannel recipe
has been used for end to end training.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Apply linting changes.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Alignment for 1D scaling for GEMM edge case.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API name.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix merge conflict with name change.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use common tensor map API.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API to use two scaling mode enums.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix typo.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update some call sites.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Tests for torch tensor API surface.

Since the quantized tensor is a tensor
subclass, these tests exercise torch hooks.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse scale calculation between quantizer refs.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Save memory by dropping reference to saved tensors.

Issues previously observed are solved.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove constexpr parameters from kernel.

Code size is reduced with fewer constexpr params.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Merge conflict from rebase.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add shape implementations for block scaling.

nvte_shape was added upstream. Logic added
for block scaled fp8.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Move benchmark to te_playground
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove amax_epsilon and pow_2_scales from tensor.

Hardcodes the default values.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Lint changes.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fixup MR changes that broke.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Safer ifdef in kernel.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Documentation prose.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse compute_scale function from Current Scaling.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Bugfix on inf_value scale refactor.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove qopt calls from test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytest list.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to reference scale calc.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use ptx.cuh functions instead of cde.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update shape logic with allocation and reuse shape.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Usage defaults MR feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Copyright and header guard.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Updating torch dispatch code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix exception type.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use TypeInfo
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update CS scale update test to use updated ref impl
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update JAX scaling mode enum
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Skip tests on Lovelace
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

1bbeab1c

[PyTorch] Debug weight matrix usages for dgrad GEMM (#1637) · 3e305f72

Tim Moon authored Apr 03, 2025



Make sure that weight matrix has required usages for dgrad GEMM
Signed-off-by: Tim Moon <tmoon@nvidia.com>

3e305f72

Introduce NVSHMEM based communication API for pytorch (#1430) · afa1f1b0

gdengk authored Apr 03, 2025



* add nvshmem based api support
Signed-off-by: gdeng <gdeng@nvidia.com>

* fix lint and license issue
Signed-off-by: gdeng <gdeng@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* remove asset
Signed-off-by: gdeng <gdeng@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix the lib
Signed-off-by: gdeng <gdeng@nvidia.com>

* address comments
Signed-off-by: gdeng <gdeng@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: gdeng <gdeng@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

afa1f1b0

Update list of authorized CI users (#1645) · da42e212
Tim Moon authored Apr 03, 2025
```
Update list of authorized users
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
da42e212

03 Apr, 2025 1 commit

Fix fp8_buf for Linear and LayerNormLinear (#1633) · e3e0375d

Kirthi Shankar Sivamani authored Apr 02, 2025



* Fix fp8_buf for Linear and LayerNormLinear
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

e3e0375d

02 Apr, 2025 1 commit
- Update list of authorized CI users (#1636) · 31f5c2d8
  Tim Moon authored Apr 02, 2025
```
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
  31f5c2d8
01 Apr, 2025 6 commits

[JAX] Backward compatible Fixes (#1631) · 160be219

Phuong Nguyen authored Apr 01, 2025



* expose NVTE_FP8_COLLECTION_NAME, update_collections, get_delayed_scaling

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

160be219

[PyTorch] Debug NCCL communication overlapping in linear backward with FP8 data (#1620) · b0ad8ef0

Tim Moon authored Apr 01, 2025



* Overlap input all-gather with dgrad GEMM in FP8 linear layers
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add missing docstring
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

b0ad8ef0

[PyTorch] Make breaking change in `InferenceParams.init` more explicit (#1619) · 56653520
Charlene Yang authored Apr 01, 2025

56653520

Bugfixes for LayerNormMLP (#1625) · 69365f88

guyueh1 authored Mar 31, 2025



* Fix GEMM+RS overlap for LayerNormMLP
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

* Fix error LayerNormMLP param.grad is None
Signed-off-by: Guyue Huang <guyueh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update dtype for wgrad GEMM
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>

69365f88

[PyTorch] fix fuse_wgrad_accumulation in LayerNormMLP backward (#1618) · 77d64552

Marks101 authored Apr 01, 2025



* [PyTorch] fix general_gemm argument out_dtype in LayerNormMLP backward
Signed-off-by: Markus Schnoes <markus.schnoes@gmx.de>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Markus Schnoes <markus.schnoes@gmx.de>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

77d64552

[JAX] Refactor + MXFP8 + GroupedGEMM (#1627) · cf9a7c2f

Phuong Nguyen authored Mar 31, 2025



* refactor + mxfp8

* added grouped gemm

* rename linear to dense

* added cublas init phase for groupedGemm

* relax the tol of test encoder multiprocessing mxfp8 by 0.001
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>

cf9a7c2f

31 Mar, 2025 3 commits

[PyTorch] Support default process group with FP8 current scaling (#1621) · be055eb0

Tim Moon authored Mar 31, 2025



* Handle case where FP8 current scaling quantizer gets default process group
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix linter warning
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid canonicalizing TP group since it may not be initialized
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

be055eb0

[JAX] Add fast path for causal masking with segment IDs. (#1601) · 3b1f5a11
Michael Goldfarb authored Mar 31, 2025
```
Add fast path for causal masking with segment IDs.
Signed-off-by: Michael Goldfarb <mgoldfarb@nvidia.com>
```
3b1f5a11
fix a sync race error of softmax_lse in CP+THD+P2P (#1624) · 76187a5e
Xiaowei Ren authored Mar 31, 2025
```
fix a race error softmax_lse
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
```
76187a5e

27 Mar, 2025 1 commit

[PyTorch] Add tests for current scaling; misc related fixes (#1606) · 3bcd7f6f

Kirthi Shankar Sivamani authored Mar 27, 2025



* Cleanup sanity tests and add CS recipe tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix sanity test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix CG capture with CS recipe
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix ops for CG
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

3bcd7f6f

25 Mar, 2025 7 commits

[PyTorch] Optimize MXFP8 all-gathers (#1581) · 0356010c

Tim Moon authored Mar 25, 2025



* Coalesce NCCL all-gathers for MXFP8 all-gather
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add missing import
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Cache quantized input tensor after linear module forward pass
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix linter warnings
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid unnecessarily allocating layernorm output in LayerNormLinear/LayerNormMLP
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

0356010c

[PyTorch] Minor fixes for TE 2.2 (#1589) · 65c2798a

Charlene Yang authored Mar 26, 2025



* skip cuDNN 9.8 for KV caching
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert from max_seqlen_kv to max_sequence_length for InferenceParams
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rename test_paged_attn to test_kv_cache
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove redundant None returns in bwd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add debug flags when no backend is found
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* skip kv_cache_accuracy tests for cuDNN 9.8
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* truncate length of cu_seqlens for consistency with q/k/v shape
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add back padding_brcm for fused attn tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* re-enable kv_cache_accuracy test for 9.8
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix cuDNN search dir
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fixes based on review
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove extra empty line
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

65c2798a

Fix mxfp8 columnwise data missing (#1593) · abbdd769

guyueh1 authored Mar 25, 2025



* Fix mxfp8 columnwise data missing when switching from validation to training
Signed-off-by: Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com>

* Fix when you interleave training and inference
Signed-off-by: Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com>

* refact
Signed-off-by: Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* rm useless code
Signed-off-by: Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com>

* Update transformer_engine/pytorch/module/base.py
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: guyueh1 <140554423+guyueh1@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix linter warnings
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------
Signed-off-by: Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com>
Signed-off-by: guyueh1 <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

abbdd769

[PyTorch] Defer torch compilation steps until first function call (#1599) · cf00d537

Peter St. John authored Mar 25, 2025



* Defer torch compilation steps until first function call
Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix function call in smoke test
Signed-off-by: Peter St. John <pstjohn@nvidia.com>

---------
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

cf00d537

[PyTorch] Fix issues for MCore DDP in grouped GEMM. (#1609) · b59d1d8b

Li Tao authored Mar 26, 2025



fix mcore DDP error
Signed-off-by: lit <lit@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

b59d1d8b

Remove deprecated interval arg to delayed scaling recipe (#1607) · 945a559b
Kirthi Shankar Sivamani authored Mar 25, 2025
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
945a559b
[JAX] Fixing importing in the encoder examples (#1600) · 3dc8c6bc
Phuong Nguyen authored Mar 25, 2025
```
import te before te_jax
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
```
3dc8c6bc