Commits · 80c5079cd5058ab5b5af456ad27c892e9f825f82 · OpenDAS / TransformerEngine

23 May, 2025 1 commit
- [DCU] surpport blockwise int8 quant · 80c5079c
  yuguo authored May 23, 2025
  
  80c5079c
20 May, 2025 2 commits
- [DCU] surpport delay_wgrad_compute in batchgemm · 460b006c
  yuguo authored May 20, 2025
  
  460b006c
- [DCU] variable ub streams add NVTE_UB_STREAM_NUMS · 196a213f
  yuguo authored May 20, 2025
  
  196a213f
14 May, 2025 2 commits
- [Workaround] multi tensor scale acc restrictions · 64d522ac
  wenjh authored May 14, 2025
  
  64d522ac
- Close Env NVTE_FORCE_ROCM_GEMM after tested gemm · 28726eaf
  wenjh authored May 14, 2025
  
  28726eaf
08 May, 2025 3 commits

[DCU] add batchgemm test · 9d0f1c9b
yuguo authored May 08, 2025

9d0f1c9b

[Workaround] Force NVTE_FORCE_ROCM_GEMM=1 · 6dfe66e9

wenjh authored May 08, 2025



The acc problem in test_grouped_linear_accuracy and test_grouped_gemm is
because calc test out and ref out using diff kernel.
Make NVTE_FORCE_ROCM_GEMM=1 can force these tests to call rocm gemm using
same kernel.
Signed-off-by: wenjh <wenjh@sugon.com>

6dfe66e9

[ROCBLAS_GEMM] Default use of hipMallocAsync · 7a47930f

wenjh authored May 08, 2025



Default use of hipMallocAsync rather than hipMalloc in rocblas_gemm and
add support of fp16_fp16_fp32 in rocblas_gemm.
Signed-off-by: wenjh <wenjh@sugon.com>

7a47930f

29 Apr, 2025 2 commits
- [DCU] fix fsdp2 · 16de530e
  yuguo authored Apr 29, 2025
  
  16de530e
- [PytorchUnitTest] Fix errors while running tests · 86f2e9a9
  wenjh authored Apr 29, 2025
```
Signed-off-by: wenjh <wenjh@sugon.com>
```
  86f2e9a9
28 Apr, 2025 1 commit
- [DCU] fix bad alloc · 11b6b7e4
  yuguo authored Apr 28, 2025
  
  11b6b7e4
18 Apr, 2025 1 commit

Split wgrad&dgrad from backward() to support a2a overlap (#1653) · 9f8aaddf

Hongbin Liu authored Apr 18, 2025



* split wgrad for GroupedLinear
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* support wgrad split for linear and ln_linear
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* add comments and fix WeightGradStore
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* support bias and fix unit tests
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* minor fix
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* support fuse_grad_accumulation=false
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add wgrad split for layernorm_mlp
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* minor fix
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix unittest
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* add unittest for distributed interface apply Dener's suggestion
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fix
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* replace split_bw with delay_wgrad_compute
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Update transformer_engine/pytorch/module/layernorm_mlp.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/pytorch/module/linear.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update transformer_engine/pytorch/module/layernorm_linear.py
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* remove comments
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

---------
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Hongbin Liu <hongbinl@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

9f8aaddf

17 Apr, 2025 1 commit
- [DCU] tmp fix overlap test · b9ec4909
  yuguo authored Apr 17, 2025
  
  b9ec4909
16 Apr, 2025 1 commit
- [DCU] tmp fix overlap allmethod · 07b750a2
  yuguo authored Apr 16, 2025
  
  07b750a2
15 Apr, 2025 3 commits

Add adam bf16 state with original fp32 kernel (#1640) · 86928e07

Li Tao authored Apr 16, 2025



* support adam bf16 state
Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

* use fp32 kernel but keep bf16 optimizer states to save memory
Signed-off-by: lit <lit@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Signed-off-by: lit <lit@nvidia.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

86928e07

[PyTorch] More precise test for the CPU offloading. (#1668) · 66d6afbf

Paweł Gadziński authored Apr 15, 2025



* test change
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* test fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* small changes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* small changes
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* test
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* clear
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* base
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

66d6afbf

[PyTorch] Fix for checkpointing for callables. (#1679) · aee78831

Paweł Gadziński authored Apr 15, 2025



* fix
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* added test
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* test change
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* changed the test
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

aee78831

14 Apr, 2025 4 commits

[PyTorch] check and try to generate fp8 weight transpose cache before dgrad backward (#1648) · 5fdd7bb9

Jianbin Chang authored Apr 15, 2025



* Add fp8 weight transpose cache check in backward, and regenerated it if it does not exist
Signed-off-by: jianbinc <shjwudp@gmail.com>

* Properly handle fsdp shard model weight input.
Signed-off-by: jianbinc <shjwudp@gmail.com>

* move Float8Tensor to QuantizedTensor in cast_master_weights_to_fp8 UT
Signed-off-by: jianbinc <shjwudp@gmail.com>

* handle Float8TensorBase issue
Signed-off-by: jianbinc <shjwudp@gmail.com>

* fix bug in activation recompute
Signed-off-by: jianbinc <shjwudp@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: jianbinc <shjwudp@gmail.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

5fdd7bb9

[PyTorch][MoE] Enable New Recipes for Grouped Linear (#1525) · 4c9626e7

Xin Yao authored Apr 15, 2025



* Enable MXFP8 and Per-Tensor Current Scaling for Grouped Linear
Signed-off-by: Xin Yao <xiny@nvidia.com>

* enable float8blockwise
Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* update
Signed-off-by: Xin Yao <xiny@nvidia.com>

* remove grouped linear parallel mode test
Signed-off-by: Xin Yao <xiny@nvidia.com>

* update test
Signed-off-by: Xin Yao <xiny@nvidia.com>

* resolve comments
Signed-off-by: Xin Yao <xiny@nvidia.com>

* internal=False for now
Signed-off-by: Xin Yao <xiny@nvidia.com>

* remove unused import
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

4c9626e7

[DCU] tmp fix · 8fb50d09
yuguo authored Apr 14, 2025

8fb50d09

[MoE] Support new fp8 recipes for permute_fusion (#1649) · c8e7cc02

Autumn1998 authored Apr 14, 2025



* add support for new recipe on permute_fusion, rm fp unpermute
Signed-off-by: tongliu <tongliu@nvidia.com>

* fix lint
Signed-off-by: Xin Yao <xiny@nvidia.com>

* remove fp8 from index map
Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* skip unsupported tests
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: tongliu <tongliu@nvidia.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

c8e7cc02

11 Apr, 2025 2 commits

[PyTorch] Add option in activation ops to cache input in FP8 (#1665) · 04642bf8

Tim Moon authored Apr 11, 2025



* Add option to cache activation input in FP8
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid casting to FP8 transpose
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Skip input caching if device is not supported
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add documentation that FP8 input caching is experimental
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>

04642bf8

[DCU] tmp fix p2p overlap · dfd264c3
yuguo authored Apr 11, 2025

dfd264c3

10 Apr, 2025 1 commit

Blockwise scaling linear quantization recipe (#1559) · a8f0fe03

kwyss-nvidia authored Apr 10, 2025



* Add GEMM logic for blockwise quantized tensors.

GEMM test cases included in pytorch integration.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update NVTE_BLOCK_SCALING for GEMM.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gate feature on CUDA 12.9
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gemm typo.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove unecessary type converter change.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reflect epilogue availability and test supported epilogues.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* GEMM simplifications from recipe branch.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update GEMM DGelu tests to match support depending on output dtype.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Force pow2Scales in GEMM
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add GEMM test to pytorch test suite.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to GEMM test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update import for GEMM test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add license.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test gemm supported predicate.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use sgemm like interfaces and naming.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rewrite GEMM comment.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Recipe setup for Linear modules.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use 12.9 feature test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Run against tensor dumps from internal library.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update FIXME to TODO with linked issue.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update full recompute feature to save recipe.

The recompute context uses the same recipe
and fp8 settings as the original fwd pass.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback. Avoid reusing quantizer objects.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update logic in module.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update for PP bug.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test numerics.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update force_power_of_2 scales in the recipe.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update usage method to satisfy upstream changes.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* fix subchannel recipe in distributed test with bf16 gather
Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Edit and cleanup BF16 gather code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test import.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* support columnwise only mode to 1D quantize kernel
Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format and move enum
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Skip alloc.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* try async bf16 gather
Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format python code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Document and type code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytorch lint errors.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Dont set high precision dtype.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add test for sanity and CG; fix CG for sequential?
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Keep make_quantizers API stable

Update num_quantizers instead to pass cuda_graph tests.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix import name.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rename recipe method.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Skip grouped linear sanity test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Set usage before BF16 gather.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* refactor for nvte_quantize_v2
Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Format code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Cleanup nvte_quantize_v2
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Test fp32 scales.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Disable CUDA graph.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Simplify layernorm linear
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Cleanup layernorm linear.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* LayerNorm linear bwd gather logic.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Communication updates.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update transformer_engine/pytorch/ops/op.py

Apply MR comment change.
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>

* Lint fix.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Enable cuda graph tests.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reduce chance of spurious failure and reword.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Review suggestions from @timmoon10
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update CPP tests.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update common.h
Signed-off-by: Xin Yao <yaox12@outlook.com>

* Update test_float8blockwisetensor.py
Signed-off-by: Xin Yao <yaox12@outlook.com>

---------
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: kwyss-nvidia <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Xin Yao <yaox12@outlook.com>
Co-authored-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Xin Yao <yaox12@outlook.com>

a8f0fe03

09 Apr, 2025 3 commits

[PyTorch] Debug checkpointing with te.Sequential (#1629) · 0da60449

Tim Moon authored Apr 09, 2025



* Debug checkpointing with te.Sequential
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

0da60449

[PyTorch] Explicitly specify quantized tensor usages needed for linear op backward (#1646) · 20e95ba3
Tim Moon authored Apr 09, 2025
```
Explicitly specify quantized tensor usages needed for linear op backward
Signed-off-by: Tim Moon <tmoon@nvidia.com>
```
20e95ba3
[DCU] fix · d8992315
yuguo authored Apr 09, 2025

d8992315

07 Apr, 2025 3 commits

Subchannel Block quantized GEMM (#1545) · db2aaa9e

kwyss-nvidia authored Apr 07, 2025



* Add GEMM logic for blockwise quantized tensors.

GEMM test cases included in pytorch integration.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update NVTE_BLOCK_SCALING for GEMM.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gate feature on CUDA 12.9
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Gemm typo.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove unecessary type converter change.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reflect epilogue availability and test supported epilogues.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* GEMM simplifications from recipe branch.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Format py code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update GEMM DGelu tests to match support depending on output dtype.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Force pow2Scales in GEMM
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add GEMM test to pytorch test suite.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to GEMM test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update import for GEMM test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add license.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update test gemm supported predicate.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use sgemm like interfaces and naming.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Rewrite GEMM comment.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR Feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Refactor GEMM param canonicalization

Configure A and B matrices separately. Have separate code path for each scaling mode.
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Prune number of tests.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

---------
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

db2aaa9e

Support FP8 primary weight in FSDP training (#1630) · c84d1708

Jianbin Chang authored Apr 07, 2025



Support fp8 primary weight in fsdp training
Signed-off-by: jianbinc <shjwudp@gmail.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

c84d1708

[PyTorch][Common] Refactor RoPE (#1626) · ba605f18

Xin Yao authored Apr 07, 2025



* refactor to add cp support for sbhd/bshd
Signed-off-by: Xin Yao <xiny@nvidia.com>

* support interleaved
Signed-off-by: Xin Yao <xiny@nvidia.com>

* format
Signed-off-by: Xin Yao <xiny@nvidia.com>

* add interleaved to RotaryPositionEmbedding in test
Signed-off-by: Xin Yao <xiny@nvidia.com>

* update
Signed-off-by: Xin Yao <xiny@nvidia.com>

* merge sbhd/bshd and thd functions
Signed-off-by: Xin Yao <xiny@nvidia.com>

---------
Signed-off-by: Xin Yao <xiny@nvidia.com>

ba605f18

04 Apr, 2025 1 commit

Blockwise float8 quantizer and quantized tensor class (#1513) · 1bbeab1c

kwyss-nvidia authored Apr 03, 2025



* Blockwise float8 quantizer and quantized tensor class.

The classes are configurable for 128x128 blocksize
and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication,
however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet
implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with
exact comparison to reference quantizer behavior as well as an attempt
to hit interesting branches through the API such as tensor creation
in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included, and are direct ports
of equivalents in the kitchen repository, where a subchannel recipe
has been used for end to end training.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Apply linting changes.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Alignment for 1D scaling for GEMM edge case.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API name.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix merge conflict with name change.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use common tensor map API.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API to use two scaling mode enums.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix typo.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update some call sites.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Tests for torch tensor API surface.

Since the quantized tensor is a tensor
subclass, these tests exercise torch hooks.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse scale calculation between quantizer refs.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Save memory by dropping reference to saved tensors.

Issues previously observed are solved.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove constexpr parameters from kernel.

Code size is reduced with fewer constexpr params.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Merge conflict from rebase.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add shape implementations for block scaling.

nvte_shape was added upstream. Logic added
for block scaled fp8.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Move benchmark to te_playground
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove amax_epsilon and pow_2_scales from tensor.

Hardcodes the default values.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Lint changes.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fixup MR changes that broke.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Safer ifdef in kernel.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Documentation prose.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse compute_scale function from Current Scaling.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Bugfix on inf_value scale refactor.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove qopt calls from test.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytest list.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to reference scale calc.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use ptx.cuh functions instead of cde.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update shape logic with allocation and reuse shape.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Usage defaults MR feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Copyright and header guard.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Updating torch dispatch code.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix exception type.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use TypeInfo
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.
Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update CS scale update test to use updated ref impl
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update JAX scaling mode enum
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Skip tests on Lovelace
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

1bbeab1c

01 Apr, 2025 1 commit
- [DCU] fix fp8 · fbee8990
  yuguo authored Apr 01, 2025
  
  fbee8990
31 Mar, 2025 1 commit

[PyTorch] Support default process group with FP8 current scaling (#1621) · be055eb0

Tim Moon authored Mar 31, 2025



* Handle case where FP8 current scaling quantizer gets default process group
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix linter warning
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid canonicalizing TP group since it may not be initialized
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

be055eb0

27 Mar, 2025 1 commit

[PyTorch] Add tests for current scaling; misc related fixes (#1606) · 3bcd7f6f

Kirthi Shankar Sivamani authored Mar 27, 2025



* Cleanup sanity tests and add CS recipe tests
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix sanity test
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix CG capture with CS recipe
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix ops for CG
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

3bcd7f6f

25 Mar, 2025 3 commits

[PyTorch] Minor fixes for TE 2.2 (#1589) · 65c2798a

Charlene Yang authored Mar 26, 2025



* skip cuDNN 9.8 for KV caching
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert from max_seqlen_kv to max_sequence_length for InferenceParams
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rename test_paged_attn to test_kv_cache
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove redundant None returns in bwd
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add debug flags when no backend is found
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* skip kv_cache_accuracy tests for cuDNN 9.8
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* truncate length of cu_seqlens for consistency with q/k/v shape
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add back padding_brcm for fused attn tests
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* re-enable kv_cache_accuracy test for 9.8
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix cuDNN search dir
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fixes based on review
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove extra empty line
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

65c2798a

[PyTorch] Defer torch compilation steps until first function call (#1599) · cf00d537

Peter St. John authored Mar 25, 2025



* Defer torch compilation steps until first function call
Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix function call in smoke test
Signed-off-by: Peter St. John <pstjohn@nvidia.com>

---------
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

cf00d537

Remove deprecated interval arg to delayed scaling recipe (#1607) · 945a559b
Kirthi Shankar Sivamani authored Mar 25, 2025
```
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
```
945a559b

24 Mar, 2025 1 commit

Fix issues in fused_attn_bwd (#1574) · e14d1472

Xiaowei Ren authored Mar 24, 2025



* fix dtypes of fused_attn_bwd in CP+A2A
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix dtypes of fused_attn_bwd in CP+P2P
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix amax_per_step
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* clone scaling factors of fwd quantizers
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix fwd quantizers of CP+P2P
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* minor change
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* dequantize fp8 out in CP unit test
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* delete redundant None in FusedAttnFunc bwd
Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

e14d1472

22 Mar, 2025 1 commit

[PyTorch] Enable fp8_primary_weights for current scaling (#1544) · 86813893

Kunlun Li authored Mar 22, 2025



* Enable fp8_primary_weights for current scaling
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Use different cast_master_weights_to_fp8 functions depending on the type of quantizer
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* All amaxes of model_weights should participate in reduce-max
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Clear _high_precision_init_val automatically in cast_master_weights_to_fp8 function
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Merge all all-reduce on amaxes into one NCCL kernel
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add unit tests for multi_tensor_compute_scale_and_scale_inv and preserve_high_precision_init_val
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Fix conflicts
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add unit test for cast_master_weights_to_fp8
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* use mock group to initialize fp8_autocast to avoid reduction of amax_history by fp8_autocast_exit
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Remove with_computing_amax and with_computing_scale
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Move replace_raw_data from QuantizedTensor to utils.py
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Remove allow_empty_output argument from nvte_compute_amax and set it always be true
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Rename import guard of recipe_common.cuh to be align with other import guards
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Add unit test for replace_raw_data
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Add test_replace_raw_data into qa/L0_pytorch_unittest/test.sh
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Minor changes in comments
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Add randomness to the unit test of replace_raw_data
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* (Maybe need revert) Add tex.quantize_to_fragment
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* (Maybe needsto rrevert) Use nvte_quantize_noop in quantize_to_fragment
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Fix lint error
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Move high_precision_init_val test and replace_raw_data test to test_sanity.py
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove test_fp8_model_init.py and test_replace_raw_data.py
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Remove cast_master_weights_to_fp8 and replace_raw_data from __all__ of tensor.__init__.py
Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Move FP8 casting logic back from C++ tex funcs to Python
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* Remove unimplemented function from header
Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------
Signed-off-by: kunlunl <kunlunl@nvidia.com>
Signed-off-by: Kunlun Li <94586211+kunlunl@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>

86813893

20 Mar, 2025 1 commit
- [DCU] Preliminary adaptation · c520cba3
  yuguo authored Mar 20, 2025
  
  c520cba3