- 02 Jan, 2026 1 commit
-
-
Kirthi Shankar Sivamani authored
Update copyright to include 2026 Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 27 Dec, 2025 1 commit
-
-
xiaoxi-wangfj authored
* [PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization 1.Fused `moe_permute_with_probs` + `Fp8Padding` and fused `moe_unpermute` + `Fp8Unpadding`, that can remove the explicit padding/unpadding of moe expert, improved performance and reduced peak gpu memory usage. 2.Add tests of fused permute/pad and unpermute/unpad. Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * [PyTorch/Common] Fuse permute+pad and unpermute+unpad support with_merging_probs Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * [PyTorch]format code Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * [Common]perf expert_idx loaded once Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * fix: pad_offsets can be None Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * add padding + merging probs bwd support. Not tested Signed-off-by:
tdophung <tdophung@nvidia.com> * Fix garbage initialized act grad Signed-off-by:
tdophung <tdophung@nvidia.com> * all test passing for jax permutation + pad Signed-off-by:
tdophung <tdophung@nvidia.com> * change tokens_per_experts APIs to num_out_tokens with conservative allocation of worst case padding for output buffer Signed-off-by:
tdophung <tdophung@nvidia.com> * change test permutation to reduce test time Signed-off-by:
tdophung <tdophung@nvidia.com> * triggering PR refresh Signed-off-by:
tdophung <tdophung@nvidia.com> * format code Signed-off-by:
tdophung <tdophung@nvidia.com> * Remove some tests cases from pytorch side. Add a separate toekn_dispatch test for sanity in case combine accidentally undo an error on dispatch in the roundtrip test. Add distinction between L0 and L2 in test cases in jax Signed-off-by:
tdophung <tdophung@nvidia.com> * format code Signed-off-by:
tdophung <tdophung@nvidia.com> * remove chance for inefficiency in moving between CPU and GPU, remove redundant primitive using a new static bool for padding, add assert for align size Signed-off-by:
tdophung <tdophung@nvidia.com> * fix lint in jax Signed-off-by:
tdophung <tdophung@nvidia.com> * account for both jax newer and older than version 0.8.2. Adjusted gpu triton binding accordingly Signed-off-by:
tdophung <tdophung@nvidia.com> * format code Signed-off-by:
tdophung <tdophung@nvidia.com> * fix typo Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> Signed-off-by:
tdophung <tdophung@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
tdophung <tdophung@nvidia.com>
-
- 09 Dec, 2025 1 commit
-
-
Teddy Do authored
change order Signed-off-by:tdophung <tdophung@nvidia.com>
-
- 25 Nov, 2025 1 commit
-
-
Teddy Do authored
* Change order of arguments to make jax works Signed-off-by:
tdophung <tdophung@nvidia.com> * make num_experts a tl.constepxr again Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com>
-
- 10 Nov, 2025 1 commit
-
-
Teddy Do authored
* move triton to common and change paths Signed-off-by:
tdophung <tdophung@nvidia.com> * Formatting Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com>
-
- 20 Oct, 2025 1 commit
-
-
fzyzcjy authored
* Update permutation.py Signed-off-by:
fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> * Update permutation.py Signed-off-by:
fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> * Update transformer_engine/pytorch/triton/permutation.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/pytorch/triton/permutation.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 02 Oct, 2025 1 commit
-
-
hx authored
* fix overflow of int32 in permute kernels Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 06 Aug, 2025 1 commit
-
-
xiaoxi-wangfj authored
Signed-off-by:xiaoxi-wangfj <690912414@qq.com>
-
- 14 Jul, 2025 1 commit
-
-
hx authored
* optimize permute Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com>
-
- 14 Apr, 2025 1 commit
-
-
Autumn1998 authored
* add support for new recipe on permute_fusion, rm fp unpermute Signed-off-by:
tongliu <tongliu@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * remove fp8 from index map Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * skip unsupported tests Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
tongliu <tongliu@nvidia.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Mar, 2025 1 commit
-
-
hx authored
fix cpu device import error Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 18 Feb, 2025 1 commit
-
-
hx authored
* add prob permute; fix fp8tensor Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert unnecessary changes in UT Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * remove unnecessary probs dtype convert Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * keep the output nums if probs is not provided Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine the doc string Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * fix lint Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * use fp32 compute type Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * style fix Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * fix empty input return Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * separate prob related functions out Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 27 Jan, 2025 1 commit
-
-
hx authored
* add mask-based moe permutation * change moe_chunk_permute to moe_sort_chunks_by_indices * fix __all__ in pytorch/permutation.py * fix func/var names and typos; update tols in UT --------- Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-