- 27 Apr, 2025 1 commit
-
-
wenjh authored
Ref params of rmsnorm will make program corruption with 'nil' error. Signed-off-by:wenjh <wenjh@sugon.com>
-
- 25 Apr, 2025 5 commits
-
-
-
yuguo authored
-
panning authored
API `rmsnorm_forward` of python returns 3 values rather than 2 from V2.3 Signed-off-by:wenjh <wenjh@sugon.com>
-
-
yuguo authored
-
- 24 Apr, 2025 2 commits
-
-
wenjh authored
Due to the difference of warp size between nvidia(32) and dtk(64), the OperatorTest/CTDBiasTestSuite.TestCTDBias/* are all failed except: * OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xfloat32X65536X128 * OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xfloat16X65536X128 * OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xbfloat16X65536X128 * OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xfloat8e5m2X65536X128 * OperatorTest/CTDBiasTestSuite.TestCTDBias/bfloat16Xfloat8e4m3X65536X128 This commit is intended to fix this. Signed-off-by:wenjh <wenjh@sugon.com>
-
wenjh authored
Due to the compiler compiling incorrect code. The following test case crashed: * OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xbfloat16X2048X12288 * OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xbfloat16X65536X128 * OperatorTest/CTTestSuite.TestCastTranspose/bfloat16Xbfloat16X256X65536 This commit is intended to fix these test cases. Signed-off-by:wenjh <wenjh@sugon.com>
-
- 23 Apr, 2025 2 commits
- 22 Apr, 2025 1 commit
-
-
yuguo authored
-
- 18 Apr, 2025 1 commit
-
-
yuguo authored
-
- 17 Apr, 2025 2 commits
- 16 Apr, 2025 1 commit
-
-
yuguo authored
-
- 14 Apr, 2025 1 commit
-
-
yuguo authored
-
- 11 Apr, 2025 2 commits
-
-
-
yuguo authored
-
- 10 Apr, 2025 2 commits
- 09 Apr, 2025 2 commits
-
-
-
yuguo authored
-
- 08 Apr, 2025 2 commits
- 01 Apr, 2025 5 commits
-
-
-
guyueh1 authored
* Fix GEMM+RS overlap for LayerNormMLP Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * Fix error LayerNormMLP param.grad is None Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update dtype for wgrad GEMM Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Guyue Huang <guyueh@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
Marks101 authored
* [PyTorch] fix general_gemm argument out_dtype in LayerNormMLP backward Signed-off-by:
Markus Schnoes <markus.schnoes@gmx.de> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Markus Schnoes <markus.schnoes@gmx.de> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
yuguo authored
-
Phuong Nguyen authored
* refactor + mxfp8 * added grouped gemm * rename linear to dense * added cublas init phase for groupedGemm * relax the tol of test encoder multiprocessing mxfp8 by 0.001 Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 31 Mar, 2025 4 commits
-
-
Tim Moon authored
* Handle case where FP8 current scaling quantizer gets default process group Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid canonicalizing TP group since it may not be initialized Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Michael Goldfarb authored
Add fast path for causal masking with segment IDs. Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
Xiaowei Ren authored
fix a race error softmax_lse Signed-off-by:Xiaowei Ren <xren@nvidia.com>
-
yuguo authored
-
- 27 Mar, 2025 2 commits
-
-
Kirthi Shankar Sivamani authored
* Cleanup sanity tests and add CS recipe tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix sanity test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CG capture with CS recipe Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix ops for CG Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
yuguo authored
-
- 25 Mar, 2025 5 commits
-
-
Tim Moon authored
* Coalesce NCCL all-gathers for MXFP8 all-gather Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add missing import Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cache quantized input tensor after linear module forward pass Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid unnecessarily allocating layernorm output in LayerNormLinear/LayerNormMLP Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Charlene Yang authored
* skip cuDNN 9.8 for KV caching Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert from max_seqlen_kv to max_sequence_length for InferenceParams Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename test_paged_attn to test_kv_cache Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove redundant None returns in bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add debug flags when no backend is found Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * skip kv_cache_accuracy tests for cuDNN 9.8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * truncate length of cu_seqlens for consistency with q/k/v shape Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back padding_brcm for fused attn tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * re-enable kv_cache_accuracy test for 9.8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN search dir Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes based on review Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove extra empty line Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
guyueh1 authored
* Fix mxfp8 columnwise data missing when switching from validation to training Signed-off-by:
Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com> * Fix when you interleave training and inference Signed-off-by:
Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com> * refact Signed-off-by:
Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rm useless code Signed-off-by:
Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com> * Update transformer_engine/pytorch/module/base.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
guyueh1 <140554423+guyueh1@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com> Signed-off-by:
guyueh1 <140554423+guyueh1@users.noreply.github.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Guyue Huang <guyueh@login-preos02.a51.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Peter St. John authored
* Defer torch compilation steps until first function call Signed-off-by:
Peter St. John <pstjohn@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix function call in smoke test Signed-off-by:
Peter St. John <pstjohn@nvidia.com> --------- Signed-off-by:
Peter St. John <pstjohn@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Li Tao authored
fix mcore DDP error Signed-off-by:
lit <lit@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-