- 25 Aug, 2025 1 commit
-
-
yuguo authored
-
- 23 Aug, 2025 2 commits
- 21 Aug, 2025 2 commits
- 19 Aug, 2025 1 commit
-
-
evt_fugx1 authored
-
- 08 Aug, 2025 1 commit
-
-
yuguo authored
-
- 07 Aug, 2025 1 commit
-
-
yuguo authored
-
- 06 Aug, 2025 2 commits
- 05 Aug, 2025 1 commit
-
-
yuguo authored
-
- 25 Jul, 2025 2 commits
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 18 Jul, 2025 2 commits
-
-
yuguo authored
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 17 Jul, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 16 Jul, 2025 1 commit
-
-
yuguo authored
-
- 15 Jul, 2025 2 commits
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
yuguo authored
-
- 11 Jul, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 09 Jul, 2025 2 commits
-
-
yuguo authored
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 08 Jul, 2025 1 commit
-
-
yuguo authored
-
- 01 Jul, 2025 1 commit
-
-
wenjh authored
Add env to chose blocklen of blockwise quantize. Signed-off-by:
wenjh <wenjh@sugon.com> Fix pytest of blockwise error Signed-off-by:
wenjh <wenjh@sugon.com> Resolve new api in int8 gemm test Signed-off-by:
wenjh <wenjh@sugon.com> Fix incorrect launch parm Signed-off-by:
wenjh <wenjh@sugon.com> Fix 1D blockwise(64) acc error Signed-off-by:
wenjh <wenjh@sugon.com>
-
- 20 Jun, 2025 2 commits
- 19 Jun, 2025 1 commit
-
-
yuguo authored
-
- 18 Jun, 2025 1 commit
-
-
yuguo authored
-
- 16 Jun, 2025 1 commit
-
-
yuguo authored
-
- 13 Jun, 2025 7 commits
-
-
Charlene Yang authored
* add support for head dim > 128 Signed-off-by:
Charlene Yang <charleney@nvidia.com> * remove debugging Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * raise tols slightly to tolerate 1/2048 mismatches Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix is_training for test_te_layer Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add bprop support for blackwell Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak for format Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix backend selection results Signed-off-by:
Charlene Yang <charleney@nvidia.com> * bump sm100 to sm100+ Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add sq=1 test for MLA Signed-off-by:
Charlene Yang <charleney@nvidia.com> * enable sq=1 for bprop Signed-off-by:
Charlene Yang <charleney@nvidia.com> * minor tweak in comments Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix head_dim logic and remove pytest skip Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add FE fix for d>128 Signed-off-by:
Charlene Yang <charleney@nvidia.com> * update FE again to take in small fixes Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add cuDNN version info in L0 tests Signed-off-by:
Charlene Yang <charleney@nvidia.com> * increase tols for Unfused + large dim Signed-off-by:
Charlene Yang <charleney@nvidia.com> * Revert "add cuDNN version info in L0 tests" This reverts commit 3e1b426ca5319a2c0540b9e73bba7047d0e583e5. Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix tols for Unfused Signed-off-by:
Charlene Yang <charleney@nvidia.com> --------- Signed-off-by:
Charlene Yang <charleney@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
yuguo authored
-
Oleg Goncharov authored
* Added support of FP4 data type Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring to BitsNum in progress Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed compilation errors. All C++ tests passed Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed a typo Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added FP4 guard to TMA tensor descriptor data type Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed errors in JAX C++ extensions Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed dummy NVFP4 C++ test file Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Make pytorch changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Refactored the code per the review notes. Fixed JAX build error. Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed unnecessary static casts Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Typo fix Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Pass correct num bits to create_2D_tensor_map; fixes CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * inline funcs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tim Moon authored
* Add FP8 current scaling to te.Sequential tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Helper function for test/ref tensors does not produce quantized tensor by default Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling to distributed te.Sequential tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling to Userbuffers te.Sequential tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug MXFP8 tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
* Do not initialize quantized weights with column-wise usage in inference mode Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix bug in test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use no-grad mode instead of inference mode in tests Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Jan Bielak authored
* Flatten basic op params during fuser init Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 949abe97070721b1da5117903067608250f5fb61) * Add caching for is_non_tn_fp8_gemm_supported Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd830ae24ffbd2d0727010b1a8a119ca72f61ce5) * Pass fuser to _OperationFuserAutogradFunction.forward and moving computation to __init__ Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd808991993958b670726896254b82fcb967fa07) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Pass basic_op_kwargs and is_grad_enabled as parameters rather than in fuser Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Daniel Stokes authored
* Add support for overlapping wgrad NCCL AG with dgrad GEMM Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * Remove unused wait on memcpy API from UB Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * Add better commenting to MXFP8 overlap Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> --------- Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> Co-authored-by:
dastokes <dastokes@dastokes-dvt-01.nvidia.com>
-
- 12 Jun, 2025 4 commits
-
-
Evgeny Tsykunov authored
* Support L2Norm basic op Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Add L2Norm module wrapper Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Expose qk_norm to MHA nd transformer laayer Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Move tests into separate file Signed-off-by:
Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix pass Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Add license Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Remove module Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Resollve comments Signed-off-by:
Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Evgeny <etsykunov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Selvaraj Anandaraj authored
* Added double buffering support initial commit Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed bugs Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * Make only one double buffer creation Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * Fixed bug Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * Fixed flag setting Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Merge conflict Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <anandaraj@wisc.edu> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Paweł Gadziński authored
typo fix Signed-off-by:Pawel Gadzinski <pgadzinski@nvidia.com>
-
Kirthi Shankar Sivamani authored
Fix for loading old ckpt formats Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-