- 15 Dec, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 13 Dec, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 11 Dec, 2025 1 commit
-
-
wenjh authored
Signed-off-by:
wenjh <wenjh@sugon.com> Mutex group gemm Signed-off-by:
wenjh <wenjh@sugon.com> do while group gemm Signed-off-by:
wenjh <wenjh@sugon.com> Remove mutex Signed-off-by:
wenjh <wenjh@sugon.com>
-
- 26 Nov, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 12 Nov, 2025 5 commits
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
wenjh authored
-
wenjh authored
-
wenjh authored
-
- 08 Nov, 2025 1 commit
-
-
wenjh authored
-
- 16 Oct, 2025 2 commits
-
-
yuguo authored
-
tabuchixiangcai3 authored
Signed-off-by:Tangao <2205747538@qq.com>
-
- 18 Sep, 2025 2 commits
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
yuguo authored
-
- 12 Sep, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 11 Sep, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 02 Sep, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 28 Aug, 2025 2 commits
- 27 Aug, 2025 2 commits
-
-
Vladimir Cherepanov authored
* Pick up cuBLASMp during build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com>
-
yuguo authored
-
- 26 Aug, 2025 5 commits
-
-
jberchtold-nvidia authored
Revert "[Common] PDL for Blockwise Quantization (#2066)" This reverts commit ebca6153 . Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
vcherepanov-nv authored
* Bump cuDNN FE to 1.14.0 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change submodule hash Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Pick up a cuDNN FE fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * New model configs in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Exclude cuDNN backend for some configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com>
-
jberchtold-nvidia authored
Revert "[Common] PDL for Quantization Kernels (#2001)" This reverts commit bfab8c67 . Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
Tim Moon authored
* Fix incorrect version checks for atomic GEMM Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix typo Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
yuguo authored
-
- 23 Aug, 2025 1 commit
-
-
yuguo authored
-
- 21 Aug, 2025 1 commit
-
-
yuguo authored
-
- 15 Aug, 2025 1 commit
-
-
Jan Bielak authored
* Add `nvte_cublas_gemm_scaled` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `tex.generic_gemm` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `general_gemm` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `BasicLinear._functional_forward` and `BasicLinear._functional_backward` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add `ForwardLinearScaleAdd` fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add `BackwardLinearScale` fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply suggestions from code review Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Remove calls to `validate_gemm_scale` from `BasicLinear` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 14 Aug, 2025 3 commits
-
-
Kirthi Shankar Sivamani authored
Add launch bounds to swizzle kernel, use empty scale inv Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Xin Yao authored
* reduce driver calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * reduce driver calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * adjust tests to capture this Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Kshiteej K authored
* fix: update grad_output quant to avoid redundant work Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * add test Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * don't keep only columnwise quant if requires_dgrad=False Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * fix stray merge Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * fix for ctx.use_bias is True case Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * Skip if FP8 not available Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 13 Aug, 2025 3 commits
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
Xin Yao authored
* enable PDL for blockwise qunatization kernels Signed-off-by:
Xin Yao <xiny@nvidia.com> * add comment Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Xin Yao <xiny@nvidia.com>
-
- 12 Aug, 2025 2 commits
-
-
Jan Bielak authored
* Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 11 Aug, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 09 Aug, 2025 1 commit
-
-
Daniel Stokes authored
* fix: Add stream synchronization before destroying MPI communicator (#1979) Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * feat: Implement column-wise userbuffer overlap for comm+GEMM operations Add support for overlapping column-wise allgather communication with GEMM operations to improve training performance: * **Core infrastructure changes:** - Update bulk_overlap_columnwise_ag() to accept explicit stream parameter - Modify userbuffers send/recv loops to use rank-ordered iteration - Add userbuffers_send_all/recv_all function declarations * **Python integration:** - Add bulk_overlap_ag_with_external_gemm() C++ extension function - Expose new overlap function via pybind11 bindings - Update overlap method configurations to include more ring_exchange ops * **LayerNorm MLP optimization:** - Enable column-wise quantization for FC2 gradient output - Implement overlap of allgather communication with FC2 DGRAD GEMM - Use fill_userbuffers_buffer_for_all_gather for efficient buffering This optimization allows overlapping communication and computation phases more effectively, reducing training wall-clock time by hiding allgather latency behind GEMM execution. Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Working userbuffer overlapping API Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Fix overwriting bulk overlap UB object for layernormLinear Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Update external overlap to use tp size instead of nvsize to determine number of copies Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Fix linter error Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Explanatory comments of overlap logic Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Fix the UB fused ops tests Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Fix linter errors Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> --------- Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 08 Aug, 2025 1 commit
-
-
yuguo authored
-