- 03 Nov, 2025 4 commits
-
-
zhaochao authored
Signed-off-by:zhaochao <zhaochao1@sugon.com>
-
zhaochao authored
Signed-off-by:zhaochao <zhaochao1@sugon.com>
-
zhaochao authored
Signed-off-by:zhaochao <zhaochao1@sugon.com>
-
zhaochao authored
Signed-off-by:zhaochao <zhaochao1@sugon.com>
-
- 17 Oct, 2025 1 commit
-
-
tabuchixiangcai3 authored
Signed-off-by:Tangao <2205747538@qq.com>
-
- 16 Oct, 2025 1 commit
-
-
tabuchixiangcai3 authored
Signed-off-by:Tangao <2205747538@qq.com>
-
- 18 Sep, 2025 3 commits
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
zhujian authored
feature(FA3,MLA,CP): 1. Update FA3 to commit-id 3ba6f82 (tag 2.8.0.post2 with compile error fixed), PR-1604 support hdimQK != hdimV backward 2. Update get_attention_backend method because FA3 support MLA now 3. Add CP MLA support for FA3 4. Add unit tests for FA3 MLA CP 5. Update attention doc Signed-off-by:zhujian <zhujian.whu.cs@gmail.com>
-
alan yang authored
* feat: add cutlass group gemm support Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: refactor multi tensor gemm interface Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: refactor nvte_multi_stream_cublas_gemm func and add license info Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add unit test for cutlass group gemm Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add cutlass support type protect Signed-off-by:
Min Yang <min.yang@shopee.com> * add tests and fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: fix unit tests error Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: refactor host workspace malloc Signed-off-by:
Min Yang <min.yang@shopee.com> * update cutlass Signed-off-by:
Xin Yao <xiny@nvidia.com> * update cutlass Signed-off-by:
Xin Yao <xiny@nvidia.com> * further relex threshold and add a env var to warn fall back Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Min Yang <min.yang@shopee.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
alan yang <89962857+cassiewilliam@users.noreply.github.com> Co-authored-by:
Min Yang <min.yang@shopee.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 17 Sep, 2025 1 commit
-
-
vthumbe1503 authored
* Test working as I think it should work Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * revert accidental change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Restrict the number of cases for unfused quantization, some fp8->fp8 cases are handled by cublas Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> fix merge conflict Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> bug: missed a } in the code Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) * FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Delay MeshResource validation until first usage (#2124) Delay MeshResource validation until first usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Decouple Recipe and ScalingMode (#1728) * Decouple recipe and scaling mode Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Expose global QuantizeConfig instance as a getter Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Format and lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename UsageType to TensorSource Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128) * add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118) * add amax input to DBiasQuantizePrimitive and FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121) Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] Disable determinism for sm100 (#2130) * disable determinism for sm100+ and cudnn<9.14 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] ONNX export of FP8 Current Scaling (#2068) * Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> build: pull cached wheels (#2127) * build: pull cached wheels Signed-off-by:
oliver könig <okoenig@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Signed-off-by:
oliver könig <okoenig@nvidia.com> --------- Signed-off-by:
oliver könig <okoenig@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> feat: Add support for multiple quantization modes in the UB communicators (#2043) Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [Common] Add checks to CUDA kernel launch and CUDA API calls (#2074) * add checks to cuda kernel launch and cuda API calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * Remove exceptions from destructors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix weired dispatch in ln/rmsnorm Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] Support bf16+fp8 cudagraph (#2098) * support bf16+fp8 model Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Dropout with 8-bit RNG (#2014) * Add dropout kernel with 8-bit RNG Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Create GPU reload buffers on main stream (#2131) * Create GPU relaod buffers on main stream Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> mxfp8 unfused quant support, refined unit test, remove unecessary quantization code Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> missed a quant code removal Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> minor bug fix Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) * FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Delay MeshResource validation until first usage (#2124) Delay MeshResource validation until first usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Decouple Recipe and ScalingMode (#1728) * Decouple recipe and scaling mode Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Expose global QuantizeConfig instance as a getter Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Format and lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename UsageType to TensorSource Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> [JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128) * add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118) * add amax input to DBiasQuantizePrimitive and FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121) Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> [PyTorch] Disable determinism for sm100 (#2130) * disable determinism for sm100+ and cudnn<9.14 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch] ONNX export of FP8 Current Scaling (#2068) * Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> build: pull cached wheels (#2127) * build: pull cached wheels Signed-off-by:
oliver könig <okoenig@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Signed-off-by:
oliver könig <okoenig@nvidia.com> --------- Signed-off-by:
oliver könig <okoenig@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> feat: Add support for multiple quantization modes in the UB communicators (#2043) [Common] Add checks to CUDA kernel launch and CUDA API calls (#2074) * add checks to cuda kernel launch and cuda API calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * Remove exceptions from destructors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix weired dispatch in ln/rmsnorm Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> [PyTorch] Support bf16+fp8 cudagraph (#2098) * support bf16+fp8 model Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Dropout with 8-bit RNG (#2014) * Add dropout kernel with 8-bit RNG Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Create GPU reload buffers on main stream (#2131) * Create GPU relaod buffers on main stream Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> minor code cleanup Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci minor cosmetics Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Address review comment Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci minor comment update Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Fix CI failures for UB overlap changes (#2149) Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> minor bug: quantizer should not be none for unfused quantization Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135) * Fix failing tests for dropout=0.1 and bias for fused attn for blackwell Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the skip message Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Add check for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to get all devs in the process for jax Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code clean up Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Represent attn bias using enum instead of string Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix linting error Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119) * add noop to comp amax Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix for fp8 blockwise recipe Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * Update test_multi_process_distributed_grouped_gemm.py change accidentally added while merging Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * Update dense.py change accidentally added while merging Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address revie comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug solved: delayed scaling quantization with mxfp8 inputs didnt work Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the unit test error Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * just to trigger ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * address review comments: quantization inside gemm and outside both should exactly match for fp32 accumulation Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix merge conflict Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> address review comments: quantization inside gemm and outside both should exactly match for fp32 accumulation [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 15 Sep, 2025 2 commits
-
-
Tim Moon authored
* Add CPU offloading logic to ops. Fix test to compute dgrad. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure grads are contiguous in op backwards Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add op-based MLP to CPU offloading tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Handle different weight cache behavior on Hopper/Blackwell Add MXFP8 to CPU offload tests. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove MXFP8 test Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Paweł Gadziński authored
* fix underflows log issue Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 10 Sep, 2025 1 commit
-
-
jomitchellnv authored
Adds context parallelism utilities: moving cp shards to diff ranks and pad sequence to divisibility factory (#2129) * test - adds unit test for cp utilities and the utilites Signed-off-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> * assert line change Signed-off-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com>
-
- 09 Sep, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 08 Sep, 2025 1 commit
-
-
vasunvidia authored
* Fused RoPE with combined QKV input. Initial commit for Dropout with 8-bit RNG Fix documentation Initial commit for Fused QKV RoPE WIP Initial tests passing Enable rotary percent and margin Enable CP2, start_positions, interleaved Cleanup test Revert "Fix documentation" This reverts commit 53df10044e7769982bd4af2ae2628e6b7717e715. Revert "Initial commit for Dropout with 8-bit RNG" This reverts commit 301505e24031cbcd679069e1c2cd4d00eedf2dca. Cleanup. Minor cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Optimize kernels Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Misc. Cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Optimize kernel performance Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Move fused_qkv_rope test to test_fused_rope.py Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * apply shared memory optimization to separate fused rope kernels Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 04 Sep, 2025 1 commit
-
-
Casper authored
* fix cross entropy Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Casper <casperbh.96@gmail.com> * fix comments Signed-off-by:
Casper <casperbh.96@gmail.com> * fix: few more style issues Signed-off-by:
Casper <casperbh.96@gmail.com> * fix: remove grad_output_stride (unnecessary) Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: only backward was broken Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Generalize cross entropy backward kernel to handle reduced and unreduced loss Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Casper <casperbh.96@gmail.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 03 Sep, 2025 2 commits
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
Daniel Stokes authored
Signed-off-by:djns99 <40156487+djns99@users.noreply.github.com>
-
- 31 Aug, 2025 1 commit
-
-
vasunvidia authored
* Add dropout kernel with 8-bit RNG Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 29 Aug, 2025 1 commit
-
-
Daniel Stokes authored
-
- 28 Aug, 2025 2 commits
-
-
Paweł Gadziński authored
* Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Charlene Yang authored
* disable determinism for sm100+ and cudnn<9.14 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 26 Aug, 2025 3 commits
-
-
Paweł Gadziński authored
fix test on old device Signed-off-by:Pawel Gadzinski <pgadzinski@nvidia.com>
-
vcherepanov-nv authored
* Bump cuDNN FE to 1.14.0 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change submodule hash Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Pick up a cuDNN FE fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * New model configs in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Exclude cuDNN backend for some configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com>
-
Xin Yao authored
expose more activation functions Signed-off-by:Xin Yao <xiny@nvidia.com>
-
- 25 Aug, 2025 2 commits
- 21 Aug, 2025 1 commit
-
-
yuguo authored
-
- 20 Aug, 2025 2 commits
-
-
Paweł Gadziński authored
* code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Jan Bielak authored
* Add rmsnorm_bwd_add Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add BackwardAddRMSNorm fused operation Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Try to optimize register usage in kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add separate BackwardAdd stage for the fused backward add Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com>
-
- 15 Aug, 2025 1 commit
-
-
Jan Bielak authored
* Add `nvte_cublas_gemm_scaled` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `tex.generic_gemm` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `general_gemm` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `BasicLinear._functional_forward` and `BasicLinear._functional_backward` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add `ForwardLinearScaleAdd` fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add `BackwardLinearScale` fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply suggestions from code review Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Remove calls to `validate_gemm_scale` from `BasicLinear` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 14 Aug, 2025 3 commits
-
-
Tim Moon authored
* Unfused impl for dbias-quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Unfused impl for dact-dbias-quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable fused bgrad-quantize for unsupported recipes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unfused dbias-quantize impls Not supported in the core lib. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support unfused impls in tex functions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweaks Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused imports Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Xin Yao authored
* reduce driver calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * reduce driver calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * adjust tests to capture this Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Kshiteej K authored
* fix: update grad_output quant to avoid redundant work Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * add test Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * don't keep only columnwise quant if requires_dgrad=False Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * fix stray merge Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * fix for ctx.use_bias is True case Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> * Skip if FP8 not available Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
kshitij12345 <kshitijkalambarkar@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 13 Aug, 2025 1 commit
-
-
Paweł Gadziński authored
* code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * turn on userbuffers for layers without debug Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * working change Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests and fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * update nvinspect version Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * docs change Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix default Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix default Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix default Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 09 Aug, 2025 1 commit
-
-
Daniel Stokes authored
* fix: Add stream synchronization before destroying MPI communicator (#1979) Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * feat: Implement column-wise userbuffer overlap for comm+GEMM operations Add support for overlapping column-wise allgather communication with GEMM operations to improve training performance: * **Core infrastructure changes:** - Update bulk_overlap_columnwise_ag() to accept explicit stream parameter - Modify userbuffers send/recv loops to use rank-ordered iteration - Add userbuffers_send_all/recv_all function declarations * **Python integration:** - Add bulk_overlap_ag_with_external_gemm() C++ extension function - Expose new overlap function via pybind11 bindings - Update overlap method configurations to include more ring_exchange ops * **LayerNorm MLP optimization:** - Enable column-wise quantization for FC2 gradient output - Implement overlap of allgather communication with FC2 DGRAD GEMM - Use fill_userbuffers_buffer_for_all_gather for efficient buffering This optimization allows overlapping communication and computation phases more effectively, reducing training wall-clock time by hiding allgather latency behind GEMM execution. Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Working userbuffer overlapping API Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Fix overwriting bulk overlap UB object for layernormLinear Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Update external overlap to use tp size instead of nvsize to determine number of copies Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Fix linter error Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Explanatory comments of overlap logic Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Fix the UB fused ops tests Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> * fix: Fix linter errors Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> --------- Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 08 Aug, 2025 2 commits
-
-
Paweł Gadziński authored
* turn on userbuffers for layers without debug Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * working change Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests and fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * update nvinspect version Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix ci Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
yuguo authored
-
- 06 Aug, 2025 1 commit
-
-
hx authored
[PyTorch] fix input_quantizer usage for save_original_input; fix blockwise FP8 convert_and_update_tensor (#1978) * fix input_quantizer in save_original_input bwd Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * fix get shape of blockwise tensor with only compact colwise data Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * fix blockwise FP8 convert_and_update_tensor Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 01 Aug, 2025 1 commit
-
-
Autumn1998 authored
* fix bug if all values<0 Signed-off-by:
tongliu <tongliu@nvidia.com> * minor fix Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com>
-