- 12 Feb, 2026 2 commits
-
-
Kim, Jin (Jay@SKT) authored
* Add sigmoid GLU Signed-off-by:
Kim, Jin <jinn.kim@sk.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Kim, Jin <jinn.kim@sk.com> * Add test for GLU op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect reshape Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Apply suggestion from @timmoon10 Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Add omitted tests for GLU op Signed-off-by:
Kim, Jin <jinn.kim@sk.com> * Add GLU activation type support in JAX extension Signed-off-by:
Kim, Jin <jinn.kim@sk.com> * [PyTorch] Add Sigmoid activation for GLU support in numerics test (#2656) Signed-off-by:
Kim, Jin <jinn.kim@sk.com> --------- Signed-off-by:
Kim, Jin <jinn.kim@sk.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Tim Moon authored
* Add ops for MoE grouped MLP Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move testing utility functions to util submodule Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweak docs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Change order of tensor compatibility checks in noop_cat Review suggestion from @ptrendx. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for GLU interleaving in clamped SwiGLU Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 25 Jan, 2026 1 commit
-
-
Tim Moon authored
* Expose option for custom op fusions Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add tests for custom ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings and numerical test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweak pattern matching logic with fixed window sizes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use TF32 tols in fused op tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestion from @greptile-apps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Backpropagate fixes from #2622 Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 17 Jan, 2026 1 commit
-
-
Tim Moon authored
* Add general C API for setting tensor params Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Implement general accessors for NVTETensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor tex swizzling to skip if scales are already swizzled Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add checks for non-swizzled scales in MXFP8 and NVFP4 kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Support pre-swizzled scales in MXFP8Tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add tex function to swizzle MXFP8 scales Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix bug in inplace swizzle function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweak comments to use "compact/swizzled format" Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * MXFP8 quantize kernel with pre-swizzled scales Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expose pre-swizzled scales in modules Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix bug in multi-swizzle Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Support MXFP8 gated activations with swizzled scales Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add PyTorch infrastructure for pre-swizzled NVFP4 tensors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Deprecate DSv3-specific quantization logic in C API Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove support for DSv3 compact data from quantizer Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove DSv3 compact data format from core lib Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix bug in FP8 all-gather Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update JAX to use new swizzled scale API Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestion from @greptile-apps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @greptile-apps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update C++ swizzle test with swizzled scales API Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Return default tensor params when querying params for invalid NVTETensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug DSv3 FP8 test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug Userbuffers test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure gated activations populate FP8 transpose if needed Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestions from @greptile-apps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable pre-swizzling with debug quantizer Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestion from @greptile-apps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix merge conflicts and review suggestions Update copyright years. Tweak comments. Fix various complaints from @greptile-apps. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use explicitly sized types in config accessors Miscellaneous review suggestions from @ptrendx. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make util header for function that compute swizzled scale index Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestions from @greptile-apps Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Update expected error message in FP8 block-scaling test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestion from @yaox12 Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-
- 02 Jan, 2026 1 commit
-
-
Kirthi Shankar Sivamani authored
Update copyright to include 2026 Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 26 Nov, 2025 1 commit
-
-
Tim Moon authored
Do not initialize recipe state in base op class Op attrs may not be set. Move recipe state initialization to linear op constructor. Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 14 Oct, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial API change Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change all imports and api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix typo Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix recipe tets Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix more tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix docs, tests, and make Jax change as well Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change internal uses of fp8_autocast Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address nits Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rename file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CG function, and small test fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change instances of make_graphed_callables internally Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix distributed tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test and add more docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup test imports and minimize internal file imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make is_bf16_available public Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better docs and better api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply suggestions from code review Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * fix nvfp4 test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 30 Sep, 2025 1 commit
-
-
vthumbe1503 authored
* Test working as I think it should work Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> revert accidental change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Restrict the number of cases for unfused quantization, some fp8->fp8 cases are handled by cublas Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> fix merge conflict Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> bug: missed a } in the code Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) * FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Delay MeshResource validation until first usage (#2124) Delay MeshResource validation until first usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Decouple Recipe and ScalingMode (#1728) * Decouple recipe and scaling mode Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Expose global QuantizeConfig instance as a getter Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Format and lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename UsageType to TensorSource Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128) * add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118) * add amax input to DBiasQuantizePrimitive and FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121) Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] Disable determinism for sm100 (#2130) * disable determinism for sm100+ and cudnn<9.14 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] ONNX export of FP8 Current Scaling (#2068) * Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> build: pull cached wheels (#2127) * build: pull cached wheels Signed-off-by:
oliver könig <okoenig@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Signed-off-by:
oliver könig <okoenig@nvidia.com> --------- Signed-off-by:
oliver könig <okoenig@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> feat: Add support for multiple quantization modes in the UB communicators (#2043) Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [Common] Add checks to CUDA kernel launch and CUDA API calls (#2074) * add checks to cuda kernel launch and cuda API calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * Remove exceptions from destructors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix weired dispatch in ln/rmsnorm Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] Support bf16+fp8 cudagraph (#2098) * support bf16+fp8 model Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Dropout with 8-bit RNG (#2014) * Add dropout kernel with 8-bit RNG Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Create GPU reload buffers on main stream (#2131) * Create GPU relaod buffers on main stream Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> mxfp8 unfused quant support, refined unit test, remove unecessary quantization code Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> missed a quant code removal Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> minor bug fix Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> minor code cleanup Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> minor cosmetics Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Address review comment Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> minor comment update Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Fix CI failures for UB overlap changes (#2149) Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> minor bug: quantizer should not be none for unfused quantization Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135) * Fix failing tests for dropout=0.1 and bias for fused attn for blackwell Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the skip message Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Add check for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to get all devs in the process for jax Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code clean up Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Represent attn bias using enum instead of string Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> fix linting error Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * initial draft of changes to get GPT oss based swiglu integrated, gated kernels needs to be fixed Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * redundant implementation for the pytorch to te hook up, refactoring to be done later Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * all gated kernels modified, pytest working for oss swiglu Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix the merge conflict Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119) * add noop to comp amax Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix for fp8 blockwise recipe Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] fix cross entropy vanishing gradients (#2139) * fix cross entropy Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Casper <casperbh.96@gmail.com> * fix comments Signed-off-by:
Casper <casperbh.96@gmail.com> * fix: few more style issues Signed-off-by:
Casper <casperbh.96@gmail.com> * fix: remove grad_output_stride (unnecessary) Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: only backward was broken Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Generalize cross entropy backward kernel to handle reduced and unreduced loss Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Casper <casperbh.96@gmail.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Fix bug when enabling --overlap-grad-reduce in mcore (#2142) * fix bugs when enabling --overlap-grad-reduce in mcore Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CI Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * format Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> Co-authored-by:
Hongbin Liu <hongbinl@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Fix CUDA version in setup.py (#2132) * Fix CUDA version in setup.py Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Re-enable building comm-gemm tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * WAR for nvidia-nvshmem package Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] NoScaleTensor wrapper for non-quantized data (#2136) * Custom call tests passing Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix comments Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix shardy issue with amax being shape 1,1,1 instead of shape (1,) Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add higher-precision VJP tests to test_distributed_layernorm_mlp Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Cast non-quantized kernels to input dtype in VJPs Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename HighPrecisionTensor to NoScaleTensor Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use NoScaleTensor in pure JAX impls where it was missing Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Fix GroupedScaledTensor creation with keyword arg (#2154) Fix GroupedScaledTensor creation Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Fixing few issues with multi-process launching. (#2155) * Fixing few issues with multi-process launching. Signed-off-by:
Ming Huang <mingh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Update list of authorized CI users (#2152) Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> a bit of cleanup Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * accidentally had removed some activations, minor bug in the templated function Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * parent de9ef2fe450daae0d4ea1b647a37219f72814f66 author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000 committer Varun Thumbe <vthumbe@nvidia.com> 1758262513 +0000 parent de9ef2fe450daae0d4ea1b647a37219f72814f66 author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000 committer Varun Thumbe <vthumbe@nvidia.com> 1758262476 +0000 parent de9ef2fe450daae0d4ea1b647a37219f72814f66 author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000 committer Varun Thumbe <vthumbe@nvidia.com> 1758262304 +0000 merge conflict Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) * FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Delay MeshResource validation until first usage (#2124) Delay MeshResource validation until first usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128) * add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118) * add amax input to DBiasQuantizePrimitive and FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121) Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> [PyTorch] Disable determinism for sm100 (#2130) * disable determinism for sm100+ and cudnn<9.14 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch] ONNX export of FP8 Current Scaling (#2068) * Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> build: pull cached wheels (#2127) * build: pull cached wheels Signed-off-by:
oliver könig <okoenig@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Signed-off-by:
oliver könig <okoenig@nvidia.com> --------- Signed-off-by:
oliver könig <okoenig@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> [Common] Add checks to CUDA kernel launch and CUDA API calls (#2074) * add checks to cuda kernel launch and cuda API calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * Remove exceptions from destructors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix weired dispatch in ln/rmsnorm Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> [PyTorch] Support bf16+fp8 cudagraph (#2098) * support bf16+fp8 model Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Dropout with 8-bit RNG (#2014) * Add dropout kernel with 8-bit RNG Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Create GPU reload buffers on main stream (#2131) * Create GPU relaod buffers on main stream Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Fix CI failures for UB overlap changes (#2149) Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> [JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135) * Fix failing tests for dropout=0.1 and bias for fused attn for blackwell Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the skip message Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Add check for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to get all devs in the process for jax Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code clean up Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Represent attn bias using enum instead of string Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119) * add noop to comp amax Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix for fp8 blockwise recipe Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> [PyTorch] fix cross entropy vanishing gradients (#2139) * fix cross entropy Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Casper <casperbh.96@gmail.com> * fix comments Signed-off-by:
Casper <casperbh.96@gmail.com> * fix: few more style issues Signed-off-by:
Casper <casperbh.96@gmail.com> * fix: remove grad_output_stride (unnecessary) Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: only backward was broken Signed-off-by:
Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Generalize cross entropy backward kernel to handle reduced and unreduced loss Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Casper <casperbh.96@gmail.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Fix bug when enabling --overlap-grad-reduce in mcore (#2142) * fix bugs when enabling --overlap-grad-reduce in mcore Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CI Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * format Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> Co-authored-by:
Hongbin Liu <hongbinl@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix CUDA version in setup.py (#2132) * Fix CUDA version in setup.py Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Re-enable building comm-gemm tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * WAR for nvidia-nvshmem package Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> [JAX] NoScaleTensor wrapper for non-quantized data (#2136) * Custom call tests passing Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix comments Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix shardy issue with amax being shape 1,1,1 instead of shape (1,) Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add higher-precision VJP tests to test_distributed_layernorm_mlp Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Cast non-quantized kernels to input dtype in VJPs Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename HighPrecisionTensor to NoScaleTensor Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use NoScaleTensor in pure JAX impls where it was missing Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> [JAX] Fix GroupedScaledTensor creation with keyword arg (#2154) Fix GroupedScaledTensor creation Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Fixing few issues with multi-process launching. (#2155) * Fixing few issues with multi-process launching. Signed-off-by:
Ming Huang <mingh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Update list of authorized CI users (#2152) Signed-off-by:
Tim Moon <tmoon@nvidia.com> Fused RoPE with combined QKV input. (#2122) * Fused RoPE with combined QKV input. Initial commit for Dropout with 8-bit RNG Fix documentation Initial commit for Fused QKV RoPE WIP Initial tests passing Enable rotary percent and margin Enable CP2, start_positions, interleaved Cleanup test Revert "Fix documentation" This reverts commit 53df10044e7769982bd4af2ae2628e6b7717e715. Revert "Initial commit for Dropout with 8-bit RNG" This reverts commit 301505e24031cbcd679069e1c2cd4d00eedf2dca. Cleanup. Minor cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Optimize kernels Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Misc. Cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Optimize kernel performance Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Move fused_qkv_rope test to test_fused_rope.py Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * apply shared memory optimization to separate fused rope kernels Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * accidentally removed the copyright Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix linting issue Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * minor issue in comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * Commit is for another PR Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * revert changes since this belongs to another PR Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert change back since belongs to another PR Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changes belong to another PR Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert changes here Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> Add bf16/fp32 token-per-expert to the MoE aux loss kernel (#2162) * add bf16/fp32 token-per-expert on the moe-loss-computation on router fusion Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [JAX] Scale swizzling via JAX transpose op (#2163) * add swizzle in jax Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added outer_impl Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * clean up FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Extract cpp distributed tests into a separate project (#2165) * Extract cpp distributed tests into a separate project Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove obsolete exclusion Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Run L1_cpp_distributed tests if at least 4 GPUs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Adds context parallelism utilities: moving cp shards to diff ranks and pad sequence to divisibility factory (#2129) * test - adds unit test for cp utilities and the utilites Signed-off-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> * assert line change Signed-off-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * cleanup Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix linting error Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci [PyTorch Debug] Fix issue with negative underflow% stat. (#2107) * fix underflows log issue Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Address review comments, fix mxfp8 kernel bug: was not passing clamped swiglu parameter correctly Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Lower precision gated-act to accelerate FP8 current-scaling. (#2153) * Applying the original precision as Norm outputs' and activation compuations. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding knob to control norm output precision. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Removing the knob and applying lower-precision norm with current-scaling only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix the error when quantizer==None Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> [PyTorch] Support activation CPU offloading in fusible ops (#2158) * Add CPU offloading logic to ops. Fix test to compute dgrad. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure grads are contiguous in op backwards Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add op-based MLP to CPU offloading tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Handle different weight cache behavior on Hopper/Blackwell Add MXFP8 to CPU offload tests. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove MXFP8 test Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Do not use normalization forward + amax fusion if cuDNN backend is requested (#2174) * Do not use norm fwd + amax fusion if cudnn backend is requested Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Read envirornment vairable directly to avoid include error Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix unjoined comm stream in UB communicator (#2160) Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> FP8 Output Quantization for GEMM (#2123) * Test working as I think it should work Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * revert accidental change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Restrict the number of cases for unfused quantization, some fp8->fp8 cases are handled by cublas Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> fix merge conflict Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> bug: missed a } in the code Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) * FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Delay MeshResource validation until first usage (#2124) Delay MeshResource validation until first usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Decouple Recipe and ScalingMode (#1728) * Decouple recipe and scaling mode Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Expose global QuantizeConfig instance as a getter Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Format and lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename UsageType to TensorSource Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128) * add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118) * add amax input to DBiasQuantizePrimitive and FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121) Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] Disable determinism for sm100 (#2130) * disable determinism for sm100+ and cudnn<9.14 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] ONNX export of FP8 Current Scaling (#2068) * Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> build: pull cached wheels (#2127) * build: pull cached wheels Signed-off-by:
oliver könig <okoenig@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Signed-off-by:
oliver könig <okoenig@nvidia.com> --------- Signed-off-by:
oliver könig <okoenig@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> feat: Add support for multiple quantization modes in the UB communicators (#2043) Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [Common] Add checks to CUDA kernel launch and CUDA API calls (#2074) * add checks to cuda kernel launch and cuda API calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * Remove exceptions from destructors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix weired dispatch in ln/rmsnorm Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch] Support bf16+fp8 cudagraph (#2098) * support bf16+fp8 model Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Dropout with 8-bit RNG (#2014) * Add dropout kernel with 8-bit RNG Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Create GPU reload buffers on main stream (#2131) * Create GPU relaod buffers on main stream Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> mxfp8 unfused quant support, refined unit test, remove unecessary quantization code Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> missed a quant code removal Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> minor bug fix Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by:
Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) * FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Delay MeshResource validation until first usage (#2124) Delay MeshResource validation until first usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Decouple Recipe and ScalingMode (#1728) * Decouple recipe and scaling mode Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Expose global QuantizeConfig instance as a getter Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Format and lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename UsageType to TensorSource Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> [JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128) * add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118) * add amax input to DBiasQuantizePrimitive and FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121) Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> [PyTorch] Disable determinism for sm100 (#2130) * disable determinism for sm100+ and cudnn<9.14 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch] ONNX export of FP8 Current Scaling (#2068) * Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> build: pull cached wheels (#2127) * build: pull cached wheels Signed-off-by:
oliver könig <okoenig@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Signed-off-by:
oliver könig <okoenig@nvidia.com> --------- Signed-off-by:
oliver könig <okoenig@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> feat: Add support for multiple quantization modes in the UB communicators (#2043) [Common] Add checks to CUDA kernel launch and CUDA API calls (#2074) * add checks to cuda kernel launch and cuda API calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * Remove exceptions from destructors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix weired dispatch in ln/rmsnorm Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> [PyTorch] Support bf16+fp8 cudagraph (#2098) * support bf16+fp8 model Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Dropout with 8-bit RNG (#2014) * Add dropout kernel with 8-bit RNG Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Create GPU reload buffers on main stream (#2131) * Create GPU relaod buffers on main stream Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed typo Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> minor code cleanup Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci minor cosmetics Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Address review comment Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci minor comment update Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Fix CI failures for UB overlap changes (#2149) Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> minor bug: quantizer should not be none for unfused quantization Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135) * Fix failing tests for dropout=0.1 and bias for fused attn for blackwell Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the skip message Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Add check for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to get all devs in the process for jax Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code clean up Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Represent attn bias using enum instead of string Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix linting error Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119) * add noop to comp amax Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix for fp8 blockwise recipe Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * Update test_multi_process_distributed_grouped_gemm.py change accidentally added while merging Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * Update dense.py change accidentally added while merging Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address revie comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug solved: delayed scaling quantization with mxfp8 inputs didnt work Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the unit test error Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * just to trigger ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * address review comments: quantization inside gemm and outside both should exactly match for fp32 accumulation Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix merge conflict Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> address review comments: quantization inside gemm and outside both should exactly match for fp32 accumulation [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> TE Gemma tutorial attempt#2 (#1839) * add tutorial files and other local changes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove extraneous code for easy debu Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * make cuda graphs work with non-paged and paged attention Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * perf imp for kv cache ops Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add code for calibration Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * optimize kv_cache reindex and copy kernels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * changes to make quantizers work with fp8_calibration Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * avoid reindexing from python side Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename variable from previous commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use quantizer only if needed Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * functionality of the tutorial tested and perf checked Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove files and update headers/licenses Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * update header/license Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update tutorial for review Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make weights downloadable on the fly; remove extra print statements Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint and update comments Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add comma back, typo Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * sequence_start_positions should be None for training Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add paged attention numberes and update requirements.txt file Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * more fixes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make tutorial work on blackwell Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove gemma FT tutorial for now Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fixing the headings placement and rewording attention -> kv caching Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fixes from comments Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the images Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * misc fixes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add more comments to te_gemma.py and cleanup utils.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add more information about the hierarchy of the classes used in the tutorial Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add better cuda graphs picture Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * addd updated cuda graphs pictures Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add illustrated cuda graphs Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * small fixes in documentation Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add torch.no_grad() to force reduced memory usage Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * some fixes from recent comments Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * more fixes from remaining comments Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add te_rope_emb to class desc Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix tutorial wording; add calibration fix to grouped_linear.py Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Fix memory overhead of linear layer when all gather from sequence parallel (#2125) * fix memory overhead of all gather from sequence parallel Signed-off-by:
Yuzhong Wang <yuzhongw@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/tensor/_internal/float8_blockwise_tensor_base.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * quick fix the errors when for UB buffers Signed-off-by:
Yuzhong Wang <yuzhongw@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/module/linear.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Avoid deallocating FP8 scale-invs since they are reused Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Yuzhong Wang <yuzhongw@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Fix incorrect TP rank calculation when using data parallel (#2179) Signed-off-by:
djns99 <40156487+djns99@users.noreply.github.com> [Pytorch] Add Cutlass Grouped GEMM Support for fine-grained MoE Model (#2045) * feat: add cutlass group gemm support Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: refactor multi tensor gemm interface Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: refactor nvte_multi_stream_cublas_gemm func and add license info Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add unit test for cutlass group gemm Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add cutlass support type protect Signed-off-by:
Min Yang <min.yang@shopee.com> * add tests and fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: fix unit tests error Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: refactor host workspace malloc Signed-off-by:
Min Yang <min.yang@shopee.com> * update cutlass Signed-off-by:
Xin Yao <xiny@nvidia.com> * update cutlass Signed-off-by:
Xin Yao <xiny@nvidia.com> * further relex threshold and add a env var to warn fall back Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Min Yang <min.yang@shopee.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
alan yang <89962857+cassiewilliam@users.noreply.github.com> Co-authored-by:
Min Yang <min.yang@shopee.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> [PyTorch] Support FA3 for MLA and with CP (#1907) feature(FA3,MLA,CP): 1. Update FA3 to commit-id 3ba6f82 (tag 2.8.0.post2 with compile error fixed), PR-1604 support hdimQK != hdimV backward 2. Update get_attention_backend method because FA3 support MLA now 3. Add CP MLA support for FA3 4. Add unit tests for FA3 MLA CP 5. Update attention doc Signed-off-by:
zhujian <zhujian.whu.cs@gmail.com> Fix cuDNN version checks when getting backend and for sm89 kv cache (#2185) * Fix cudnn version checks for kv cache for sm89. Add cudnn version check in preparation for 9.14 when getting backend Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor fix for cuDNN version condition check Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use limit=0.75 in clamped SwiGLU test Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * accidentally removed a line while resolving merge conflict Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * match pytorch implementation: dclamp should be 1 for borders of clamping limits as well Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix dswiglu quantization fusion bug Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * pass param by reference as much as possible Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * float should rather be bool: fix by copilot Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * { missed in activation.cpp Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * minor formatting change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * nvfp4 change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> --------- Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> Signed-off-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com>
-
- 29 Sep, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Add NVFP4 recipe Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Frank Sun <frsun@nvidia.com> Co-authored-by:
Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by:
Zhongbo Zhu <zhongboz@nvidia.com> Co-authored-by:
Evgeny Tsykunov <etsykunov@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Teddy Do <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add MathDx dependency to GitHub builds Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Suggestions from GitHub Copilot Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move 2x shape logic from core to PyTorch Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix compilation errors with CUDA 12.1 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * SM 70 is not supported in CUDA 13 Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Typo Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Revert "Move 2x shape logic from core to PyTorch" This reverts commit f8b2a2d0111d9af690b43bb98ae448d9a430a185. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Added dequantize kernel for FP4 Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add NVFP4 support with fusible ops Use logical tensor dims for PyTorch NVFP4 tensors. Temporarily add unfused dequantize impl. Fix bug where NVFP4 recipe was not configurable. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix logic for 2x shapes and move to PyTorch Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CG test model config Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Debug NVFP4 tensor size function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Proper handling of the RNG state Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Test SR properly Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix workspace size for GEMM heuristic. Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix compile error in C++ NVFP4 test Some some numeric errors when blocks are all zero. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix distrbuted test problem shape Signed-off-by:
zhongboz <zhongboz@nvidia.com> * proper assert dim for low precision AG TP Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up duplicated code in nvfp4_utils.cuh Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * pylint: disable=unused-argument Signed-off-by:
zhongboz <zhongboz@nvidia.com> * `nvte_cublas_gemm_v2` to take alpha pointer (#12) * make nvte_cublas_gemm_v2 to take alpha/beta pointers Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * users are expected to pass a valid C_tensor Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * typos Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * API to have const float* alpha Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * Minor tweaks Support arbitrary beta scales. Increase workspace to be aligned to 128 bytes. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug IMA with alpha pointer Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support fused amax kernels with NVFP4 quantization Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable fused amax with cuDNN LayerNorm kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add NVFP4 cases to distributed tests for TE ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Change assert to NVTE_CHECK in the hadamard cast fusion Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix compile error Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use global thread IDs for Philox subsequences Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add shape checks for NVFP4 cast kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Do not fuse amax if cuDNN normalization is forced by envvar Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Frank Sun <frsun@nvidia.com> Co-authored-by:
Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by:
Zhongbo Zhu <zhongboz@nvidia.com> Co-authored-by:
Evgeny Tsykunov <etsykunov@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Teddy Do <tdophung@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 31 Aug, 2025 1 commit
-
-
vasunvidia authored
* Add dropout kernel with 8-bit RNG Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 26 Aug, 2025 1 commit
-
-
Xin Yao authored
expose more activation functions Signed-off-by:Xin Yao <xiny@nvidia.com>
-
- 20 Aug, 2025 1 commit
-
-
Jan Bielak authored
* Add rmsnorm_bwd_add Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add BackwardAddRMSNorm fused operation Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Try to optimize register usage in kernels Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add separate BackwardAdd stage for the fused backward add Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com>
-
- 15 Aug, 2025 1 commit
-
-
Jan Bielak authored
* Add `nvte_cublas_gemm_scaled` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `tex.generic_gemm` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `general_gemm` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Support use of `alpha` and `beta` in `BasicLinear._functional_forward` and `BasicLinear._functional_backward` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add `ForwardLinearScaleAdd` fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add `BackwardLinearScale` fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply suggestions from code review Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Remove calls to `validate_gemm_scale` from `BasicLinear` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 14 Aug, 2025 1 commit
-
-
Tim Moon authored
* Unfused impl for dbias-quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Unfused impl for dact-dbias-quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable fused bgrad-quantize for unsupported recipes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unfused dbias-quantize impls Not supported in the core lib. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support unfused impls in tex functions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweaks Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused imports Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 29 Jul, 2025 1 commit
-
-
Tim Moon authored
* remove reciprocal op Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Refactor Quantizer::create_tensor function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug when constructing FP8 tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add quantize function to C++ quantizers Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Prototype function to coerce Python quantized tensors to match quantizer Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use quantizer class in tex.quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling support for activation backward Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable quantized GEMM output with FP8 current scaling Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add coerce_tensor functions for MXFP8 and DSv3 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Avoid quantizing empty tensors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use consistent shapes for FP8 transposes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * In attention impl, construct FP8 tensors with pre-initialized scale-invs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initialize MXFP8 scales to zero Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Store copy of quantizer when creating quantized tensors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure quantized tensors have private quantizer Avoid problems with in-place ops after quantizer usages are changed externally. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename "coerce_tensor" to "convert_and_update_tensor" Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure CUDA context is available when launching NVRTC kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expose CUDA context creation function externally Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 25 Jul, 2025 1 commit
-
-
Tim Moon authored
* Add ops for dropout and constant scale Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 24 Jul, 2025 2 commits
-
-
Jan Bielak authored
* Mark output tensors as not deletable in backward Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add `in_place` kwarg to `MakeExtraOutput` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Rename `AddInPlace` to `AddExtraInput` and add an `in_place` kwarg Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Evgeny Tsykunov authored
* Increase intermediate precision and reuse tensors from fwd Signed-off-by:
Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * JIT warmup only when required Signed-off-by:
Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Recompute only rsqrt_norm Signed-off-by:
Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Evgeny <etsykunov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 22 Jul, 2025 1 commit
-
-
Jan Bielak authored
* Refactor _OperationFuserAutogradFunction.forward to use less parameters Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit f8f59b1bb184e89468058521df4cfff029ad909c) * Rename `BackwardBiasActivation` to `BackwardActivationBias` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 397c58fc296f801fe4ad600aadc2daff3b78be45) * Use forward operation order in backward fused operations Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 2d37a9385069b066e6cdeff3eb9173c2079cb791) * Rename `prev_op_grad_input_quantizer` to `prev_op_grad_output_quantizer` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit d7ab5dfb23e216866f7f4fc4d7a99f625d329f1e) * Make OperationFuser persistent Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 77984d9715d31e87519dc6ea1e02c483a81355a7) * Distribute extra inputs to and collect extra outputs from multiple module groups in Sequential Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 0716aaad542e59f2c1ac4620167965a0334bbf71) * Take requires_grad into account when fusing operations Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Change get_quantizer to return None if no quantization recipe is used Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Refactor pre_first_forward Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Fix for failing `test_make_graphed_callables[fp8_recipe0-*-True-*-linear_op]` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Fix linting errors Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Fix fp8 meta tensors in CUDA Graph capture Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix failing distributed userbuffers tests Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 12 Jul, 2025 1 commit
-
-
Jan Bielak authored
* Fix clearing tensor data in backward removing is_first_op Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Misc fixes Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Use Linear weight dtype and device for compute consistently Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add backward dbias + quantize fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Pass recipe to OperationFuser to allow recipe-dependent fusions Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Remove redundant view from activations Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add bias activation backward fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 08 Jul, 2025 1 commit
-
-
Jan Bielak authored
Add test for `LayerNormMLP` implementation using `te.ops.Sequential` to `test_fusible_ops.py` (#1924) * Add e2e test for LayerNormMLP implemented with te.Sequential Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Fix bugs uncovered by test Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reshaping columnwise_data of MXFP8Tensor Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Fix taking dtype from weight or grad_output in BasicLinear._functional_backward Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 01 Jul, 2025 1 commit
-
-
Jan Bielak authored
* Replace `is_float8_tensor` with `is_quantized_tensor` Replace free function `is_float8_tensor` with `is_quantized_tensor` in `_common.py` and use it throughout the `ops` codebase to check if a tensor is a (possibly internal) quantized tensor Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Pass next and previous op quantizers directly to op_forward and fuser_forward Change interface of `fuser_forward` and `op_forward` to no longer take preceding and following ops and instead take the following op's input quantizer and preceding op's input gradient's quantizer directly Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Remove use redundant `detach` in `BasicLinear` Remove use of `detach` in `BasicLinear` for improved performance (enabled by not passing prev_op to backward) Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Handle saving internal tensors Handle saving internal tensors in `_OperationFuserAutogradFunction` using `prepare_for_saving` and `restore_from_saved` Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Use internal tensors Enable use of internal tensors in `BasicLinear` quantizers and fix issues resulting from internal tensors not having methods that regular tensors have Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jan Bielak <jbielak@nvidia.com> --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 13 Jun, 2025 1 commit
-
-
Tim Moon authored
* Add FP8 current scaling to te.Sequential tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Helper function for test/ref tensors does not produce quantized tensor by default Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling to distributed te.Sequential tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling to Userbuffers te.Sequential tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug MXFP8 tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 12 Jun, 2025 1 commit
-
-
Evgeny Tsykunov authored
* Support L2Norm basic op Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Add L2Norm module wrapper Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Expose qk_norm to MHA nd transformer laayer Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Move tests into separate file Signed-off-by:
Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix pass Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Add license Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Remove module Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Resollve comments Signed-off-by:
Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Evgeny <etsykunov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 29 May, 2025 1 commit
-
-
Tim Moon authored
* Linear op avoids saving input tensor if weight grad is not needed Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Linear op forward avoids producing quantized tensors with unnecessary usages Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid unnecessary usages in fused linear ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 11 Apr, 2025 1 commit
-
-
Tim Moon authored
* Add option to cache activation input in FP8 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid casting to FP8 transpose Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Skip input caching if device is not supported Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add documentation that FP8 input caching is experimental Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 09 Apr, 2025 2 commits
-
-
Tim Moon authored
* Debug checkpointing with te.Sequential Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
Explicitly specify quantized tensor usages needed for linear op backward Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 25 Mar, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 02 Jan, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 15 Nov, 2024 1 commit
-
-
Tim Moon authored
* Add activation ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix lint warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Update to use QuantizedTensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Respect PyTorch autograd dtype Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename CastFloat8 op to Quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for fused dSwiGLU-cast-transpose Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Nov, 2024 1 commit
-
-
Tim Moon authored
* Remove manual FP8 scale update for FP8 params Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 05 Nov, 2024 1 commit
-
-
Tim Moon authored
* Add layer norm op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add FP8 cast op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add tests for linear and layernorm with FP8 output Signed-off-by:
Tim Moon <tmoon@nvidia.com> * RMSNorm op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Replace LayerNorm module with LayerNorm op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Replace RMSNorm module with RMSNorm op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add AMP support Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Do not save autograd context if grad mode is disabled Debugging ONNX export tests. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Forward args in pre_forward func to base op class Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update to use QuantizedTensor class Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Apply suggestions from code review Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestions from @ptrendx Rename "CastFloat8" op to "Quantize". Add more fine-grained control for SM margin. Add docs for legacy sequence_parallel kwarg. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use weight dtype as default compute dtype Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com>
-
- 09 Oct, 2024 1 commit
-
-
Tim Moon authored
* Handle Float8Tensor when casting module dtype Keep data in Float8Tensor and only change nominal dtype. Monkey-patch PyTorch module casting functions to handle Float8Tensor. Add tests. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Respect autocast dtype in linear op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Suppress linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Suppress linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweak comments Review suggestion from @ptrendx Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 11 Sep, 2024 1 commit
-
-
Tim Moon authored
* Add base class for tensor proxies Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move tensor detaching logic to tensor proxy base class Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use Python wrappers to PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Include transpose caching logic in proxy encode function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug dimension mismatch with amax history Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move dequantize logic to proxy_decode func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename to "QuantizedTensor" Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename "proxy_detach" to "detach" Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Include transpose cache in detach and clone funcs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update FP8 workspaces with QuantizedTensor functions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move logic for FP8 transpose cache in FP8 workspaces to base class Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove cast-transpose logic from linear op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unnecessary args for Float8Tensor when using FP8 attr dict Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove __torch_function__ to QuantizedTensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update tests/pytorch/test_float8tensor.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Debug FP8 transpose test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug cast functions Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 10 Aug, 2024 1 commit
-
-
Tim Moon authored
* Add op for in-place add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add op for in-place add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add op that adds extra output to fuser Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add fused op for GEMM+bias+add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add fused op for dgrad+add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add documentation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @ptrendx Output tensor dtype and device take precedence over weight tensor in linear functional API. Move some index calculation to fuser constructor. Avoid some unnecessary dereferences. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update transformer_engine/pytorch/ops/fuser.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 09 Jul, 2024 1 commit
-
-
Tim Moon authored
* Add basic infrastructure for Sequential module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add linear op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add FP8 support in linear op Runs, but need to validate. Runtime errors with non-FP8 params and FP8 compute, or FP8 params and non-FP8 compute. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add reshape op and unit test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add bias op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add unfused linear op Test does not pass with FP8. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug unfused linear op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add test for linear+bias op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add separate abstract classes for unfused and fused ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Consolidate unfused ops in submodule Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add linear-bias fused op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use fused cast-transpose in linear ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable GEMM+bias fusion with FP32 activations Not supported by cuBLAS. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add parallel unit test for unfused linear op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor parallel tests to reduce job launches Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add all-reduce, all-gather, and reduce-scatter ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unused file Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug multi-GPU FP8 test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for FP8 scale updates Still need to implement amax reductions. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add license boilerplate Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fuse GEMM+bias in row TP Add documentation for unfused ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename pipeline to fuser Expand documentation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweak documentation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Preserve cached FP8 transpose between ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add option for fused wgrad accumulation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Directly output FP8 from linear if needed Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix cuDNN front-end commit Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use updated FP8 tensor API for transpose caching Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use updated API for FP8 scale updates Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add tests for non-default FP8 recipes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename UnfusedOperation to BasicOperation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add unit test to check amax reduction with fusable op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Operator autograd state no longer needs to be initialized Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initial functional implementation of linear op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug fused linear+bias op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove autograd context from functional linear impl Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use functional linear impl in fused linear+bias op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename subdirectory from "fuser" to "ops" Avoid confusion with kernel fusers and graph compilers. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update with Float8Tensor changes in #820 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unnecessary CPU overheads Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Correctly pass FP8 metadata from next op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter errors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add convenience functions to manipulate Sequential class Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update name of PyTorch extensions module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Clear saved tensor data in linear op after bprop Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix Pylint error Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update name of PyTorch extensions module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix test name in QA script Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update name of PyTorch extensions module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Run distributed tests even when only 1 GPU is available Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Only run distributed tests with 2 GPUs if there are >=2 GPUs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestions from @sudhakarsingh27 and @ksivaman Fix spelling of "fusible". Avoid "input" name in internal APIs. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update transformer_engine/pytorch/ops/__init__.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-