- 25 Jul, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Remove deprecated device arg Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 24 Jul, 2025 3 commits
-
-
Oleg Goncharov authored
* Fixed integer overflow when computing offsets Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kshitij Lakhani authored
Fix cudnn versioning in support in PyTorch DPA and Fused attn Signed-off-by:Kshitij Janardan Lakhani <klakhani@nvidia.com>
-
Kirthi Shankar Sivamani authored
Fix cuDNN lib runtime loading and simplify Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 22 Jul, 2025 2 commits
-
-
Daniel Stokes authored
Signed-off-by:djns99 <40156487+djns99@users.noreply.github.com>
-
Oleg Goncharov authored
* Fixed conflicts Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor code refactoring to avoid unnecessary checks Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed dBias accumulation error due to initialization. Minor code refactoring Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Test case to reproduce the init error Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed rowwise dbias error Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed ptx API Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added a struct for two packed FP8 values Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Rolled back to scalar code for columnwise scaling due to its better performance Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Minor corrections Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rebased on main Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per code review Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed constexpr in C++ test suite to build faster Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Computed activations are now numerically truncated to InputType before scaling. Improved test suite. Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor refactoring Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Minor refactoring Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modified mismatches checks of MXFP8 to address FP8 numerics Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Implemented Jeremy's fixes to JAX test suite with an intermediate downcast Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reduced the dims of the test tensors to improve CI runtime Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed memory alignment issue. Compute dbias without downcast. Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed misaligned memory issue also in gated kernels. Reduced size of MXFP8 gated tests Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 21 Jul, 2025 1 commit
-
-
Charlene Yang authored
* exclude 9.10.0/.1 for certain configs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix kv_channels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add get_backend to tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add init files Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix numerics and cuda graph tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove prints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor changes after renaming Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix import structure and rename get_attention_backends Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix docs and benchmarks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get backend calls Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix get backend calls" This reverts commit 653cbb51c697bc2f975416bb3aac1d85f76c36dc. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix docs and benchmarks" This reverts commit 98cd52e04ff7c53e26b412195f5744e39f7ed0e9. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix docs, benchmarks and pre-commit ci Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix dpa/mha flash attn selection Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix rng states Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ModelConfig Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix backend selection on Ampere Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix issues from last merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update tests/pytorch/utils.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove initialization of rng_states to None Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * redefine ModelConfig Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ModelConfig Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix seed for CP tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update tests/pytorch/test_sanity.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move fixture from utils to individual tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 18 Jul, 2025 1 commit
-
-
Charlene Yang authored
* update cudnn-frontend to 1.13.0 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable 9.11 for a bug Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix selection logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 17 Jul, 2025 1 commit
-
-
Charlene Yang authored
* optimize kv_cache reindex and copy kernels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * avoid reindexing from python side Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename variable from previous commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 16 Jul, 2025 1 commit
-
-
Tim Moon authored
* Add dtype checks in multi-tensor Adam Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid throwing exception in destructor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Jul, 2025 2 commits
-
-
Tim Moon authored
* Add run-time version checks in cuBLAS GEMM wrapper Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add run-time version logic for multicast Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix namespace error Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Autumn1998 authored
* fix underterminsic problem in CI Signed-off-by:
tongliu <tongliu@nvidia.com> * fix bug on mbs>1 Signed-off-by:
tongliu <tongliu@nvidia.com> * fix bug on sm dispatcher Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CI initial values Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com>
-
- 12 Jul, 2025 1 commit
-
-
Jan Bielak authored
* Fix clearing tensor data in backward removing is_first_op Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Misc fixes Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Use Linear weight dtype and device for compute consistently Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add backward dbias + quantize fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Pass recipe to OperationFuser to allow recipe-dependent fusions Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Remove redundant view from activations Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Add bias activation backward fusion Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 10 Jul, 2025 1 commit
-
-
Autumn1998 authored
* add router fusion Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ci Signed-off-by:
tongliu <tongliu@nvidia.com> * fix ci with cuda 12.3 Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix CI sm89/80 Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
tongliu <tongliu@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 26 Jun, 2025 1 commit
-
-
xiaoxi-wangfj authored
* [PyTorch|common] Implement unpadding kernel for FP8 1. Add multi-tensor unpadding kernel 2. Replace split+cat with unpadding kernel in Fp8Padding and Fp8Unpadding 3. Add unpadding with padding unit tests Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add license Signed-off-by:
Xin Yao <xiny@nvidia.com> * Update padding.cu Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com>
-
- 16 Jun, 2025 1 commit
-
-
Hua Huang authored
* Support MXFP8 and handle empty matrices Signed-off-by:
Hua Huang <huah@nvidia.com> --------- Signed-off-by:
Hua Huang <huah@nvidia.com>
-
- 13 Jun, 2025 2 commits
-
-
Charlene Yang authored
* add support for head dim > 128 Signed-off-by:
Charlene Yang <charleney@nvidia.com> * remove debugging Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * raise tols slightly to tolerate 1/2048 mismatches Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix is_training for test_te_layer Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add bprop support for blackwell Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak for format Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix backend selection results Signed-off-by:
Charlene Yang <charleney@nvidia.com> * bump sm100 to sm100+ Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add sq=1 test for MLA Signed-off-by:
Charlene Yang <charleney@nvidia.com> * enable sq=1 for bprop Signed-off-by:
Charlene Yang <charleney@nvidia.com> * minor tweak in comments Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix head_dim logic and remove pytest skip Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add FE fix for d>128 Signed-off-by:
Charlene Yang <charleney@nvidia.com> * update FE again to take in small fixes Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add cuDNN version info in L0 tests Signed-off-by:
Charlene Yang <charleney@nvidia.com> * increase tols for Unfused + large dim Signed-off-by:
Charlene Yang <charleney@nvidia.com> * Revert "add cuDNN version info in L0 tests" This reverts commit 3e1b426ca5319a2c0540b9e73bba7047d0e583e5. Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix tols for Unfused Signed-off-by:
Charlene Yang <charleney@nvidia.com> --------- Signed-off-by:
Charlene Yang <charleney@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Oleg Goncharov authored
* Added support of FP4 data type Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring to BitsNum in progress Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed compilation errors. All C++ tests passed Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed a typo Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added FP4 guard to TMA tensor descriptor data type Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed errors in JAX C++ extensions Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed dummy NVFP4 C++ test file Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Make pytorch changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Refactored the code per the review notes. Fixed JAX build error. Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed unnecessary static casts Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Typo fix Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Pass correct num bits to create_2D_tensor_map; fixes CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * inline funcs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 12 Jun, 2025 3 commits
-
-
Phuong Nguyen authored
* Implemented GroupedDense and TestGroupedDense for BF16, FP16, and FP8 * Fix GroupedGemmFFI cuBLAS workspace alignment bug Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Hua Huang <huah@nvidia.com>
-
Phuong Nguyen authored
Revert "[JAX] GroupedDense v.2 without dynamic shape (#1721)" This reverts commit 5d01ef21 . Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* Implemented GroupedDense and TestGroupedDense for BF16, FP16, and FP8 * Fix GroupedGemmFFI cuBLAS workspace alignment bug Signed-off-by:
Hua Huang <huah@nvidia.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 06 Jun, 2025 3 commits
-
-
Alp Dener authored
* added missing deallocs in Userbuffers destroyer Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Phuong Nguyen authored
* refactor the multi_stream utils + implement nvte_multi_tensor_quantize in TE/Common * implement GroupedQuantizer and grouped_quantize in jaxx * fix logical_axes_names for transpose tensor in ScaledTensor Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
Ming Huang <mingh@nvidia.com>
-
Zhongbo Zhu authored
[PyTorch] FP8 Subchannel Recipe With FP8 Gather And Configurable Scaling Factor Tensor Swizzling (#1707) * functional kernel for columnwise + no-transpose option, still hacky Signed-off-by:
zhongboz <zhongboz@nvidia.com> * pass all quantizer unit tests Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor, add gemm ready api Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make format options private members, simplify api Signed-off-by:
zhongboz <zhongboz@nvidia.com> * swizzle scales right before gemm Signed-off-by:
zhongboz <zhongboz@nvidia.com> * bug fix of single layer test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * attempt to fix lint issue Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fp8 gather pass, need minor refine Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix return_layernorm_output_gathered case Signed-off-by:
zhongboz <zhongboz@nvidia.com> * remove special cases, add sanity check before gemm Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint ungrouped imports Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Implement dequantize for compact 1D blocks. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * add more unit test with dequantize compact supported Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint again Signed-off-by:
zhongboz <zhongboz@nvidia.com> * make ag for subchannel respect async Signed-off-by:
zhongboz <zhongboz@nvidia.com> * zero tolerance in distributed test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix zero tolerance test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve rebase issues Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint & format Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * bug fix Signed-off-by:
zhongboz <zhongboz@nvidia.com> * relax rtol for fp32 distributed test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix some ci issue Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix ci test failure in debug mode Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Force row-wise and column-wise data to have same data format Prototype "all-gather usage" in quantizer. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove dead logic for high-precision AGs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug FP8 block-wise tensor tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug distributed test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Handle case where LayerNormLinear returns gathered norm output Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix debug mode Signed-off-by:
zhongboz <zhongboz@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Keith Wyss <kwyss@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 05 Jun, 2025 1 commit
-
-
Przemyslaw Tredak authored
* Use versioned flavor of get driver entrypoint function Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Update the check to call the versioned API starting with CUDA 12.5 where it was added Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Dynamically find entrypoint functions Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Error checking Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Lint fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 02 Jun, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
minor build improvements Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 30 May, 2025 1 commit
-
-
Tim Moon authored
Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 29 May, 2025 1 commit
-
-
Przemyslaw Tredak authored
* Changed the Tensor allocation strategy Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Disable debug flag Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the double free error Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fixed pyTorch recipe extension Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Hide TensorAllocator and fix the usage in LayerNorm Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleaning Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix permutation Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 28 May, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Fix single FW build with multi FW available Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Some fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * sug Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 23 May, 2025 1 commit
-
-
Przemyslaw Tredak authored
* Modify the test cases Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Make the tests reproducible on different machines Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fixed the cache of the gamma_in_weight_dtype setting Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Reinstate the tests Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * More verbose code and comments Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 22 May, 2025 2 commits
-
-
Kirthi Shankar Sivamani authored
Document all recipes Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Build support for cuda 13 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix build for cudnn 8.9*; cuda 12.1 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * readd include Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 21 May, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Add missing docs for C API Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Grammar, typos, copy-paste errors Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * remove contiguous word Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better wording Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 May, 2025 1 commit
-
-
guyueh1 authored
* Fix split_overlap_rs aggregate=True chunk offset calculation Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add unit test for aggregate=True Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix unit test Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Guyue Huang <guyueh@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 19 May, 2025 1 commit
-
-
Evgeny Tsykunov authored
* Check tensor-recipe compatibility Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * Tensor class in recipe, checking for *Base Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * Extend recipe __repr__ with recipe_type Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * Warn about recipe change Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Enable dynamic recipe change: clear fp8 workspace Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * TE 1.x checkpoint compatibility Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * Disable warning for recipe wrappers Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * Test recipe change Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use QuantizedTensorBase Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * Fix circular import Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * Revert previous circular import fix Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * Fix pytorch imports in common Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Let quantizer know about the recipe Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix imports Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> --------- Signed-off-by:
Evgeny Tsykunov <etsykunov@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 15 May, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Cleanup runtime library loading Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better comments and logic Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix catching stray builds Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix missing fw case Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * minor grammar Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix duplicate SO for editable installs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better comment for build ext Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Improve error msg Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 11 May, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* First pass refactor Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * first pass Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * core compiles Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Include cuda dirs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Compiles Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Move grad outside autocast Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix kv cache Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change src file name in cmake Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * move the kernels too Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Move comment Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Move comments around Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * more movement Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * move Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 May, 2025 1 commit
-
-
Tim Moon authored
* Initial work toward restoring UB support in te.Sequential Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Forward UB linear runs, but has numerical error Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug UB forward tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor tweaks Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove Python checks for MXFP8 UB linear forward Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add dim check for MXFP8 full tiles Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move QuantizedTensor logic out of UB comm and into Python helper function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Support MXFP8 AGs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Coalesce NCCL all-gathers for MXFP8 all-gather Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initial impl of backward UB linear in te.Sequential Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug UB linear backward with no quantization dgrad GEMM + dx RS is still broken. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix chunk dims for dgrad GEMM + dx RS Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debugging MXFP8 UB cases Still failing with dy AG + wgrad GEMM Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use NCCL to overlap dy AG with dgrad GEMM Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug UB GEMM tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initial refactoring of linear module forward Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor linear module backward Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug linear module UB tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweak test tensor dims Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not store autograd context within wgrad GEMM closure Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update LayerNormLinear Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update LayerNormMLP Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug UB tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor style tweaks Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect usage for GEMM input with block-scaled FP8 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix RS out dims Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable dgrad GEMM + UB AG + NCCL AG overlapping Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Disable dgrad GEMM + UB AG + NCCL AG overlap in te.Sequential Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Restore support for internal quantized tensors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add tests for MXFP8 GEMM with UB Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 05 May, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Move multi tensors kernels from PyTorch extensions to core Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add int16 type to core (for storing fp32 param remainders) Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix core build Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * same fix to scale Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix perf, memory, vars Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Re-add device guard for multi-device Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix junk output dtype for non-per tensor Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes for test and upgrade mcore version Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix core tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 01 May, 2025 1 commit
-
-
Tim Moon authored
Debug UB conversion function from NVTEShape to std::vector Signed-off-by:Tim Moon <tmoon@nvidia.com>
-