- 10 Nov, 2025 2 commits
-
-
Teddy Do authored
* Changing default activations in MLP, TransformerLayer, dropout rate after FC1 to 0, and return_layernorm_output to False Signed-off-by:
tdophung <tdophung@nvidia.com> * Fixing the failing tests by hard coding arguments to the previous values instead of relying on newer default values Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
tdophung <tdophung@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Teddy Do authored
* move triton to common and change paths Signed-off-by:
tdophung <tdophung@nvidia.com> * Formatting Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com>
-
- 07 Nov, 2025 4 commits
-
-
Paweł Gadziński authored
* code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * depracted compile time warning + \warning -> \deprecated Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Kshitij Lakhani authored
* Default to fused attention in JAX DPA Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Consolidate documentation for DPA in JAX Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> * Correctly update the documentation for defaults in JAX DPA Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
* Fix cuDNN backend selection for more case. Add CG as a option as well Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix logic Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix cuDNN checks Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add more checks Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix cuddn version Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix error message Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add check for window size Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Michael Goldfarb authored
-
- 06 Nov, 2025 2 commits
-
-
Kunlun Li authored
* Make cast_master_weights_to_fp8 compatible with older MCore version Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Rename keep_columnwise to manual_post_all_gather_processing & Optimize unit test Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove redundant _test_mini_optimizer() Signed-off-by:
kunlunl <kunlunl@nvidia.com> --------- Signed-off-by:
kunlunl <kunlunl@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 05 Nov, 2025 2 commits
-
-
Paweł Gadziński authored
* fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
Paweł Gadziński authored
* code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 04 Nov, 2025 1 commit
-
-
Paweł Gadziński authored
* code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix: Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 31 Oct, 2025 2 commits
-
-
Oleg Goncharov authored
Deleted unused header Signed-off-by:Oleg Goncharov <ogoncharov@nvidia.com>
-
jberchtold-nvidia authored
* Fix mesh resource requirement when no mesh Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * do not require meshresource if all axes are manual axes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * remove abstract_mesh is None check Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 30 Oct, 2025 5 commits
-
-
Kshitij Lakhani authored
[PyT] Bump the min version expected to supported FP8 current scaling determinism on Blackwell (#2316) * Bump the min version expected to supported FP8 cs det on Blackwell Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Disable fused attn for cudnn < 9.14 for FP8 CS. Disable fused attn for cudnn < 9.18 for FP8 deterministic CS Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Oleg Goncharov authored
* Separated gated and dequantize kernels Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Separated quantize, dequantize and gated functions Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed lint issues Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed persistent lint issues Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added missing compute capability 10.0 check for Quantize FP8 TMA kernels Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed the issue which was added again by autofix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed files description. Completely removed non-identity activations from the NVFP4 transpose test suite Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Removed unsupported template arguments in NVFP4 quantize Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed undefined symbol error Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed condition Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Fixed CUDA version check Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed arch conditions order Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Small fix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Small fix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the PR review Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Split quantize helper into two (FWD and BWD) functions Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Moved activation functions from cast.cu. Removed cast.cu from the fast-math compilation list Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Enabled fast math for activations by default Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Disabled fast math for activations by default Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
* Fix attention backend and tests for sm120 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Disable MLA only for backward Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kshitij Lakhani authored
* Fix: Skip determinism tests for bprop for all sm >=100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add username to TODO Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100+ Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Phuong Nguyen authored
fix max jobs Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 29 Oct, 2025 1 commit
-
-
vthumbe1503 authored
* changes working Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for onnx, minor comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * greptile review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/transformer.py Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * Update transformer_engine/pytorch/module/layernorm_mlp.py Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * Update transformer_engine/pytorch/transformer.py Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * revert the name change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> --------- Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com>
-
- 28 Oct, 2025 1 commit
-
-
Phuong Nguyen authored
* jax norm + te quant Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 27 Oct, 2025 2 commits
-
-
Kirthi Shankar Sivamani authored
* Remove nvidia-mathdx dep Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix SR Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add comment Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tong Liu authored
dummy wgrad Signed-off-by:
tongliu <tongliu@nvidia.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Xin Yao <xiny@nvidia.com>
-
- 25 Oct, 2025 1 commit
-
-
Charlene Yang authored
* add max_score for fused/unfused F16 non-CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * calculate max per head instead of max over all heads Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn max_score shape Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert FE to github Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.15.0-rc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reduce ew kernels; fix causal masks; add more tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix to tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove logic for flash-attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add CP support for p2p/a2a/all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor improvements of implementation/tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: add thd support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add thd to UnfusedDPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more fixes for lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to FE 1.15 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove unneeded changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable unfused for thd + pad_between_seqs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable thd for unfused until bug is fixed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename max_score to max_logit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable fused attn + thd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 24 Oct, 2025 2 commits
-
-
jberchtold-nvidia authored
fix checks in unoptimized non-rht fp4 quantize kernel Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
buptzyb authored
* support cudagraph dw Signed-off-by:
Robin Zhang <robinz@nvidia.com> * fix lint Signed-off-by:
Robin Zhang <robinz@nvidia.com> * fix ci Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 23 Oct, 2025 3 commits
-
-
Paweł Gadziński authored
* fix perf issue Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
jberchtold-nvidia authored
* Make SR rng state always 2D (num_devices, 4) Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix pure-jax impl Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix test shape Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
Przemyslaw Tredak authored
* Added sm_120f to the build Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Change the arch specific handling Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Support for CUDA<12.9 Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Moved through the rest of the files Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Common cases Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Remove pure 100 from the list Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * CMake changes, (not yet working) Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Do not pass the arch-specific thing from build_tools Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Moved some of the files to arch-specific compilation Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix and also changing the order of compilation to hopefully get the compilation time lower Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix for the files overwriting custom compile properties Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Actually make this whole thing work Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add space to the error message Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> * Apply suggestions from code review Co-authored-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> * Fixes from review Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Changing the naming to be more intuitive Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add missing cassert include for device-side asserts Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
-
- 22 Oct, 2025 3 commits
-
-
jberchtold-nvidia authored
Defer cublas check on fp8 gemms until lowering Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
jberchtold-nvidia authored
* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add test for SR state preservation across VJP boundaries Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix sharding of SR rng state Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update tolerances slightly now that SR is enabled Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use hashlib for deterministic hashes across runs for SR Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * rename uses_rht on scaled tensors to has_applied_rht Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add assert Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Move decision of whether to use RHT into helper.py and add dedicated RHT tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix use_rht attr usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix pure-jax rht usage criteria Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Adjust tolerances after rebase Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
Evgeny Tsykunov authored
* rename experimental -> custom_recipes Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Decouple python base classes (api) Signed-off-by:
Evgeny <etsykunov@nvidia.com> * update test_custom_recipe Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Rename experimental -> custom Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Minor Signed-off-by:
Evgeny <etsykunov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix import Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Update tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Evgeny Tsykunov <e.tsykunov@gmail.com> * Update tests/pytorch/test_custom_recipe.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Evgeny Tsykunov <e.tsykunov@gmail.com> * quantization_base -> quantized_tensor rename Signed-off-by:
Evgeny <etsykunov@nvidia.com> --------- Signed-off-by:
Evgeny <etsykunov@nvidia.com> Signed-off-by:
Evgeny Tsykunov <e.tsykunov@gmail.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 21 Oct, 2025 2 commits
-
-
Kunlun Li authored
* Add post-processing API for FP8 primary weights to support CUDA Graph Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Add post-processing support for plain pytorch tensors Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Update type hint Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
kunlunl <kunlunl@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Zhongbo Zhu authored
* pipeclean, fix nvfp4 padding of 32 alignment Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * numerical test passed Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fix CI failure with test_cast_master_weights_to_fp8 (in a hacky way) Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * found CUDA mis-aligned address error in training in multi-swizzle, hack the vec_load_size to 1 to unblock Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * leave comments about alignment issue Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fused bulk alloc nvfp4 Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fix RHT sign mask CPU overhead Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fix Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * resolve comments Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Remove incorrect logic that treats 0-D tensor as uninitialized Tensor shape logic still requires treating 0-D tensor as uninitialized. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix invalid conversion from tensor to int Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 20 Oct, 2025 2 commits
-
-
Kirthi Shankar Sivamani authored
* Fix CI failures due to deterministic attention Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * some more cleanup Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix debug test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
fzyzcjy authored
* Update permutation.py Signed-off-by:
fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> * Update permutation.py Signed-off-by:
fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> * Update transformer_engine/pytorch/triton/permutation.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/pytorch/triton/permutation.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 18 Oct, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Support wheel build for cuda 13 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes for cu13 runtime, format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add documentation Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better error handling Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix jax sdist Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Modify function names Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 17 Oct, 2025 4 commits
-
-
Alp Dener authored
Make `CanonicalizeGemmInput()` support non-TN layout FP8 GEMM on Blackwell with column-wise/transposed data (#2233) Modified CanonicalizeGemmInput() logic to pull from column-wise data for FP8 GEMM on Blackwell when row-wise is not available. Signed-off-by:Alp Dener <adener@nvidia.com>
-
Haowen Zheng authored
Signed-off-by:
将来 <jianglai.zhw@alibaba-inc.com> Co-authored-by:
将来 <jianglai.zhw@alibaba-inc.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tim Geypens authored
Signed-off-by:Tim Geypens <tim.geypens@gmail.com>
-
Kevin Tong authored
* CUDA RHT Signed-off-by:
Kevin Tong <kevin@augmentcode.com> * Fix cuda graphs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix bug where RHT mask is tensor instead of int Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Kevin Tong <kevin@augmentcode.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-