- 30 Oct, 2025 1 commit
-
-
Kshitij Lakhani authored
* Fix: Skip determinism tests for bprop for all sm >=100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add username to TODO Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100+ Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 28 Oct, 2025 1 commit
-
-
Phuong Nguyen authored
* jax norm + te quant Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 25 Oct, 2025 1 commit
-
-
Charlene Yang authored
* add max_score for fused/unfused F16 non-CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * calculate max per head instead of max over all heads Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn max_score shape Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert FE to github Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.15.0-rc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reduce ew kernels; fix causal masks; add more tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix to tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove logic for flash-attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add CP support for p2p/a2a/all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor improvements of implementation/tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: add thd support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add thd to UnfusedDPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more fixes for lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to FE 1.15 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove unneeded changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable unfused for thd + pad_between_seqs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable thd for unfused until bug is fixed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename max_score to max_logit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable fused attn + thd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 23 Oct, 2025 1 commit
-
-
jberchtold-nvidia authored
* Make SR rng state always 2D (num_devices, 4) Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix pure-jax impl Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix test shape Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 22 Oct, 2025 2 commits
-
-
jberchtold-nvidia authored
Defer cublas check on fp8 gemms until lowering Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
jberchtold-nvidia authored
* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add test for SR state preservation across VJP boundaries Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix sharding of SR rng state Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update tolerances slightly now that SR is enabled Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use hashlib for deterministic hashes across runs for SR Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * rename uses_rht on scaled tensors to has_applied_rht Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add assert Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Move decision of whether to use RHT into helper.py and add dedicated RHT tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix use_rht attr usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix pure-jax rht usage criteria Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Adjust tolerances after rebase Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 18 Oct, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Support wheel build for cuda 13 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes for cu13 runtime, format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add documentation Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better error handling Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix jax sdist Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Modify function names Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 14 Oct, 2025 2 commits
-
-
Kirthi Shankar Sivamani authored
* Initial API change Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change all imports and api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix typo Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix recipe tets Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix more tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix docs, tests, and make Jax change as well Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change internal uses of fp8_autocast Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address nits Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rename file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CG function, and small test fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change instances of make_graphed_callables internally Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix distributed tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test and add more docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup test imports and minimize internal file imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make is_bf16_available public Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better docs and better api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply suggestions from code review Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * fix nvfp4 test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Kshitij Lakhani authored
* Add BRCM support when creating a test mask for fused attn Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add support for BRCM to correctly generate the mask needed for calculating the seqlens and offsets for THD Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Skip drop=0 and no_bias case for BRCM as cuDNN does not suport this Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Skip BRCM test cases where max_seqlen_q > max_seqlen_kv Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Refactor the segment id run length code for BRCM seqoffset and seqlens calculations Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix the drop inequality skip condition in fused attn Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * nit: Adjust the BRCM id name in the test to make it consistent Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix the brcm mask condition. Fix the condition for cross atnn type pattern to only apply for brcm Change the num segments per sequence to 3 instead of 2 Reduce one test pattern data size and make it such that it triggers brcm Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix lint errors Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix incorrectly changed dtype to numpy bool_ rather than native python bool Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Restore the numsegments to earlier value Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add example for THD BRCM Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 13 Oct, 2025 2 commits
-
-
jberchtold-nvidia authored
assertion check Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
jberchtold-nvidia authored
* Improve error message for cublas fp8 gemm with incorrect shape Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Removed unnecessary non-contracting size check Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * rename inner dim -> leading dim Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 09 Oct, 2025 2 commits
-
-
jberchtold-nvidia authored
Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Update minimum python version to 3.10 and update CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 08 Oct, 2025 1 commit
-
-
Hua Huang authored
* Try async copy of grouped GEMM group_sizes data Signed-off-by:
Hua Huang <huah@nvidia.com> --------- Signed-off-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 07 Oct, 2025 1 commit
-
-
Phuong Nguyen authored
* reuse amax for current scaling Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 06 Oct, 2025 1 commit
-
-
Phuong Nguyen authored
* not fuse bias for output all reduction case + unit tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * norm to reduce dgamma along tpsp as well Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * clean up tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix test_distributed_layernorm byte counts Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * increase tols for jax_gemm Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 03 Oct, 2025 1 commit
-
-
vthumbe1503 authored
Signed-off-by:Varun Thumbe <vthumbe@nvidia.com> *Jax integration for clamped swiglu. This is the continuation of PR which added Clamped Swiglu(used in GPT OSS) support in TE along with Pytorch integration. This PR hooks up the clamped swiglu and dswiglu's nvte APIs to TE Jax.
-
- 02 Oct, 2025 2 commits
-
-
jberchtold-nvidia authored
Fix shard map issue Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
jberchtold-nvidia authored
Fix code block in fp8_autocast docstring Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 01 Oct, 2025 2 commits
-
-
Phuong Nguyen authored
fix rng_state shape Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
-
Phuong Nguyen authored
* rm using_global_amax_of_x Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 30 Sep, 2025 1 commit
-
-
jberchtold-nvidia authored
Load modules during initialize Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
JAX Toolbox <jax@nvidia.com>
-
- 27 Sep, 2025 1 commit
-
-
Phuong Nguyen authored
* init cgemm + unit tests * UB bootstrap with NCCL, no MPI dependency * add NVLINK-P2P check + error message * skip tests if no NVLINK available * use std::vector to store ncclComm_t * update misuse of TP warning Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 23 Sep, 2025 2 commits
-
-
Phuong Nguyen authored
* Rework shardy rules * WAR for compound factor=1 Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Ming-Xu Huang authored
* Adding Amax Primitive and related args. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Enable local-amax for current-scaling and optionally run AR aross FSDP/TP/SP. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding doc for Amax Primitive. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix the function name conflict. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Modification as feedback suggested. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix errors from lint. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix the wrong amax-scope in the bwd. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Added more description for amax-scope Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix the wrong attribute name. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Keep dim for AmaxCalcuation. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Remove keepDim and add shardy_rule Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix shardy_rule Signed-off-by:
Ming Huang <mingh@nvidia.com> * Remove extra-collective bytes from ref_coll_count due to local amax. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Signed-off-by:
Ming-Xu Huang <mingh@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 22 Sep, 2025 2 commits
-
-
Charlene Yang authored
* first draft; debug plan failure Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * debug uid error Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak params Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add grad in output Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up prints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix prints in test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * address review comments Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix unfused grad; add softmax_type; add sink to bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix padding mask; add swa tests; remove requires_grad for off-by-one Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:
Chen Cui <chcui@nvidia.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix indent Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix non-determinism and shapes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up prints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add GQA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add CP A2A; dq/dk mismatches Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CP A2A; need cleaner solution Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CP A2A; pending cudnn kernel change Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix world size in unit test; avoid thd format Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix kernel_backend, dtype in unit test; fix head_dim for FP8 Hopper Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix thd logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 context Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak CP logging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * allow no_mask/padding for SWA(left,0) Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "allow no_mask/padding for SWA(left,0)" This reverts commit 08b4ccc67a08b6882080b06aa715f541bb832aca. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add softmax_type to Jax Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add cuDNN version control Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * prettify tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip 9.13 for MLA, non 192/128 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename compare_with_error Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * small cleanups and improvements Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix minor CI failures Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * force sink/dsink to be float32 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * switch FE to GH FE Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * return to GH TE main FE commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.14.1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up before CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * bump up cudnn version Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add backend selection guard for unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for softmax type enums in C Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
Chen Cui <chcui@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Phuong Nguyen authored
* remove import jax.extend.ffi Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 18 Sep, 2025 1 commit
-
-
alan yang authored
* feat: add cutlass group gemm support Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: refactor multi tensor gemm interface Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: refactor nvte_multi_stream_cublas_gemm func and add license info Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add unit test for cutlass group gemm Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add cutlass support type protect Signed-off-by:
Min Yang <min.yang@shopee.com> * add tests and fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: fix unit tests error Signed-off-by:
Min Yang <min.yang@shopee.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: refactor host workspace malloc Signed-off-by:
Min Yang <min.yang@shopee.com> * update cutlass Signed-off-by:
Xin Yao <xiny@nvidia.com> * update cutlass Signed-off-by:
Xin Yao <xiny@nvidia.com> * further relex threshold and add a env var to warn fall back Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Min Yang <min.yang@shopee.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
alan yang <89962857+cassiewilliam@users.noreply.github.com> Co-authored-by:
Min Yang <min.yang@shopee.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 15 Sep, 2025 1 commit
-
-
Ming-Xu Huang authored
* Applying the original precision as Norm outputs' and activation compuations. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding knob to control norm output precision. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Removing the knob and applying lower-precision norm with current-scaling only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix the error when quantizer==None Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com>
-
- 09 Sep, 2025 1 commit
-
-
Phuong Nguyen authored
* add swizzle in jax Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added outer_impl Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * clean up FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 08 Sep, 2025 1 commit
-
-
Phuong Nguyen authored
Fix GroupedScaledTensor creation Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 05 Sep, 2025 1 commit
-
-
jberchtold-nvidia authored
* Custom call tests passing Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix comments Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix shardy issue with amax being shape 1,1,1 instead of shape (1,) Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add higher-precision VJP tests to test_distributed_layernorm_mlp Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Cast non-quantized kernels to input dtype in VJPs Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename HighPrecisionTensor to NoScaleTensor Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use NoScaleTensor in pure JAX impls where it was missing Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 03 Sep, 2025 1 commit
-
-
Kshitij Lakhani authored
* Fix failing tests for dropout=0.1 and bias for fused attn for blackwell Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the skip message Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Add check for sm100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to get all devs in the process for jax Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code clean up Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Represent attn bias using enum instead of string Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 27 Aug, 2025 5 commits
-
-
Phuong Nguyen authored
* add amax input to DBiasQuantizePrimitive and FFI Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Phuong Nguyen authored
* add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
jberchtold-nvidia authored
* Decouple recipe and scaling mode Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Expose global QuantizeConfig instance as a getter Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Format and lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Rename UsageType to TensorSource Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
-
jberchtold-nvidia authored
Delay MeshResource validation until first usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Ming-Xu Huang authored
* FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 26 Aug, 2025 2 commits
-
-
Phuong Nguyen authored
* clean up sharding Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added tpsp_resource Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * update tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * rework test for MeshResource Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * add mesh_resource into fp8_autocast in test_helper.py Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* added amax as an optional arg Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-