- 14 Feb, 2025 3 commits
-
-
Phuong Nguyen authored
JAX Lint Fix Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* fixes L1 test * fix test_multigpu_encoder * fixes for other multi-encoder tests * jax.extend.ffi to jax.ffi * initialization with float32 * add init_dtype as an optional arg to all modules * update use_scan query from xla flags * relax threshold for test_encoder fp8 * relax the tols --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* initialization with weight_dtype Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 13 Feb, 2025 1 commit
-
-
Xin Yao authored
* fix a bug for at::from_blob with nullptr Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix a bug for non-TN Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 12 Feb, 2025 2 commits
-
-
Przemyslaw Tredak authored
* Updated docs for TE 2.0 Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Do not expose comm_gemm_overlap and cast_transpose_noop Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Made the figures larger Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> * Update quickstart_utils.py Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change from review Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Jaemin Choi authored
Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Jaemin Choi <jaeminc@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 11 Feb, 2025 1 commit
-
-
Phuong Nguyen authored
* flax module to init params with given dtype Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * all tests passed Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * remove unneccessary reshape for kernel Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * remove casting output of dot Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * clean up Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 07 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 31 Jan, 2025 1 commit
-
-
Selvaraj Anandaraj authored
* Initial commit Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Fixed compilation errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Fixed syntax errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed NaN issue when initial param value is zero Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Removed 64 bit indexing instantiation Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Made this feature an opt-in Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Removed arg from unscaled state Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Fixed compilation error Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleaned up errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added support for checkpointing Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed checkpointing logic Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Added tests Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added assert failure for capturable mode Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed pylint errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 28 Jan, 2025 1 commit
-
-
Sergii Dymchenko authored
This function is more accurate than torch.log() for small values of input - https://pytorch.org/docs/stable/generated/torch.log1p.html Found with TorchFix https://github.com/pytorch-labs/torchfix/ Signed-off-by:
Sergii Dymchenko <sdym@meta.com> Co-authored-by:
Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 27 Jan, 2025 1 commit
-
-
hx authored
* add mask-based moe permutation * change moe_chunk_permute to moe_sort_chunks_by_indices * fix __all__ in pytorch/permutation.py * fix func/var names and typos; update tols in UT --------- Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 24 Jan, 2025 1 commit
-
-
Reese Wang authored
* POC for segment_ids/segment_pos Signed-off-by:
Reese Wang <rewang@nvidia.com> * Change segment_pos position Signed-off-by:
Reese Wang <rewang@nvidia.com> * Use RemainingArgs to solve number of parameters mismatches Signed-off-by:
Reese Wang <rewang@nvidia.com> * Test mask_descriptor for accomendating different mask representations Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix bugs Signed-off-by:
Reese Wang <rewang@nvidia.com> * Use descriptor in bwd Signed-off-by:
Reese Wang <rewang@nvidia.com> * Primitives only accepts pure jnp array Signed-off-by:
Reese Wang <rewang@nvidia.com> * segment_ids/pos support POC Signed-off-by:
Reese Wang <rewang@nvidia.com> * Move seqlens/offsets generation to mask descriptor Signed-off-by:
Reese Wang <rewang@nvidia.com> * Rename MaskDescriptor to SequenceDescriptor Signed-off-by:
Reese Wang <rewang@nvidia.com> * Generalize get_seqlens_and_offsets Signed-off-by:
Reese Wang <rewang@nvidia.com> * Utilize sequence desc on FA bwd Signed-off-by:
Reese Wang <rewang@nvidia.com> * Migrate to new API Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add docstrings Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove small inputs and test different input format Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix lint Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix seed shardings Signed-off-by:
Reese Wang <rewang@nvidia.com> * Optimize sequence converting overhead Signed-off-by:
Reese Wang <rewang@nvidia.com> * Optimize seq_offsets calculation Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix up Signed-off-by:
Reese Wang <rewang@nvidia.com> * fix lint Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix conflicts Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove reduntant line Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 22 Jan, 2025 1 commit
-
-
Tim Moon authored
* Avoid `parameters` function in op backward pass Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 21 Jan, 2025 1 commit
-
-
Charlene Yang authored
only compare the recipe in AttentionParams.fp8_meta Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 16 Jan, 2025 1 commit
-
-
Alp Dener authored
* corrected RS overlap BF16 output clashing with Float8Tensor constructor Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed empty dgrad buffer dtype at initialization Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 15 Jan, 2025 1 commit
-
-
guyueh1 authored
* Add a compile option to compile activation kernels with fast math Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * Fix Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * Apply suggestions from code review Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
guyueh1 <140554423+guyueh1@users.noreply.github.com> --------- Signed-off-by:
Guyue Huang <guyueh@nvidia.com> Signed-off-by:
guyueh1 <140554423+guyueh1@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 13 Jan, 2025 1 commit
-
-
Alp Dener authored
* support AG overlap in sequence-parallel Linear forward and RS overlap in sequence-parallel Linear backward Signed-off-by:
Alp Dener <adener@nvidia.com> * implemented TP overlap support for column-parallel te.Linear Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed backward pass for te.Linear column-parallel with TP overlap, updated unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improved error messages for internal failure to infer TP overlap options in te.Linear Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect TP overlap option asserts Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 10 Jan, 2025 1 commit
-
-
Xiaowei Ren authored
Take token count quantization of fused attention into consideration for CP results correction (#1396) * fix second half lse shape Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fixes Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 08 Jan, 2025 4 commits
-
-
Xiaowei Ren authored
* make pad_between_seqs check do not consider padding at the end Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change CP THD test to make it consider 0-length sequence Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change to flash func name Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * only use varlen func of flash attention while qkv_format is THD Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try to converge code of flash and fused attentions Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bwd compute with P2P Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant out_per_step view Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * enable cudnn>9.6 and THD+GQA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * enable CP with FusedAttn+SWA+All_Gather Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * enable CP with FusedAttn+SWA+All_Gather Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning for cu_seqlens Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix some pylint error Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor import change for pylint Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * more fix for pylint Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix lse_seqlen in thd out correction Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Michael Goldfarb authored
Correct fused attention output after each step to reduce intermediate memory use. Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
Liyuan Liu authored
the current implementation would release the output of ln, leading to an error if setting `return_layernorm_output=True`. Signed-off-by:
Liyuan Liu <llychinalz@gmail.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Reese Wang authored
* Fix SWA mask for THD and forcing seqlen_kv >= seqlen_q for SWA Signed-off-by:
Reese Wang <rewang@nvidia.com> * Generalize sliding window mask Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix pylint Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 02 Jan, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 Dec, 2024 1 commit
-
-
Charlene Yang authored
* add swa (left,0) + padding + brcm support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * final fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * upgrade to FE 1.9-rc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip thd + CP + fused attn tests for cuDNN 9.6+ due to different stats shapes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 18 Dec, 2024 1 commit
-
-
Charlene Yang authored
* WIP: fix get_swa_mask for padding Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix mask type setting Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix the order of checking valid swa and changing mask type Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revamp to get full mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 17 Dec, 2024 2 commits
-
-
Reese Wang authored
* Add util functions to attn_mask_type Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add util functions to qkv_layout Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix THD cross reference code Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove explicit segment_pad, encoding it to segment_ids Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add jax.jit, replace _token with segment_ids, rename bias shape enum Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add comment for make_mask Signed-off-by:
Reese Wang <rewang@nvidia.com> * Clean code Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add doc strings for the added functions Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove cache for fa deterministic which causes UT failed Signed-off-by:
Reese Wang <rewang@nvidia.com> * Rename fixture to avoid conflict Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
Charlene Yang authored
add max_t for KV Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 16 Dec, 2024 1 commit
-
-
Youngeun Kwon authored
* draft implementation of fsdp2 fp8 all gather Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix the convergence issue Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * Add warning Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add comments Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add ref Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add related tests Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Dec, 2024 1 commit
-
-
Phuong Nguyen authored
* softmax custom calls with correct encapsulates * rm jax deprecated features --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 12 Dec, 2024 1 commit
-
-
Phuong Nguyen authored
* fix ctx.aval_out indexing for workspace * add cudnn init to prepare phase of norm custom calls * add thread_local for norm registry instance --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 10 Dec, 2024 1 commit
-
-
Reese Wang authored
* Bug Fix: Use default factory for not sharing mutable default values --------- Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 06 Dec, 2024 1 commit
-
-
Phuong Nguyen authored
* cuDNN normalization integration * TE Norm refactor * TE Norm APIs changes. --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 05 Dec, 2024 2 commits
-
-
Xiaowei Ren authored
* always have padding mask type for both flash and fused attentions Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove an redundant assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com>
-
Tim Moon authored
Store module extra state in tensor Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 02 Dec, 2024 1 commit
-
-
Youngeun Kwon authored
* draft implementation Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * compile error fix Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix compile error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * remove print Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Edit comments Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * edit the bulk-overlap test case Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add runtime version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix the version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> --------- Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 27 Nov, 2024 1 commit
-
-
Xiaowei Ren authored
* retain_graph=True for grouped gemm Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove an unnecessary retain_graph=True Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make retain_graph in graph capture configurable Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com>
-
- 25 Nov, 2024 2 commits
-
-
Michael Goldfarb authored
Moved framework agnostic THD kernels to common. --------- Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
buptzyb authored
* Align RNG tracker with megatron Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Fix module_params order and warmup bug in cudagraph Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Add fp8_group argument and fix fp8 accuracy issue for cudagraph Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Add TE modules and weights filters to support MoE models Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Revert self.fp8 Signed-off-by:
Robin Zhang <robinz@nvidia.com> * Use hooks to filter module params Signed-off-by:
Robin Zhang <robinz@nvidia.com> * Filter all TE modules in hooks Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> * Format code Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update graph.py Signed-off-by:
Xin Yao <yaox12@outlook.com> * Revert CudaRNGStatesTracker Signed-off-by:
Robin Zhang <robinz@nvidia.com> * Format Update Signed-off-by:
Yifei Song <yifeis@nvidia.com> * Revert "Use hooks to filter module params" This reverts commit 73a22e2e8bcf43ec84c23bc844b8d16d06626e26. Signed-off-by:
Yifei Song <yifeis@nvidia.com> * Remove filtering module params Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Signed-off-by:
Yifei Song <yifeis@nvidia.com> Co-authored-by:
Yifei Song <yifeis@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <yaox12@outlook.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 22 Nov, 2024 1 commit
-
-
Tim Moon authored
* Add helper function to convert C++ container to string Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 21 Nov, 2024 1 commit
-
-
Tim Moon authored
* Handle deprecated `hidden_size` arg in norm modules Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Support initializing norm ops on CPU Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add integration test for Megatron-LM Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename Mcore integration test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Handle case in RMSNorm where hidden dim is not provided Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-