- 27 Feb, 2025 1 commit
-
-
Tim Moon authored
Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 26 Feb, 2025 3 commits
-
-
Tim Moon authored
* Skip context parallelism tests if not enough GPUs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Apply suggestions from code review Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Oleg Goncharov authored
* Added TMA alignment check to cast_fp8_1D Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use tensor const-ref instead of tensor const-ptr Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Selvaraj Anandaraj authored
* Added parallel cross entropy loss implementation using online softmax Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added tests Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added reshape of loss output Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added to test list Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * Added Triton dependency Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * Added copyright Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * Fixed lint errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Selvaraj Anandaraj <anandaraj@wisc.edu> * Fixed lint and triton failure Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * Removed flattening for scalars Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * Skip tests on Blackwell due to TE CI caveat Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added reason arg Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Do not register Triton dependency with setuptools Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by:
Selvaraj Anandaraj <anandaraj@wisc.edu> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@cw-dfw-cs-001-login-01.cm.cluster> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 25 Feb, 2025 3 commits
-
-
Youngeun Kwon authored
* add remove_caches api Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * Update transformer_engine/pytorch/tensor/float8_tensor.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * explicit delete Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
guyueh1 authored
* Fix a crash with module._apply(lambda t: t.cpu()) Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * Add comments Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * Make sure tensor is moved to dst device before quantizer quantizes Signed-off-by:
Guyue Huang <guyueh@nvidia.com> --------- Signed-off-by:
Guyue Huang <guyueh@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Charlene Yang authored
* minor fixes for attention Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <charleney@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 24 Feb, 2025 2 commits
-
-
Paweł Gadziński authored
* non-exit tests Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Paweł Gadziński authored
* fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * reshape inp Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
- 22 Feb, 2025 2 commits
-
-
Kshitij Lakhani authored
* Remove dependency on transformer_engine::Tensor in attention.cu Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Templatize thd_partition_indices_kernel and thd_read_half_tensor_kernel kernels ONLY for invoking recompilation and not directly using the pre-compiled symbols in libtransformer.so Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Modify attention.cu for thd templatized kernels. Remove dependency on common.h Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move thd structs from libtransformer.so to framework extensions include header Code cleanup Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Consolidate and move thd_utils from common to framework extensions Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Remove template decorators around thd_partition_indices_kernel and thd_read_half_tensor_kernel Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
Use same API in optimizer zero_grad as PyT optimizers Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 20 Feb, 2025 2 commits
-
-
Xiaowei Ren authored
* commit some debug code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add more debug info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * debug code commit and typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * a typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove debug info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not return lse Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add amax_per_step for quantizers of CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix FP8 + CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * dtype fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xiaowei Ren <xren@login-preos01.a51.clusters.nvidia.com>
-
Kirthi Shankar Sivamani authored
* Fix te sequential for older pytorch versions Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * FIxes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 19 Feb, 2025 3 commits
-
-
Xin Yao authored
* fix fuse_wgrad_accumulation for GroupedLinear Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix fuse_wgrad_accumulation for GroupedLinear Signed-off-by:
Xin Yao <xiny@nvidia.com> * update tests Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Tim Moon authored
Fix typo Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Zhenhuan Liu authored
* Fix issues for MCore DDP. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Remove force data release for CPU offloading. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Add preserved attributeds. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add main_grad to prevserved attributes. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Change prepare_for_saving to original tensor and add .data to CPU hook. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Fix for LayernormLinear in FP8. Signed-off-by:
Dennis Liu <denliu@nvidia.com> --------- Signed-off-by:
Dennis Liu <denliu@nvidia.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 18 Feb, 2025 2 commits
-
-
Phuong Nguyen authored
flax module with compute dtype inferred from the inputs Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
hx authored
* add prob permute; fix fp8tensor Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert unnecessary changes in UT Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * remove unnecessary probs dtype convert Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * keep the output nums if probs is not provided Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine the doc string Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * fix lint Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * use fp32 compute type Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * style fix Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * fix empty input return Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * separate prob related functions out Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 15 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 14 Feb, 2025 5 commits
-
-
Reese Wang authored
* Expose THD to flex MHA module Signed-off-by:
Reese Wang <rewang@nvidia.com> * Enhance docs Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Reese Wang authored
Fix issues when mask/sequence_descriptor is None Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
JAX Lint Fix Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* fixes L1 test * fix test_multigpu_encoder * fixes for other multi-encoder tests * jax.extend.ffi to jax.ffi * initialization with float32 * add init_dtype as an optional arg to all modules * update use_scan query from xla flags * relax threshold for test_encoder fp8 * relax the tols --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* initialization with weight_dtype Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 13 Feb, 2025 1 commit
-
-
Xin Yao authored
* fix a bug for at::from_blob with nullptr Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix a bug for non-TN Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 12 Feb, 2025 2 commits
-
-
Przemyslaw Tredak authored
* Updated docs for TE 2.0 Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Do not expose comm_gemm_overlap and cast_transpose_noop Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Made the figures larger Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> * Update quickstart_utils.py Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change from review Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Jaemin Choi authored
Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Jaemin Choi <jaeminc@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 11 Feb, 2025 1 commit
-
-
Phuong Nguyen authored
* flax module to init params with given dtype Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * all tests passed Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * remove unneccessary reshape for kernel Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * remove casting output of dot Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * clean up Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 07 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 31 Jan, 2025 1 commit
-
-
Selvaraj Anandaraj authored
* Initial commit Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Fixed compilation errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Fixed syntax errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed NaN issue when initial param value is zero Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Removed 64 bit indexing instantiation Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Made this feature an opt-in Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Removed arg from unscaled state Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Fixed compilation error Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleaned up errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added support for checkpointing Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed checkpointing logic Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Added tests Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added assert failure for capturable mode Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed pylint errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 30 Jan, 2025 1 commit
-
-
Quentin Anthony authored
Signed-off-by:Quentin Anthony <qganthony@yahoo.com>
-
- 28 Jan, 2025 1 commit
-
-
Sergii Dymchenko authored
This function is more accurate than torch.log() for small values of input - https://pytorch.org/docs/stable/generated/torch.log1p.html Found with TorchFix https://github.com/pytorch-labs/torchfix/ Signed-off-by:
Sergii Dymchenko <sdym@meta.com> Co-authored-by:
Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 27 Jan, 2025 1 commit
-
-
hx authored
* add mask-based moe permutation * change moe_chunk_permute to moe_sort_chunks_by_indices * fix __all__ in pytorch/permutation.py * fix func/var names and typos; update tols in UT --------- Signed-off-by:
Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 24 Jan, 2025 1 commit
-
-
Reese Wang authored
* POC for segment_ids/segment_pos Signed-off-by:
Reese Wang <rewang@nvidia.com> * Change segment_pos position Signed-off-by:
Reese Wang <rewang@nvidia.com> * Use RemainingArgs to solve number of parameters mismatches Signed-off-by:
Reese Wang <rewang@nvidia.com> * Test mask_descriptor for accomendating different mask representations Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix bugs Signed-off-by:
Reese Wang <rewang@nvidia.com> * Use descriptor in bwd Signed-off-by:
Reese Wang <rewang@nvidia.com> * Primitives only accepts pure jnp array Signed-off-by:
Reese Wang <rewang@nvidia.com> * segment_ids/pos support POC Signed-off-by:
Reese Wang <rewang@nvidia.com> * Move seqlens/offsets generation to mask descriptor Signed-off-by:
Reese Wang <rewang@nvidia.com> * Rename MaskDescriptor to SequenceDescriptor Signed-off-by:
Reese Wang <rewang@nvidia.com> * Generalize get_seqlens_and_offsets Signed-off-by:
Reese Wang <rewang@nvidia.com> * Utilize sequence desc on FA bwd Signed-off-by:
Reese Wang <rewang@nvidia.com> * Migrate to new API Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add docstrings Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove small inputs and test different input format Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix lint Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix seed shardings Signed-off-by:
Reese Wang <rewang@nvidia.com> * Optimize sequence converting overhead Signed-off-by:
Reese Wang <rewang@nvidia.com> * Optimize seq_offsets calculation Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix up Signed-off-by:
Reese Wang <rewang@nvidia.com> * fix lint Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix conflicts Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove reduntant line Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 22 Jan, 2025 1 commit
-
-
Tim Moon authored
* Avoid `parameters` function in op backward pass Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 21 Jan, 2025 1 commit
-
-
Charlene Yang authored
only compare the recipe in AttentionParams.fp8_meta Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 17 Jan, 2025 1 commit
-
-
Michael Goldfarb authored
Consolidate the distributed fused attention tests to shared input generation and execition logic. Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
- 16 Jan, 2025 1 commit
-
-
Alp Dener authored
* corrected RS overlap BF16 output clashing with Float8Tensor constructor Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed empty dgrad buffer dtype at initialization Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 15 Jan, 2025 1 commit
-
-
guyueh1 authored
* Add a compile option to compile activation kernels with fast math Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * Fix Signed-off-by:
Guyue Huang <guyueh@nvidia.com> * Apply suggestions from code review Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
guyueh1 <140554423+guyueh1@users.noreply.github.com> --------- Signed-off-by:
Guyue Huang <guyueh@nvidia.com> Signed-off-by:
guyueh1 <140554423+guyueh1@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 13 Jan, 2025 1 commit
-
-
Alp Dener authored
* support AG overlap in sequence-parallel Linear forward and RS overlap in sequence-parallel Linear backward Signed-off-by:
Alp Dener <adener@nvidia.com> * implemented TP overlap support for column-parallel te.Linear Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed backward pass for te.Linear column-parallel with TP overlap, updated unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improved error messages for internal failure to infer TP overlap options in te.Linear Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect TP overlap option asserts Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-