- 05 Sep, 2024 5 commits
-
-
Kirthi Shankar Sivamani authored
Revert "[C] Suppress 128-D warning from cudnn-frontend (#1158)" This reverts commit 206c1d92 . Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Charlene Yang authored
suppress 128D warning from cudnn-frontend Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Ruibin Cheung authored
* [TE/PyTorch][MoE] Add FP8 padding and unpadding module 1. Add multi-tensor padding kernel for FP8 with padding size = 16. 2. Add FP8Padding and Fp8Unpadding module 3. Add Padded GroupedLinear unit tests --------- Signed-off-by:
beinggod <zhangruibin@01.ai> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
Selvaraj Anandaraj authored
* Added offloading support FP8 attention Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> * Update transformer_engine/pytorch/attention.py Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Selvaraj Anandaraj <anandaraj@wisc.edu> * Fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> Signed-off-by:
Selvaraj Anandaraj <anandaraj@wisc.edu> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Xin Yao authored
* fp8 mha with rope Signed-off-by:
Xin Yao <xiny@nvidia.com> * avoid index select in cast ops Signed-off-by:
Xin Yao <xiny@nvidia.com> * avoid index select in fused_attn_fwd Signed-off-by:
Xin Yao <xiny@nvidia.com> * rename is_first_module_in_mha to fp8_output Signed-off-by:
Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move transpose to backward for fp8 input Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix ut Signed-off-by:
Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * update argument list for CP Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix for FA3 Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unnecessary copy of scale_inv Signed-off-by:
Xin Yao <xiny@nvidia.com> * skip fp8 dpa/mha tests when fa3 is not available Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix a merge bug Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 03 Sep, 2024 2 commits
-
-
Ruibin Cheung authored
[PyTorch] Add contiguous check for grouped gemm Signed-off-by:
beinggod <zhangruibin@01.ai> Co-authored-by:
beinggod <zhangruibin@01.ai> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Improvements for wheels Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes for wheel build Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Move package finder to common Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * FIx Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI and distributed test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix paddle ci Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 30 Aug, 2024 2 commits
-
-
Xiaowei Ren authored
* fix qkv_dtype of FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * config cp correction dtype of FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code style change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * always do FP8 CP correction in FP32 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Charlene Yang authored
* fix FP8 logic when FA3 is not installed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak to make logic more explicit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * limit FA3 warning to Hopper and NVTE_FLASH_ATTN=1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * prefer fused attn for FP8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 29 Aug, 2024 1 commit
-
-
Xin Yao authored
* remove dtype from args * update docs with permutation ops --------- Signed-off-by:Xin Yao <xiny@nvidia.com>
-
- 27 Aug, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 23 Aug, 2024 2 commits
-
-
Charlene Yang authored
* WIP: add fa3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: add benchmarks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * differentiate func/varlen_func Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix parsing keyword for FA3 and remove bshd->thd conversion for flash_attn_func Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add FP8 fwd support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add FA3 FP8 fwd code and test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix assert for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix FA3 FP8 logic and add tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA2 to <=2.6.3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak unit tests for base/mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * set constraints for FA3 for sm90 and causal_bottom_right Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert debug changes in benchmark script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Alexandros Koumparoulis authored
Signed-off-by:Alexandros Koumparoulis <akoumparouli@nvidia.com>
-
- 22 Aug, 2024 1 commit
-
-
NVJiangShao authored
* Add permutation functions * Add permutation ops * Remove the dependency on cutlass * Move permutation.py out of module dir * Rewrite the unit test and enable skipping if FP8 is unavailable * Rename exposed C++ API and reorder its parameters + take NVTETensor as inputs * Use Float8Tensor for FP8 input * Move dtype to ctx --------- Signed-off-by:
Jiang Shao <jiangs@nvidia.com> Co-authored-by:
Qi Zhang <qizhang@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 21 Aug, 2024 4 commits
-
-
Charlene Yang authored
* add support for padding in UnfusedDPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for padding_causal/_bottom_right Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix padding_causal/_bottom_right Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * need to test max512 backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix mask logic in unfused Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use actual_seqlen for alibi/causal_bottom_right padding Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes and convert causal to causal_bottom_right for inference Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use causal in kv cache inference test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify get_alibi logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * simplify the non-padding path for get_alibi Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * avoid batch_size loop in generating padding_causal/_bottom_right masks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Xiaowei Ren authored
* add window_size to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo for cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets calculation of cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove a thd assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bias for thd test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add thd test for cudnn FA with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * skip GQA/MQA test for cuDNN THD Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make sure seq_offsets are computed with qkv_group of hd_hd_hd while CP>1 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets inputs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove two comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn mask type for cudnn thd with cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type for cudnn fa with thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a typo Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix out dout in bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert cudnn+thd does not support attn bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if attn_mask_type has padding Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change cp test batch size to 2 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix code format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix two assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert swa+CP cannot work with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a new CP function for swa Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a missing dgrads Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft fwd function for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * enable flash attention for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove an assert of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * call SWAFuncWithCP for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * use 2hd layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change qkv_format check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a code comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * tensor shape bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tensor shape fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add function to compute cu_seqlens of a cp rank Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cu_seqlens and cu_seqlens_padded to context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix FlashAttention output sequence length Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix cu_seqlens_kv_per_step calculation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV for ending padded tokens Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV tensors of FlashAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix softmax_lse correction Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove padded tokens of KV to save comounication Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not need to zero dkv for FlashAttention any mroe Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero out tensors Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of cp test with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add simple code framework Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try not to have a separate CP function for SWA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * backup some code change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * back up code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * clean up fwd implementation of SWAFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * reduce kv chunk concat overheads Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make AttnFuncWithCP and SWAFuncWithCP have same API Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a docstring Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * preliminary implementation of SWAFuncWithCP forward seems working Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix output shape of SWAFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code refactoring for FlashAttention and add a code placeholder for bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * use gather_along_first_dim Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * finish the preliminary implementation of bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert condition Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft implementation of SWA+CP with FusedAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attention mask type of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add qkv_layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add missing window_size argument Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug and typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix dout shape Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add multi stream in fwd of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save chunk_ids_to_kv_ag in fwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add multi stream in bwd of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix to cp stream sync Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * rename AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if window size is None Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix docstring of AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add env var for users to choose KV ag or KV p2p Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix window size in cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix pytest skip messages Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cp_comm_type into API Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add deterministic konb in cuDNN fused attn backend Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * pass fp8 and fp8_meta to attn_func_with_cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert only Fused Attn can support FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a fwd draft implementation of FP8 + CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save fp8 and fp8_meta Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert sequence length divisible requirements Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove a redundant qkv_layout compute Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * if condition change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * some typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add support table of context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo and code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not print multiple disabling messages Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix aux_ctx_tensors of FP8 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix device in torch.arange and adjust code for the PR of MLA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * commit code change for FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit more code change for FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit more fp8 code for FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fixes Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * cast merged CP results from FP32 to BF16 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix softmax_lse Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix some bugs of FP8 dkv exchange Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add FP8 unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix typos and clean asserts Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix get_p2p_comm_info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix dkv p2p exchange Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change FP8 dkv P2P to A2A Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add FP8+CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert amax reduction is needed for FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove duplicated code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * destroy process group in CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove interval from fp8_recipe because it has been deprecated Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try to fix the failed CP test with the latest CI pipeline Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove redundant f before string Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change META_O_CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xiaowei Ren <xren@cs-cw-dfw-login-01.cm.cluster>
-
Charlene Yang authored
* update FE to 1.6 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to 1.6.1-rc for testing Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to fe 1.6.1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
* Perform scale-inv update in cast-transpose kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Perform scale-inv update in cast and activation kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Perform sclae-inv update in LayerNorm and RMSNorm kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Perform scale-inv update after FP8 GEMMs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fuse casts and scale-inv updates in linear module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fuse casts and scale-inv updates in layernorm-linear module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Simplify kernel to update FP8 scale-inv Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix typos Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug amax update in layernorm kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug ONNX export Use quantization scaling factor in ONNX quantize op. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestion from @ptrendx Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Debug mismatched dtypes Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 20 Aug, 2024 1 commit
-
-
hXl3s authored
feat(pytorch): Allow TransformerLayer and MultiheadAttention to accept sequence length parameters (#1066) * Added ability for seqlen for transformer and mha layer Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * Documentation for new parameters Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * Add tests for THD layout, assert for THD layout with KV-Cache Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * Fixed tests Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move THD logic in shape calculation, add missing optional in params Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * Skip the THD test on GPUs older than Ampere Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Przemek Tredak <ptredak@nvidia.com>
-
- 19 Aug, 2024 1 commit
-
-
Frédéric Bastien authored
Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 16 Aug, 2024 2 commits
-
-
Shijie authored
* support dtype casting fusion in FusedAdam Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * minor changes Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * fix lint Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * changes based on review comments Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * remove unused code Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * code refactor Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * fix typo Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * refactor Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * remove unused code Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Copy CUDA headers for framework sdists Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Shijie Wang <jaywan@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
Xiaowei Ren authored
* add window_size to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo for cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets calculation of cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove a thd assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bias for thd test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add thd test for cudnn FA with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * skip GQA/MQA test for cuDNN THD Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make sure seq_offsets are computed with qkv_group of hd_hd_hd while CP>1 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets inputs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove two comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn mask type for cudnn thd with cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type for cudnn fa with thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a typo Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix out dout in bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert cudnn+thd does not support attn bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if attn_mask_type has padding Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change cp test batch size to 2 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix code format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix two assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert swa+CP cannot work with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a new CP function for swa Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a missing dgrads Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft fwd function for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * enable flash attention for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove an assert of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * call SWAFuncWithCP for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * use 2hd layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change qkv_format check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a code comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * tensor shape bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tensor shape fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add function to compute cu_seqlens of a cp rank Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cu_seqlens and cu_seqlens_padded to context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix FlashAttention output sequence length Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix cu_seqlens_kv_per_step calculation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV for ending padded tokens Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV tensors of FlashAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix softmax_lse correction Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove padded tokens of KV to save comounication Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not need to zero dkv for FlashAttention any mroe Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero out tensors Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of cp test with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add simple code framework Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try not to have a separate CP function for SWA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * backup some code change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * back up code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * clean up fwd implementation of SWAFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * reduce kv chunk concat overheads Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make AttnFuncWithCP and SWAFuncWithCP have same API Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a docstring Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * preliminary implementation of SWAFuncWithCP forward seems working Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix output shape of SWAFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code refactoring for FlashAttention and add a code placeholder for bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * use gather_along_first_dim Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * finish the preliminary implementation of bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert condition Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft implementation of SWA+CP with FusedAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attention mask type of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add qkv_layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add missing window_size argument Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug and typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix dout shape Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add multi stream in fwd of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save chunk_ids_to_kv_ag in fwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add multi stream in bwd of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix to cp stream sync Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * rename AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if window size is None Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix docstring of AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add env var for users to choose KV ag or KV p2p Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix window size in cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix pytest skip messages Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cp_comm_type into API Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * assert sequence length divisible requirements Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support table of context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo and code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not print multiple disabling messages Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix device in torch.arange and adjust code for the PR of MLA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typos and clean asserts Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xiaowei Ren <xren@cs-cw-dfw-login-01.cm.cluster>
-
- 15 Aug, 2024 2 commits
-
-
Charlene Yang authored
fix typos regarding t in thd Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Marks101 authored
Signed-off-by:
Markus Schnoes <markus.schnoes@gmx.de> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 14 Aug, 2024 3 commits
-
-
Tim Moon authored
* Bump minimum CUDA version to 12.0 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug CUDA version check Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug CMake build Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @ksivaman and @ptrendx Remove logic for CUDA <12.0 in PyTorch and Paddle builds. Update version in docs and README. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Reese Wang authored
* Propagate sm_margin to the underly layernorm kernels --------- Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
Phuong Nguyen authored
* implemented custom call with ffi in csrc * moved headers of misc to misc.h, add ffi.h * ActLu and DActLu lowering with ffi_lowering * CastTranspose with ffi_lowering * enabled cudaGraph * added 4d input test case to TestActivationLu * added operand_output_aliases for CastTranspose * added env var NVTE_JAX_WITH_FFI, default value = 1 * replace casting ActivationEnum by taking its value --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 13 Aug, 2024 2 commits
-
-
Charlene Yang authored
* merge k_channels and v_channels back to kv_channels and accept a tuple Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix isinstance call Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MLA tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
vasunvidia authored
* DGRAD-RS overlap bug fix This PR fixes a bug in enabling DGRAD-RS overlap by adding the layer to the correct method list. Previously, the RS-DGRAD overlap layer was incorrectly added to pipeline method list even if ring_exchange method is specified in config. Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix for ring_exchange ReduceScatter ring_exchange RS uses main_stream for last GEMM chunk. But the send/recv streams wait for stream_compute during last chunk. Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 12 Aug, 2024 2 commits
-
-
Phuong Nguyen authored
* added threading build back * integrating threading for pytorch and paddle extensions * added messages --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
vasunvidia authored
Buf fix for num_warmup_iters=0 case Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 10 Aug, 2024 1 commit
-
-
Tim Moon authored
* Add op for in-place add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add op for in-place add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add op that adds extra output to fuser Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add fused op for GEMM+bias+add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add fused op for dgrad+add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add documentation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @ptrendx Output tensor dtype and device take precedence over weight tensor in linear functional API. Move some index calculation to fuser constructor. Avoid some unnecessary dereferences. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update transformer_engine/pytorch/ops/fuser.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 09 Aug, 2024 2 commits
-
-
Xin Yao authored
* use fused_multi_cast_transpose Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix input being empty tensor Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * allocate output tensors in C++ Signed-off-by:
Xin Yao <xiny@nvidia.com> * simplify code Signed-off-by:
Xin Yao <xiny@nvidia.com> * avoid cudaGetDriverEntryPoint Signed-off-by:
Xin Yao <xiny@nvidia.com> * reduce torch.Tensor() calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Alp Dener authored
[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` (#1087) * updated initialize_ub() to use new_subgroups_by_enumeration() to generate intra-node groups, added new unit tests for TE layers with comm overlap Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 08 Aug, 2024 1 commit
-
-
Reese Wang authored
* Support non-deterministic algo Signed-off-by:
Reese Wang <rewang@nvidia.com> * Refine the helper function name Signed-off-by:
Reese Wang <rewang@nvidia.com> * Move fixture to conftest.py Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 06 Aug, 2024 4 commits
-
-
Charlene Yang authored
reduce the roundup of max_seqlen for THD to multiples of 64 Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Charlene Yang authored
* fix logging in attention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove logging in fwd/bwd methods due to CPU overhead Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: fix check_set_window_size messaging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix typo Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix window_size messaging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove redundant imports Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Reese Wang authored
* Support actlen = 0 after cuDNN 9.3.0 Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add runtime_segment < max_segment tests Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
Charlene Yang authored
* add multi-latent attention for DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Jax/Paddle API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix typo in test script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix too-many-boolean lint error Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix lint" This reverts commit 67399a3a6f45bb4ce9e5eaa6bcce40b28e347e5b. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix stride check in get_qkv_layout Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: fix layout_thd tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix merge conflict Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix thd pad_between_seqs=False/True tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 02 Aug, 2024 1 commit
-
-
Li Tao authored
fix an argument issue when flash_attn>=2.5.7 Signed-off-by:
Li Tao <lit@nvidia.com> Co-authored-by:
Li Tao <lit@nvidia.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-