- 31 Jan, 2024 3 commits
-
-
cyanguwa authored
* update cudnn cmake for v9 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back license information Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
Fused rope computation in fp32 Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tim Moon authored
Do not allocate FP8 workspace buffers when params are FP8 Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 30 Jan, 2024 2 commits
-
-
Tim Moon authored
* Replace paddle.fluid imports with paddle.base Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove paddle.fluid usage from tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Selvaraj Anandaraj authored
Fixed offloading for PyT version/ Added Attention activation offloading support/ Native FP8 support (#632) * Fixed offloading for PyT version/ Added Attention activation offloading support/ Native FP8 support Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Removed activation offloading for fused attention Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fixed the illegal memory access issue for activation offloading of attention Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Removed the version guard Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Pipeline failures fix Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fixed lint erros Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Lint error fix Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>
-
- 29 Jan, 2024 1 commit
-
-
Alp Dener authored
* Removed cudaMalloc/WorkspaceManager in JAX csrc. JAX custom ops now request buffers from XLA for their workspace tensors. Signed-off-by:
Alp Dener <adener@nvidia.com> * removed unused GEMM C++ API in TE-JAX Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed typo in layernorm_geglu_fp8_mlp and removed unnecessary shape reductions in primitives Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed import order for linting Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed custom op errors due to incorrect static arg nums in JAX jit Signed-off-by:
Alp Dener <adener@nvidia.com> * shifted cudnnSetStream further down the kernel to avoid error when executing dummy kernel call with nullptr stream Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed linting errors for blank lines Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 26 Jan, 2024 2 commits
-
-
Shijie authored
* use separate qkv Signed-off-by:
jaywan <jaywan@nvidia.com> * add support for GQA Signed-off-by:
jaywan <jaywan@nvidia.com> * minor changes Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * change rtol Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * fix reshape issue Signed-off-by:
Shijie Wang <jaywan@nvidia.com> --------- Signed-off-by:
jaywan <jaywan@nvidia.com> Signed-off-by:
Shijie Wang <jaywan@nvidia.com>
-
Isaac Ong authored
Fix MHA docstring Signed-off-by:Isaac Ong <isaacong.jw@gmail.com>
-
- 25 Jan, 2024 1 commit
-
-
Xin Yao authored
* fused apply rope Signed-off-by:
Xin Yao <xiny@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * make rotary_percent optional Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix ci Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix test Signed-off-by:
Xin Yao <xiny@nvidia.com> * add rope test to qa Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix linting Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: add transpose_output_memory Signed-off-by:
Xin Yao <xiny@nvidia.com> * small fix Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: fuse sin/cos Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: fused rope for thd format Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * Fix license headers Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * add support for bshd format Signed-off-by:
Xin Yao <xiny@nvidia.com> * support different seq length Signed-off-by:
Xin Yao <xiny@nvidia.com> * update Signed-off-by:
Xin Yao <xiny@nvidia.com> * update copyright Signed-off-by:
Xin Yao <xiny@nvidia.com> * remove transpose_output_memory Signed-off-by:
Xin Yao <xiny@nvidia.com> * Make outputs contiguous in SBHD case Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 24 Jan, 2024 3 commits
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
Marks101 authored
[PyTorch] fix forward attention_type in MultiheadAttention Signed-off-by:
Markus Schnoes <markus.schnoes@gmx.de> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Alp Dener authored
[PyTorch] Workaround for incorrect output from torch.cuda.is_bf16_compatible() on V100s and TU102s (#626) * replaced torch.cuda.is_bf16_compatible() with explicit sm_80 check via torch.cuda.get_device_capability() Signed-off-by:
Alp Dener <adener@nvidia.com> * implement te.utils.is_bf16_compatible() to replace torch.cuda counterpart Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 23 Jan, 2024 2 commits
-
-
Luke Petre authored
* Support building using the manylinux docker image. libpython is only required for embedded python. Signed-off-by:
Luke Petre <lpetre@midjourney.com> * Be explicit about which python to use in cmake Signed-off-by:
Luke Petre <lpetre@midjourney.com> * Remove cmake version check Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Luke Petre <lpetre@gmail.com> --------- Signed-off-by:
Luke Petre <lpetre@midjourney.com> Signed-off-by:
Luke Petre <lpetre@gmail.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Alp Dener authored
* added missing parameter materialization on real device for LayerNorm and RMSNorm Signed-off-by:
Alp Dener <adener@nvidia.com> * added new unittest for deferred initialization and modified parameter materialization to support standalone execution outside of FSDP Signed-off-by:
Alp Dener <adener@nvidia.com> * restored tensor parallel attributes that were being wiped out by the parameter reset Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect order of fp8 metadata initialization Signed-off-by:
Alp Dener <adener@nvidia.com> * added deferred init unittest to the QA script Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 22 Jan, 2024 1 commit
-
-
Marks101 authored
Signed-off-by:Markus Schnoes <markus.schnoes@gmx.de>
-
- 21 Jan, 2024 1 commit
-
-
Selvaraj Anandaraj authored
Activation offloading to CPU's for the Linear, Layernorm Linear and the Layernorm MLP modules (#571) * Added support activation offloading to CPU's Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Moving CPU offloading library to TE Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Restructured code, added switch to choose between weight/activation offloading Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Removed arg during constructor Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fix nit-pick errors Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Documentation fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix to the code block in docs Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Added offloading unit test Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fixed formatting Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * wgrad fusion fix, minor errors and lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Errors, test, lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * RM test file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixed stray PyT tensors in LayernormMLP getting offloaded Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fixed typi Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> * Fix offloading for rmsnorm, rm test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix errors Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Float8Tensor compatible offloading Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 Jan, 2024 1 commit
-
-
Sudhakar Singh authored
fix failing tests due to PR #557 Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 19 Jan, 2024 2 commits
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 18 Jan, 2024 1 commit
-
-
Sudhakar Singh authored
* make TransformerLayer accept a `bshd` or `sbhd` tensor format Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Fixes from feedback Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * more feedback fixes Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove incorrect info from docstring Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix from feedback Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com>
-
- 17 Jan, 2024 1 commit
-
-
Alp Dener authored
* Implemented deferred initialization via `device='meta'` option for te.Linear and added new PyTorch example to demonstrate its use with FullyShardedDataParallel execution. Signed-off-by:
Alp Dener <adener@nvidia.com> * correcting Float8Tensor initialization and fixing linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * removed duplicate code from upstream rebase, local tests passing Signed-off-by:
Alp Dener <adener@nvidia.com> * improved comments/documentation for FSDP example Signed-off-by:
Alp Dener <adener@nvidia.com> * converted reset_parameters() into a base module function Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed Float8Tensor creation with deferred init, all tests passing locally Signed-off-by:
Alp Dener <adener@nvidia.com> * extended deferred initialization to all TE modules Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * removed unnecessary reference to the parent module of parameter, added clarifying comments in parameter reset Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 16 Jan, 2024 1 commit
-
-
zlsh80826 authored
* Support num_gqa_groups arguments Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add GQA support on the JAX bridge code Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix the kv stride of the arbitrary backend Signed-off-by:
Reese Wang <rewang@nvidia.com> * Complete rewrite fused attention tests and add GQA coverage Signed-off-by:
Reese Wang <rewang@nvidia.com> * Support unfused GQA Signed-off-by:
Reese Wang <rewang@nvidia.com> * Calculate seqlen before the primitive for the better perf Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add GQA layer tests Signed-off-by:
Reese Wang <rewang@nvidia.com> * Apply code style checks for te_jax Signed-off-by:
Reese Wang <rewang@nvidia.com> * Apply code style checks for tests Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add num_gqa_groups doc Signed-off-by:
Reese Wang <rewang@nvidia.com> * Refine the qkv_type Signed-off-by:
Reese Wang <rewang@nvidia.com> * Correct the variable naming Signed-off-by:
Reese Wang <rewang@nvidia.com> * Handle Max512 CAUSAL Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add WAR for the latest jax image Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 12 Jan, 2024 2 commits
-
-
Tian Zheng authored
* Actively free tensor in bwd Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * - Add inplace support for fp8 casting - Allow skipping weight update in fp8 meta update Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Support weight caching for Linear Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add weight caching for LayernormLinear Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add weight caching for LayerNormMLP Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add weight caching for Transformer layer Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add PP unittests Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix CI Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> --------- Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>
-
Ming-Xu Huang authored
* Adding Cast custom call Signed-off-by:
Ming Huang <mingh@nvidia.com> * Applying cast to the kernel of layernorm_fp8_dot Signed-off-by:
Ming Huang <mingh@nvidia.com> * Applying native cast to the kernel of fp8_dot. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Apply Cast and native cast to layernorm_geglu_fp8_dot Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix the bug to enable layernorm_geglu_fp8_dot in LayernormMlp Signed-off-by:
Ming Huang <mingh@nvidia.com> * Modifiied code with the review feedback. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Adding 2xACC control to FP8 GEMMs. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Set precision as an static arg Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com>
-
- 11 Jan, 2024 1 commit
-
-
Tian Zheng authored
* Add SP for linear Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add SP for LayerNormLinear Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add SP for LayerNormMLP Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add SP API for transformer layer Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add sequence_parallel attr Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add SP unittests for Transformer and Attention Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix compatibility with PaddleNLP Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Copyright Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 10 Jan, 2024 2 commits
-
-
Zhang Haitao authored
* support non-tensor inputs/outputs for checkpoint Signed-off-by:
skydoorkai <htsantaclara@163.com> * better format Signed-off-by:
skydoorkai <htsantaclara@163.com> * modify to avoid python loops Signed-off-by:
skydoorkai <htsantaclara@163.com> --------- Signed-off-by:
skydoorkai <htsantaclara@163.com>
-
Xiaowei Ren authored
* try to use cuDNN fused attention for context parallelism Signed-off-by:
xren <xren@nvidia.com> * assert CP is only supported with NVTE_F16_arbitrary_seqlen Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * port fused attn api to context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add one more assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert CP does not support padded tokens Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add qkv_format into CP implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove qkv_format from CP function Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix qkv_for,at Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bwd error with FA v2 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make cp implementation support non-causal masking Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant asserts for CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor assert information change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert core attn bias has not been supported with CP yet Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make CP work with window_sizes of [-1, -1] and [-1, 0] Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft code for fa test with cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * move fused attn test to a specific folder Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add assert_close to flash attn cp test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add more tests for CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add optional arguments for FA v2.4+ Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add skip condition for CP test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * class and function naming fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * docstring fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not use fused attn if backend does not work with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * create a separate folder for CP test as it needs multi-GPUs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add attn_mask_type check in attn_forwrad_func_with_cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
xren <xren@nvidia.com> Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 08 Jan, 2024 3 commits
-
-
Tim Moon authored
* Refactor parameter split in Linear module Remove module state from noop_cat. Support arbitrary names in parameter split. Handle tensor parallelism. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make noop_cat a standalone operation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update parameter splits in LayerNormLinear Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug case without bias Fix pylint complaints. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unused import Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
Jaemin Choi authored
* Use jit_fuser for bias-dropout-add fusion Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> * Use jit_fuser for CP FA kernel Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Jaemin Choi <jaeminc@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
cyanguwa authored
fix FP8 dims Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 06 Jan, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Deterministic FA, bump minimum supported version Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix MQA/GQA Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 05 Jan, 2024 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 03 Jan, 2024 3 commits
-
-
Przemyslaw Tredak authored
Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Sangkug Lym authored
* Provide pre-computed max sequence to remove unnecessary kernels and D2H copies Signed-off-by:
Sangkug Lym <slym@nvidia.com> * Tweak comments Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 02 Jan, 2024 1 commit
-
-
Hongbin Liu authored
avoid redundant computation for cu_seqlens Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> Co-authored-by:
Hongbin Liu <hongbinl@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 18 Dec, 2023 1 commit
-
-
Alp Dener authored
* Linear and LayerNormLinear weight and bias buffer cleanup at the end of init when there is no parameter split Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed typo in tensor name Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed typo in tensor name Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 16 Dec, 2023 2 commits
-
-
cyanguwa authored
* add sliding window to FA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix forward logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change bert test to causal as unfused does not support padding Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FlashAttention for v2-2.3 versions Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * verify FA swa works Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix mask related restrictions and duplicate code after merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix swa test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for get_swa func Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move repeated code into a function Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert mask change Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add determinism filter and fix FA warning message Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add message for determinism filter Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify check_set_window_size() Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix check_set_window_size in transformer layers Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix indent Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
Tim Moon authored
* Update fp8_meta amax when copying into Float8Tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid amax when copying between Float8Tensors with fp8_metas Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 15 Dec, 2023 1 commit
-
-
Przemyslaw Tredak authored
* Disable dynamo for Fused Attention Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Added test Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-