- 21 Aug, 2024 2 commits
-
-
Charlene Yang authored
* update FE to 1.6 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to 1.6.1-rc for testing Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to fe 1.6.1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
* Perform scale-inv update in cast-transpose kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Perform scale-inv update in cast and activation kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Perform sclae-inv update in LayerNorm and RMSNorm kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Perform scale-inv update after FP8 GEMMs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fuse casts and scale-inv updates in linear module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fuse casts and scale-inv updates in layernorm-linear module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Simplify kernel to update FP8 scale-inv Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix typos Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug amax update in layernorm kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug ONNX export Use quantization scaling factor in ONNX quantize op. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestion from @ptrendx Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Debug mismatched dtypes Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 20 Aug, 2024 1 commit
-
-
hXl3s authored
feat(pytorch): Allow TransformerLayer and MultiheadAttention to accept sequence length parameters (#1066) * Added ability for seqlen for transformer and mha layer Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * Documentation for new parameters Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * Add tests for THD layout, assert for THD layout with KV-Cache Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * Fixed tests Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move THD logic in shape calculation, add missing optional in params Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> * Skip the THD test on GPUs older than Ampere Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Lukasz Pierscieniewski <lukaszp@nvidia.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Przemek Tredak <ptredak@nvidia.com>
-
- 19 Aug, 2024 3 commits
-
-
Frédéric Bastien authored
Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 16 Aug, 2024 2 commits
-
-
Shijie authored
* support dtype casting fusion in FusedAdam Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * minor changes Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * fix lint Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * changes based on review comments Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * remove unused code Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * code refactor Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * fix typo Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * refactor Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * remove unused code Signed-off-by:
Shijie Wang <jaywan@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Copy CUDA headers for framework sdists Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Shijie Wang <jaywan@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
Xiaowei Ren authored
* add window_size to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo for cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets calculation of cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove a thd assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bias for thd test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add thd test for cudnn FA with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * skip GQA/MQA test for cuDNN THD Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make sure seq_offsets are computed with qkv_group of hd_hd_hd while CP>1 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets inputs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove two comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn mask type for cudnn thd with cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type for cudnn fa with thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a typo Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix out dout in bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert cudnn+thd does not support attn bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if attn_mask_type has padding Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change cp test batch size to 2 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix code format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix two assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert swa+CP cannot work with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a new CP function for swa Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a missing dgrads Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft fwd function for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * enable flash attention for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove an assert of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * call SWAFuncWithCP for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * use 2hd layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change qkv_format check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a code comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * tensor shape bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tensor shape fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add function to compute cu_seqlens of a cp rank Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cu_seqlens and cu_seqlens_padded to context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix FlashAttention output sequence length Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix cu_seqlens_kv_per_step calculation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV for ending padded tokens Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV tensors of FlashAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix softmax_lse correction Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove padded tokens of KV to save comounication Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not need to zero dkv for FlashAttention any mroe Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero out tensors Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of cp test with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add simple code framework Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try not to have a separate CP function for SWA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * backup some code change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * back up code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * clean up fwd implementation of SWAFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * reduce kv chunk concat overheads Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make AttnFuncWithCP and SWAFuncWithCP have same API Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a docstring Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * preliminary implementation of SWAFuncWithCP forward seems working Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix output shape of SWAFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code refactoring for FlashAttention and add a code placeholder for bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * use gather_along_first_dim Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * finish the preliminary implementation of bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert condition Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft implementation of SWA+CP with FusedAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attention mask type of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add qkv_layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add missing window_size argument Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug and typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix dout shape Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add multi stream in fwd of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save chunk_ids_to_kv_ag in fwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add multi stream in bwd of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix to cp stream sync Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * rename AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if window size is None Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix docstring of AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add env var for users to choose KV ag or KV p2p Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix window size in cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix pytest skip messages Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cp_comm_type into API Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * assert sequence length divisible requirements Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support table of context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo and code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not print multiple disabling messages Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix device in torch.arange and adjust code for the PR of MLA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typos and clean asserts Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xiaowei Ren <xren@cs-cw-dfw-login-01.cm.cluster>
-
- 15 Aug, 2024 2 commits
-
-
Charlene Yang authored
fix typos regarding t in thd Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Marks101 authored
Signed-off-by:
Markus Schnoes <markus.schnoes@gmx.de> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 14 Aug, 2024 6 commits
-
-
Tim Moon authored
* Bump minimum CUDA version to 12.0 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug CUDA version check Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug CMake build Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @ksivaman and @ptrendx Remove logic for CUDA <12.0 in PyTorch and Paddle builds. Update version in docs and README. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Phuong Nguyen authored
* add default path for ffi include * add an option to get XLA_HOME from env --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
Remove total time measurement Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Reese Wang authored
* Propagate sm_margin to the underly layernorm kernels --------- Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
* Switch to nltk>3.8.1 and new data Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix nltk install Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Phuong Nguyen authored
* implemented custom call with ffi in csrc * moved headers of misc to misc.h, add ffi.h * ActLu and DActLu lowering with ffi_lowering * CastTranspose with ffi_lowering * enabled cudaGraph * added 4d input test case to TestActivationLu * added operand_output_aliases for CastTranspose * added env var NVTE_JAX_WITH_FFI, default value = 1 * replace casting ActivationEnum by taking its value --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 13 Aug, 2024 7 commits
-
-
Tim Moon authored
* Use minimal CUDA container for PyTorch GitHub build Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Accidentally installed PyTorch in wrong test Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Debug sanity test Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Install PyTorch build dependencies Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Include NumPy as a dependency Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Disable sanity import test Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Phuong Nguyen authored
rm test_dgeglu Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* add timing for build * using perf_counter --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
Charlene Yang authored
* update example/benchmark scripts Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix head_dim after MLA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update notebook Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Charlene Yang authored
* merge k_channels and v_channels back to kv_channels and accept a tuple Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix isinstance call Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MLA tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
vasunvidia authored
* DGRAD-RS overlap bug fix This PR fixes a bug in enabling DGRAD-RS overlap by adding the layer to the correct method list. Previously, the RS-DGRAD overlap layer was incorrectly added to pipeline method list even if ring_exchange method is specified in config. Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix for ring_exchange ReduceScatter ring_exchange RS uses main_stream for last GEMM chunk. But the send/recv streams wait for stream_compute during last chunk. Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 12 Aug, 2024 3 commits
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
Phuong Nguyen authored
* added threading build back * integrating threading for pytorch and paddle extensions * added messages --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
vasunvidia authored
Buf fix for num_warmup_iters=0 case Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 10 Aug, 2024 1 commit
-
-
Tim Moon authored
* Add op for in-place add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add op for in-place add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add op that adds extra output to fuser Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add fused op for GEMM+bias+add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add fused op for dgrad+add Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add documentation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Review suggestions from @ptrendx Output tensor dtype and device take precedence over weight tensor in linear functional API. Move some index calculation to fuser constructor. Avoid some unnecessary dereferences. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update transformer_engine/pytorch/ops/fuser.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 09 Aug, 2024 2 commits
-
-
Xin Yao authored
* use fused_multi_cast_transpose Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix input being empty tensor Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * allocate output tensors in C++ Signed-off-by:
Xin Yao <xiny@nvidia.com> * simplify code Signed-off-by:
Xin Yao <xiny@nvidia.com> * avoid cudaGetDriverEntryPoint Signed-off-by:
Xin Yao <xiny@nvidia.com> * reduce torch.Tensor() calls Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Alp Dener authored
[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` (#1087) * updated initialize_ub() to use new_subgroups_by_enumeration() to generate intra-node groups, added new unit tests for TE layers with comm overlap Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 08 Aug, 2024 1 commit
-
-
Reese Wang authored
* Support non-deterministic algo Signed-off-by:
Reese Wang <rewang@nvidia.com> * Refine the helper function name Signed-off-by:
Reese Wang <rewang@nvidia.com> * Move fixture to conftest.py Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 06 Aug, 2024 6 commits
-
-
Tim Moon authored
Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
Charlene Yang authored
reduce the roundup of max_seqlen for THD to multiples of 64 Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Charlene Yang authored
* fix logging in attention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove logging in fwd/bwd methods due to CPU overhead Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: fix check_set_window_size messaging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix typo Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix window_size messaging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove redundant imports Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Reese Wang authored
Add the missing 1HSS tests Signed-off-by:Reese Wang <rewang@nvidia.com>
-
Reese Wang authored
* Support actlen = 0 after cuDNN 9.3.0 Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add runtime_segment < max_segment tests Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
Charlene Yang authored
* add multi-latent attention for DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Jax/Paddle API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix typo in test script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix too-many-boolean lint error Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix lint" This reverts commit 67399a3a6f45bb4ce9e5eaa6bcce40b28e347e5b. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix stride check in get_qkv_layout Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: fix layout_thd tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix merge conflict Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix thd pad_between_seqs=False/True tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 02 Aug, 2024 2 commits
-
-
Li Tao authored
fix an argument issue when flash_attn>=2.5.7 Signed-off-by:
Li Tao <lit@nvidia.com> Co-authored-by:
Li Tao <lit@nvidia.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Przemyslaw Tredak authored
* Link attention docs to the main docs and fix errors reported by Sphinx Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Lower the version of nbsphinx Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * More fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Change the URL of example_attention.py to GitHub Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * More fixes in the attention tutorial Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com>
-
- 01 Aug, 2024 2 commits
-
-
Xiaowei Ren authored
* use 2hd layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change qkv_format check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a code comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * tensor shape bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tensor shape fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add function to compute cu_seqlens of a cp rank Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cu_seqlens and cu_seqlens_padded to context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix FlashAttention output sequence length Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix cu_seqlens_kv_per_step calculation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV for ending padded tokens Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV tensors of FlashAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix softmax_lse correction Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove padded tokens of KV to save comounication Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not need to zero dkv for FlashAttention any mroe Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero out tensors Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of cp test with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
Xiaowei Ren <xren@cs-cw-dfw-login-01.cm.cluster>
-
Xin Yao authored
* fix workspaces and unfused bias in multi-stream cuBLAS * Expose num_streams via pybind * Fix C-compatibility * rm importing packaging in test_fused_attn.py --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-