- 17 Jul, 2025 1 commit
-
-
Charlene Yang authored
* optimize kv_cache reindex and copy kernels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * avoid reindexing from python side Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename variable from previous commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 28 Apr, 2025 1 commit
-
-
Kshitij Lakhani authored
* Move MultiHeadAttention into its own file. Modify tests and files in t_e/pytorch to import from the new MHA module Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Resolving lost MHA changes from PR 1614 as a result of rebase Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move context parallelism code into it's own file. Modify test and local imports of cp code accordingly Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move softmax.py frm pytorch/ to pytorch/d_p_a Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move Unfused and Fused attention to backends.py and some utils functions to pytorch/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Resolving lost mark_activation_offload changes from PR 1678 as a result of rebase Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor attention dir Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Refactor dir structure. Make relevant symbols public in __init__ for attention and d_p_a dirs Move FA package imports to backends.py Code cleanup Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Modify tests to import attention modules correctly Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Lint fixes Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Code clean up and fix typo Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Allowing InferenceParams and RoPE imports from attention module and pytorch module Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Allow InferenceParams and RoPE imports via transformer_engine.pytorch and transformer_engine.pytorch.attention modules Remove unnecessary checks for check_set_window_size in MHA and TL Reorder backends such that smaller classes at the start and larger ones at the end Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Reinstating changes from PR 1478 for rope.py lost during rebase conflict resolution Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix lint issues Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make imports leaner Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 01 Apr, 2025 1 commit
-
-
Charlene Yang authored
-
- 25 Mar, 2025 1 commit
-
-
Charlene Yang authored
* skip cuDNN 9.8 for KV caching Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert from max_seqlen_kv to max_sequence_length for InferenceParams Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename test_paged_attn to test_kv_cache Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove redundant None returns in bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add debug flags when no backend is found Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * skip kv_cache_accuracy tests for cuDNN 9.8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * truncate length of cu_seqlens for consistency with q/k/v shape Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back padding_brcm for fused attn tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * re-enable kv_cache_accuracy test for 9.8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN search dir Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes based on review Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove extra empty line Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 18 Mar, 2025 1 commit
-
-
Charlene Yang authored
* add paged attention; test_kv_cache_accuray and test_paged_attn pass Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove unnecessary change from last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test_fused_attn pass Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unnecessary import in test_numerics Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add license for test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add to L0 test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update license for test_paged_attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update kv_cache_manager license Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix build issue from previous merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: minor fix/preparation for inference/cuda graph Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged, bshd/sbhd Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged, thd, no CG Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: non-paged, thd, CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged, CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged, using paged kernel Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: restructure kernels Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: paged, CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: padding + BRCM Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: restructure IP, clean up Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix non-CG, fused Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix last commit Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: unfused, non-CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: flash-attn, non-CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: flash_attn_with_kvcache Signed-off-by:
Charlene Yang <charleney@nvidia.com> * commit two files missed by bcef6b34 Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: thd_bshd_bshd Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix last commit Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix 1c31b68d Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: add bshd_2sbhd, sbhd_2bshd Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: some cleanup Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: all qkv_format combinations and merge CM files Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: some lint fixes Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: add docstring for IP Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix sequences_pre Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: minor fixes for multi-layer Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: initial multi-layer test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: switch to flash_attn_varlen_func Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix unfused for separate q/kv format Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix fused for separate q/kv formats Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: flash attn + TELayer + 2 layers Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: unfused + TL + 2layers Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: all modules/backend Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: minor cleanup Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: FlashAttention on Hopper with 2.7.3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: FlashAttention + v3 from 39e7179 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: FlashAttention + v3 + FP8 + WIP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add backend support table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: separate use_flash_attention_2 and _3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: tweaks to paged attn script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: enable/disable certain cases for fused attn Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: small fixes for lint and cg Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: minor fixes for attn/infer Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix CP Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: readd page info to FADescriptor_v1 Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak to test_numerics.py Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix 9.5/9.7 sq/skv + mask logic Signed-off-by:
Charlene Yang <charleney@nvidia.com> * clean up Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more minor fixes for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test page_size=1 for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix t3hd/th3d strides Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix ckpt recompute and fa3 k_scale Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * raise dynamo recompile limit for test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove thunder test from L0 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix FA selection logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FA3 q_descale shape Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove page_table from IP.step() returns Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FP8 FlashAttn DPA fp8_dpa tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweaks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA3 note and L3 test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove redundant import in test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * adopt new FA3 APIs from FA2.7.3+/hopper for CP and non-CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * relax tols for TransformerLayers Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix merge 2 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FA import comments Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * relax tols for Ampere Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fa3 version and reduce messaging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA3 to its latest commit on main Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add default values to IP and assertion to graph.py Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add more comments in attention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use custom_cache_manager instead of cache_manager Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Charlene Yang <charleney@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Mar, 2025 1 commit
-
-
Kshitij Lakhani authored
* Create pytorch/dot_product_attention module and pytorch/d_p_a/utils.py Move attention logging into a separate class in pytorch/d_p_a/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Create FlashAttentionUtils class in pytorch/d_p_a/utils/py for versioning info Move versioning info out of pytorch/attention.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move AttentionParams and get_attention_backend from attention.py to d_p_a/utils.py Fix tests and imports for the above refactor change Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move get_qkv_layout(), get_full_mask(), get_alibi(), get_attention_quantizers() to d_p_a/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move tensor packing and unpacking helper functions from pyt/attention.py to d_p_a/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move cumulative seqlens and indices methods from pyt/attention.py to d_p_a/utils.py Rename cumulative functions from using _cu_ to using _cumul_ to differentiate from CUDA cu calls protocol Rename tensor packaging methods with leading underscore to make them as internal to file Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary imports in pytorch/attention.py and d_p_a/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Create d_p_a/inference.py and move InferenceParams from pyt/attention.py to it Modify tests and other files to import InferenceParams correctly Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Modify docs api for InferenceParams Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Create d_p_a/rope.py and move RoPE methods from pytorch/attention.py to it Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code cleanup Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix qa testing induced bug Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect pack_tensor arg type Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Resolve lint errors Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove typedef FAUtils for FlashAttentionUtils Use attn_log instead of att_log Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Fix lint error Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Fix the function name from get_cumul to the earlier get_cu Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Fix typos, explicit imports and remove extra comments Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-