- 04 Nov, 2024 1 commit
-
-
Xin Yao authored
* Let fp8 mha work with rope when cp is on Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix and update ut Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com>
-
- 01 Nov, 2024 1 commit
-
-
Kunlun Li authored
* Add precision aware fused adam Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Minor changes based on review comments. Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Kunlun Li <94586211+kunlunl@users.noreply.github.com> --------- Signed-off-by:
kunlunl <kunlunl@nvidia.com> Signed-off-by:
Kunlun Li <94586211+kunlunl@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 29 Oct, 2024 2 commits
-
-
Charlene Yang authored
skip some t3hd/th3d tests for MQA/GQA Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Alp Dener authored
* moved userbuffers code to TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * moved comm+GEMM overlap code to TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * removed PyTorch depdency from comm+GEMM overlap in TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * added TE/PyTorch wrappers for refactored comm+GEMM overlap code in TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * updated TE/PyTorch Python API to match the refactored comm+GEMM overlap code Signed-off-by:
Alp Dener <adener@nvidia.com> * updated unit tests to work with refactored comm+GEMM overlap code Signed-off-by:
Alp Dener <adener@nvidia.com> * added a pylint exception to comm+GEMM overlap test runner Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixing linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added documentation for te.initialize_ub Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed compile errors when building with NVTE_UB_WITH_MPI=1 Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed default bootstrap backend Signed-off-by:
Alp Dener <adener@nvidia.com> * switched default bootstrap backend priority to MPI > Gloo > NCCL Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated bootstrap backend documentation Signed-off-by:
Alp Dener <adener@nvidia.com> * close UB bootstrap socket to avoid interfering with CUDA Multicast shareable file handle send/recv Signed-off-by:
Alp Dener <adener@nvidia.com> * added torch::Tensor wrappers for communication buffer and atomic counters so PyTorch can factor externally allocated memory into its garbage collection threshold Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * automated handling of world, local and node ranks/sizes within C++ CommOverlapHelper to simplify Python function signatures Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed incorrect read of environment variables Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected priority for _SOCKET_IFNAME environment variables in UB bootstrapping Signed-off-by:
Alp Dener <adener@nvidia.com> * moved multicast support check to cuda_runtime.h and replaced cudaDeviceGetProp call with cached sm_count() Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed commented out old code and replaced external collective function type defines with aliases Signed-off-by:
Alp Dener <adener@nvidia.com> * compile-time CUDA version guard for CUDA Driver Multicast attribute Signed-off-by:
Alp Dener <adener@nvidia.com> * added compile-time CUDA version guards to Multicast code in Userbuffers Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * condensed UB docs, corrected const violations Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed autodoc rst for UB calls, added CUDA version guard on Multicast UB kernels Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect UB type reporting for P2P overlaps, comment reformatting Signed-off-by:
Alp Dener <adener@nvidia.com> * add docstring to tex.ubuf_built_with_mpi() Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 25 Oct, 2024 2 commits
-
-
Charlene Yang authored
* WIP: add max_t support for THD Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: save tensors for debug and point to new FE Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix stats in bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix stats in fwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add docstring for DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: first try on adding max_b and max_t Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit c3d522e9f5aef3c8ddfec5bf6ff24c3db97bb059. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "WIP: first try on adding max_b and max_t" This reverts commit 3bc01ebaf2aa846fd16634e2d33b0d0f5803a076. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update docstring and fix max_seqlen logic for thd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert two lines of change in docstring Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add get_max_b/t Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix max_seqlen code and docstring Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * sucess: add max_b/max_t Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove debug code Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change max_b/max_t buckets Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix b vs orig_b Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix b vs orig_b with 0 fill Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE for T3HD/TH3D Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add max_b to conversion kernels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix changes after last merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add Jax support for max_t Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.8.0-rc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 1.8.0 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * code review/formating fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Stats shape for <9.6 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * return nullptr for offset_stats when cudnn < 9.6 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add more version control Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Charlene Yang authored
* add THD MQA/GQA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix nvte_get_fused_attn_backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 24 Oct, 2024 3 commits
-
-
Paweł Gadziński authored
* update test numerics Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test numerics Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test numerics Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/test_numerics.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * tests fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Not passing CI fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Not passing CI fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fix key Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Hua Huang authored
[JAX] XLA Custom Calls with FFI for FusedAttnFwd, Quantize, Transpose, ActLuFP8, LayerNormForwardFP8FFI, and LayerNormBackwardFFI (#1263) * Add TransposeFFI, test passed Signed-off-by:
Hua Huang <huah@nvidia.com> * Add ActLuFP8FFI; fix TransposeFFI Signed-off-by:
Hua Huang <huah@nvidia.com> * Add QuantizeFFI Signed-off-by:
Hua Huang <huah@nvidia.com> * Add FusedAttnForwardFFI and some unit tests Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor fix Signed-off-by:
Hua Huang <huah@nvidia.com> * Add LayerNormForwardFP8FFI & LayerNormBackwardFFI Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revise FusedAttnForwardFFI() Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add FFI_CudaGraph_Traits All tests passed, ready for merge Signed-off-by:
Hua Huang <huah@nvidia.com> * Bug fix for FFI data type mismatch Also add a safeguard on the entrance to FFI function Signed-off-by:
Hua Huang <huah@nvidia.com> --------- Signed-off-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Michael Goldfarb authored
[JAX] Fix correctness of JAX fused attention with CP and improve numerics check in unit tests (#1282) Fix correctness of JAX fused attention with CP. Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
- 22 Oct, 2024 1 commit
-
-
Reese Wang authored
Add THD + GQA supports for cuDNN >= 9.6 Signed-off-by:Reese Wang <rewang@nvidia.com>
-
- 18 Oct, 2024 1 commit
-
-
Tim Moon authored
* Reorganize PyTorch L1 tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move ONNX tests to L1 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move FA version test to L3 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Limit parallel build jobs in FA version test Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 16 Oct, 2024 2 commits
-
-
Charlene Yang authored
* WIP: make FA2 optional Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: fix logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add L1 test to test all supported FA versions Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update version to 2.1.1 and trim L1 tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update onnxruntime version Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove onnxruntime from L1 FA versions tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
* Build custom ORT ops before running ONNX tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove ONNX from context parallelism tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Export ONNX ops that do compute in FP32 Matches internal impl of TE kernels. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add build script for custom ORT ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 15 Oct, 2024 1 commit
-
-
Michael Goldfarb authored
Update test to check support for context parallel attention. Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
- 12 Oct, 2024 1 commit
-
-
Xin Yao authored
* Let Fused RoPE support THD with CP Signed-off-by:
Xin Yao <xiny@nvidia.com> * add comment Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
-
- 10 Oct, 2024 1 commit
-
-
Hua Huang authored
* Expose JAX sliding window attn API Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * No SWA in context parallel; fix RNG seed in test Signed-off-by:
Hua Huang <huah@nvidia.com> * Handle SAW API discrepancy in cuDNN and Python Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add SAW API for flax, all tests passed Will update tests/jax/test_praxis_layers.py next Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test_praxis_layers.py for SWA, test passed Signed-off-by:
Hua Huang <huah@nvidia.com> * Use tuple window_size; update for PR #1212 Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add and adjust some pytest.skip Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revised following Reese Wang's comments Still need further debugging: FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-NO_BIAS] - AssertionError: FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-KV_PACKED-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError: FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-NO_BIAS] - AssertionError: FAILED test_fused_attn.py::TestFusedAttn::test_backward[NO_SWA-DROP_0.0-4-128-256-16-16-64-BF16-CROSS-SEPARATE-NO_MASK-POST_SCALE_BIAS-1HSS] - AssertionError: These errors does not exist in the previous commit Signed-off-by:
Hua Huang <huah@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix no-SWA test case errors in previous commit Signed-off-by:
Hua Huang <huah@nvidia.com> * Add Padding mask w/ sliding windows sanity tests Signed-off-by:
Reese Wang <rewang@nvidia.com> * Use float32 for the reference code softmax calculation Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Hua Huang <huah@nvidia.com> Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Reese Wang <rewang@nvidia.com>
-
- 09 Oct, 2024 2 commits
-
-
Charlene Yang authored
* add extra_state change description for different TE versions Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FAQ page Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FAQ page Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix extra_state tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
* Handle Float8Tensor when casting module dtype Keep data in Float8Tensor and only change nominal dtype. Monkey-patch PyTorch module casting functions to handle Float8Tensor. Add tests. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Respect autocast dtype in linear op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Suppress linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Suppress linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Tweak comments Review suggestion from @ptrendx Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 08 Oct, 2024 1 commit
-
-
Charlene Yang authored
* add qkv descales to FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix sbhd shapes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * force the same dtype when comparing FA3 and cuDNN FP8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "force the same dtype when comparing FA3 and cuDNN FP8" This reverts commit 19e7f877026a19a32d2f02c6c9de20df4ae2e064. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * force the same dtype when comparing FA3 and cuDNN FP8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add try/except for FA3 when custom qkv descales are not supported Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace FA3 installation warning with a debug logging message Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unused imports Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * avoid varlen_func for FP8 and improve messaging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add SWA support for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change preference reason for FP8 logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 07 Oct, 2024 3 commits
-
-
Charlene Yang authored
* adjust window size to (i-window_size_left,i] for cuDNN Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reduce the window to make any errors more pronouced Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Xiaowei Ren authored
* change API for hierarchical CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * move fp8 code before qkv reshape Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try to insert A2A for hierarchical CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make fwd work Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove a redundant sync Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make bwd of hierarchical CP work Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix dout a2a in bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix q_f16 with fp8 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert hierarchical CP implementation does not support THD format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert hierarchical CP does not support attn bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add unit test for hierarchical CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix cp_comm_type in unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix and code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * an assert info change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * dout shape fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move function definitions to the front of the first call Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix tensor view comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * refine CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save cp_size_a2a and rank_a2a in fwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add more explainations of cp_group in doc_string Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Paweł Gadziński authored
* Tests for distributed Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added the test to the qa script Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Changed qa Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix to test_numerics file Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * pr fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/distributed/run_numerics.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 06 Oct, 2024 1 commit
-
-
Emmanuel Ferdman authored
Signed-off-by:
Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 04 Oct, 2024 1 commit
-
-
Tim Moon authored
* CPU perf optimization in linear autograd function Avoid enable_grad context when possible in cast function. Cache distributed group properties. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * CPU perf optimization in prepare_forward function Avoid torch.nn.Module impl of __setattr__. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid module import in TE module forwards Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use fast getter for params Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Reuse tensor dims in linear autograd func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply optimizations to grouped linear Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid deepcopy in tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move _fast_setattr logic to __setattr__ method Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 27 Sep, 2024 1 commit
-
-
Xiaowei Ren authored
skip FP8 CP tests if hardware does not support FP8 Signed-off-by:Xiaowei Ren <xren@nvidia.com>
-
- 18 Sep, 2024 1 commit
-
-
Tim Moon authored
Port optimizer tests to pytest Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 17 Sep, 2024 1 commit
-
-
Michael Goldfarb authored
Implementation of context parallel fused attention using all-gather. Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
- 16 Sep, 2024 1 commit
-
-
Michael Goldfarb authored
Modify unit tests to work around cuDNN 9.4 regression. Signed-off-by:Michael Goldfarb <mgoldfarb@nvidia.com>
-
- 11 Sep, 2024 2 commits
-
-
Charlene Yang authored
* reduce atol/rtol for F16 tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * relax the tols for Ampere Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Tim Moon authored
* Add base class for tensor proxies Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move tensor detaching logic to tensor proxy base class Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use Python wrappers to PyTorch extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Include transpose caching logic in proxy encode function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug dimension mismatch with amax history Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move dequantize logic to proxy_decode func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename to "QuantizedTensor" Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename "proxy_detach" to "detach" Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Include transpose cache in detach and clone funcs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update FP8 workspaces with QuantizedTensor functions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move logic for FP8 transpose cache in FP8 workspaces to base class Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove cast-transpose logic from linear op Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unnecessary args for Float8Tensor when using FP8 attr dict Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove __torch_function__ to QuantizedTensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update tests/pytorch/test_float8tensor.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Debug FP8 transpose test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug cast functions Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 09 Sep, 2024 2 commits
-
-
Xiaowei Ren authored
* clean code for CP function args Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a placeholder for Ulysses implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit code change to CP+A2A Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * finish the draft fwd implementation of Ulysses Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft bwd implementation of Ulysses Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make swa work with ulysses Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit FP8 code for Ulysses Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix qkv type in the bwd of FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix qkv_dtype of FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code refactoring Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor code change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * config cp correction dtype of FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code style change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save chunk_ids Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try to make Ulysses A2A async Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make more a2a async Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a2a_outputs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix chunk_ids generation for A2A Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * avoid code duplication of a2a before attn Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove code duplication of a2a after attn Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cp_stream in A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix qkv of fp8_fwd + bf16_bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kernel order in cp a2a communication Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning for CP a2a Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix merging with main Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a2a communication order Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * adjust sequence chunk reordering for a2a Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add docstring for A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change an assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add unit tests of A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add more A2A unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP unit tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add more cp unit tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix window size of no_mask Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fused attn does not support swa+no_mask Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change num_gqa_groups to 2 for A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * function and variable renaming Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning for CP all-gather implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * some function renaming Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit code change for kv all-gather implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix all-gather implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a window size check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add unit test of all_gather+no_mask Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix all-gather cp implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix FP8 with A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add paper references to CP implementations with all-gather and all-to-all Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change pdf to abs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * elaborate cp_comm_type Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP docstring Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Xin Yao authored
* propagate scale_inv modification to GroupedLinear Signed-off-by:
Xin Yao <xiny@nvidia.com> * optimization for separate scale_inv of weights and single output Signed-off-by:
Xin Yao <xiny@nvidia.com> * let grouped gemm support different input combinations Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix type Signed-off-by:
Xin Yao <xiny@nvidia.com> * add contiguous check Signed-off-by:
Xin Yao <xiny@nvidia.com> * use len() instead of isinstance Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix ut Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 05 Sep, 2024 2 commits
-
-
Ruibin Cheung authored
* [TE/PyTorch][MoE] Add FP8 padding and unpadding module 1. Add multi-tensor padding kernel for FP8 with padding size = 16. 2. Add FP8Padding and Fp8Unpadding module 3. Add Padded GroupedLinear unit tests --------- Signed-off-by:
beinggod <zhangruibin@01.ai> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
Xin Yao authored
* fp8 mha with rope Signed-off-by:
Xin Yao <xiny@nvidia.com> * avoid index select in cast ops Signed-off-by:
Xin Yao <xiny@nvidia.com> * avoid index select in fused_attn_fwd Signed-off-by:
Xin Yao <xiny@nvidia.com> * rename is_first_module_in_mha to fp8_output Signed-off-by:
Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move transpose to backward for fp8 input Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix ut Signed-off-by:
Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * update argument list for CP Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix for FA3 Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unnecessary copy of scale_inv Signed-off-by:
Xin Yao <xiny@nvidia.com> * skip fp8 dpa/mha tests when fa3 is not available Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix a merge bug Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 30 Aug, 2024 1 commit
-
-
Charlene Yang authored
* fix FP8 logic when FA3 is not installed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak to make logic more explicit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * limit FA3 warning to Hopper and NVTE_FLASH_ATTN=1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * prefer fused attn for FP8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
- 29 Aug, 2024 1 commit
-
-
Xin Yao authored
* remove dtype from args * update docs with permutation ops --------- Signed-off-by:Xin Yao <xiny@nvidia.com>
-
- 23 Aug, 2024 1 commit
-
-
Charlene Yang authored
* WIP: add fa3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: add benchmarks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * differentiate func/varlen_func Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix parsing keyword for FA3 and remove bshd->thd conversion for flash_attn_func Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add FP8 fwd support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add FA3 FP8 fwd code and test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix assert for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix FA3 FP8 logic and add tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA2 to <=2.6.3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak unit tests for base/mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * set constraints for FA3 for sm90 and causal_bottom_right Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert debug changes in benchmark script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 22 Aug, 2024 1 commit
-
-
NVJiangShao authored
* Add permutation functions * Add permutation ops * Remove the dependency on cutlass * Move permutation.py out of module dir * Rewrite the unit test and enable skipping if FP8 is unavailable * Rename exposed C++ API and reorder its parameters + take NVTETensor as inputs * Use Float8Tensor for FP8 input * Move dtype to ctx --------- Signed-off-by:
Jiang Shao <jiangs@nvidia.com> Co-authored-by:
Qi Zhang <qizhang@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 21 Aug, 2024 2 commits
-
-
Charlene Yang authored
* add support for padding in UnfusedDPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for padding_causal/_bottom_right Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix padding_causal/_bottom_right Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * need to test max512 backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix mask logic in unfused Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use actual_seqlen for alibi/causal_bottom_right padding Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes and convert causal to causal_bottom_right for inference Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use causal in kv cache inference test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify get_alibi logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * simplify the non-padding path for get_alibi Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * avoid batch_size loop in generating padding_causal/_bottom_right masks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Xiaowei Ren authored
* add window_size to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo for cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add seq_offsets_qkvo to AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets calculation of cudnn thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove a thd assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bias for thd test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add thd test for cudnn FA with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * skip GQA/MQA test for cuDNN THD Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make sure seq_offsets are computed with qkv_group of hd_hd_hd while CP>1 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix seq_offsets inputs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove two comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn mask type for cudnn thd with cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attn_mask_type for cudnn fa with thd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a typo Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix out dout in bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert cudnn+thd does not support attn bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if attn_mask_type has padding Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change cp test batch size to 2 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix code format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix two assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert comments Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert swa+CP cannot work with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a new CP function for swa Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a missing dgrads Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft fwd function for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * enable flash attention for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove an assert of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * call SWAFuncWithCP for swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * use 2hd layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * change qkv_format check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a code comment Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * tensor shape bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tensor shape fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add function to compute cu_seqlens of a cp rank Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cu_seqlens and cu_seqlens_padded to context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix FlashAttention output sequence length Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix cu_seqlens_kv_per_step calculation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV for ending padded tokens Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero dQKV tensors of FlashAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix softmax_lse correction Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove padded tokens of KV to save comounication Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not need to zero dkv for FlashAttention any mroe Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * zero out tensors Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of cp test with thd format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add simple code framework Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try not to have a separate CP function for SWA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * backup some code change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * back up code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * clean up fwd implementation of SWAFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * reduce kv chunk concat overheads Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make AttnFuncWithCP and SWAFuncWithCP have same API Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a docstring Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * preliminary implementation of SWAFuncWithCP forward seems working Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix output shape of SWAFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code refactoring for FlashAttention and add a code placeholder for bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * use gather_along_first_dim Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * finish the preliminary implementation of bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix assert condition Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft implementation of SWA+CP with FusedAttention Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix attention mask type of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add qkv_layout Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add missing window_size argument Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kv shape of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug and typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix dout shape Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add multi stream in fwd of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save chunk_ids_to_kv_ag in fwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add multi stream in bwd of swa+cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix to cp stream sync Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * rename AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * check if window size is None Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix docstring of AttnFuncWithCP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add env var for users to choose KV ag or KV p2p Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * update cp tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix window size in cp unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix pytest skip messages Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cp_comm_type into API Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add deterministic konb in cuDNN fused attn backend Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * pass fp8 and fp8_meta to attn_func_with_cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert only Fused Attn can support FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a fwd draft implementation of FP8 + CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save fp8 and fp8_meta Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert sequence length divisible requirements Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove a redundant qkv_layout compute Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * if condition change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * some typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add support table of context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo and code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not print multiple disabling messages Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix aux_ctx_tensors of FP8 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix device in torch.arange and adjust code for the PR of MLA Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * commit code change for FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit more code change for FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit more fp8 code for FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fixes Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * cast merged CP results from FP32 to BF16 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix softmax_lse Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix some bugs of FP8 dkv exchange Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add FP8 unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix typos and clean asserts Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix get_p2p_comm_info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix dkv p2p exchange Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change FP8 dkv P2P to A2A Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add FP8+CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert amax reduction is needed for FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove duplicated code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * destroy process group in CP unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove interval from fp8_recipe because it has been deprecated Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try to fix the failed CP test with the latest CI pipeline Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove redundant f before string Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change META_O_CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xiaowei Ren <xren@cs-cw-dfw-login-01.cm.cluster>
-