Add a context parallelism implementation with QKVO all-to-all (#1160)
* clean code for CP function args Signed-off-by:Xiaowei Ren <xren@nvidia.com> * add a placeholder for Ulysses implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit code change to CP+A2A Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * finish the draft fwd implementation of Ulysses Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft bwd implementation of Ulysses Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make swa work with ulysses Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit FP8 code for Ulysses Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix qkv type in the bwd of FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix qkv_dtype of FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code refactoring Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor code change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * config cp correction dtype of FP8+CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code style change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * save chunk_ids Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * try to make Ulysses A2A async Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make more a2a async Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a2a_outputs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix chunk_ids generation for A2A Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * avoid code duplication of a2a before attn Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove code duplication of a2a after attn Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add cp_stream in A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix qkv of fp8_fwd + bf16_bwd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix kernel order in cp a2a communication Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning for CP a2a Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix merging with main Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix a2a communication order Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * adjust sequence chunk reordering for a2a Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add docstring for A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change an assert info Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add unit tests of A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add more A2A unit test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP unit tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add more cp unit tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix window size of no_mask Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fused attn does not support swa+no_mask Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change num_gqa_groups to 2 for A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * function and variable renaming Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning for CP all-gather implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * some function renaming Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant code Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * commit code change for kv all-gather implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix all-gather implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add a window size check Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add unit test of all_gather+no_mask Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix all-gather cp implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code cleaning Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix FP8 with A2A implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add paper references to CP implementations with all-gather and all-to-all Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change pdf to abs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * elaborate cp_comm_type Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP docstring Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Showing
Please register or sign in to comment