CP implementation refinement for BSHD/SBHD format (#1523)

* fix recompilation of out and lse correction in p2p+bshd/sbhd Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix recompilation of get_seq_chunk_ids_for_reordering Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix recomplilation of reorder_seq_chunks_for_a2a Signed-off-by: Xiaowei Ren <xren@nvidia.com> * recover a change Signed-off-by: Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * minor change to softmax_lse correction Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cache cu_seqlens for BSHD/SBHD format Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do not need to allocate out buffer for BSHD/SBHD Signed-off-by: Xiaowei Ren <xren@nvidia.com> * code refactoring Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * refactor init out correction Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix a docstring Signed-off-by: Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * code refactoring Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix init out correct dtype Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add pad_between_seqs to DPA API Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

CP implementation refinement for BSHD/SBHD format (#1523)
* fix recompilation of out and lse correction in p2p+bshd/sbhd Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix recompilation of get_seq_chunk_ids_for_reordering Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix recomplilation of reorder_seq_chunks_for_a2a Signed-off-by: Xiaowei Ren <xren@nvidia.com> * recover a change Signed-off-by: Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * minor change to softmax_lse correction Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cache cu_seqlens for BSHD/SBHD format Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do not need to allocate out buffer for BSHD/SBHD Signed-off-by: Xiaowei Ren <xren@nvidia.com> * code refactoring Signed-off-by: Xiaowei Ren <xren@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * refactor init out correction Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix a docstring Signed-off-by: Xiaowei Ren <xren@nvidia.com> * typo fix Signed-off-by: Xiaowei Ren <xren@nvidia.com> * code refactoring Signed-off-by: Xiaowei Ren <xren@nvidia.com> * fix init out correct dtype Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add pad_between_seqs to DPA API Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <xren@nvidia.com> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2a95efd3 · Xiaowei Ren · GitHub · 2ad5da95 · 2a95efd3 · 2a95efd3
Unverified Commit 2a95efd3 authored Mar 07, 2025 by Xiaowei Ren Committed by GitHub Mar 07, 2025
Showing with 228 additions and 121 deletions

transformer_engine/pytorch/attention.py transformer_engine/pytorch/attention.py +223 -121

transformer_engine/pytorch/transformer.py transformer_engine/pytorch/transformer.py +5 -0

No files found.
--- a/transformer_engine/pytorch/attention.py
+++ b/transformer_engine/pytorch/attention.py
--- a/transformer_engine/pytorch/transformer.py
+++ b/transformer_engine/pytorch/transformer.py
@@ -546,6 +546,7 @@ class TransformerLayer(torch.nn.Module):
        max_seqlen_q: Optional[int] = None,
        max_seqlen_kv: Optional[int] = None,
        fast_zero_fill: bool = True,
+        pad_between_seqs: Optional[bool] = None,
    ) -> torch.Tensor:
        """
        Transformer Layer: attention block and a feedforward network (MLP)
@@ -637,6 +638,9 @@ class TransformerLayer(torch.nn.Module):
        inference_params: InferenceParams, default = None
                         Inference parameters that are passed to the main model in order
                         to efficiently calculate and store the context during inference.
+        pad_between_seqs: Optional[bool], default = `None`
+            If None, inferred from qkv_format, cu_seqlens and cu_seqlens_padded.
+            If true, there are padding tokens between individual sequences in a packed batch.
        """

        if self_attn_mask_type is None:
@@ -697,6 +701,7 @@ class TransformerLayer(torch.nn.Module):
            max_seqlen_q=max_seqlen_q,
            max_seqlen_kv=max_seqlen_kv,
            fast_zero_fill=fast_zero_fill,
+            pad_between_seqs=pad_between_seqs,
        )

        if self.apply_residual_connection_post_layernorm and not self.output_layernorm: