Unverified Commit 2a95efd3 authored by Xiaowei Ren's avatar Xiaowei Ren Committed by GitHub
Browse files

CP implementation refinement for BSHD/SBHD format (#1523)



* fix recompilation of out and lse correction in p2p+bshd/sbhd
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* fix recompilation of get_seq_chunk_ids_for_reordering
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* fix recomplilation of reorder_seq_chunks_for_a2a
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* recover a change
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* typo fix
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* minor change to softmax_lse correction
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* cache cu_seqlens for BSHD/SBHD format
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* do not need to allocate out buffer for BSHD/SBHD
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* code refactoring
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci



* minor fix
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* refactor init out correction
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* fix a docstring
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* typo fix
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* code refactoring
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* fix init out correct dtype
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* add pad_between_seqs to DPA API
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* add pad_between_seqs to the API of MHA and transformer layer
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

* add pad_between_seqs to the API of MHA and transformer layer
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>

---------
Signed-off-by: default avatarXiaowei Ren <xren@nvidia.com>
Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
parent 2ad5da95
This diff is collapsed.
...@@ -546,6 +546,7 @@ class TransformerLayer(torch.nn.Module): ...@@ -546,6 +546,7 @@ class TransformerLayer(torch.nn.Module):
max_seqlen_q: Optional[int] = None, max_seqlen_q: Optional[int] = None,
max_seqlen_kv: Optional[int] = None, max_seqlen_kv: Optional[int] = None,
fast_zero_fill: bool = True, fast_zero_fill: bool = True,
pad_between_seqs: Optional[bool] = None,
) -> torch.Tensor: ) -> torch.Tensor:
""" """
Transformer Layer: attention block and a feedforward network (MLP) Transformer Layer: attention block and a feedforward network (MLP)
...@@ -637,6 +638,9 @@ class TransformerLayer(torch.nn.Module): ...@@ -637,6 +638,9 @@ class TransformerLayer(torch.nn.Module):
inference_params: InferenceParams, default = None inference_params: InferenceParams, default = None
Inference parameters that are passed to the main model in order Inference parameters that are passed to the main model in order
to efficiently calculate and store the context during inference. to efficiently calculate and store the context during inference.
pad_between_seqs: Optional[bool], default = `None`
If None, inferred from qkv_format, cu_seqlens and cu_seqlens_padded.
If true, there are padding tokens between individual sequences in a packed batch.
""" """
if self_attn_mask_type is None: if self_attn_mask_type is None:
...@@ -697,6 +701,7 @@ class TransformerLayer(torch.nn.Module): ...@@ -697,6 +701,7 @@ class TransformerLayer(torch.nn.Module):
max_seqlen_q=max_seqlen_q, max_seqlen_q=max_seqlen_q,
max_seqlen_kv=max_seqlen_kv, max_seqlen_kv=max_seqlen_kv,
fast_zero_fill=fast_zero_fill, fast_zero_fill=fast_zero_fill,
pad_between_seqs=pad_between_seqs,
) )
if self.apply_residual_connection_post_layernorm and not self.output_layernorm: if self.apply_residual_connection_post_layernorm and not self.output_layernorm:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment