Unverified Commit f386c51a authored by liangjs's avatar liangjs Committed by GitHub
Browse files

StableLM: Fix dropout argument type error (#29236)



* fix stablelm dropout argument type error

* fix docs of _flash_attention_forward

* fix all docs of _flash_attention_forward

* fix docs of _flash_attention_forward in starcoder2

---------
Co-authored-by: default avataroliang <oliang@tencent.com>
parent 1ea3ad1a
...@@ -306,7 +306,7 @@ class BarkSelfFlashAttention2(BarkSelfAttention): ...@@ -306,7 +306,7 @@ class BarkSelfFlashAttention2(BarkSelfAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -430,7 +430,7 @@ class BartFlashAttention2(BartAttention): ...@@ -430,7 +430,7 @@ class BartFlashAttention2(BartAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -370,7 +370,7 @@ class DistilBertFlashAttention2(MultiHeadSelfAttention): ...@@ -370,7 +370,7 @@ class DistilBertFlashAttention2(MultiHeadSelfAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -657,7 +657,7 @@ class FalconFlashAttention2(FalconAttention): ...@@ -657,7 +657,7 @@ class FalconFlashAttention2(FalconAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -410,7 +410,7 @@ class GemmaFlashAttention2(GemmaAttention): ...@@ -410,7 +410,7 @@ class GemmaFlashAttention2(GemmaAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -425,7 +425,7 @@ class GPTBigCodeFlashAttention2(GPTBigCodeAttention): ...@@ -425,7 +425,7 @@ class GPTBigCodeFlashAttention2(GPTBigCodeAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -407,7 +407,7 @@ class GPTNeoFlashAttention2(GPTNeoSelfAttention): ...@@ -407,7 +407,7 @@ class GPTNeoFlashAttention2(GPTNeoSelfAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -439,7 +439,7 @@ class GPTNeoXFlashAttention2(GPTNeoXAttention): ...@@ -439,7 +439,7 @@ class GPTNeoXFlashAttention2(GPTNeoXAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -518,7 +518,7 @@ class LlamaFlashAttention2(LlamaAttention): ...@@ -518,7 +518,7 @@ class LlamaFlashAttention2(LlamaAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -420,7 +420,7 @@ class MBartFlashAttention2(MBartAttention): ...@@ -420,7 +420,7 @@ class MBartFlashAttention2(MBartAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -496,7 +496,7 @@ class MistralFlashAttention2(MistralAttention): ...@@ -496,7 +496,7 @@ class MistralFlashAttention2(MistralAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -574,7 +574,7 @@ class MixtralFlashAttention2(MixtralAttention): ...@@ -574,7 +574,7 @@ class MixtralFlashAttention2(MixtralAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -394,7 +394,7 @@ class OptFlashAttention2(OPTAttention): ...@@ -394,7 +394,7 @@ class OptFlashAttention2(OPTAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -540,7 +540,7 @@ class PhiFlashAttention2(PhiAttention): ...@@ -540,7 +540,7 @@ class PhiFlashAttention2(PhiAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -502,7 +502,7 @@ class Qwen2FlashAttention2(Qwen2Attention): ...@@ -502,7 +502,7 @@ class Qwen2FlashAttention2(Qwen2Attention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -549,7 +549,7 @@ class StableLmFlashAttention2(StableLmAttention): ...@@ -549,7 +549,7 @@ class StableLmFlashAttention2(StableLmAttention):
key_states = key_states.transpose(1, 2) key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2) value_states = value_states.transpose(1, 2)
dropout_rate = self.attention_dropout if self.training else 0.0 dropout_rate = self.attention_dropout.p if self.training else 0.0
attn_output = self._flash_attention_forward( attn_output = self._flash_attention_forward(
query_states, query_states,
...@@ -586,7 +586,7 @@ class StableLmFlashAttention2(StableLmAttention): ...@@ -586,7 +586,7 @@ class StableLmFlashAttention2(StableLmAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -481,7 +481,7 @@ class Starcoder2FlashAttention2(Starcoder2Attention): ...@@ -481,7 +481,7 @@ class Starcoder2FlashAttention2(Starcoder2Attention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
...@@ -536,7 +536,7 @@ class WhisperFlashAttention2(WhisperAttention): ...@@ -536,7 +536,7 @@ class WhisperFlashAttention2(WhisperAttention):
attention_mask (`torch.Tensor`): attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens. position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*): dropout (`float`):
Attention dropout Attention dropout
softmax_scale (`float`, *optional*): softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment