Remove the need for `einsum` in Albert's attention computation (#12394)

* debug albert einsum * Fix matmul computation * Let's use torch linear layer. * Style.

Remove the need for `einsum` in Albert's attention computation (#12394)
* debug albert einsum * Fix matmul computation * Let's use torch linear layer. * Style.
a7d0b288 · Funtowicz Morgan · GitHub · 276bc149 · a7d0b288
Unverified Commit a7d0b288 authored Jun 28, 2021 by Funtowicz Morgan Committed by GitHub Jun 28, 2021
Show whitespace changes
Inline Side-by-side

Showing with 2 additions and 11 deletions

src/transformers/models/albert/modeling_albert.py src/transformers/models/albert/modeling_albert.py +2 -11

No files found.
--- a/src/transformers/models/albert/modeling_albert.py
+++ b/src/transformers/models/albert/modeling_albert.py
@@ -360,18 +360,9 @@ class AlbertAttention(nn.Module):
            attention_probs = attention_probs * head_mask
        context_layer = torch.matmul(attention_probs, value_layer)
+        context_layer = context_layer.transpose(2, 1).flatten(2)
-        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        projected_context_layer = self.dense(context_layer)
-        # Should find a better way to do this
-        w = (
-            self.dense.weight.t()
-            .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
-            .to(context_layer.dtype)
-        )
-        b = self.dense.bias.to(context_layer.dtype)
-        projected_context_layer = torch.einsum("bfnd,ndh->bfh", context_layer, w) + b
        projected_context_layer_dropout = self.output_dropout(projected_context_layer)
        layernormed_context_layer = self.LayerNorm(hidden_states + projected_context_layer_dropout)
        return (layernormed_context_layer, attention_probs) if output_attentions else (layernormed_context_layer,)