Error (also in original) model, scaling only q matrix not qk.T dot product...

Error (also in original) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head)) (#21627) * Error in model, scaling only q matrix not qK.T dot product (qk.T/sqrt(dim_per_head)) As per Vaswani et al, 2017 p.4 Is torch.matmul(q, k.transpose(2, 3)) / math.sqrt(dim_per_head) not q / math.sqrt(dim_per_head) https://arxiv.org/pdf/1912.05372.pdf Error was in original FlauBERT repo and effectively scales queries but not values cf. https://github.com/getalp/Flaubert/pull/45/commits/6d176880ca3a1a8dfa2b76c97030bb51c5e917b8 * Update modeling_flaubert.py Update to https://github.com/huggingface/transformers/pull/21627 make fixup make repo_consistency * Update modeling_xlm.py * Update modeling_flaubert.py * Update modeling_xlm.py

Error (also in original) model, scaling only q matrix not qk.T dot product...
Error (also in original) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head)) (#21627) * Error in model, scaling only q matrix not qK.T dot product (qk.T/sqrt(dim_per_head)) As per Vaswani et al, 2017 p.4 Is torch.matmul(q, k.transpose(2, 3)) / math.sqrt(dim_per_head) not q / math.sqrt(dim_per_head) https://arxiv.org/pdf/1912.05372.pdf Error was in original FlauBERT repo and effectively scales queries but not values cf. https://github.com/getalp/Flaubert/pull/45/commits/6d176880ca3a1a8dfa2b76c97030bb51c5e917b8 * Update modeling_flaubert.py Update to https://github.com/huggingface/transformers/pull/21627 make fixup make repo_consistency * Update modeling_xlm.py * Update modeling_flaubert.py * Update modeling_xlm.py
bad83008 · Benoit · GitHub · aaf6795f · bad83008 · bad83008
Unverified Commit bad83008 authored Feb 14, 2023 by Benoit Committed by GitHub Feb 14, 2023
Showing with 2 additions and 4 deletions

src/transformers/models/flaubert/modeling_flaubert.py src/transformers/models/flaubert/modeling_flaubert.py +1 -2

src/transformers/models/xlm/modeling_xlm.py src/transformers/models/xlm/modeling_xlm.py +1 -2

No files found.
--- a/src/transformers/models/flaubert/modeling_flaubert.py
+++ b/src/transformers/models/flaubert/modeling_flaubert.py
@@ -172,8 +172,7 @@ class MultiHeadAttention(nn.Module):
                    k, v = cache[self.layer_id]
            cache[self.layer_id] = (k, v)

-        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)
-        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)
+        scores = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, klen)
        mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)
        scores.masked_fill_(mask, torch.finfo(scores.dtype).min)  # (bs, n_heads, qlen, klen)


--- a/src/transformers/models/xlm/modeling_xlm.py
+++ b/src/transformers/models/xlm/modeling_xlm.py
@@ -176,8 +176,7 @@ class MultiHeadAttention(nn.Module):
                    k, v = cache[self.layer_id]
            cache[self.layer_id] = (k, v)

-        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)
-        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)
+        scores = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, klen)
        mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)
        scores.masked_fill_(mask, torch.finfo(scores.dtype).min)  # (bs, n_heads, qlen, klen)