Add doc about `attention_mask` on gpt2 (#16829)

* Add doc about `attention_mask` on gpt2 Add a simple sentence describing how `attention_mask` needs to be constructed when ``past_key_values` is used. * Add doc about attention_mask on gpt2_tf * clean up style * remove empty line white spaces * remove whitespace in empty line

Add doc about `attention_mask` on gpt2 (#16829)
* Add doc about `attention_mask` on gpt2 Add a simple sentence describing how `attention_mask` needs to be constructed when ``past_key_values` is used. * Add doc about attention_mask on gpt2_tf * clean up style * remove empty line white spaces * remove whitespace in empty line
74814574 · wiio12 · GitHub · b96e82c8 · 74814574 · 74814574
Unverified Commit 74814574 authored Apr 19, 2022 by wiio12 Committed by GitHub Apr 19, 2022
Showing with 8 additions and 0 deletions

src/transformers/models/gpt2/modeling_gpt2.py src/transformers/models/gpt2/modeling_gpt2.py +4 -0

src/transformers/models/gpt2/modeling_tf_gpt2.py src/transformers/models/gpt2/modeling_tf_gpt2.py +4 -0

No files found.
--- a/src/transformers/models/gpt2/modeling_gpt2.py
+++ b/src/transformers/models/gpt2/modeling_gpt2.py
@@ -565,6 +565,10 @@ GPT2_INPUTS_DOCSTRING = r"""
            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.
+            If `past_key_values` is used, `attention_mask` needs to contain the masking strategy that was used for
+            `past_key_values`. In other words, the `attention_mask` always has to have the length:
+            `len(past_key_values) + len(input_ids)`
            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`, *optional*):
            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,

--- a/src/transformers/models/gpt2/modeling_tf_gpt2.py
+++ b/src/transformers/models/gpt2/modeling_tf_gpt2.py
@@ -655,6 +655,10 @@ GPT2_INPUTS_DOCSTRING = r"""
            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.
+            If `past_key_values` is used, `attention_mask` needs to contain the masking strategy that was used for
+            `past_key_values`. In other words, the `attention_mask` always has to have the length:
+            `len(past_key_values) + len(input_ids)`
            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`tf.Tensor` or `Numpy array` of shape `(batch_size, sequence_length)`, *optional*):
            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,