Fix whisper and speech to text doc (#20595)

* Fix whisper and speech to text doc # What does this PR do? Previously the documentation was badly indented for both models and indicated that > If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value of `inputs_embeds`.` Which is on valid for the forward pass of the `ForConditionnalGeneration` not for the model alone. * other fixes

Fix whisper and speech to text doc (#20595)
* Fix whisper and speech to text doc # What does this PR do? Previously the documentation was badly indented for both models and indicated that > If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value of `inputs_embeds`.` Which is on valid for the forward pass of the `ForConditionnalGeneration` not for the model alone. * other fixes
9763f829 · Arthur · GitHub · 4430b912 · 9763f829 · 9763f829
Unverified Commit 9763f829 authored Dec 05, 2022 by Arthur Committed by GitHub Dec 05, 2022
4 changed files
--- a/src/transformers/models/speech_to_text/modeling_speech_to_text.py
+++ b/src/transformers/models/speech_to_text/modeling_speech_to_text.py
@@ -663,15 +663,12 @@ SPEECH_TO_TEXT_INPUTS_DOCSTRING = r"""
            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            `decoder_input_ids` of shape `(batch_size, sequence_length)`. decoder_inputs_embeds (`torch.FloatTensor` of
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
-            shape `(batch_size, target_sequence_length, hidden_size)`, *optional*): Optionally, instead of passing
+        decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*):
-            `decoder_input_ids` you can choose to directly pass an embedded representation. If `past_key_values` is
+            Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
-            used, optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`). This is
+            representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be
-            useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
+            input (see `past_key_values`). This is useful if you want more control over how to convert
-            than the model's internal embedding lookup matrix.
+            `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
-            If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
-            of `inputs_embeds`.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).

--- a/src/transformers/models/speech_to_text/modeling_tf_speech_to_text.py
+++ b/src/transformers/models/speech_to_text/modeling_tf_speech_to_text.py
@@ -673,8 +673,9 @@ SPEECH_TO_TEXT_INPUTS_DOCSTRING = r"""
            [What are decoder input IDs?](../glossary#decoder-input-ids)
-            Bart uses the `eos_token_id` as the starting token for `decoder_input_ids` generation. If `past_key_values`
+            SpeechToText uses the `eos_token_id` as the starting token for `decoder_input_ids` generation. If
-            is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`).
+            `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
+            `past_key_values`).
            For translation and summarization training, `decoder_input_ids` should be provided. If no
            `decoder_input_ids` is provided, the model will create this tensor by shifting the `input_ids` to the right
@@ -707,6 +708,14 @@ SPEECH_TO_TEXT_INPUTS_DOCSTRING = r"""
            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        decoder_inputs_embeds (`tf.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+            representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be
+            input (see `past_key_values`). This is useful if you want more control over how to convert
+            `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the

--- a/src/transformers/models/whisper/modeling_tf_whisper.py
+++ b/src/transformers/models/whisper/modeling_tf_whisper.py
@@ -565,15 +565,12 @@ WHISPER_INPUTS_DOCSTRING = r"""
            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            `decoder_input_ids` of shape `(batch_size, sequence_length)`. decoder_inputs_embeds (`tf.Tensor` of shape
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
-            `(batch_size, target_sequence_length, hidden_size)`, *optional*): Optionally, instead of passing
+        decoder_inputs_embeds (`tf.Tensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*):
-            `decoder_input_ids` you can choose to directly pass an embedded representation. If `past_key_values` is
+            Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
-            used, optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`). This is
+            representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be
-            useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
+            input (see `past_key_values`). This is useful if you want more control over how to convert
-            than the model's internal embedding lookup matrix.
+            `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
-            If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
-            of `inputs_embeds`.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).

--- a/src/transformers/models/whisper/modeling_whisper.py
+++ b/src/transformers/models/whisper/modeling_whisper.py
@@ -546,15 +546,12 @@ WHISPER_INPUTS_DOCSTRING = r"""
            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            `decoder_input_ids` of shape `(batch_size, sequence_length)`. decoder_inputs_embeds (`torch.FloatTensor` of
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
-            shape `(batch_size, target_sequence_length, hidden_size)`, *optional*): Optionally, instead of passing
+        decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*):
-            `decoder_input_ids` you can choose to directly pass an embedded representation. If `past_key_values` is
+            Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
-            used, optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`). This is
+            representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be
-            useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
+            input (see `past_key_values`). This is useful if you want more control over how to convert
-            than the model's internal embedding lookup matrix.
+            `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
-            If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
-            of `inputs_embeds`.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).