labels and decoder_input_ids to Glossary (#7906)

* labels and decoder_input_ids to Glossary * Formatting fixes * Update docs/source/glossary.rst Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * sam's comments Co-authored-by: Sam Shleifer <sshleifer@gmail.com>

labels and decoder_input_ids to Glossary (#7906)
* labels and decoder_input_ids to Glossary * Formatting fixes * Update docs/source/glossary.rst Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * sam's comments Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
5547b40b · Lysandre Debut · GitHub · f3312515 · 5547b40b
Unverified Commit 5547b40b authored Oct 20, 2020 by Lysandre Debut Committed by GitHub Oct 20, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 46 additions and 0 deletions

docs/source/glossary.rst docs/source/glossary.rst +46 -0

No files found.
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -218,6 +218,52 @@ positional embeddings.
 Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
 use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
+.. _labels:
+Labels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
+should be the expected prediction of the model: it will use the standard loss in order to compute the loss between
+its predictions and the expected value (the label).
+These labels are different according to the model head, for example:
+- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects
+  a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
+  entire sequence.
+- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects
+  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  individual token.
+- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects
+  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually
+  -100).
+- For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
+  :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension
+  :obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each
+  input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder
+  attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the
+  Encoder-Decoder framework.
+  See the documentation of each model for more information on each specific model's labels.
+The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models,
+simply outputting features.
+.. _decoder-input-ids:
+Decoder input IDs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder.
+These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually
+built in a way specific to each model.
+Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`.
+In such models, passing the :obj:`labels` is the preferred way to handle training.
+Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
 .. _feed-forward-chunking:
 Feed Forward Chunking