Cleanup documentation for BART, Marian, MBART and Pegasus (#7523)

* Cleanup documentation for BART, Marian, MBART and Pegasus * Cleanup documentation for BART, Marian, MBART and Pegasus

Cleanup documentation for BART, Marian, MBART and Pegasus (#7523)
* Cleanup documentation for BART, Marian, MBART and Pegasus * Cleanup documentation for BART, Marian, MBART and Pegasus
e2c935f5 · Sylvain Gugger · GitHub · 5e941bec · e2c935f5 · e2c935f5
Unverified Commit e2c935f5 authored Oct 05, 2020 by Sylvain Gugger Committed by GitHub Oct 05, 2020
16 changed files
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
-Bart
+BART
 -----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
 @sshleifer

 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The Bart model was `proposed <https://arxiv.org/abs/1910.13461>`_ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
+The Bart model was proposed in `BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
+Translation, and Comprehension <https://arxiv.org/abs/1910.13461>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
+Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
+
 According to the abstract,

- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
+- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a
+  left-to-right decoder (like GPT).
+- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme,
+  where spans of text are replaced with a single mask token.
+- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It
+  matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new
+  state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
+  of up to 6 ROUGE.

-The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_
+The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`__.


 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use BartTokenizer.encode to get the proper splitting.
- The forward pass of ``BartModel`` will create decoder inputs (using the helper function ``transformers.modeling_bart._prepare_bart_decoder_inputs``)  if they are not passed. This is different than some other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to ``fairseq.encode`` starts with a space.
- ``BartForConditionalGeneration.generate`` should be used for conditional generation tasks like summarization, see the example in that docstrings
- Models that load the ``"facebook/bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.
- for training/forward passes that don't involve beam search, pass ``use_cache=False``
-
-
-BartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.BartForConditionalGeneration
-    :members: forward
+- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` 
+  or :meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
+- The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function
+  :func:`transformers.modeling_bart._prepare_bart_decoder_inputs`)  if they are not passed. This is different than some
+  other modeling APIs.
+- Model predictions are intended to be identical to the original implementation. This only works, however, if the
+  string you pass to :func:`fairseq.encode` starts with a space.
+- :meth:`~transformers.BartForConditionalGeneration.generate` should be used for conditional generation tasks like
+  summarization, see the example in that docstrings.
+- Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
+  mask-filling tasks.
+- For training/forward passes that don't involve beam search, pass :obj:`use_cache=False`.


 BartConfig
@@ -59,6 +67,13 @@ BartModel
 .. autofunction:: transformers.modeling_bart._prepare_bart_decoder_inputs


+BartForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BartForConditionalGeneration
+    :members: forward
+
+
 BartForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@@ -71,5 +86,3 @@ BartForQuestionAnswering

 .. autoclass:: transformers.BartForQuestionAnswering
    :members: forward
-
-
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
 MarianMT
 -----------------------------------------------------------------------------------------------------------------------
-**Bugs:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ and assign
-@sshleifer. Translations should be similar, but not identical to, output in the test set linked to in each model card.
+
+**Bugs:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
+and assign @sshleifer. 
+
+Translations should be similar, but not identical to, output in the test set linked to in each model card.

 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Each model is about 298 MB on disk, there are 1,000+ models.
+
+- Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- models were originally trained by `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian <https://marian-nmt.github.io/>`_ C++ library, which supports fast training and translation.
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented in a model card.
+- Models were originally trained by 
+  `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the
+  `Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
+- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
+  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
- The modeling code is the same as ``BartForConditionalGeneration`` with a few minor modifications:
-    - static (sinusoid) positional embeddings (``MarianConfig.static_position_embeddings=True``)
-    - a new final_logits_bias (``MarianConfig.add_bias_logits=True``)
-    - no layernorm_embedding (``MarianConfig.normalize_embedding=False``)
-    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. (Bart uses <s/>)
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``
+- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
+    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
+    - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
+    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
+    - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
+      :obj:`<s/>`),
+- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.

 Naming
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here <https://developers.google.com/admin-sdk/directory/v1/languages>`_, three digit codes require googling "language code {code}".
- Codes formatted like ``es_AR`` are usually ``code_{region}``. That one is spanish documents from Argentina.
+
+- All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
+- The language codes used to name models are inconsistent. Two digit codes can usually be found `here
+  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling
+  "language code {code}".
+- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.


 Multilingual Models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``:
-    - if ``src`` is in all caps, the model supports multiple input languages, you can figure out which ones by looking at the model card, or the Group Members `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
-    - if ``tgt`` is in all caps, the model can output multiple languages, and you should specify a language code by prepending the desired output language to the src_text
+All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
+
+    - If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
+      looking at the model card, or the Group Members `mapping
+      <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
+    - If :obj:`tgt` is in all caps, the model can output multiple languages, and you should specify a language code by
+      prepending the desired output language to the :obj:`src_text`.
    - You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes``

 Example of translating english to many romance languages, using language codes:
@@ -54,12 +69,20 @@ Example of translating english to many romance languages, using language codes:
    # 'Isto deve ir para o português.',
    # 'Y esto al español']

-Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a separator for src or tgt, as in ``'Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi'``. These still require language codes.
-There are many supported regional language codes, like ``>>es_ES<<`` (Spain) and ``>>es_AR<<`` (Argentina), that do not seem to change translations. I have not found these to provide different results than just using ``>>es<<``.
+Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a
+separator for src or tgt, as in :obj:`Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi`. These still require language
+codes.
+
+There are many supported regional language codes, like :obj:`>>es_ES<<` (Spain) and :obj:`>>es_AR<<` (Argentina), that
+do not seem to change translations. I have not found these to provide different results than just using :obj:`>>es<<`.
+
+For example:

-For Example:
-    - ``Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU``: translates from all NORTH_EU languages (see `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special language code like ``>>de<<`` to specify output language.
-    - ``Helsinki-NLP/opus-mt-ROMANCE-en``: translates from many romance languages to english, no codes needed since there is only 1 tgt language.
+    - `Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU`: translates from all NORTH_EU languages (see `mapping
+      <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special
+      language code like :obj:`>>de<<` to specify output language.
+    - `Helsinki-NLP/opus-mt-ROMANCE-en`: translates from many romance languages to english, no codes needed since there
+      is only one target language.



@@ -86,13 +109,6 @@ Code to see available pretrained models:
    suffix = [x.split('/')[1] for x in model_ids]
    multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]

-MarianMTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
-Model API is identical to BartForConditionalGeneration.
-Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
-This class inherits nearly all functionality from ``BartForConditionalGeneration``, see that page for method signatures.

 MarianConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -107,5 +123,7 @@ MarianTokenizer
    :members: prepare_seq2seq_batch


+MarianMTModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-
+.. autoclass:: transformers.MarianMTModel
--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
 MBart
 -----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
 @sshleifer

 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. According to the abstract,
+The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
+<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

-MBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.
+According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
+corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
+sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
+on the encoder, decoder, or reconstructing parts of the text.

 The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__

@@ -18,10 +23,11 @@ Training
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. 
 As the model is multilingual it expects the sequences in a different format. A special language id token 
-is added in both the source and target text. The source text format is ``X [eos, src_lang_code]`` 
-where ``X`` is the source text. The target text format is ```[tgt_lang_code] X [eos]```. ```bos``` is never used.
-The ```MBartTokenizer.prepare_seq2seq_batch``` handles this automatically and should be used to encode 
-the sequences for seq-2-seq fine-tuning.
+is added in both the source and target text. The source text format is :obj:`X [eos, src_lang_code]`
+where :obj:`X` is the source text. The target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
+
+The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode 
+the sequences for sequence-to-sequence fine-tuning.

 - Supervised training

@@ -38,8 +44,8 @@ the sequences for seq-2-seq fine-tuning.

 - Generation

-    While generating the target text set the `decoder_start_token_id` to the target language id. 
-    The following example shows how to translate English to Romanian using the ```facebook/mbart-large-en-ro``` model.
+    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. 
+    The following example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.

 .. code-block::

@@ -71,6 +77,4 @@ MBartForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.MBartForConditionalGeneration
-    :members: generate, forward
-
-
+    :members: forward
--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
 Pegasus
 -----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ and assign
-@sshleifer.
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
+and assign @sshleifer.


 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
-Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`_ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
+Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and
+Peter J. Liu on Dec 18, 2019.
+
 According to the abstract,

- Pegasus' pretraining task is intentionally similar to summarization: important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
+- Pegasus' pretraining task is intentionally similar to summarization: important sentences are removed/masked from an
+  input document and are generated together as one output sequence from the remaining sentences, similar to an
+  extractive summary.
 - Pegasus achieves SOTA summarization performance on all 12 downstream tasks, as measured by ROUGE and human eval.

-The Authors' code can be found `here <https://github.com/google-research/pegasus>`_.
+The Authors' code can be found `here <https://github.com/google-research/pegasus>`__.


 Checkpoints
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-All the `checkpoints <https://huggingface.co/models?search=pegasus>`_ are finetuned for summarization, besides ``pegasus-large``, whence the other checkpoints are finetuned.
+
+All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides 
+`pegasus-large`, whence the other checkpoints are fine-tuned:
+
 - Each checkpoint is 2.2 GB on disk and 568M parameters.
 - FP16 is not supported (help/ideas on this appreciated!).
 - Summarizing xsum in fp32 takes about 400ms/sample, with default parameters on a v100 GPU.
- For XSUM, The paper reports rouge1,rouge2, rougeL of paper: 47.21/24.56/39.25. As of Aug 9, this port scores 46.91/24.34/39.1.
+- For XSUM, The paper reports rouge1,rouge2, rougeL of paper: 47.21/24.56/39.25. As of Aug 9, this port scores
+  46.91/24.34/39.1.
+
 The gap is likely because of different alpha/length_penalty implementations in beam search.


@@ -32,14 +42,16 @@ Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - All models are transformer encoder-decoders with 16 layers in each component.
- The implementation is completely inherited from ``BartForConditionalGeneration``
+- The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration`
 - Some key configuration differences:
    - static, sinusoidal position embeddings
-    - no ``layernorm_embedding`` (``PegasusConfig.normalize_embedding=False``)
+    - no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
-    - ``num_beams=8``
- All pretrained pegasus checkpoints are the same besides three attributes: ``tokenizer.model_max_length`` (max input size),  ``max_length`` (max num tokens to generate) and ``length_penalty``
- Code to convert checkpoints trained in the author's `repo <https://github.com/google-research/pegasus>`_ can be found in ``convert_pegasus_tf_to_pytorch.py``
+    - more beams are used (:obj:`num_beams=8`)
+- All pretrained pegasus checkpoints are the same besides three attributes: :obj:`tokenizer.model_max_length` (maximum
+  input size), :obj:`max_length` (the maximum number of tokens to generate) and :obj:`length_penalty`.
+- The code to convert checkpoints trained in the author's `repo <https://github.com/google-research/pegasus>`_ can be
+  found in ``convert_pegasus_tf_to_pytorch.py``.


 Usage Example
@@ -62,48 +74,12 @@ Usage Example
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."

-PegasusForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This class inherits all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
-Available models are listed at `Model List <https://huggingface.co/models?search=pegasus>`__
-
-.. autoclass:: transformers.PegasusForConditionalGeneration
-    :members:


 PegasusConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This config fully inherits from ``BartConfig``, but pegasus uses different default values:
-Up to date parameter values can be seen in `S3 <https://s3.amazonaws.com/models.huggingface.co/bert/google/pegasus-xsum/config.json>`_.
-As of Aug 10, 2020, they are:
-
-.. code-block:: python

-    dict(
-    vocab_size=96103,
-    max_position_embeddings=512,
-    d_model=1024,
-    encoder_ffn_dim=4096,
-    decoder_ffn_dim=4096,
-    encoder_attention_heads=16,
-    decoder_attention_heads=16,
-    encoder_layers=16,
-    decoder_layers=16,
-    dropout=0.1,
-    attention_dropout=0.1,
-    activation_dropout=0.1,
-    pad_token_id=0,
-    eos_token_id=1,
-    is_encoder_decoder=True,
-    normalize_before=True,
-    scale_embedding=True,
-    normalize_embedding=False,
-    add_final_layer_norm=True,
-    static_position_embeddings=True,
-    num_beams=8,
-    activation_function="relu",
-    )
+.. autoclass:: transformers.PegasusConfig


 PegasusTokenizer
@@ -114,4 +90,7 @@ warning: ``add_tokens`` does not work at the moment.
    :members: __call__, prepare_seq2seq_batch


+PegasusForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. autoclass:: transformers.PegasusForConditionalGeneration
--- a/src/transformers/configuration_bart.py
+++ b/src/transformers/configuration_bart.py
@@ -15,7 +15,6 @@
 """ BART configuration """

 from .configuration_utils import PretrainedConfig
-from .file_utils import add_start_docstrings_to_callable
 from .utils import logging


@@ -31,78 +30,83 @@ BART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
    "yjernite/bart_eli5": "https://s3.amazonaws.com/models.huggingface.co/bert/yjernite/bart_eli5/config.json",
 }

-BART_CONFIG_ARGS_DOC = r"""
+
+class BartConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.BartModel`. It is used to
+    instantiate a BART model according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
+
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 50265):
-            defines the different tokens that can be represented by `inputs_ids` passed to the forward method.
-        d_model (:obj:`int`, optional, defaults to 1024):
+        vocab_size (:obj:`int`, `optional`, defaults to 50265):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.BartModel`.
+        d_model (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the layers and the pooler layer.
-        encoder_layers (:obj:`int`, optional, defaults to 12):
-            Number of encoder layers, 16 for pegasus, 6 for bart-base and marian
-        decoder_layers (:obj:`int`, optional, defaults to 12):
-            Number of decoder layers, 16 for pegasus, 6 for bart-base and marian
-        encoder_attention_heads (:obj:`int`, optional, defaults to 16):
+        encoder_layers (:obj:`int`, `optional`, defaults to 12):
+            Number of encoder layers, 6 are used for the `bart-base` model.
+        decoder_layers (:obj:`int`, `optional`, defaults to 12):
+            Number of decoder layers, 6 are used for the `bart-base` model.
+        encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        decoder_attention_heads (:obj:`int`, optional, defaults to 16):
+        decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer decoder.
-        decoder_ffn_dim (:obj:`int`, optional, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
-        encoder_ffn_dim (:obj:`int`, optional, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
-        activation_function (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, optional, defaults to 0.0):
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for the attention probabilities.
-        activation_dropout (:obj:`float`, optional, defaults to 0.0):
+        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for activations inside the fully connected layer.
-        classifier_dropout (:obj:`float`, optional, defaults to 0.0):
+        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
            The dropout ratio for classifier.
-        max_position_embeddings (:obj:`int`, optional, defaults to 1024):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        init_std (:obj:`float`, optional, defaults to 0.02):
+        init_std (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        add_bias_logits (:obj:`bool`, optional, defaults to :obj:`False`):
-            True for marian only.
-        normalize_before (:obj:`bool`, optional, defaults to :obj:`False`):
-            Call layernorm before attention ops. True for pegasus, mbart. False for bart. FIXME: marian?
-        normalize_embedding (:obj:`bool`, optional, defaults to :obj:`True`):
-            Call layernorm after embeddings. Only True for Bart.
-        static_position_embeddings (:obj:`bool`, optional, defaults to :obj:`False`):
-            Don't learn positional embeddings, use sinusoidal. True for marian, pegasus.
-        add_final_layer_norm (:obj:`bool`, optional, defaults to :obj:`False`):
+        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            This should be completed, specific to marian.
+        normalize_before (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Call layernorm before attention ops.
+        normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Call layernorm after embeddings.
+        static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Don't learn positional embeddings, use sinusoidal.
+        add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Why not add another layernorm?
-        scale_embedding (:obj:`bool`, optional, defaults to :obj:`False`):
+        scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Scale embeddings by diving by sqrt(d_model).
-        eos_token_id (:obj:`int`, optional, defaults to 2)
+        eos_token_id (:obj:`int`, `optional`, defaults to 2)
            End of stream token id.
-        pad_token_id (:obj:`int`, optional, defaults to 1)
+        pad_token_id (:obj:`int`, `optional`, defaults to 1)
            Padding token id.
-        bos_token_id (:obj:`int`, optional, defaults to 0)
+        bos_token_id (:obj:`int`, `optional`, defaults to 0)
            Beginning of stream token id.
-        encoder_layerdrop: (:obj:`float`, optional, defaults to 0.0):
-            Google "layerdrop arxiv", as its not explainable in one line.
-        decoder_layerdrop: (:obj:`float`, optional, defaults to 0.0):
-            Google "layerdrop arxiv", as its not explainable in one line.
-        extra_pos_embeddings: (:obj:`int`, optional, defaults to 2):
-            How many extra learned positional embeddings to use. Should be pad_token_id+1 for bart.
-        num_labels: (:obj:`int`, optional, defaults to 3):
-            for SequenceClassification
-        is_encoder_decoder (:obj:`bool`, optional, defaults to :obj:`True`):
-            Whether this is an encoder/decoder model
+        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
+            How many extra learned positional embeddings to use. Should be set to :obj:`pad_token_id+1`.
+        num_labels: (:obj:`int`, `optional`, defaults to 3):
+            The number of labels to use in :class:`~transformers.BartForSequenceClassification`.
+        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether this is an encoder/decoder model.
        force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only true for `bart-large-cnn`.
-
-"""
-
-
-@add_start_docstrings_to_callable(BART_CONFIG_ARGS_DOC)
-class BartConfig(PretrainedConfig):
-    r"""
-    Configuration class for Bart. Parameters are renamed from the fairseq implementation
+            Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``),
+            only :obj:`True` for `bart-large-cnn`.
    """
    model_type = "bart"


--- a/src/transformers/configuration_marian.py
+++ b/src/transformers/configuration_marian.py
@@ -23,4 +23,78 @@ PRETRAINED_CONFIG_ARCHIVE_MAP = {


 class MarianConfig(BartConfig):
+    """
+    This is the configuration class to store the configuration of a :class:`~transformers.MarianMTModel`. It is used to
+    instantiate a Marian model according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
+
+    Args:
+        vocab_size (:obj:`int`, `optional`, defaults to 58101):
+            Vocabulary size of the Marian model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.MarianMTModel`.
+        d_model (:obj:`int`, `optional`, defaults to 512):
+            Dimensionality of the layers and the pooler layer.
+        encoder_layers (:obj:`int`, `optional`, defaults to 6):
+            Number of encoder layers.
+        decoder_layers (:obj:`int`, `optional`, defaults to 6):
+            Number of decoder layers.
+        encoder_attention_heads (:obj:`int`, `optional`, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        decoder_attention_heads (:obj:`int`, `optional`, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        decoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for activations inside the fully connected layer.
+        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for classifier.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+            The maximum sequence length that this model might ever be used with.
+            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+        init_std (:obj:`float`, `optional`, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            This should be completed, specific to marian.
+        normalize_before (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Call layernorm before attention ops.
+        normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Call layernorm after embeddings.
+        static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Don't learn positional embeddings, use sinusoidal.
+        add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Why not add another layernorm?
+        scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Scale embeddings by diving by sqrt(d_model).
+        eos_token_id (:obj:`int`, `optional`, defaults to 2)
+            End of stream token id.
+        pad_token_id (:obj:`int`, `optional`, defaults to 1)
+            Padding token id.
+        bos_token_id (:obj:`int`, `optional`, defaults to 0)
+            Beginning of stream token id.
+        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
+            How many extra learned positional embeddings to use.
+        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether this is an encoder/decoder model
+        force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``).
+    """
+
    model_type = "marian"
--- a/src/transformers/configuration_mbart.py
+++ b/src/transformers/configuration_mbart.py
@@ -27,5 +27,79 @@ MBART_PRETRAINED_CONFIG_ARCHIVE_MAP = {


 class MBartConfig(BartConfig):
+    """
+    This is the configuration class to store the configuration of a
+    :class:`~transformers.MBartForConditionalGeneration`. It is used to
+    instantiate a BART model according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
+
+    Args:
+        vocab_size (:obj:`int`, `optional`, defaults to 250027):
+            Vocabulary size of the MBART model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.MBartForConditionalGeneration`.
+        d_model (:obj:`int`, `optional`, defaults to 1024):
+            Dimensionality of the layers and the pooler layer.
+        encoder_layers (:obj:`int`, `optional`, defaults to 12):
+            Number of encoder layers.
+        decoder_layers (:obj:`int`, `optional`, defaults to 12):
+            Number of decoder layers.
+        encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for activations inside the fully connected layer.
+        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for classifier.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
+            The maximum sequence length that this model might ever be used with.
+            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+        init_std (:obj:`float`, `optional`, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            This should be completed, specific to marian.
+        normalize_before (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Call layernorm before attention ops.
+        normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Call layernorm after embeddings. Only True for Bart.
+        static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Don't learn positional embeddings, use sinusoidal.
+        add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Why not add another layernorm?
+        scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Scale embeddings by diving by sqrt(d_model).
+        eos_token_id (:obj:`int`, `optional`, defaults to 2)
+            End of stream token id.
+        pad_token_id (:obj:`int`, `optional`, defaults to 1)
+            Padding token id.
+        bos_token_id (:obj:`int`, `optional`, defaults to 0)
+            Beginning of stream token id.
+        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
+            How many extra learned positional embeddings to use. Should be equal to :obj:`pad_token_id+1`.
+        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether this is an encoder/decoder model
+        force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``).
+    """
+
    model_type = "mbart"
-    """See real config values at https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/config.json."""
--- a/src/transformers/configuration_pegasus.py
+++ b/src/transformers/configuration_pegasus.py
@@ -14,8 +14,7 @@
 # limitations under the License.
 """ PEGASUS model configuration """

-from .configuration_bart import BART_CONFIG_ARGS_DOC, BartConfig
-from .file_utils import add_start_docstrings_to_callable
+from .configuration_bart import BartConfig
 from .utils import logging


@@ -66,11 +65,81 @@ task_specific_params = {
 }


-@add_start_docstrings_to_callable(BART_CONFIG_ARGS_DOC)
 class PegasusConfig(BartConfig):
-    r"""
-    :class:`~transformers.PegasusConfig` is the configuration class to store the configuration of a
-    `PegasusModel`.
    """
+    This is the configuration class to store the configuration of a
+    :class:`~transformers.PegasusForConditionalGeneration`. It is used to
+    instantiate a Pegasus model according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
+
+    Args:
+        vocab_size (:obj:`int`, `optional`, defaults to 96103):
+            Vocabulary size of the Pegasus model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.PegasusForConditionalGeneration`.
+        d_model (:obj:`int`, `optional`, defaults to 1024):
+            Dimensionality of the layers and the pooler layer.
+        encoder_layers (:obj:`int`, `optional`, defaults to 16):
+            Number of encoder layers.
+        decoder_layers (:obj:`int`, `optional`, defaults to 16):
+            Number of decoder layers.
+        encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for activations inside the fully connected layer.
+        classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout ratio for classifier.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
+            The maximum sequence length that this model might ever be used with.
+            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+        init_std (:obj:`float`, `optional`, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            This should be completed, specific to marian.
+        normalize_before (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Call layernorm before attention ops.
+        normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Call layernorm after embeddings.
+        static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Don't learn positional embeddings, use sinusoidal.
+        add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Why not add another layernorm?
+        scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Scale embeddings by diving by sqrt(d_model).
+        eos_token_id (:obj:`int`, `optional`, defaults to 2)
+            End of stream token id.
+        pad_token_id (:obj:`int`, `optional`, defaults to 1)
+            Padding token id.
+        bos_token_id (:obj:`int`, `optional`, defaults to 0)
+            Beginning of stream token id.
+        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the `LayerDrop paper
+            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+            The LayerDrop probability for the decoder. See the `LayerDrop paper
+            <see https://arxiv.org/abs/1909.11556>`__ for more details.
+        extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2):
+            How many extra learned positional embeddings to use. Should be pad_token_id+1 for bart.
+        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether this is an encoder/decoder model
+        force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``).
+    """
+
    model_type = "pegasus"
    # The implementation of the config object is in BartConfig
--- a/src/transformers/modeling_bart.py
+++ b/src/transformers/modeling_bart.py
@@ -64,8 +64,13 @@ BART_PRETRAINED_MODEL_ARCHIVE_LIST = [

 BART_START_DOCSTRING = r"""

-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and
-    refer to the PyTorch documentation for all matters related to general usage and behavior.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.

    Parameters:
        config (:class:`~transformers.BartConfig`): Model configuration class with all the parameters of the model.
@@ -73,6 +78,7 @@ BART_START_DOCSTRING = r"""
            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.

 """
+
 BART_GENERATION_EXAMPLE = r"""
    Summarization example::

@@ -94,39 +100,54 @@ BART_GENERATION_EXAMPLE = r"""
 BART_INPUTS_DOCSTRING = r"""
    Args:
        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
-               Indices of input sequence tokens in the vocabulary. Use BartTokenizer.encode to produce them.
-            Padding will be ignored by default should you provide it.
-            Indices can be obtained using :class:`transformers.BartTokenizer.encode(text)`.
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+
+            Indices can be obtained using :class:`~transformers.BartTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
        attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-            Mask to avoid performing attention on padding token indices in input_ids.
+            Mask to avoid performing attention on padding token indices.
            Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
-            Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.
+            Provide for translation and summarization training. By default, the model will create this tensor by
+            shifting the :obj:`input_ids` to the right, following the paper.
        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`):
-            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
-            If you want to change padding behavior, you should read :func:`~transformers.modeling_bart._prepare_decoder_inputs` and modify.
-            See diagram 1 in the paper for more info on the default strategy
+            Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
+            also be used by default.
+
+            If you want to change padding behavior, you should read :func:`modeling_bart._prepare_decoder_inputs` and
+            modify to your needs. See diagram 1 in `the paper <https://arxiv.org/abs/1910.13461>`__ for more
+            information on the default strategy.
        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`):
            Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: :obj:`attentions`)
-            :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`) is a sequence of hidden-states at the output of the last layer of the encoder.
-            Used in the cross-attention of the decoder.
+            :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`) is a
+            sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of
+            the decoder.
        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
-            Contains pre-computed key and value hidden-states of the attention blocks.
-            Can be used to speed up decoding.
+            Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up decoding.
+
            If :obj:`past_key_values` are used, the user can optionally input only the last
            ``decoder_input_ids`` (those that don't have their past key value states given to this model) of shape
            :obj:`(batch_size, 1)` instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
-        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            If :obj:`use_cache` is True, :obj:`past_key_values` are returned and can be used to speed up decoding (see
-            :obj:`past_key_values`).
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
        output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
        output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
        return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """



--- a/src/transformers/modeling_marian.py
+++ b/src/transformers/modeling_marian.py
@@ -23,11 +23,12 @@ from .modeling_bart import BartForConditionalGeneration


 class MarianMTModel(BartForConditionalGeneration):
-    config_class = MarianConfig
    r"""
    Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
-    Model API is identical to BartForConditionalGeneration.
-    Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
+    Available models are listed `here <https://huggingface.co/models?search=Helsinki-NLP>`__.
+
+    This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the
+    superclass for the appropriate documentation alongside usage examples.

    Examples::

@@ -45,6 +46,7 @@ class MarianMTModel(BartForConditionalGeneration):
        >>> words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns "Where is the bus stop ?"

    """
+    config_class = MarianConfig

    def adjust_logits_during_generation(self, logits, cur_len, max_length):
        logits[:, self.config.pad_token_id] = float("-inf")  # never predict pad token.

--- a/src/transformers/modeling_mbart.py
+++ b/src/transformers/modeling_mbart.py
 from .configuration_mbart import MBartConfig
-from .file_utils import add_start_docstrings
 from .modeling_bart import BartForConditionalGeneration


@@ -12,23 +11,7 @@ MBART_PRETRAINED_MODEL_ARCHIVE_LIST = [
    # See all multilingual BART models at https://huggingface.co/models?filter=mbart
 ]

-MBART_START_DOCSTRING = r"""

-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ sub-class.
-    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
-    usage and behavior.
-
-    Parameters:
-        config (:class:`~transformers.MBartConfig`): Model configuration class with all the parameters of the
-            model. Initializing with a config file does not load the weights associated with the model, only the
-            configuration.
-            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
-"""
-
-
-@add_start_docstrings(
-    "The BART Model with a language modeling head. Can be used for machine translation.", MBART_START_DOCSTRING
-)
 class MBartForConditionalGeneration(BartForConditionalGeneration):
    r"""
    This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the

--- a/src/transformers/modeling_pegasus.py
+++ b/src/transformers/modeling_pegasus.py
@@ -22,18 +22,12 @@ from .modeling_bart import BART_START_DOCSTRING, BartForConditionalGeneration

 @add_start_docstrings("The Pegasus Model for summarization ", BART_START_DOCSTRING)
 class PegasusForConditionalGeneration(BartForConditionalGeneration):
-    config_class = PegasusConfig
-    authorized_missing_keys = [
-        r"final_logits_bias",
-        r"encoder\.version",
-        r"decoder\.version",
-        r"model.encoder.embed_positions",
-        "model.decoder.embed_positions",
-    ]
    r"""
    Pytorch version of google's pegasus model for summarization.
-    Model API is identical to BartForConditionalGeneration.
-    Available models are listed at `Model List <https://huggingface.co/models?search=pegasus>`__
+    Available models are listed `here <https://huggingface.co/models?search=pegasus>`__.
+
+    This class overrides :class:`~transformers.BartForConditionalGeneration`. Please check the
+    superclass for the appropriate documentation alongside usage examples.

    Examples::

@@ -51,3 +45,11 @@ class PegasusForConditionalGeneration(BartForConditionalGeneration):

    """
    # All the code is in src/transformers/modeling_bart.py
+    config_class = PegasusConfig
+    authorized_missing_keys = [
+        r"final_logits_bias",
+        r"encoder\.version",
+        r"decoder\.version",
+        r"model.encoder.embed_positions",
+        "model.decoder.embed_positions",
+    ]
--- a/src/transformers/tokenization_bart.py
+++ b/src/transformers/tokenization_bart.py
@@ -38,6 +38,15 @@ _all_bart_models = [


 class BartTokenizer(RobertaTokenizer):
+    r"""
+    Construct a BART tokenizer.
+
+    :class:`~transformers.BartTokenizer` is identical to :class:`~transformers.RobertaTokenizer` and adds a new
+    :meth:`~transformers.BartTokenizer.prepare_seq2seq_batch`
+
+    Refer to superclass :class:`~transformers.RobertaTokenizer` for usage examples and documentation concerning
+    the initialization parameters and other methods.
+    """
    # merges and vocab same as Roberta
    max_model_input_sizes = {m: 1024 for m in _all_bart_models}
    pretrained_vocab_files_map = {

--- a/src/transformers/tokenization_marian.py
+++ b/src/transformers/tokenization_marian.py
@@ -22,9 +22,34 @@ vocab_files_names = {


 class MarianTokenizer(PreTrainedTokenizer):
-    """Sentencepiece tokenizer for marian. Source and target languages have different SPM models.
-    The logic is use the relevant source_spm or target_spm to encode txt as pieces, then look up each piece in a
-    vocab dictionary.
+    r"""
+    Construct a Marian tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
+
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
+
+    Args:
+        source_spm (:obj:`str`):
+            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that
+            contains the vocabulary for the source language.
+        target_spm (:obj:`str`):
+            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that
+            contains the vocabulary for the target language.
+        source_lang (:obj:`str`, `optional`):
+            A string representing the source language.
+        target_lang (:obj:`str`, `optional`):
+            A string representing the target language.
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
+            The end of sequence token.
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        model_max_length (:obj:`int`, `optional`, defaults to 512):
+            The maximum sentence length the model accepts.
+        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<eop>", "<eod>"]`):
+            Additional special tokens used by the tokenizer.

    Examples::

@@ -165,7 +190,16 @@ class MarianTokenizer(PreTrainedTokenizer):
        return len(self.encoder)

    def save_vocabulary(self, save_directory: str) -> Tuple[str]:
-        """save vocab file to json and copy spm files from their original path."""
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+
+        Args:
+            save_directory (:obj:`str`):
+                The directory in which to save the vocabulary.
+
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
+        """
        save_dir = Path(save_directory)
        assert save_dir.is_dir(), f"{save_directory} should be a directory"
        save_json(self.encoder, save_dir / self.vocab_files_names["vocab"])

--- a/src/transformers/tokenization_mbart.py
+++ b/src/transformers/tokenization_mbart.py
@@ -58,8 +58,19 @@ FAIRSEQ_LANGUAGE_CODES = [

 class MBartTokenizer(XLMRobertaTokenizer):
    """
-    This inherits from XLMRobertaTokenizer. ``prepare_seq2seq_batch`` should be used to encode inputs.
-    Other tokenizer methods like ``encode`` do not work properly.
+    Construct an MBART tokenizer.
+
+    :class:`~transformers.MBartTokenizer` is a subclass of :class:`~transformers.XLMRobertaTokenizer` and adds a new
+    :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch`
+
+    Refer to superclass :class:`~transformers.XLMRobertaTokenizer` for usage examples and documentation concerning
+    the initialization parameters and other methods.
+
+    .. warning::
+
+        ``prepare_seq2seq_batch`` should be used to encode inputs. Other tokenizer methods like ``encode`` do not work
+        properly.
+
    The tokenization method is ``<tokens> <eos> <language code>`` for source language documents, and
    ``<language code> <tokens> <eos>``` for target language documents.

@@ -102,16 +113,16 @@ class MBartTokenizer(XLMRobertaTokenizer):
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` methods.
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.

        Args:
            token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.

        Returns:
            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -135,21 +146,23 @@ class MBartTokenizer(XLMRobertaTokenizer):
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
-        by concatenating and adding special tokens. The special tokens depend on calling set_lang.
+        by concatenating and adding special tokens.
        An MBART sequence has the following format, where ``X`` represents the sequence:
+
        - ``input_ids`` (for encoder) ``X [eos, src_lang_code]``
        - ``decoder_input_ids``: (for decoder) ``[tgt_lang_code] X [eos]``
+
        BOS is never used.
        Pairs of sequences are not the expected use case, but they will be handled without a separator.

        Args:
            token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return self.prefix_tokens + token_ids_0 + self.suffix_tokens

--- a/src/transformers/tokenization_pegasus.py
+++ b/src/transformers/tokenization_pegasus.py
@@ -20,6 +20,15 @@ from .tokenization_utils_base import PREPARE_SEQ2SEQ_BATCH_DOCSTRING, BatchEncod


 class PegasusTokenizer(ReformerTokenizer):
+    r"""
+    Construct a Pegasus tokenizer.
+
+    :class:`~transformers.PegasusTokenizer` is identical to :class:`~transformers.ReformerTokenizer` and adds a new
+    :meth:`~transformers.PegasusTokenizer.prepare_seq2seq_batch`
+
+    Refer to superclass :class:`~transformers.ReformerTokenizer` for usage examples and documentation concerning
+    the initialization parameters and other methods.
+    """
    offset = 103  # entries 2-104 are only used for pretraining
    vocab_files_names = {"vocab_file": "spiece.model"}

@@ -85,18 +94,24 @@ class PegasusTokenizer(ReformerTokenizer):

    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:
        """
-        Build model inputs from a sequence by adding eos to the end. no bos token is added to the front.
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens.
+        A Pegasus sequence has the following format, where ``X`` represents the sequence:
+
        - single sequence: ``X </s>``
        - pair of sequences: ``A B </s>``  (not intended use)

+        BOS is never used.
+        Pairs of sequences are not the expected use case, but they will be handled without a separator.
+
        Args:
            token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return token_ids_0 + [self.eos_token_id]
@@ -115,10 +130,6 @@ class PegasusTokenizer(ReformerTokenizer):
        padding="longest",
        **unused,
    ) -> BatchEncoding:
-        """
-        Prepare model inputs for summarization or translation.
-
-        """
        if "" in src_texts:
            raise ValueError(f"found empty string in src_texts: {src_texts}")
        tokenizer_kwargs = dict(