ProphetNet (#7157)

* add new model prophetnet prophetnet modified modify codes as suggested v1 add prophetnet test files * still bugs, because of changed output formats of encoder and decoder * move prophetnet into the latest version * clean integration tests * clean tokenizers * add xlm config to init * correct typo in init * further refactoring * continue refactor * save parallel * add decoder_attention_mask * fix use_cache vs. past_key_values * fix common tests * change decoder output logits * fix xlm tests * make common tests pass * change model architecture * add tokenizer tests * finalize model structure * no weight mapping * correct n-gram stream attention mask as discussed with qweizhen * remove unused import * fix index.rst * fix tests * delete unnecessary code * add fast integration test * rename weights * final weight remapping * save intermediate * Descriptions for Prophetnet Config File * finish all models * finish new model outputs * delete unnecessary files * refactor encoder layer * add dummy docs * code quality * fix tests * add model pages to doctree * further refactor * more refactor, more tests * finish code refactor and tests * remove unnecessary files * further clean up * add docstring template * finish tokenizer doc * finish prophetnet * fix copies * fix typos * fix tf tests * fix fp16 * fix tf test 2nd try * fix code quality * add test for each model * merge new tests to branch * Update model_cards/microsoft/prophetnet-large-uncased-cnndm/README.md Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update model_cards/microsoft/prophetnet-large-uncased-cnndm/README.md Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update src/transformers/modeling_prophetnet.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update utils/check_repo.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * apply sams and sylvains comments * make style * remove unnecessary code * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/configuration_prophetnet.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * implement lysandres comments * correct docs * fix isort * fix tokenizers * fix copies Co-authored-by: weizhen <weizhen@mail.ustc.edu.cn> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

ProphetNet (#7157)
* add new model prophetnet prophetnet modified modify codes as suggested v1 add prophetnet test files * still bugs, because of changed output formats of encoder and decoder * move prophetnet into the latest version * clean integration tests * clean tokenizers * add xlm config to init * correct typo in init * further refactoring * continue refactor * save parallel * add decoder_attention_mask * fix use_cache vs. past_key_values * fix common tests * change decoder output logits * fix xlm tests * make common tests pass * change model architecture * add tokenizer tests * finalize model structure * no weight mapping * correct n-gram stream attention mask as discussed with qweizhen * remove unused import * fix index.rst * fix tests * delete unnecessary code * add fast integration test * rename weights * final weight remapping * save intermediate * Descriptions for Prophetnet Config File * finish all models * finish new model outputs * delete unnecessary files * refactor encoder layer * add dummy docs * code quality * fix tests * add model pages to doctree * further refactor * more refactor, more tests * finish code refactor and tests * remove unnecessary files * further clean up * add docstring template * finish tokenizer doc * finish prophetnet * fix copies * fix typos * fix tf tests * fix fp16 * fix tf test 2nd try * fix code quality * add test for each model * merge new tests to branch * Update model_cards/microsoft/prophetnet-large-uncased-cnndm/README.md Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update model_cards/microsoft/prophetnet-large-uncased-cnndm/README.md Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update src/transformers/modeling_prophetnet.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * Update utils/check_repo.py Co-authored-by: Sam Shleifer <sshleifer@gmail.com> * apply sams and sylvains comments * make style * remove unnecessary code * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/configuration_prophetnet.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * implement lysandres comments * correct docs * fix isort * fix tokenizers * fix copies Co-authored-by: weizhen <weizhen@mail.ustc.edu.cn> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2422cda0 · Weizhen · GitHub · 8f8f8d99 · 2422cda0 · 2422cda0
Unverified Commit 2422cda0 authored Oct 19, 2020 by Weizhen Committed by GitHub Oct 19, 2020
20 changed files
--- a/README.md
+++ b/README.md
@@ -182,6 +182,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
 1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper  [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
+1. **[ProphetNet](https://huggingface.co/transformers/master/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
 ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
@@ -189,6 +190,7 @@ ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/
 1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
+1. **[XLM-ProphetNet](https://huggingface.co/transformers/master/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
 1. **[XLNet](https://huggingface.co/transformers/model_doc/xlnet.html)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 1. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -129,32 +129,40 @@ conversion utilities for the following models:
 22. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
    Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
    Mohammad Saleh and Peter J. Liu.
-23. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
+23. `ProphetNet <https://huggingface.co/transformers/master/model_doc/prophetnet.html>`__ (from Microsoft Research)
+    released with the paper `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training
+    <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen,
+    Ruofei Zhang and Ming Zhou.
+24. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
    Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-24. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
+25. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
    Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
    Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. ultilingual BERT into `DistilmBERT
    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German version of
    DistilBERT.
-25. `SqueezeBert <https://huggingface.co/transformers/master/model_doc/squeezebert.html>`__ released with the paper
+26. `SqueezeBert <https://huggingface.co/transformers/master/model_doc/squeezebert.html>`__ released with the paper
    `SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
    <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
-26. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
+27. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
    Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
    Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-27. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
+28. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
    Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
    Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-28. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
+29. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
    Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
-29. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
+30. `XLM-ProphetNet <https://huggingface.co/transformers/master/model_doc/xlmprophetnet.html>`__ (from Microsoft
+    Research) released with the paper `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training
+    <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen,
+    Ruofei Zhang and Ming Zhou.
+31. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
    Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
    Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
    Zettlemoyer and Veselin Stoyanov.
-30. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
+32. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
    Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
    Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-31. `Other community models <https://huggingface.co/models>`__, contributed by the `community
+33. `Other community models <https://huggingface.co/models>`__, contributed by the `community
    <https://huggingface.co/users>`__.
 .. toctree::
@@ -245,6 +253,7 @@ conversion utilities for the following models:
    model_doc/gpt
    model_doc/gpt2
    model_doc/pegasus
+    model_doc/prophetnet
    model_doc/rag
    model_doc/reformer
    model_doc/retribert
@@ -253,6 +262,7 @@ conversion utilities for the following models:
    model_doc/t5
    model_doc/transformerxl
    model_doc/xlm
+    model_doc/xlmprophetnet
    model_doc/xlmroberta
    model_doc/xlnet

--- a/docs/source/model_doc/encoderdecoder.rst
+++ b/docs/source/model_doc/encoderdecoder.rst
@@ -27,4 +27,4 @@ EncoderDecoderModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.EncoderDecoderModel
-    :members: forward
+    :members: forward, from_encoder_decoder_pretrained
--- a/docs/source/model_doc/prophetnet.rst
+++ b/docs/source/model_doc/prophetnet.rst
+ProphetNet
+-----------------------------------------------------------------------------------------------------------------------
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+@patrickvonplaten
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token.
+The abstract from the paper is the following:
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
+ProphetNetConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.ProphetNetConfig
+    :members:
+ProphetNetTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.ProphetNetTokenizer
+    :members:
+ProphetNet specific outputs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.modeling_prophetnet.ProphetNetSeq2SeqLMOutput
+    :members:
+.. autoclass:: transformers.modeling_prophetnet.ProphetNetSeq2SeqModelOutput
+    :members:
+.. autoclass:: transformers.modeling_prophetnet.ProphetNetDecoderModelOutput
+    :members:
+.. autoclass:: transformers.modeling_prophetnet.ProphetNetDecoderLMOutput
+    :members:
+ProphetNetModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.ProphetNetModel
+    :members: forward
+ProphetNetEncoder
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.ProphetNetEncoder
+    :members: forward
+ProphetNetDecoder
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.ProphetNetDecoder
+    :members: forward
+ProphetNetForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.ProphetNetForConditionalGeneration
+    :members: forward
+ProphetNetForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.ProphetNetForCausalLM
+    :members: forward
--- a/docs/source/model_doc/xlmprophetnet.rst
+++ b/docs/source/model_doc/xlmprophetnet.rst
+XLM-ProphetNet
+-----------------------------------------------------------------------------------------------------------------------
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+@patrickvonplaten
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual "wiki100" Wikipedia dump.
+The abstract from the paper is the following:
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
+XLMProphetNetConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.XLMProphetNetConfig
+    :members:
+XLMProphetNetTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.XLMProphetNetTokenizer
+    :members:
+XLMProphetNetModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.XLMProphetNetModel
+XLMProphetNetEncoder
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.XLMProphetNetEncoder
+XLMProphetNetDecoder
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.XLMProphetNetDecoder
+XLMProphetNetForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.XLMProphetNetForConditionalGeneration
+XLMProphetNetForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.XLMProphetNetForCausalLM
--- a/model_cards/microsoft/prophetnet-large-uncased-cnndm/README.md
+++ b/model_cards/microsoft/prophetnet-large-uncased-cnndm/README.md
+## prophetnet-large-uncased-cnndm
+Fine-tuned weights(converted from [original fairseq version repo](https://github.com/microsoft/ProphetNet)) for [ProphetNet](https://arxiv.org/abs/2001.04063) on summarization task CNN/DailyMail.  
+ProphetNet is a new pre-trained language model for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction.  
+ProphetNet is able to predict more future tokens with a n-stream decoder. The original implementation is Fairseq version at [github repo](https://github.com/microsoft/ProphetNet).   
+### Usage
+```
+from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, ProphetNetConfig
+model = ProphetNetForConditionalGeneration.from_pretrained('microsoft/prophetnet-large-uncased-cnndm')
+tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/prophetnet-large-uncased-cnndm')
+ARTICLE_TO_SUMMARIZE = "USTC was founded in Beijing by the Chinese Academy of Sciences (CAS) in September 1958. The Director of CAS, Mr. Guo Moruo was appointed the first president of USTC. USTC's founding mission was to develop a high-level science and technology workforce, as deemed critical for development of China's economy, defense, and science and technology education. The establishment was hailed as \"A Major Event in the History of Chinese Education and Science.\" CAS has supported USTC by combining most of its institutes with the departments of the university. USTC is listed in the top 16 national key universities, becoming the youngest national key university.".lower()
+inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=100, return_tensors='pt')
+# Generate Summary
+summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=512, early_stopping=True)
+tokenizer.batch_decode(summary_ids.tolist())
+```
+Here, [X_SEP] is used as a special token to seperate sentences.
+### Citation
+```bibtex
+@article{yan2020prophetnet,
+  title={Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training},
+  author={Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming},
+  journal={arXiv preprint arXiv:2001.04063},
+  year={2020}
+}
+```
--- a/model_cards/microsoft/prophetnet-large-uncased/README.md
+++ b/model_cards/microsoft/prophetnet-large-uncased/README.md
+## prophetnet-large-uncased
+Pretrained weights for [ProphetNet](https://arxiv.org/abs/2001.04063).  
+ProphetNet is a new pre-trained language model for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction.  
+ProphetNet is able to predict more future tokens with a n-stream decoder. The original implementation is Fairseq version at [github repo](https://github.com/microsoft/ProphetNet).   
+### Usage
+Please see [the official repository](https://github.com/microsoft/ProphetNet) for details.
+### Citation
+```bibtex
+@article{yan2020prophetnet,
+  title={Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training},
+  author={Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming},
+  journal={arXiv preprint arXiv:2001.04063},
+  year={2020}
+}
+```
--- a/model_cards/microsoft/xprophetnet-large-wiki100-cased-xglue-ntg/README.md
+++ b/model_cards/microsoft/xprophetnet-large-wiki100-cased-xglue-ntg/README.md
+## xprophetnet-large-wiki100-cased-xglue-ntg
+Cross-lingual version [ProphetNet](https://arxiv.org/abs/2001.04063), pretrained on [wiki100 xGLUE dataset](https://arxiv.org/abs/2004.01401) and finetuned on xGLUE cross-lingual News Titles Generation task.  
+ProphetNet is a new pre-trained language model for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction.  
+ProphetNet is able to predict more future tokens with a n-stream decoder. The original implementation is Fairseq version at [github repo](https://github.com/microsoft/ProphetNet).   
+xProphetNet is also served as the baseline model for xGLUE cross-lingual natural language generation tasks.  
+For xGLUE corss-lingual NLG tasks, xProphetNet is finetuned with English data, but inference with both English and other zero-shot language data.  
+### Usage
+A quick usage is like: 
+```
+from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, ProphetNetConfig
+model = ProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-ntg')
+tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-ntg')
+EN_SENTENCE = "Microsoft Corporation intends to officially end free support for the Windows 7 operating system after January 14, 2020, according to the official portal of the organization. From that day, users of this system will not be able to receive security updates, which could make their computers vulnerable to cyber attacks."
+RU_SENTENCE = "орпорация Microsoft намерена официально прекратить бесплатную поддержку операционной системы Windows 7 после 14 января 2020 года, сообщается на официальном портале организации . С указанного дня пользователи этой системы не смогут получать обновления безопасности, из-за чего их компьютеры могут стать уязвимыми к кибератакам."
+ZH_SENTENCE = "根据该组织的官方门户网站，微软公司打算在2020年1月14日之后正式终止对Windows 7操作系统的免费支持。从那时起，该系统的用户将无法接收安全更新，这可能会使他们的计算机容易受到网络攻击。"
+inputs = tokenizer([EN_SENTENCE, RU_SENTENCE, ZH_SENTENCE], padding=True, max_length=256, return_tensors='pt')
+summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=True)
+print([tokenizer.decode(g) for g in summary_ids])  
+```
+### Citation
+```bibtex
+@article{yan2020prophetnet,
+  title={Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training},
+  author={Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming},
+  journal={arXiv preprint arXiv:2001.04063},
+  year={2020}
+}
+```
--- a/model_cards/microsoft/xprophetnet-large-wiki100-cased-xglue-qg/README.md
+++ b/model_cards/microsoft/xprophetnet-large-wiki100-cased-xglue-qg/README.md
+## xprophetnet-large-wiki100-cased-xglue-ntg
+Cross-lingual version [ProphetNet](https://arxiv.org/abs/2001.04063), pretrained on [wiki100 xGLUE dataset](https://arxiv.org/abs/2004.01401) and finetuned on xGLUE cross-lingual Question Generation task.  
+ProphetNet is a new pre-trained language model for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction.  
+ProphetNet is able to predict more future tokens with a n-stream decoder. The original implementation is Fairseq version at [github repo](https://github.com/microsoft/ProphetNet).   
+xProphetNet is also served as the baseline model for xGLUE cross-lingual natural language generation tasks.  
+For xGLUE corss-lingual NLG tasks, xProphetNet is finetuned with English data, but inference with both English and other zero-shot language data.  
+### Usage
+A quick usage is like: 
+```
+from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, ProphetNetConfig
+model = ProphetNetForConditionalGeneration.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')
+tokenizer = ProphetNetTokenizer.from_pretrained('microsoft/xprophetnet-large-wiki100-cased-xglue-qg')
+EN_SENTENCE = "Google left China in 2010"
+ZH_SENTENCE = "Google在2010年离开中国"
+inputs = tokenizer([EN_SENTENCE, ZH_SENTENCE], padding=True, max_length=256, return_tensors='pt')
+summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=True)
+print([tokenizer.decode(g) for g in summary_ids])  
+```
+### Citation
+```bibtex
+@article{yan2020prophetnet,
+  title={Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training},
+  author={Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming},
+  journal={arXiv preprint arXiv:2001.04063},
+  year={2020}
+}
+```
--- a/model_cards/microsoft/xprophetnet-large-wiki100-cased/README.md
+++ b/model_cards/microsoft/xprophetnet-large-wiki100-cased/README.md
+## xprophetnet-large-wiki100-cased
+Cross-lingual version [ProphetNet](https://arxiv.org/abs/2001.04063), pretrained on [wiki100 xGLUE dataset](https://arxiv.org/abs/2004.01401).  
+ProphetNet is a new pre-trained language model for sequence-to-sequence learning with a novel self-supervised objective called future n-gram prediction.  
+ProphetNet is able to predict more future tokens with a n-stream decoder. The original implementation is Fairseq version at [github repo](https://github.com/microsoft/ProphetNet).   
+xProphetNet is also served as the baseline model for xGLUE cross-lingual natural language generation tasks.  
+For xGLUE corss-lingual NLG tasks, xProphetNet is finetuned with English data, but inference with both English and other zero-shot language data.  
+### Usage
+Please see [the official repository](https://github.com/microsoft/ProphetNet/tree/master/xProphetNet) for details.
+### Citation
+```bibtex
+@article{yan2020prophetnet,
+  title={Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training},
+  author={Yan, Yu and Qi, Weizhen and Gong, Yeyun and Liu, Dayiheng and Duan, Nan and Chen, Jiusheng and Zhang, Ruofei and Zhou, Ming},
+  journal={arXiv preprint arXiv:2001.04063},
+  year={2020}
+}
+```
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -54,6 +54,7 @@ from .configuration_mmbt import MMBTConfig
 from .configuration_mobilebert import MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileBertConfig
 from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
 from .configuration_pegasus import PegasusConfig
+from .configuration_prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig
 from .configuration_rag import RagConfig
 from .configuration_reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig
 from .configuration_retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig
@@ -63,6 +64,7 @@ from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
 from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
 from .configuration_utils import PretrainedConfig
 from .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
+from .configuration_xlm_prophetnet import XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMProphetNetConfig
 from .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
 from .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig
 from .data import (
@@ -183,6 +185,7 @@ from .tokenization_lxmert import LxmertTokenizer
 from .tokenization_mobilebert import MobileBertTokenizer
 from .tokenization_openai import OpenAIGPTTokenizer
 from .tokenization_phobert import PhobertTokenizer
+from .tokenization_prophetnet import ProphetNetTokenizer
 from .tokenization_rag import RagTokenizer
 from .tokenization_retribert import RetriBertTokenizer
 from .tokenization_roberta import RobertaTokenizer
@@ -210,6 +213,7 @@ if is_sentencepiece_available():
    from .tokenization_pegasus import PegasusTokenizer
    from .tokenization_reformer import ReformerTokenizer
    from .tokenization_t5 import T5Tokenizer
+    from .tokenization_xlm_prophetnet import XLMProphetNetTokenizer
    from .tokenization_xlm_roberta import XLMRobertaTokenizer
    from .tokenization_xlnet import XLNetTokenizer
 else:
@@ -251,7 +255,6 @@ if is_tokenizers_available():
 else:
    from .utils.dummy_tokenizers_objects import *
 # Trainer
 from .trainer_callback import (
    DefaultFlowCallback,
@@ -490,6 +493,15 @@ if is_torch_available():
        load_tf_weights_in_openai_gpt,
    )
    from .modeling_pegasus import PegasusForConditionalGeneration
+    from .modeling_prophetnet import (
+        PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
+        ProphetNetDecoder,
+        ProphetNetEncoder,
+        ProphetNetForCausalLM,
+        ProphetNetForConditionalGeneration,
+        ProphetNetModel,
+        ProphetNetPreTrainedModel,
+    )
    from .modeling_rag import RagModel, RagSequenceForGeneration, RagTokenForGeneration
    from .modeling_reformer import (
        REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
@@ -550,6 +562,14 @@ if is_torch_available():
        XLMPreTrainedModel,
        XLMWithLMHeadModel,
    )
+    from .modeling_xlm_prophetnet import (
+        XLM_PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
+        XLMProphetNetDecoder,
+        XLMProphetNetEncoder,
+        XLMProphetNetForCausalLM,
+        XLMProphetNetForConditionalGeneration,
+        XLMProphetNetModel,
+    )
    from .modeling_xlm_roberta import (
        XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,
        XLMRobertaForCausalLM,

--- a/src/transformers/configuration_auto.py
+++ b/src/transformers/configuration_auto.py
@@ -41,6 +41,7 @@ from .configuration_mbart import MBART_PRETRAINED_CONFIG_ARCHIVE_MAP, MBartConfi
 from .configuration_mobilebert import MobileBertConfig
 from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
 from .configuration_pegasus import PegasusConfig
+from .configuration_prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig
 from .configuration_rag import RagConfig
 from .configuration_reformer import ReformerConfig
 from .configuration_retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig
@@ -50,6 +51,7 @@ from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
 from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
 from .configuration_utils import PretrainedConfig
 from .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
+from .configuration_xlm_prophetnet import XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMProphetNetConfig
 from .configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
 from .configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig
@@ -84,6 +86,8 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
        DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
        DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
        SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
    ]
    for key, value, in pretrained_map.items()
 )
@@ -116,6 +120,8 @@ CONFIG_MAPPING = OrderedDict(
        ("gpt2", GPT2Config),
        ("transfo-xl", TransfoXLConfig),
        ("xlnet", XLNetConfig),
+        ("xlm-prophetnet", XLMProphetNetConfig),
+        ("prophetnet", ProphetNetConfig),
        ("xlm", XLMConfig),
        ("ctrl", CTRLConfig),
        ("electra", ElectraConfig),
@@ -164,6 +170,8 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("layoutlm", "LayoutLM"),
        ("dpr", "DPR"),
        ("rag", "RAG"),
+        ("xlm-prophetnet", "XLMProphetNet"),
+        ("prophetnet", "ProphetNet"),
    ]
 )

--- a/src/transformers/configuration_lxmert.py
+++ b/src/transformers/configuration_lxmert.py
@@ -15,12 +15,11 @@
 """ LXMERT model configuration """
-import logging
 from .configuration_utils import PretrainedConfig
+from .utils import logging
-logger = logging.getLogger(__name__)
+logger = logging.get_logger(__name__)
 LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
    "unc-nlp/lxmert-base-uncased": "",

--- a/src/transformers/configuration_prophetnet.py
+++ b/src/transformers/configuration_prophetnet.py
+# coding=utf-8
+# Copyright 2020 The Microsoft Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ProphetNet model configuration """
+from .configuration_utils import PretrainedConfig
+from .utils import logging
+logger = logging.get_logger(__name__)
+PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "microsoft/prophetnet-large-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/prophetnet-large-uncased/config.json",
+}
+class ProphetNetConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.ProphetNetModel`. It is used to
+    instantiate a ProphetNet model according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
+    Args:
+        activation_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for activations inside the fully connected layer.
+        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the ProphetNET model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.ProphetNetModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 1024):
+            Dimensionality of the layers and the pooler layer.
+        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        num_encoder_layers (:obj:`int`, `optional`, defaults to 12):
+            Number of encoder layers.
+        num_encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the ``intermediate`` (often named feed-forward) layer in decoder.
+        num_decoder_layers (:obj:`int`, `optional`, defaults to 12):
+            Number of decoder layers.
+        num_decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+            The maximum sequence length that this model might ever be used with.
+            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+        init_std (:obj:`float`, `optional`, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        add_cross_attention (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether cross-attention layers should be added to the model.
+        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether this is an encoder/decoder model.
+        pad_token_id (:obj:`int`, `optional`, defaults to 1)
+            Padding token id.
+        bos_token_id (:obj:`int`, `optional`, defaults to 0)
+            Beginning of stream token id.
+        eos_token_id (:obj:`int`, `optional`, defaults to 2)
+            End of stream token id.
+        ngram (:obj:`int`, `optional`, defaults to 2)
+            Number of future tokens to predict.
+            Set to 1 to be same as traditional Language model to predict next first token.
+        num_buckets (:obj:`int`, `optional`, defaults to 32)
+            The number of buckets to use for each attention layer.
+            This is for relative position calculation. See the `T5 paper
+            <see https://arxiv.org/abs/1910.10683>`__ for more details.
+        relative_max_distance (:obj:`int`, `optional`, defaults to 128)
+            Relative distances greater than this number will be put into the last same bucket.
+            This is for relative position calculation. See the `T5 paper
+            <see https://arxiv.org/abs/1910.10683>`__ for more details.
+        disable_ngram_loss (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether be trained predicting only the next first token.
+        eps (:obj:`float`, `optional`, defaults to 0.0):
+            Controls the ``epsilon`` parameter value for label
+            smoothing in the loss calculation. If set to 0, no label smoothing is performed.
+    """
+    model_type = "prophetnet"
+    def __init__(
+        self,
+        activation_dropout=0.1,
+        activation_function="gelu",
+        vocab_size=30522,
+        hidden_size=1024,
+        encoder_ffn_dim=4096,
+        num_encoder_layers=12,
+        num_encoder_attention_heads=16,
+        decoder_ffn_dim=4096,
+        num_decoder_layers=12,
+        num_decoder_attention_heads=16,
+        attention_dropout=0.1,
+        dropout=0.1,
+        max_position_embeddings=512,
+        init_std=0.02,
+        is_encoder_decoder=True,
+        add_cross_attention=True,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        decoder_start_token_id=0,
+        ngram=2,
+        num_buckets=32,
+        relative_max_distance=128,
+        disable_ngram_loss=False,
+        eps=0.0,
+        **kwargs
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            add_cross_attention=add_cross_attention,
+            decoder_start_token_id=decoder_start_token_id,
+            **kwargs,
+        )
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.num_encoder_layers = num_encoder_layers
+        self.num_encoder_attention_heads = num_encoder_attention_heads
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.num_decoder_layers = num_decoder_layers
+        self.num_decoder_attention_heads = num_decoder_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.init_std = init_std  # Normal(0, this parameter)
+        self.activation_function = activation_function
+        # parameters for prophetnet
+        self.ngram = ngram
+        self.num_buckets = num_buckets
+        self.relative_max_distance = relative_max_distance
+        self.disable_ngram_loss = disable_ngram_loss
+        self.eps = eps
+        # 3 Types of Dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.dropout = dropout
+    @property
+    def num_attention_heads(self) -> int:
+        return self.num_encoder_attention_heads
+    @property
+    def num_hidden_layers(self) -> int:
+        return self.num_encoder_layers + self.num_decoder_layers
--- a/src/transformers/configuration_xlm_prophetnet.py
+++ b/src/transformers/configuration_xlm_prophetnet.py
+# coding=utf-8
+# Copyright 2020 The Microsoft Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" XLM-ProphetNet model configuration """
+from .configuration_prophetnet import ProphetNetConfig
+from .utils import logging
+logger = logging.get_logger(__name__)
+XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "microsoft/xprophetnet-large-wiki100-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/xprophetnet-large-wiki100-cased/config.json",
+}
+class XLMProphetNetConfig(ProphetNetConfig):
+    """
+    This class overrides :class:`~transformers.ProphetNetConfig`. Please check the
+    superclass for the appropriate documentation alongside usage examples.
+    """
+    model_type = "xlm-prophetnet"
--- a/src/transformers/convert_prophetnet_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/convert_prophetnet_original_pytorch_checkpoint_to_pytorch.py
+# coding=utf-8
+# Copyright 2020 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert ProphetNet checkpoint."""
+import argparse
+import torch
+from transformers import logging
+from transformers.modeling_prophetnet import ProphetNetForConditionalGeneration
+from transformers.modeling_xlm_prophetnet import XLMProphetNetForConditionalGeneration
+# transformers_old should correspond to branch `save_old_prophetnet_model_structure` here
+# original prophetnet_checkpoints are saved under `patrickvonplaten/..._old` respectively
+from transformers_old.modeling_prophetnet import (
+    ProphetNetForConditionalGeneration as ProphetNetForConditionalGenerationOld,
+)
+from transformers_old.modeling_xlm_prophetnet import (
+    XLMProphetNetForConditionalGeneration as XLMProphetNetForConditionalGenerationOld,
+)
+logger = logging.get_logger(__name__)
+logging.set_verbosity_info()
+def convert_prophetnet_checkpoint_to_pytorch(prophetnet_checkpoint_path: str, pytorch_dump_folder_path: str):
+    """
+    Copy/paste/tweak prohpetnet's weights to our prophetnet structure.
+    """
+    if "xprophetnet" in prophetnet_checkpoint_path:
+        prophet_old = XLMProphetNetForConditionalGenerationOld.from_pretrained(prophetnet_checkpoint_path)
+        prophet, loading_info = XLMProphetNetForConditionalGeneration.from_pretrained(
+            prophetnet_checkpoint_path, output_loading_info=True
+        )
+    else:
+        prophet_old = ProphetNetForConditionalGenerationOld.from_pretrained(prophetnet_checkpoint_path)
+        prophet, loading_info = ProphetNetForConditionalGeneration.from_pretrained(
+            prophetnet_checkpoint_path, output_loading_info=True
+        )
+    special_keys = ["key_proj", "value_proj", "query_proj"]
+    mapping = {
+        "self_attn": "ngram_self_attn",
+        "cross_attn": "encoder_attn",
+        "cross_attn_layer_norm": "encoder_attn_layer_norm",
+        "feed_forward_layer_norm": "final_layer_norm",
+        "feed_forward": "",
+        "intermediate": "fc1",
+        "output": "fc2",
+        "key_proj": "k_proj",
+        "query_proj": "q_proj",
+        "value_proj": "v_proj",
+        "word_embeddings": "embed_tokens",
+        "embeddings_layer_norm": "emb_layer_norm",
+        "relative_pos_embeddings": "relative_linear",
+        "ngram_embeddings": "ngram_input_embed",
+        "position_embeddings": "embed_positions",
+    }
+    for key in loading_info["missing_keys"]:
+        attributes = key.split(".")
+        if attributes[0] == "lm_head":
+            model = prophet
+            old_model = prophet_old
+        else:
+            model = prophet.prophetnet
+            old_model = prophet_old.model
+        is_key_init = False
+        for attribute in attributes:
+            if attribute in mapping:
+                old_attribute = mapping[attribute]
+                if not hasattr(old_model, old_attribute) and len(old_attribute) > 0:
+                    old_attribute = attribute
+            elif hasattr(old_model, attribute):
+                old_attribute = attribute
+            if attribute == "weight":
+                assert old_model.weight.shape == model.weight.shape, "Shapes have to match!"
+                model.weight = old_model.weight
+                logger.info(f"{attribute} is initialized.")
+                is_key_init = True
+                break
+            elif attribute == "bias":
+                assert old_model.bias.shape == model.bias.shape, "Shapes have to match!"
+                model.bias = old_model.bias
+                logger.info(f"{attribute} is initialized")
+                is_key_init = True
+                break
+            elif attribute in special_keys and hasattr(old_model, "in_proj_weight"):
+                embed_dim = old_model.in_proj_weight.shape[0] // 3
+                param = getattr(model, attribute)
+                param.weight.shape == old_model.in_proj_weight[:embed_dim, :].shape, "Shapes have to match"
+                param.bias.shape == old_model.in_proj_bias[:embed_dim].shape, "Shapes have to match"
+                if attribute == "query_proj":
+                    model.query_proj.weight = torch.nn.Parameter(old_model.in_proj_weight[:embed_dim, :])
+                    model.query_proj.bias = torch.nn.Parameter(old_model.in_proj_bias[:embed_dim])
+                elif attribute == "key_proj":
+                    model.key_proj.weight = torch.nn.Parameter(old_model.in_proj_weight[embed_dim : 2 * embed_dim, :])
+                    model.key_proj.bias = torch.nn.Parameter(old_model.in_proj_bias[embed_dim : 2 * embed_dim])
+                elif attribute == "value_proj":
+                    model.value_proj.weight = torch.nn.Parameter(old_model.in_proj_weight[2 * embed_dim :, :])
+                    model.value_proj.bias = torch.nn.Parameter(old_model.in_proj_bias[2 * embed_dim :])
+                is_key_init = True
+                break
+            elif attribute == "position_embeddings":
+                assert (
+                    model.position_embeddings.weight.shape[-1] == old_model.embed_positions.weight.shape[-1]
+                ), "Hidden size has to match"
+                assert model.position_embeddings.weight.shape[0] == 512, "We want 512 position_embeddings."
+                model.position_embeddings.weight = torch.nn.Parameter(old_model.embed_positions.weight[:512, :])
+                is_key_init = True
+                break
+            if attribute.isdigit():
+                model = model[int(attribute)]
+                old_model = old_model[int(old_attribute)]
+            else:
+                model = getattr(model, attribute)
+                if old_attribute == "":
+                    old_model = old_model
+                else:
+                    if not hasattr(old_model, old_attribute):
+                        raise ValueError(f"{old_model} does not have {old_attribute}")
+                    old_model = getattr(old_model, old_attribute)
+        if not is_key_init:
+            raise ValueError(f"{key} was not correctly initialized!")
+    print(f"Saving model to {pytorch_dump_folder_path}")
+    prophet.save_pretrained(pytorch_dump_folder_path)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--prophetnet_checkpoint_path", default=None, type=str, required=True, help="Path the official PyTorch dump."
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
+    )
+    args = parser.parse_args()
+    convert_prophetnet_checkpoint_to_pytorch(args.prophetnet_checkpoint_path, args.pytorch_dump_folder_path)
--- a/src/transformers/modeling_auto.py
+++ b/src/transformers/modeling_auto.py
@@ -43,6 +43,7 @@ from .configuration_auto import (
    MobileBertConfig,
    OpenAIGPTConfig,
    PegasusConfig,
+    ProphetNetConfig,
    ReformerConfig,
    RetriBertConfig,
    RobertaConfig,
@@ -50,6 +51,7 @@ from .configuration_auto import (
    T5Config,
    TransfoXLConfig,
    XLMConfig,
+    XLMProphetNetConfig,
    XLMRobertaConfig,
    XLNetConfig,
    replace_list_option_in_docstrings,
@@ -155,6 +157,7 @@ from .modeling_mobilebert import (
 )
 from .modeling_openai import OpenAIGPTForSequenceClassification, OpenAIGPTLMHeadModel, OpenAIGPTModel
 from .modeling_pegasus import PegasusForConditionalGeneration
+from .modeling_prophetnet import ProphetNetForCausalLM, ProphetNetForConditionalGeneration, ProphetNetModel
 from .modeling_rag import (  # noqa: F401 - need to import all RagModels to be in globals() function
    RagModel,
    RagSequenceForGeneration,
@@ -194,6 +197,11 @@ from .modeling_xlm import (
    XLMModel,
    XLMWithLMHeadModel,
 )
+from .modeling_xlm_prophetnet import (
+    XLMProphetNetForCausalLM,
+    XLMProphetNetForConditionalGeneration,
+    XLMProphetNetModel,
+)
 from .modeling_xlm_roberta import (
    XLMRobertaForCausalLM,
    XLMRobertaForMaskedLM,
@@ -247,6 +255,8 @@ MODEL_MAPPING = OrderedDict(
        (BertGenerationConfig, BertGenerationEncoder),
        (DebertaConfig, DebertaModel),
        (DPRConfig, DPRQuestionEncoder),
+        (XLMProphetNetConfig, XLMProphetNetModel),
+        (ProphetNetConfig, ProphetNetModel),
    ]
 )
@@ -325,6 +335,8 @@ MODEL_FOR_CAUSAL_LM_MAPPING = OrderedDict(
        (CTRLConfig, CTRLLMHeadModel),
        (ReformerConfig, ReformerModelWithLMHead),
        (BertGenerationConfig, BertGenerationDecoder),
+        (ProphetNetConfig, XLMProphetNetForCausalLM),
+        (ProphetNetConfig, ProphetNetForCausalLM),
    ]
 )
@@ -359,6 +371,8 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = OrderedDict(
        (BartConfig, BartForConditionalGeneration),
        (FSMTConfig, FSMTForConditionalGeneration),
        (EncoderDecoderConfig, EncoderDecoderModel),
+        (XLMProphetNetConfig, XLMProphetNetForConditionalGeneration),
+        (ProphetNetConfig, ProphetNetForConditionalGeneration),
    ]
 )

--- a/src/transformers/modeling_lxmert.py
+++ b/src/transformers/modeling_lxmert.py
@@ -15,7 +15,6 @@
 """ PyTorch LXMERT model. """
-import logging
 import math
 import os
 from dataclasses import dataclass
@@ -35,9 +34,10 @@ from .file_utils import (
    replace_return_docstrings,
 )
 from .modeling_utils import PreTrainedModel
+from .utils import logging
-logger = logging.getLogger(__name__)
+logger = logging.get_logger(__name__)
 _CONFIG_FOR_DOC = "LxmertConfig"
 _TOKENIZER_FOR_DOC = "LxmertTokenizer"

--- a/src/transformers/modeling_prophetnet.py
+++ b/src/transformers/modeling_prophetnet.py
--- a/src/transformers/modeling_tf_lxmert.py
+++ b/src/transformers/modeling_tf_lxmert.py
@@ -17,7 +17,6 @@
 """ TF 2.0 LXMERT model. """
-import logging
 from dataclasses import dataclass
 from typing import Dict, Optional, Tuple
@@ -34,9 +33,10 @@ from .file_utils import (
 )
 from .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list
 from .tokenization_utils_base import BatchEncoding
+from .utils import logging
-logger = logging.getLogger(__name__)
+logger = logging.get_logger(__name__)
 _CONFIG_FOR_DOC = "LxmertConfig"