Unverified Commit c314b1fd authored by Sam Shleifer's avatar Sam Shleifer Committed by GitHub
Browse files

[docs] improve bart/marian/mBART/pegasus docs (#8421)

parent 3213d3bf
...@@ -34,6 +34,8 @@ ________________________________________________________________________________ ...@@ -34,6 +34,8 @@ ________________________________________________________________________________
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets` - An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
object can be found in this `forum discussion object can be found in this `forum discussion
<https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__. <https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
- `Distilled checkpoints <https://huggingface.co/models?search=distilbart>`__ are described in this `paper
<https://arxiv.org/abs/2010.13002>`__.
Implementation Notes Implementation Notes
...@@ -44,14 +46,31 @@ Implementation Notes ...@@ -44,14 +46,31 @@ Implementation Notes
- The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function - The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function
:func:`transformers.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is different than some :func:`transformers.modeling_bart._prepare_bart_decoder_inputs`) if they are not passed. This is different than some
other modeling APIs. other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the - Model predictions are intended to be identical to the original implementation when
string you pass to :func:`fairseq.encode` starts with a space. :obj:`force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
:func:`fairseq.encode` starts with a space.
- :meth:`~transformers.BartForConditionalGeneration.generate` should be used for conditional generation tasks like - :meth:`~transformers.BartForConditionalGeneration.generate` should be used for conditional generation tasks like
summarization, see the example in that docstrings. summarization, see the example in that docstrings.
- Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform - Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
mask-filling tasks. mask-filling tasks.
- For training/forward passes that don't involve beam search, pass :obj:`use_cache=False`. - For training/forward passes that don't involve beam search, pass :obj:`use_cache=False`.
Mask Filling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :obj:`facebook/bart-base` and :obj:`facebook/bart-large` checkpoints can be used to fill multi-token masks.
.. code-block::
from transformers import BartForConditionalGeneration, BartTokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", force_bos_token_to_be_generated=True)
tok = BartTokenizer.from_pretrained("facebook/bart-large")
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
batch = tok(example_english_phrase, return_tensors='pt')
generated_ids = model.generate(batch['input_ids'])
assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
BartConfig BartConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -5,7 +5,7 @@ MarianMT ...@@ -5,7 +5,7 @@ MarianMT
<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
and assign @patrickvonplaten. and assign @patrickvonplaten.
Translations should be similar, but not identical to, output in the test set linked to in each model card. Translations should be similar, but not identical to output in the test set linked to in each model card.
Implementation Notes Implementation Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -35,32 +35,46 @@ Naming ...@@ -35,32 +35,46 @@ Naming
<https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
code {code}". code {code}".
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina. - Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
group use a combination of ISO-639-5 codes and ISO-639-2 codes.
Multilingual Models Examples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`: - Since Marian models are smaller than many other translation models available in the library, they can be useful for
fine-tuning experiments and integration tests.
- `Fine-tune on TPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh>`__
- `Fine-tune on GPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh>`__
- `Fine-tune on GPU with pytorch-lightning
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/distil_marian_no_teacher.sh>`__
Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by - All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
looking at the model card, or the Group Members `mapping - If a model can output multiple languages, and you should specify a language code by prepending the desired output
<https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ . language to the :obj:`src_text`.
- If :obj:`tgt` is in all caps, the model can output multiple languages, and you should specify a language code by - You can see a models's supported language codes in its model card, under target constituents, like in `opus-mt-en-roa
prepending the desired output language to the :obj:`src_text`. <https://huggingface.co/Helsinki-NLP/opus-mt-en-roa>`__.
- You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes`` - Note that if a model is only multilingual on the source side, like :obj:`Helsinki-NLP/opus-mt-roa-en`, no language
codes are required.
Example of translating english to many romance languages, using language codes: New multi-lingual models from the `Tatoeba-Challenge repo <https://github.com/Helsinki-NLP/Tatoeba-Challenge>`__
require 3 character language codes:
.. code-block:: python .. code-block:: python
from transformers import MarianMTModel, MarianTokenizer from transformers import MarianMTModel, MarianTokenizer
src_text = [ src_text = [
'>>fr<< this is a sentence in english that we want to translate to french', '>>fra<< this is a sentence in english that we want to translate to french',
'>>pt<< This should go to portuguese', '>>por<< This should go to portuguese',
'>>es<< And this to Spanish' '>>esp<< And this to Spanish'
] ]
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE' model_name = 'Helsinki-NLP/opus-mt-en-roa'
tokenizer = MarianTokenizer.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes) print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name)
...@@ -70,25 +84,42 @@ Example of translating english to many romance languages, using language codes: ...@@ -70,25 +84,42 @@ Example of translating english to many romance languages, using language codes:
# 'Isto deve ir para o português.', # 'Isto deve ir para o português.',
# 'Y esto al español'] # 'Y esto al español']
Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a
separator for src or tgt, as in :obj:`Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi`. These still require language
codes.
There are many supported regional language codes, like :obj:`>>es_ES<<` (Spain) and :obj:`>>es_AR<<` (Argentina), that
do not seem to change translations. I have not found these to provide different results than just using :obj:`>>es<<`.
For example:
- `Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU`: translates from all NORTH_EU languages (see `mapping Code to see available pretrained models:
<https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special
language code like :obj:`>>de<<` to specify output language. .. code-block:: python
- `Helsinki-NLP/opus-mt-ROMANCE-en`: translates from many romance languages to english, no codes needed since there
is only one target language. from transformers.hf_api import HfApi
model_list = HfApi().model_list()
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split('/')[1] for x in model_ids]
old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
Old Style Multi-Lingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
group:
.. code-block:: python .. code-block:: python
['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
'Helsinki-NLP/opus-mt-ROMANCE-en',
'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
'Helsinki-NLP/opus-mt-de-ZH',
'Helsinki-NLP/opus-mt-en-CELTIC',
'Helsinki-NLP/opus-mt-en-ROMANCE',
'Helsinki-NLP/opus-mt-es-NORWAY',
'Helsinki-NLP/opus-mt-fi-NORWAY',
'Helsinki-NLP/opus-mt-fi-ZH',
'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
'Helsinki-NLP/opus-mt-sv-NORWAY',
'Helsinki-NLP/opus-mt-sv-ZH']
GROUP_MEMBERS = { GROUP_MEMBERS = {
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'], 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'], 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
...@@ -99,16 +130,22 @@ For example: ...@@ -99,16 +130,22 @@ For example:
'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv'] 'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
} }
Code to see available pretrained models:
.. code-block:: python
from transformers.hf_api import HfApi
model_list = HfApi().model_list() Example of translating english to many romance languages, using old-style 2 character language codes
org = "Helsinki-NLP"
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
suffix = [x.split('/')[1] for x in model_ids] .. code-block::python
multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
from transformers import MarianMTModel, MarianTokenizer
src_text = [ '>>fr<< this is a sentence in english that we want to translate to french', '>>pt<< This should go to portuguese', '>>es<< And this to Spanish']
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE' tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes) model = MarianMTModel.from_pretrained(model_name) translated =
model.generate(**tokenizer.prepare_seq2seq_batch(src_text)) tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
# ["c'est une phrase en anglais que nous voulons traduire en français", 'Isto deve ir para o português.', 'Y esto al español']
MarianConfig MarianConfig
......
...@@ -19,6 +19,13 @@ on the encoder, decoder, or reconstructing parts of the text. ...@@ -19,6 +19,13 @@ on the encoder, decoder, or reconstructing parts of the text.
The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
Examples
_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning mBART and other models for sequence to sequence tasks can be found in
`examples/seq2seq/ <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- Given the large embeddings table, mBART consumes a large amount of GPU RAM, especially for fine-tuning.
:class:`MarianMTModel` is usually a better choice for bilingual machine translation.
Training Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -38,11 +45,7 @@ the sequences for sequence-to-sequence fine-tuning. ...@@ -38,11 +45,7 @@ the sequences for sequence-to-sequence fine-tuning.
example_english_phrase = "UN Chief Says There Is No Military Solution in Syria" example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria" expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian) batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
input_ids = batch["input_ids"] model(input_ids=batch['input_ids'], labels=batch['labels']) # forward pass
target_ids = batch["decoder_input_ids"]
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()
model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels) #forward
- Generation - Generation
......
...@@ -31,10 +31,19 @@ All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine- ...@@ -31,10 +31,19 @@ All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-
- Each checkpoint is 2.2 GB on disk and 568M parameters. - Each checkpoint is 2.2 GB on disk and 568M parameters.
- FP16 is not supported (help/ideas on this appreciated!). - FP16 is not supported (help/ideas on this appreciated!).
- Summarizing xsum in fp32 takes about 400ms/sample, with default parameters on a v100 GPU. - Summarizing xsum in fp32 takes about 400ms/sample, with default parameters on a v100 GPU.
- For XSUM, The paper reports rouge1,rouge2, rougeL of paper: 47.21/24.56/39.25. As of Aug 9, this port scores - Full replication results and correctly pre-processed data can be found in this `Issue
46.91/24.34/39.1. <https://github.com/huggingface/transformers/issues/6844#issue-689259666>`__.
- `Distilled checkpoints <https://huggingface.co/models?search=distill-pegasus>`__ are described in this `paper
<https://arxiv.org/abs/2010.13002>`__.
The gap is likely because of different alpha/length_penalty implementations in beam search. Examples
_______________________________________________________________________________________________________________________
- `Script <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune_pegasus_xsum.sh>`__ to
fine-tune pegasus on the XSUM dataset. Data download instructions at `examples/seq2seq/
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- FP16 is not supported (help/ideas on this appreciated!).
- The adafactor optimizer is recommended for pegasus fine-tuning.
Implementation Notes Implementation Notes
...@@ -45,7 +54,7 @@ Implementation Notes ...@@ -45,7 +54,7 @@ Implementation Notes
- Some key configuration differences: - Some key configuration differences:
- static, sinusoidal position embeddings - static, sinusoidal position embeddings
- no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`) - no :obj:`layernorm_embedding` (:obj:`PegasusConfig.normalize_embedding=False`)
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
- more beams are used (:obj:`num_beams=8`) - more beams are used (:obj:`num_beams=8`)
- All pretrained pegasus checkpoints are the same besides three attributes: :obj:`tokenizer.model_max_length` (maximum - All pretrained pegasus checkpoints are the same besides three attributes: :obj:`tokenizer.model_max_length` (maximum
......
...@@ -476,9 +476,9 @@ class BartModelIntegrationTests(unittest.TestCase): ...@@ -476,9 +476,9 @@ class BartModelIntegrationTests(unittest.TestCase):
@slow @slow
def test_bart_large_mask_filling(self): def test_bart_large_mask_filling(self):
pbase = pipeline(task="fill-mask", model="facebook/bart-large") plarge = pipeline(task="fill-mask", model="facebook/bart-large")
src_text = [" I went to the <mask>."] src_text = [" I went to the <mask>."]
results = [x["token_str"] for x in pbase(src_text)] results = [x["token_str"] for x in plarge(src_text)]
expected_results = ["Ġbathroom", "Ġgym", "Ġwrong", "Ġmovies", "Ġhospital"] expected_results = ["Ġbathroom", "Ġgym", "Ġwrong", "Ġmovies", "Ġhospital"]
self.assertListEqual(results, expected_results) self.assertListEqual(results, expected_results)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment