marian.rst 6.6 KB
Newer Older
1
MarianMT
Sylvain Gugger's avatar
Sylvain Gugger committed
2
-----------------------------------------------------------------------------------------------------------------------
3
4
5

**Bugs:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
Sylvain Gugger's avatar
Sylvain Gugger committed
6
and assign @sshleifer.
7
8

Translations should be similar, but not identical to, output in the test set linked to in each model card.
9
10

Implementation Notes
Sylvain Gugger's avatar
Sylvain Gugger committed
11
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12
13

- Each model is about 298 MB on disk, there are more than 1,000 models.
14
- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
Sylvain Gugger's avatar
Sylvain Gugger committed
15
16
17
- Models were originally trained by `Jrg Tiedemann
  <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
  <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
18
19
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
Sylvain Gugger's avatar
Sylvain Gugger committed
20
- The 80 opus models that require BPE preprocessing are not supported.
21
- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
Sylvain Gugger's avatar
Sylvain Gugger committed
22

23
24
25
26
27
28
    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
    - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
    - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
      :obj:`<s/>`),
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.
29

30
Naming
Sylvain Gugger's avatar
Sylvain Gugger committed
31
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
32

Sylvain Gugger's avatar
Sylvain Gugger committed
33
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
34
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here
Sylvain Gugger's avatar
Sylvain Gugger committed
35
36
  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
  code {code}".
37
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
38
39
40


Multilingual Models
Sylvain Gugger's avatar
Sylvain Gugger committed
41
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
42

Sylvain Gugger's avatar
Sylvain Gugger committed
43
All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
44
45
46
47
48
49

    - If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
      looking at the model card, or the Group Members `mapping
      <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
    - If :obj:`tgt` is in all caps, the model can output multiple languages, and you should specify a language code by
      prepending the desired output language to the :obj:`src_text`.
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
    - You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes``

Example of translating english to many romance languages, using language codes:

.. code-block:: python

    from transformers import MarianMTModel, MarianTokenizer
    src_text = [
        '>>fr<< this is a sentence in english that we want to translate to french',
        '>>pt<< This should go to portuguese',
        '>>es<< And this to Spanish'
    ]

    model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    print(tokenizer.supported_language_codes)
    model = MarianMTModel.from_pretrained(model_name)
67
    translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text))
68
69
70
71
72
    tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
    # ["c'est une phrase en anglais que nous voulons traduire en fran莽ais",
    # 'Isto deve ir para o portugu锚s.',
    # 'Y esto al espa帽ol']

73
74
75
76
77
78
79
80
Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a
separator for src or tgt, as in :obj:`Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi`. These still require language
codes.

There are many supported regional language codes, like :obj:`>>es_ES<<` (Spain) and :obj:`>>es_AR<<` (Argentina), that
do not seem to change translations. I have not found these to provide different results than just using :obj:`>>es<<`.

For example:
81

82
83
84
85
86
    - `Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU`: translates from all NORTH_EU languages (see `mapping
      <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special
      language code like :obj:`>>de<<` to specify output language.
    - `Helsinki-NLP/opus-mt-ROMANCE-en`: translates from many romance languages to english, no codes needed since there
      is only one target language.
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111



.. code-block:: python

    GROUP_MEMBERS = {
     'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
     'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
     'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
     'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
     'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
     'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
     'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
    }

Code to see available pretrained models:

.. code-block:: python

    from transformers.hf_api import HfApi
    model_list = HfApi().model_list()
    org = "Helsinki-NLP"
    model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
    suffix = [x.split('/')[1] for x in model_ids]
    multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
112

113

Sylvain Gugger's avatar
Sylvain Gugger committed
114
MarianConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
115
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sylvain Gugger's avatar
Sylvain Gugger committed
116

Sylvain Gugger's avatar
Sylvain Gugger committed
117
118
119
120
121
.. autoclass:: transformers.MarianConfig
    :members:


MarianTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
122
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sylvain Gugger's avatar
Sylvain Gugger committed
123
124

.. autoclass:: transformers.MarianTokenizer
125
    :members: prepare_seq2seq_batch
Sylvain Gugger's avatar
Sylvain Gugger committed
126
127


128
129
MarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
130

131
.. autoclass:: transformers.MarianMTModel