marian.rst 11 KB
Newer Older
Sylvain Gugger's avatar
Sylvain Gugger committed
1
2
3
4
5
6
7
8
9
10
11
12
.. 
    Copyright 2020 The HuggingFace Team. All rights reserved.

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.

13
MarianMT
Sylvain Gugger's avatar
Sylvain Gugger committed
14
-----------------------------------------------------------------------------------------------------------------------
15
16
17

**Bugs:** If you see something strange, file a `Github Issue
<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
18
and assign @patrickvonplaten.
19

20
Translations should be similar, but not identical to output in the test set linked to in each model card.
21
22

Implementation Notes
Sylvain Gugger's avatar
Sylvain Gugger committed
23
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
24
25

- Each model is about 298 MB on disk, there are more than 1,000 models.
26
- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
Sylvain Gugger's avatar
Sylvain Gugger committed
27
28
29
- Models were originally trained by `J枚rg Tiedemann
  <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
  <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
30
31
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
Sylvain Gugger's avatar
Sylvain Gugger committed
32
- The 80 opus models that require BPE preprocessing are not supported.
33
- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
Sylvain Gugger's avatar
Sylvain Gugger committed
34

35
36
37
38
39
    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
    - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
      :obj:`<s/>`),
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.
40
- This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__.
41

42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Tips
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- In Flax, it is highly advised to pass `early_stopping=True` to `generate`. *E.g.*:

    ::

          >>> from transformers import MarianTokenizer, FlaxMarianMTModel

          >>> model = FlaxMarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-de')
          >>> tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')

          >>> text = "My friends are cool but they eat too many carbs."
          >>> input_ids = tokenizer(text, max_length=64, return_tensors='jax').input_ids

          >>> # Marian has to make use of early_stopping=True
58
          >>> sequences = model.generate(input_ids, early_stopping=True, max_length=64, num_beams=2).sequences
59
60


61
Naming
Sylvain Gugger's avatar
Sylvain Gugger committed
62
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
63

Sylvain Gugger's avatar
Sylvain Gugger committed
64
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
65
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here
Sylvain Gugger's avatar
Sylvain Gugger committed
66
67
  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
  code {code}".
68
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
69
70
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
  group use a combination of ISO-639-5 codes and ISO-639-2 codes.
71
72


73
Examples
Sylvain Gugger's avatar
Sylvain Gugger committed
74
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
75

76
77
- Since Marian models are smaller than many other translation models available in the library, they can be useful for
  fine-tuning experiments and integration tests.
78
79
80
81
- `Fine-tune on GPU
  <https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh>`__
- `Fine-tune on GPU with pytorch-lightning
  <https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh>`__
82
83
84

Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
85

86
87
88
89
90
91
92
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- If a model can output multiple languages, and you should specify a language code by prepending the desired output
  language to the :obj:`src_text`.
- You can see a models's supported language codes in its model card, under target constituents, like in `opus-mt-en-roa
  <https://huggingface.co/Helsinki-NLP/opus-mt-en-roa>`__.
- Note that if a model is only multilingual on the source side, like :obj:`Helsinki-NLP/opus-mt-roa-en`, no language
  codes are required.
93

94
95
New multi-lingual models from the `Tatoeba-Challenge repo <https://github.com/Helsinki-NLP/Tatoeba-Challenge>`__
require 3 character language codes:
96
97
98

.. code-block:: python

99
100
101
102
103
104
    >>> from transformers import MarianMTModel, MarianTokenizer
    >>> src_text = [
    ...     '>>fra<< this is a sentence in english that we want to translate to french',
    ...     '>>por<< This should go to portuguese',
    ...     '>>esp<< And this to Spanish'
    >>> ]
105

106
107
108
109
    >>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
    >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
    >>> print(tokenizer.supported_language_codes)
    ['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
110

111
112
113
114
115
116
    >>> model = MarianMTModel.from_pretrained(model_name)
    >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    >>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
    ["c'est une phrase en anglais que nous voulons traduire en fran莽ais",
     'Isto deve ir para o portugu锚s.',
     'Y esto al espa帽ol']
117
118


119

120
121

Here is the code to see all available pretrained models on the hub:
122
123
124
125
126
127
128
129
130
131
132

.. code-block:: python

    from transformers.hf_api import HfApi
    model_list = HfApi().model_list()
    org = "Helsinki-NLP"
    model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
    suffix = [x.split('/')[1] for x in model_ids]
    old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]


133

134
135
Old Style Multi-Lingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
136

137
138
These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
group:
139
140
141

.. code-block:: python

142
143
144
145
146
147
148
149
150
151
152
153
    ['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
     'Helsinki-NLP/opus-mt-ROMANCE-en',
     'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
     'Helsinki-NLP/opus-mt-de-ZH',
     'Helsinki-NLP/opus-mt-en-CELTIC',
     'Helsinki-NLP/opus-mt-en-ROMANCE',
     'Helsinki-NLP/opus-mt-es-NORWAY',
     'Helsinki-NLP/opus-mt-fi-NORWAY',
     'Helsinki-NLP/opus-mt-fi-ZH',
     'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
     'Helsinki-NLP/opus-mt-sv-NORWAY',
     'Helsinki-NLP/opus-mt-sv-ZH']
154
155
156
157
158
159
160
161
162
163
164
165
    GROUP_MEMBERS = {
     'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
     'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
     'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
     'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
     'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
     'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
     'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
    }



166
167
168
169
170
171

Example of translating english to many romance languages, using old-style 2 character language codes


.. code-block::python

172
173
174
175
176
177
    >>> from transformers import MarianMTModel, MarianTokenizer
    >>> src_text = [
    ...     '>>fr<< this is a sentence in english that we want to translate to french',
    ...     '>>pt<< This should go to portuguese',
    ...     '>>es<< And this to Spanish'
    >>> ]
Sylvain Gugger's avatar
Sylvain Gugger committed
178

179
180
    >>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
    >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
181

182
183
184
185
186
187
    >>> model = MarianMTModel.from_pretrained(model_name)
    >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    >>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
    ["c'est une phrase en anglais que nous voulons traduire en fran莽ais", 
     'Isto deve ir para o portugu锚s.',
     'Y esto al espa帽ol']
188

189

190

Sylvain Gugger's avatar
Sylvain Gugger committed
191
MarianConfig
Sylvain Gugger's avatar
Sylvain Gugger committed
192
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sylvain Gugger's avatar
Sylvain Gugger committed
193

Sylvain Gugger's avatar
Sylvain Gugger committed
194
195
196
197
198
.. autoclass:: transformers.MarianConfig
    :members:


MarianTokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
199
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sylvain Gugger's avatar
Sylvain Gugger committed
200
201

.. autoclass:: transformers.MarianTokenizer
202
    :members: as_target_tokenizer
Sylvain Gugger's avatar
Sylvain Gugger committed
203
204


205
206
207
208
209
210
211
MarianModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.MarianModel
    :members: forward


212
213
MarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
214

215
.. autoclass:: transformers.MarianMTModel
216
    :members: forward
217
218


219
220
221
222
223
224
225
MarianForCausalLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.MarianForCausalLM
    :members: forward


226
227
228
229
230
231
232
TFMarianModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFMarianModel
    :members: call


233
234
235
236
TFMarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFMarianMTModel
237
    :members: call
238
239
240
241
242
243
244
245
246
247
248
249
250
251


FlaxMarianModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.FlaxMarianModel
    :members: __call__


FlaxMarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.FlaxMarianMTModel
    :members: __call__