Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
c314b1fd
Unverified
Commit
c314b1fd
authored
Nov 10, 2020
by
Sam Shleifer
Committed by
GitHub
Nov 10, 2020
Browse files
[docs] improve bart/marian/mBART/pegasus docs (#8421)
parent
3213d3bf
Changes
5
Show whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
114 additions
and
46 deletions
+114
-46
docs/source/model_doc/bart.rst
docs/source/model_doc/bart.rst
+21
-2
docs/source/model_doc/marian.rst
docs/source/model_doc/marian.rst
+70
-33
docs/source/model_doc/mbart.rst
docs/source/model_doc/mbart.rst
+8
-5
docs/source/model_doc/pegasus.rst
docs/source/model_doc/pegasus.rst
+13
-4
tests/test_modeling_bart.py
tests/test_modeling_bart.py
+2
-2
No files found.
docs/source/model_doc/bart.rst
View file @
c314b1fd
...
...
@@ -34,6 +34,8 @@ ________________________________________________________________________________
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
object can be found in this `forum discussion
<https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
- `Distilled checkpoints <https://huggingface.co/models?search=distilbart>`__ are described in this `paper
<https://arxiv.org/abs/2010.13002>`__.
Implementation Notes
...
...
@@ -44,14 +46,31 @@ Implementation Notes
-
The
forward
pass
of
:
class
:`~
transformers
.
BartModel
`
will
create
decoder
inputs
(
using
the
helper
function
:
func
:`
transformers
.
modeling_bart
.
_prepare_bart_decoder_inputs
`)
if
they
are
not
passed
.
This
is
different
than
some
other
modeling
APIs
.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the
string you pass to :func:`fairseq.encode` starts with a space.
-
Model
predictions
are
intended
to
be
identical
to
the
original
implementation
when
:
obj
:`
force_bos_token_to_be_generated
=
True
`.
This
only
works
,
however
,
if
the
string
you
pass
to
:
func
:`
fairseq
.
encode
`
starts
with
a
space
.
-
:
meth
:`~
transformers
.
BartForConditionalGeneration
.
generate
`
should
be
used
for
conditional
generation
tasks
like
summarization
,
see
the
example
in
that
docstrings
.
-
Models
that
load
the
`
facebook
/
bart
-
large
-
cnn
`
weights
will
not
have
a
:
obj
:`
mask_token_id
`,
or
be
able
to
perform
mask
-
filling
tasks
.
-
For
training
/
forward
passes
that
don
't involve beam search, pass :obj:`use_cache=False`.
Mask Filling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :obj:`facebook/bart-base` and :obj:`facebook/bart-large` checkpoints can be used to fill multi-token masks.
.. code-block::
from transformers import BartForConditionalGeneration, BartTokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", force_bos_token_to_be_generated=True)
tok = BartTokenizer.from_pretrained("facebook/bart-large")
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
batch = tok(example_english_phrase, return_tensors='
pt
')
generated_ids = model.generate(batch['
input_ids
'])
assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['
UN
Chief
Says
There
Is
No
Plan
to
Stop
Chemical
Weapons
in
Syria
']
BartConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
docs/source/model_doc/marian.rst
View file @
c314b1fd
...
...
@@ -5,7 +5,7 @@ MarianMT
<
https
://
github
.
com
/
huggingface
/
transformers
/
issues
/
new
?
assignees
=
sshleifer
&
labels
=&
template
=
bug
-
report
.
md
&
title
>`
__
and
assign
@
patrickvonplaten
.
Translations
should
be
similar
,
but
not
identical
to
,
output
in
the
test
set
linked
to
in
each
model
card
.
Translations
should
be
similar
,
but
not
identical
to
output
in
the
test
set
linked
to
in
each
model
card
.
Implementation
Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
@@ -35,32 +35,46 @@ Naming
<https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
code {code}".
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
group use a combination of ISO-639-5 codes and ISO-639-2 codes.
Multilingual Model
s
Example
s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- Since Marian models are smaller than many other translation models available in the library, they can be useful for
fine-tuning experiments and integration tests.
- `Fine-tune on TPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh>`__
- `Fine-tune on GPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh>`__
- `Fine-tune on GPU with pytorch-lightning
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/distil_marian_no_teacher.sh>`__
Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
looking at the model card, or the Group Members `mapping
<https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
- If :obj:`tgt` is in all caps, the model can output multiple languages, and you should specify a language code by
prepending the desired output language to the :obj:`src_text`.
- You can see a tokenizer'
s
supported
language
codes
in
``
tokenizer
.
supported_language_codes
``
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- If a model can output multiple languages, and you should specify a language code by prepending the desired output
language to the :obj:`src_text`.
- You can see a models'
s
supported
language
codes
in
its
model
card
,
under
target
constituents
,
like
in
`
opus
-
mt
-
en
-
roa
<
https
://
huggingface
.
co
/
Helsinki
-
NLP
/
opus
-
mt
-
en
-
roa
>`
__
.
-
Note
that
if
a
model
is
only
multilingual
on
the
source
side
,
like
:
obj
:`
Helsinki
-
NLP
/
opus
-
mt
-
roa
-
en
`,
no
language
codes
are
required
.
Example
of
translating
english
to
many
romance
languages
,
using
language
codes
:
New
multi
-
lingual
models
from
the
`
Tatoeba
-
Challenge
repo
<
https
://
github
.
com
/
Helsinki
-
NLP
/
Tatoeba
-
Challenge
>`
__
require
3
character
language
codes
:
..
code
-
block
::
python
from
transformers
import
MarianMTModel
,
MarianTokenizer
src_text
=
[
'>>fr<< this is a sentence in english that we want to translate to french'
,
'>>p
t
<< This should go to portuguese'
,
'>>es<< And this to Spanish'
'>>fr
a
<< this is a sentence in english that we want to translate to french'
,
'>>p
or
<< This should go to portuguese'
,
'>>es
p
<< And this to Spanish'
]
model_name
=
'Helsinki-NLP/opus-mt-en-
ROMANCE
'
model_name
=
'Helsinki-NLP/opus-mt-en-
roa
'
tokenizer
=
MarianTokenizer
.
from_pretrained
(
model_name
)
print
(
tokenizer
.
supported_language_codes
)
model
=
MarianMTModel
.
from_pretrained
(
model_name
)
...
...
@@ -70,25 +84,42 @@ Example of translating english to many romance languages, using language codes:
#
'Isto deve ir para o português.'
,
#
'Y esto al español'
]
Sometimes
,
models
were
trained
on
collections
of
languages
that
do
not
resolve
to
a
group
.
In
this
case
,
_
is
used
as
a
separator
for
src
or
tgt
,
as
in
:
obj
:`
Helsinki
-
NLP
/
opus
-
mt
-
en_el_es_fi
-
en_el_es_fi
`.
These
still
require
language
codes
.
There
are
many
supported
regional
language
codes
,
like
:
obj
:`>>
es_ES
<<`
(
Spain
)
and
:
obj
:`>>
es_AR
<<`
(
Argentina
),
that
do
not
seem
to
change
translations
.
I
have
not
found
these
to
provide
different
results
than
just
using
:
obj
:`>>
es
<<`.
For
example
:
-
`
Helsinki
-
NLP
/
opus
-
mt
-
NORTH_EU
-
NORTH_EU
`:
translates
from
all
NORTH_EU
languages
(
see
`
mapping
<
https
://
gist
.
github
.
com
/
sshleifer
/
6
d20e7761931b08e73c3219027b97b8a
>`
_
)
to
all
NORTH_EU
languages
.
Use
a
special
language
code
like
:
obj
:`>>
de
<<`
to
specify
output
language
.
-
`
Helsinki
-
NLP
/
opus
-
mt
-
ROMANCE
-
en
`:
translates
from
many
romance
languages
to
english
,
no
codes
needed
since
there
is
only
one
target
language
.
Code
to
see
available
pretrained
models
:
..
code
-
block
::
python
from
transformers
.
hf_api
import
HfApi
model_list
=
HfApi
().
model_list
()
org
=
"Helsinki-NLP"
model_ids
=
[
x
.
modelId
for
x
in
model_list
if
x
.
modelId
.
startswith
(
org
)]
suffix
=
[
x
.
split
(
'/'
)[
1
]
for
x
in
model_ids
]
old_style_multi_models
=
[
f
'{org}/{s}'
for
s
in
suffix
if
s
!= s.lower()]
Old
Style
Multi
-
Lingual
Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These
are
the
old
style
multi
-
lingual
models
ported
from
the
OPUS
-
MT
-
Train
repo
:
and
the
members
of
each
language
group
:
..
code
-
block
::
python
[
'Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU'
,
'Helsinki-NLP/opus-mt-ROMANCE-en'
,
'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA'
,
'Helsinki-NLP/opus-mt-de-ZH'
,
'Helsinki-NLP/opus-mt-en-CELTIC'
,
'Helsinki-NLP/opus-mt-en-ROMANCE'
,
'Helsinki-NLP/opus-mt-es-NORWAY'
,
'Helsinki-NLP/opus-mt-fi-NORWAY'
,
'Helsinki-NLP/opus-mt-fi-ZH'
,
'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI'
,
'Helsinki-NLP/opus-mt-sv-NORWAY'
,
'Helsinki-NLP/opus-mt-sv-ZH'
]
GROUP_MEMBERS
=
{
'ZH'
:
[
'cmn'
,
'cn'
,
'yue'
,
'ze_zh'
,
'zh_cn'
,
'zh_CN'
,
'zh_HK'
,
'zh_tw'
,
'zh_TW'
,
'zh_yue'
,
'zhs'
,
'zht'
,
'zh'
],
'ROMANCE'
:
[
'fr'
,
'fr_BE'
,
'fr_CA'
,
'fr_FR'
,
'wa'
,
'frp'
,
'oc'
,
'ca'
,
'rm'
,
'lld'
,
'fur'
,
'lij'
,
'lmo'
,
'es'
,
'es_AR'
,
'es_CL'
,
'es_CO'
,
'es_CR'
,
'es_DO'
,
'es_EC'
,
'es_ES'
,
'es_GT'
,
'es_HN'
,
'es_MX'
,
'es_NI'
,
'es_PA'
,
'es_PE'
,
'es_PR'
,
'es_SV'
,
'es_UY'
,
'es_VE'
,
'pt'
,
'pt_br'
,
'pt_BR'
,
'pt_PT'
,
'gl'
,
'lad'
,
'an'
,
'mwl'
,
'it'
,
'it_IT'
,
'co'
,
'nap'
,
'scn'
,
'vec'
,
'sc'
,
'ro'
,
'la'
],
...
...
@@ -99,16 +130,22 @@ For example:
'CELTIC'
:
[
'ga'
,
'cy'
,
'br'
,
'gd'
,
'kw'
,
'gv'
]
}
Code
to
see
available
pretrained
models
:
..
code
-
block
::
python
from
transformers
.
hf_api
import
HfApi
model_list
=
HfApi
().
model_list
()
org
=
"Helsinki-NLP"
model_ids
=
[
x
.
modelId
for
x
in
model_list
if
x
.
modelId
.
startswith
(
org
)]
suffix
=
[
x
.
split
(
'/'
)[
1
]
for
x
in
model_ids
]
multi_models
=
[
f
'{org}/{s}'
for
s
in
suffix
if
s
!= s.lower()]
Example
of
translating
english
to
many
romance
languages
,
using
old
-
style
2
character
language
codes
..
code
-
block
::
python
from
transformers
import
MarianMTModel
,
MarianTokenizer
src_text
=
[
'>>fr<< this is a sentence in english that we want to translate to french'
,
'>>pt<< This should go to portuguese'
,
'>>es<< And this to Spanish'
]
model_name
=
'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer
=
MarianTokenizer
.
from_pretrained
(
model_name
)
print
(
tokenizer
.
supported_language_codes
)
model
=
MarianMTModel
.
from_pretrained
(
model_name
)
translated
=
model
.
generate
(**
tokenizer
.
prepare_seq2seq_batch
(
src_text
))
tgt_text
=
[
tokenizer
.
decode
(
t
,
skip_special_tokens
=
True
)
for
t
in
translated
]
#
[
"c'est une phrase en anglais que nous voulons traduire en français"
,
'Isto deve ir para o português.'
,
'Y esto al español'
]
MarianConfig
...
...
docs/source/model_doc/mbart.rst
View file @
c314b1fd
...
...
@@ -19,6 +19,13 @@ on the encoder, decoder, or reconstructing parts of the text.
The
Authors
' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
Examples
_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning mBART and other models for sequence to sequence tasks can be found in
`examples/seq2seq/ <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- Given the large embeddings table, mBART consumes a large amount of GPU RAM, especially for fine-tuning.
:class:`MarianMTModel` is usually a better choice for bilingual machine translation.
Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
@@ -38,11 +45,7 @@ the sequences for sequence-to-sequence fine-tuning.
example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
input_ids = batch["input_ids"]
target_ids = batch["decoder_input_ids"]
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()
model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels) #forward
model(input_ids=batch['
input_ids
'], labels=batch['
labels
']) # forward pass
- Generation
...
...
docs/source/model_doc/pegasus.rst
View file @
c314b1fd
...
...
@@ -31,10 +31,19 @@ All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-
-
Each
checkpoint
is
2.2
GB
on
disk
and
568
M
parameters
.
-
FP16
is
not
supported
(
help
/
ideas
on
this
appreciated
!).
-
Summarizing
xsum
in
fp32
takes
about
400
ms
/
sample
,
with
default
parameters
on
a
v100
GPU
.
-
For
XSUM
,
The
paper
reports
rouge1
,
rouge2
,
rougeL
of
paper
:
47.21
/
24.56
/
39.25
.
As
of
Aug
9
,
this
port
scores
46.91
/
24.34
/
39.1
.
-
Full
replication
results
and
correctly
pre
-
processed
data
can
be
found
in
this
`
Issue
<
https
://
github
.
com
/
huggingface
/
transformers
/
issues
/
6844
#
issue
-
689259666
>`
__
.
-
`
Distilled
checkpoints
<
https
://
huggingface
.
co
/
models
?
search
=
distill
-
pegasus
>`
__
are
described
in
this
`
paper
<
https
://
arxiv
.
org
/
abs
/
2010.13002
>`
__
.
The
gap
is
likely
because
of
different
alpha
/
length_penalty
implementations
in
beam
search
.
Examples
_______________________________________________________________________________________________________________________
-
`
Script
<
https
://
github
.
com
/
huggingface
/
transformers
/
blob
/
master
/
examples
/
seq2seq
/
finetune_pegasus_xsum
.
sh
>`
__
to
fine
-
tune
pegasus
on
the
XSUM
dataset
.
Data
download
instructions
at
`
examples
/
seq2seq
/
<
https
://
github
.
com
/
huggingface
/
transformers
/
blob
/
master
/
examples
/
seq2seq
/
README
.
md
>`
__
.
-
FP16
is
not
supported
(
help
/
ideas
on
this
appreciated
!).
-
The
adafactor
optimizer
is
recommended
for
pegasus
fine
-
tuning
.
Implementation
Notes
...
...
@@ -45,7 +54,7 @@ Implementation Notes
-
Some
key
configuration
differences
:
-
static
,
sinusoidal
position
embeddings
-
no
:
obj
:`
layernorm_embedding
`
(:
obj
`
PegasusConfig
.
normalize_embedding
=
False
`)
-
no
:
obj
:`
layernorm_embedding
`
(:
obj
:
`
PegasusConfig
.
normalize_embedding
=
False
`)
-
the
model
starts
generating
with
pad_token_id
(
which
has
0
token_embedding
)
as
the
prefix
.
-
more
beams
are
used
(:
obj
:`
num_beams
=
8
`)
-
All
pretrained
pegasus
checkpoints
are
the
same
besides
three
attributes
:
:
obj
:`
tokenizer
.
model_max_length
`
(
maximum
...
...
tests/test_modeling_bart.py
View file @
c314b1fd
...
...
@@ -476,9 +476,9 @@ class BartModelIntegrationTests(unittest.TestCase):
@
slow
def
test_bart_large_mask_filling
(
self
):
p
bas
e
=
pipeline
(
task
=
"fill-mask"
,
model
=
"facebook/bart-large"
)
p
larg
e
=
pipeline
(
task
=
"fill-mask"
,
model
=
"facebook/bart-large"
)
src_text
=
[
" I went to the <mask>."
]
results
=
[
x
[
"token_str"
]
for
x
in
p
bas
e
(
src_text
)]
results
=
[
x
[
"token_str"
]
for
x
in
p
larg
e
(
src_text
)]
expected_results
=
[
"Ġbathroom"
,
"Ġgym"
,
"Ġwrong"
,
"Ġmovies"
,
"Ġhospital"
]
self
.
assertListEqual
(
results
,
expected_results
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment