Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
c314b1fd
Unverified
Commit
c314b1fd
authored
Nov 10, 2020
by
Sam Shleifer
Committed by
GitHub
Nov 10, 2020
Browse files
[docs] improve bart/marian/mBART/pegasus docs (#8421)
parent
3213d3bf
Changes
5
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
114 additions
and
46 deletions
+114
-46
docs/source/model_doc/bart.rst
docs/source/model_doc/bart.rst
+21
-2
docs/source/model_doc/marian.rst
docs/source/model_doc/marian.rst
+70
-33
docs/source/model_doc/mbart.rst
docs/source/model_doc/mbart.rst
+8
-5
docs/source/model_doc/pegasus.rst
docs/source/model_doc/pegasus.rst
+13
-4
tests/test_modeling_bart.py
tests/test_modeling_bart.py
+2
-2
No files found.
docs/source/model_doc/bart.rst
View file @
c314b1fd
...
...
@@ -34,6 +34,8 @@ ________________________________________________________________________________
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
object can be found in this `forum discussion
<https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
- `Distilled checkpoints <https://huggingface.co/models?search=distilbart>`__ are described in this `paper
<https://arxiv.org/abs/2010.13002>`__.
Implementation Notes
...
...
@@ -44,14 +46,31 @@ Implementation Notes
-
The
forward
pass
of
:
class
:`~
transformers
.
BartModel
`
will
create
decoder
inputs
(
using
the
helper
function
:
func
:`
transformers
.
modeling_bart
.
_prepare_bart_decoder_inputs
`)
if
they
are
not
passed
.
This
is
different
than
some
other
modeling
APIs
.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the
string you pass to :func:`fairseq.encode` starts with a space.
-
Model
predictions
are
intended
to
be
identical
to
the
original
implementation
when
:
obj
:`
force_bos_token_to_be_generated
=
True
`.
This
only
works
,
however
,
if
the
string
you
pass
to
:
func
:`
fairseq
.
encode
`
starts
with
a
space
.
-
:
meth
:`~
transformers
.
BartForConditionalGeneration
.
generate
`
should
be
used
for
conditional
generation
tasks
like
summarization
,
see
the
example
in
that
docstrings
.
-
Models
that
load
the
`
facebook
/
bart
-
large
-
cnn
`
weights
will
not
have
a
:
obj
:`
mask_token_id
`,
or
be
able
to
perform
mask
-
filling
tasks
.
-
For
training
/
forward
passes
that
don
't involve beam search, pass :obj:`use_cache=False`.
Mask Filling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :obj:`facebook/bart-base` and :obj:`facebook/bart-large` checkpoints can be used to fill multi-token masks.
.. code-block::
from transformers import BartForConditionalGeneration, BartTokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", force_bos_token_to_be_generated=True)
tok = BartTokenizer.from_pretrained("facebook/bart-large")
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
batch = tok(example_english_phrase, return_tensors='
pt
')
generated_ids = model.generate(batch['
input_ids
'])
assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['
UN
Chief
Says
There
Is
No
Plan
to
Stop
Chemical
Weapons
in
Syria
']
BartConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
docs/source/model_doc/marian.rst
View file @
c314b1fd
...
...
@@ -5,7 +5,7 @@ MarianMT
<
https
://
github
.
com
/
huggingface
/
transformers
/
issues
/
new
?
assignees
=
sshleifer
&
labels
=&
template
=
bug
-
report
.
md
&
title
>`
__
and
assign
@
patrickvonplaten
.
Translations
should
be
similar
,
but
not
identical
to
,
output
in
the
test
set
linked
to
in
each
model
card
.
Translations
should
be
similar
,
but
not
identical
to
output
in
the
test
set
linked
to
in
each
model
card
.
Implementation
Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
@@ -35,32 +35,46 @@ Naming
<https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
code {code}".
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
group use a combination of ISO-639-5 codes and ISO-639-2 codes.
Multilingual Model
s
Example
s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- Since Marian models are smaller than many other translation models available in the library, they can be useful for
fine-tuning experiments and integration tests.
- `Fine-tune on TPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh>`__
- `Fine-tune on GPU
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh>`__
- `Fine-tune on GPU with pytorch-lightning
<https://github.com/huggingface/transformers/blob/master/examples/seq2seq/distil_marian_no_teacher.sh>`__
Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
looking at the model card, or the Group Members `mapping
<https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
- If :obj:`tgt` is in all caps, the model can output multiple languages, and you should specify a language code by
prepending the desired output language to the :obj:`src_text`.
- You can see a tokenizer'
s
supported
language
codes
in
``
tokenizer
.
supported_language_codes
``
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- If a model can output multiple languages, and you should specify a language code by prepending the desired output
language to the :obj:`src_text`.
- You can see a models'
s
supported
language
codes
in
its
model
card
,
under
target
constituents
,
like
in
`
opus
-
mt
-
en
-
roa
<
https
://
huggingface
.
co
/
Helsinki
-
NLP
/
opus
-
mt
-
en
-
roa
>`
__
.
-
Note
that
if
a
model
is
only
multilingual
on
the
source
side
,
like
:
obj
:`
Helsinki
-
NLP
/
opus
-
mt
-
roa
-
en
`,
no
language
codes
are
required
.
Example
of
translating
english
to
many
romance
languages
,
using
language
codes
:
New
multi
-
lingual
models
from
the
`
Tatoeba
-
Challenge
repo
<
https
://
github
.
com
/
Helsinki
-
NLP
/
Tatoeba
-
Challenge
>`
__
require
3
character
language
codes
:
..
code
-
block
::
python
from
transformers
import
MarianMTModel
,
MarianTokenizer
src_text
=
[
'>>fr<< this is a sentence in english that we want to translate to french'
,
'>>p
t
<< This should go to portuguese'
,
'>>es<< And this to Spanish'
'>>fr
a
<< this is a sentence in english that we want to translate to french'
,
'>>p
or
<< This should go to portuguese'
,
'>>es
p
<< And this to Spanish'
]
model_name
=
'Helsinki-NLP/opus-mt-en-
ROMANCE
'
model_name
=
'Helsinki-NLP/opus-mt-en-
roa
'
tokenizer
=
MarianTokenizer
.
from_pretrained
(
model_name
)
print
(
tokenizer
.
supported_language_codes
)
model
=
MarianMTModel
.
from_pretrained
(
model_name
)
...
...
@@ -70,25 +84,42 @@ Example of translating english to many romance languages, using language codes:
#
'Isto deve ir para o português.'
,
#
'Y esto al español'
]
Sometimes
,
models
were
trained
on
collections
of
languages
that
do
not
resolve
to
a
group
.
In
this
case
,
_
is
used
as
a
separator
for
src
or
tgt
,
as
in
:
obj
:`
Helsinki
-
NLP
/
opus
-
mt
-
en_el_es_fi
-
en_el_es_fi
`.
These
still
require
language
codes
.
There
are
many
supported
regional
language
codes
,
like
:
obj
:`>>
es_ES
<<`
(
Spain
)
and
:
obj
:`>>
es_AR
<<`
(
Argentina
),
that
do
not
seem
to
change
translations
.
I
have
not
found
these
to
provide
different
results
than
just
using
:
obj
:`>>
es
<<`.
For
example
:
-
`
Helsinki
-
NLP
/
opus
-
mt
-
NORTH_EU
-
NORTH_EU
`:
translates
from
all
NORTH_EU
languages
(
see
`
mapping
<
https
://
gist
.
github
.
com
/
sshleifer
/
6
d20e7761931b08e73c3219027b97b8a
>`
_
)
to
all
NORTH_EU
languages
.
Use
a
special
language
code
like
:
obj
:`>>
de
<<`
to
specify
output
language
.
-
`
Helsinki
-
NLP
/
opus
-
mt
-
ROMANCE
-
en
`:
translates
from
many
romance
languages
to
english
,
no
codes
needed
since
there
is
only
one
target
language
.
Code
to
see
available
pretrained
models
:
..
code
-
block
::
python
from
transformers
.
hf_api
import
HfApi
model_list
=
HfApi
().
model_list
()
org
=
"Helsinki-NLP"
model_ids
=
[
x
.
modelId
for
x
in
model_list
if
x
.
modelId
.
startswith
(
org
)]
suffix
=
[
x
.
split
(
'/'
)[
1
]
for
x
in
model_ids
]
old_style_multi_models
=
[
f
'{org}/{s}'
for
s
in
suffix
if
s
!= s.lower()]
Old
Style
Multi
-
Lingual
Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These
are
the
old
style
multi
-
lingual
models
ported
from
the
OPUS
-
MT
-
Train
repo
:
and
the
members
of
each
language
group
:
..
code
-
block
::
python
[
'Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU'
,
'Helsinki-NLP/opus-mt-ROMANCE-en'
,
'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA'
,
'Helsinki-NLP/opus-mt-de-ZH'
,
'Helsinki-NLP/opus-mt-en-CELTIC'
,
'Helsinki-NLP/opus-mt-en-ROMANCE'
,
'Helsinki-NLP/opus-mt-es-NORWAY'
,
'Helsinki-NLP/opus-mt-fi-NORWAY'
,
'Helsinki-NLP/opus-mt-fi-ZH'
,
'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI'
,
'Helsinki-NLP/opus-mt-sv-NORWAY'
,
'Helsinki-NLP/opus-mt-sv-ZH'
]
GROUP_MEMBERS
=
{
'ZH'
:
[
'cmn'
,
'cn'
,
'yue'
,
'ze_zh'
,
'zh_cn'
,
'zh_CN'
,
'zh_HK'
,
'zh_tw'
,
'zh_TW'
,
'zh_yue'
,
'zhs'
,
'zht'
,
'zh'
],
'ROMANCE'
:
[
'fr'
,
'fr_BE'
,
'fr_CA'
,
'fr_FR'
,
'wa'
,
'frp'
,
'oc'
,
'ca'
,
'rm'
,
'lld'
,
'fur'
,
'lij'
,
'lmo'
,
'es'
,
'es_AR'
,
'es_CL'
,
'es_CO'
,
'es_CR'
,
'es_DO'
,
'es_EC'
,
'es_ES'
,
'es_GT'
,
'es_HN'
,
'es_MX'
,
'es_NI'
,
'es_PA'
,
'es_PE'
,
'es_PR'
,
'es_SV'
,
'es_UY'
,
'es_VE'
,
'pt'
,
'pt_br'
,
'pt_BR'
,
'pt_PT'
,
'gl'
,
'lad'
,
'an'
,
'mwl'
,
'it'
,
'it_IT'
,
'co'
,
'nap'
,
'scn'
,
'vec'
,
'sc'
,
'ro'
,
'la'
],
...
...
@@ -99,16 +130,22 @@ For example:
'CELTIC'
:
[
'ga'
,
'cy'
,
'br'
,
'gd'
,
'kw'
,
'gv'
]
}
Code
to
see
available
pretrained
models
:
..
code
-
block
::
python
from
transformers
.
hf_api
import
HfApi
model_list
=
HfApi
().
model_list
()
org
=
"Helsinki-NLP"
model_ids
=
[
x
.
modelId
for
x
in
model_list
if
x
.
modelId
.
startswith
(
org
)]
suffix
=
[
x
.
split
(
'/'
)[
1
]
for
x
in
model_ids
]
multi_models
=
[
f
'{org}/{s}'
for
s
in
suffix
if
s
!= s.lower()]
Example
of
translating
english
to
many
romance
languages
,
using
old
-
style
2
character
language
codes
..
code
-
block
::
python
from
transformers
import
MarianMTModel
,
MarianTokenizer
src_text
=
[
'>>fr<< this is a sentence in english that we want to translate to french'
,
'>>pt<< This should go to portuguese'
,
'>>es<< And this to Spanish'
]
model_name
=
'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer
=
MarianTokenizer
.
from_pretrained
(
model_name
)
print
(
tokenizer
.
supported_language_codes
)
model
=
MarianMTModel
.
from_pretrained
(
model_name
)
translated
=
model
.
generate
(**
tokenizer
.
prepare_seq2seq_batch
(
src_text
))
tgt_text
=
[
tokenizer
.
decode
(
t
,
skip_special_tokens
=
True
)
for
t
in
translated
]
#
[
"c'est une phrase en anglais que nous voulons traduire en français"
,
'Isto deve ir para o português.'
,
'Y esto al español'
]
MarianConfig
...
...
docs/source/model_doc/mbart.rst
View file @
c314b1fd
...
...
@@ -19,6 +19,13 @@ on the encoder, decoder, or reconstructing parts of the text.
The
Authors
' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
Examples
_______________________________________________________________________________________________________________________
- Examples and scripts for fine-tuning mBART and other models for sequence to sequence tasks can be found in
`examples/seq2seq/ <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/README.md>`__.
- Given the large embeddings table, mBART consumes a large amount of GPU RAM, especially for fine-tuning.
:class:`MarianMTModel` is usually a better choice for bilingual machine translation.
Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
@@ -38,11 +45,7 @@ the sequences for sequence-to-sequence fine-tuning.
example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
input_ids = batch["input_ids"]
target_ids = batch["decoder_input_ids"]
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()
model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels) #forward
model(input_ids=batch['
input_ids
'], labels=batch['
labels
']) # forward pass
- Generation
...
...
docs/source/model_doc/pegasus.rst
View file @
c314b1fd
...
...
@@ -31,10 +31,19 @@ All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-
-
Each
checkpoint
is
2.2
GB
on
disk
and
568
M
parameters
.
-
FP16
is
not
supported
(
help
/
ideas
on
this
appreciated
!).
-
Summarizing
xsum
in
fp32
takes
about
400
ms
/
sample
,
with
default
parameters
on
a
v100
GPU
.
-
For
XSUM
,
The
paper
reports
rouge1
,
rouge2
,
rougeL
of
paper
:
47.21
/
24.56
/
39.25
.
As
of
Aug
9
,
this
port
scores
46.91
/
24.34
/
39.1
.
-
Full
replication
results
and
correctly
pre
-
processed
data
can
be
found
in
this
`
Issue
<
https
://
github
.
com
/
huggingface
/
transformers
/
issues
/
6844
#
issue
-
689259666
>`
__
.
-
`
Distilled
checkpoints
<
https
://
huggingface
.
co
/
models
?
search
=
distill
-
pegasus
>`
__
are
described
in
this
`
paper
<
https
://
arxiv
.
org
/
abs
/
2010.13002
>`
__
.
The
gap
is
likely
because
of
different
alpha
/
length_penalty
implementations
in
beam
search
.
Examples
_______________________________________________________________________________________________________________________
-
`
Script
<
https
://
github
.
com
/
huggingface
/
transformers
/
blob
/
master
/
examples
/
seq2seq
/
finetune_pegasus_xsum
.
sh
>`
__
to
fine
-
tune
pegasus
on
the
XSUM
dataset
.
Data
download
instructions
at
`
examples
/
seq2seq
/
<
https
://
github
.
com
/
huggingface
/
transformers
/
blob
/
master
/
examples
/
seq2seq
/
README
.
md
>`
__
.
-
FP16
is
not
supported
(
help
/
ideas
on
this
appreciated
!).
-
The
adafactor
optimizer
is
recommended
for
pegasus
fine
-
tuning
.
Implementation
Notes
...
...
@@ -45,7 +54,7 @@ Implementation Notes
-
Some
key
configuration
differences
:
-
static
,
sinusoidal
position
embeddings
-
no
:
obj
:`
layernorm_embedding
`
(:
obj
`
PegasusConfig
.
normalize_embedding
=
False
`)
-
no
:
obj
:`
layernorm_embedding
`
(:
obj
:
`
PegasusConfig
.
normalize_embedding
=
False
`)
-
the
model
starts
generating
with
pad_token_id
(
which
has
0
token_embedding
)
as
the
prefix
.
-
more
beams
are
used
(:
obj
:`
num_beams
=
8
`)
-
All
pretrained
pegasus
checkpoints
are
the
same
besides
three
attributes
:
:
obj
:`
tokenizer
.
model_max_length
`
(
maximum
...
...
tests/test_modeling_bart.py
View file @
c314b1fd
...
...
@@ -476,9 +476,9 @@ class BartModelIntegrationTests(unittest.TestCase):
@
slow
def
test_bart_large_mask_filling
(
self
):
p
bas
e
=
pipeline
(
task
=
"fill-mask"
,
model
=
"facebook/bart-large"
)
p
larg
e
=
pipeline
(
task
=
"fill-mask"
,
model
=
"facebook/bart-large"
)
src_text
=
[
" I went to the <mask>."
]
results
=
[
x
[
"token_str"
]
for
x
in
p
bas
e
(
src_text
)]
results
=
[
x
[
"token_str"
]
for
x
in
p
larg
e
(
src_text
)]
expected_results
=
[
"Ġbathroom"
,
"Ġgym"
,
"Ġwrong"
,
"Ġmovies"
,
"Ġhospital"
]
self
.
assertListEqual
(
results
,
expected_results
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment