Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
a55dc157
Unverified
Commit
a55dc157
authored
Jun 15, 2021
by
Sylvain Gugger
Committed by
GitHub
Jun 15, 2021
Browse files
Add video links to the documentation (#12162)
parent
04028317
Changes
7
Hide whitespace changes
Inline
Side-by-side
Showing
7 changed files
with
167 additions
and
26 deletions
+167
-26
docs/source/glossary.rst
docs/source/glossary.rst
+26
-6
docs/source/model_sharing.rst
docs/source/model_sharing.rst
+18
-0
docs/source/model_summary.rst
docs/source/model_summary.rst
+26
-2
docs/source/preprocessing.rst
docs/source/preprocessing.rst
+12
-0
docs/source/quicktour.rst
docs/source/quicktour.rst
+18
-4
docs/source/tokenizer_summary.rst
docs/source/tokenizer_summary.rst
+43
-14
docs/source/training.rst
docs/source/training.rst
+24
-0
No files found.
docs/source/glossary.rst
View file @
a55dc157
...
@@ -55,6 +55,12 @@ Input IDs
...
@@ -55,6 +55,12 @@ Input IDs
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
numerical representations of tokens building the sequences that will be used as input by the model*.
numerical representations of tokens building the sequences that will be used as input by the model*.
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
Each tokenizer works differently but the underlying mechanism remains the same. Here'
s
an
example
using
the
BERT
Each tokenizer works differently but the underlying mechanism remains the same. Here'
s
an
example
using
the
BERT
tokenizer
,
which
is
a
`
WordPiece
<
https
://
arxiv
.
org
/
pdf
/
1609.08144
.
pdf
>`
__
tokenizer
:
tokenizer
,
which
is
a
`
WordPiece
<
https
://
arxiv
.
org
/
pdf
/
1609.08144
.
pdf
>`
__
tokenizer
:
...
@@ -120,8 +126,15 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
...
@@ -120,8 +126,15 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
Attention
mask
Attention
mask
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
The
attention
mask
is
an
optional
argument
used
when
batching
sequences
together
.
which tokens should be attended to, and which should not.
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/M6adb1j2jPI"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
This
argument
indicates
to
the
model
which
tokens
should
be
attended
to
,
and
which
should
not
.
For
example
,
consider
these
two
sequences
:
For
example
,
consider
these
two
sequences
:
...
@@ -175,10 +188,17 @@ in the dictionary returned by the tokenizer under the key "attention_mask":
...
@@ -175,10 +188,17 @@ in the dictionary returned by the tokenizer under the key "attention_mask":
Token Type IDs
Token Type IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some models' purpose is to do sequence classification or question answering. These require two different sequences to
Some models'
purpose
is
to
do
classification
on
pairs
of
sentences
or
question
answering
.
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
..
raw
::
html
such:
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/0u3ioSwev3s"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
These
require
two
different
sequences
to
be
joined
in
a
single
"input_ids"
entry
,
which
usually
is
performed
with
the
help
of
special
tokens
,
such
as
the
classifier
(``[
CLS
]``)
and
separator
(``[
SEP
]``)
tokens
.
For
example
,
the
BERT
model
builds
its
two
sequence
input
as
such
:
..
code
-
block
::
..
code
-
block
::
...
...
docs/source/model_sharing.rst
View file @
a55dc157
...
@@ -16,6 +16,12 @@ Model sharing and uploading
...
@@ -16,6 +16,12 @@ Model sharing and uploading
In
this
page
,
we
will
show
you
how
to
share
a
model
you
have
trained
or
fine
-
tuned
on
new
data
with
the
community
on
In
this
page
,
we
will
show
you
how
to
share
a
model
you
have
trained
or
fine
-
tuned
on
new
data
with
the
community
on
the
`
model
hub
<
https
://
huggingface
.
co
/
models
>`
__
.
the
`
model
hub
<
https
://
huggingface
.
co
/
models
>`
__
.
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/XvSGPZFEjDY"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
..
note
::
..
note
::
You
will
need
to
create
an
account
on
`
huggingface
.
co
<
https
://
huggingface
.
co
/
join
>`
__
for
this
.
You
will
need
to
create
an
account
on
`
huggingface
.
co
<
https
://
huggingface
.
co
/
join
>`
__
for
this
.
...
@@ -77,6 +83,12 @@ token that you can just copy.
...
@@ -77,6 +83,12 @@ token that you can just copy.
Directly
push
your
model
to
the
hub
Directly
push
your
model
to
the
hub
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/Z1-XMy-GNLQ"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
Once
you
have
an
API
token
(
either
stored
in
the
cache
or
copied
and
pasted
in
your
notebook
),
you
can
directly
push
a
Once
you
have
an
API
token
(
either
stored
in
the
cache
or
copied
and
pasted
in
your
notebook
),
you
can
directly
push
a
finetuned
model
you
saved
in
:
obj
:`
save_drectory
`
by
calling
:
finetuned
model
you
saved
in
:
obj
:`
save_drectory
`
by
calling
:
...
@@ -152,6 +164,12 @@ or
...
@@ -152,6 +164,12 @@ or
Use your terminal and git
Use your terminal and git
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/rkCly_cbMBk" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
Basic steps
Basic steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
...
docs/source/model_summary.rst
View file @
a55dc157
...
@@ -28,6 +28,12 @@ Each one of the models in the library falls into one of the following categories
...
@@ -28,6 +28,12 @@ Each one of the models in the library falls into one of the following categories
*
:
ref
:`
multimodal
-
models
`
*
:
ref
:`
multimodal
-
models
`
*
:
ref
:`
retrieval
-
based
-
models
`
*
:
ref
:`
retrieval
-
based
-
models
`
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/H39Z_720T5s"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
Autoregressive
models
are
pretrained
on
the
classic
language
modeling
task
:
guess
the
next
token
having
read
all
the
Autoregressive
models
are
pretrained
on
the
classic
language
modeling
task
:
guess
the
next
token
having
read
all
the
previous
ones
.
They
correspond
to
the
decoder
of
the
original
transformer
model
,
and
a
mask
is
used
on
top
of
the
full
previous
ones
.
They
correspond
to
the
decoder
of
the
original
transformer
model
,
and
a
mask
is
used
on
top
of
the
full
sentence
so
that
the
attention
heads
can
only
see
what
was
before
in
the
text
,
and
not
what
’
s
after
.
Although
those
sentence
so
that
the
attention
heads
can
only
see
what
was
before
in
the
text
,
and
not
what
’
s
after
.
Although
those
...
@@ -54,12 +60,18 @@ Multimodal models mix text inputs with other kinds (e.g. images) and are more sp
...
@@ -54,12 +60,18 @@ Multimodal models mix text inputs with other kinds (e.g. images) and are more sp
..
_autoregressive
-
models
:
..
_autoregressive
-
models
:
A
utoregressive
models
Decoders
or
a
utoregressive
models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As
mentioned
before
,
these
models
rely
on
the
decoder
part
of
the
original
transformer
and
use
an
attention
mask
so
As
mentioned
before
,
these
models
rely
on
the
decoder
part
of
the
original
transformer
and
use
an
attention
mask
so
that
at
each
position
,
the
model
can
only
look
at
the
tokens
before
the
attention
heads
.
that
at
each
position
,
the
model
can
only
look
at
the
tokens
before
the
attention
heads
.
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/d_ixlCubqQw"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
Original
GPT
Original
GPT
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
...
@@ -215,13 +227,19 @@ multiple choice classification and question answering.
...
@@ -215,13 +227,19 @@ multiple choice classification and question answering.
.. _autoencoding-models:
.. _autoencoding-models:
A
utoencoding models
Encoders or a
utoencoding models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
corrupted versions.
corrupted versions.
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/MUqNwgPjJvQ" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
BERT
BERT
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
...
@@ -526,6 +544,12 @@ Sequence-to-sequence models
...
@@ -526,6 +544,12 @@ Sequence-to-sequence models
As mentioned before, these models keep both the encoder and the decoder of the original transformer.
As mentioned before, these models keep both the encoder and the decoder of the original transformer.
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/0_4KEb08xrE" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
BART
BART
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
...
...
docs/source/preprocessing.rst
View file @
a55dc157
...
@@ -39,6 +39,12 @@ To automatically download the vocab used during pretraining or fine-tuning a giv
...
@@ -39,6 +39,12 @@ To automatically download the vocab used during pretraining or fine-tuning a giv
Base use
Base use
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/Yffk5aydLzg" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
is its ``__call__``: you just need to feed your sentence to your tokenizer object.
is its ``__call__``: you just need to feed your sentence to your tokenizer object.
...
@@ -138,6 +144,12 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
...
@@ -138,6 +144,12 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
Preprocessing pairs of sentences
Preprocessing pairs of sentences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/0u3ioSwev3s" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
...
...
docs/source/quicktour.rst
View file @
a55dc157
...
@@ -28,8 +28,15 @@ will dig a little bit more and see how the library gives you access to those mod
...
@@ -28,8 +28,15 @@ will dig a little bit more and see how the library gives you access to those mod
Getting started on a task with a pipeline
Getting started on a task with a pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`. 🤗 Transformers
The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`.
provides the following tasks out of the box:
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/tiZFewofSLM" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
🤗 Transformers provides the following tasks out of the box:
- Sentiment analysis: is a text positive or negative?
- Sentiment analysis: is a text positive or negative?
- Text generation (in English): provide a prompt and the model will generate what follows.
- Text generation (in English): provide a prompt and the model will generate what follows.
...
@@ -137,8 +144,15 @@ to share your fine-tuned model on the hub with the community, using :doc:`this t
...
@@ -137,8 +144,15 @@ to share your fine-tuned model on the hub with the community, using :doc:`this t
Under the hood: pretrained models
Under the hood: pretrained models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let'
s
now
see
what
happens
beneath
the
hood
when
using
those
pipelines
.
As
we
saw
,
the
model
and
tokenizer
are
created
Let'
s
now
see
what
happens
beneath
the
hood
when
using
those
pipelines
.
using
the
:
obj
:`
from_pretrained
`
method
:
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/AhChOFRegn4"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
As
we
saw
,
the
model
and
tokenizer
are
created
using
the
:
obj
:`
from_pretrained
`
method
:
..
code
-
block
::
..
code
-
block
::
...
...
docs/source/tokenizer_summary.rst
View file @
a55dc157
...
@@ -13,12 +13,20 @@
...
@@ -13,12 +13,20 @@
Summary of the tokenizers
Summary of the tokenizers
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
On this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
On this page, we will have a closer look at tokenization.
<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids through a
look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
.. raw:: html
text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
<iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A" title="YouTube video player"
and :ref:`SentencePiece <sentencepiece>`, and show examples of which tokenizer type is used by which model.
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
As we saw in :doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or
subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is
straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text).
More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding
(BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`, and :ref:`SentencePiece <sentencepiece>`, and show examples
of which tokenizer type is used by which model.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
...
@@ -28,8 +36,15 @@ Introduction
...
@@ -28,8 +36,15 @@ Introduction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."`` A simple way of tokenizing
For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."``
this text is to split it by spaces, which would give:
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/nhJxYji1aho" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
A simple way of tokenizing this text is to split it by spaces, which would give:
.. code-block::
.. code-block::
...
@@ -69,16 +84,30 @@ Such a big vocabulary size forces the model to have an enormous embedding matrix
...
@@ -69,16 +84,30 @@ Such a big vocabulary size forces the model to have an enormous embedding matrix
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
greater than 50,000, especially if they are pretrained only on a single language.
greater than 50,000, especially if they are pretrained only on a single language.
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?
character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
.. raw:: html
for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
<iframe width="560" height="315" src="https://www.youtube.com/embed/ssLq_EK2jLE" title="YouTube video player"
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder
for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent
representation for the letter ``"t"`` is much harder than learning a context-independent representation for the word
``"today"``. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of
both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword**
tokenization.
Subword tokenization
Subword tokenization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/zHvTiHr506c" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
...
...
docs/source/training.rst
View file @
a55dc157
...
@@ -27,6 +27,12 @@ negative. For examples of other tasks, refer to the :ref:`additional-resources`
...
@@ -27,6 +27,12 @@ negative. For examples of other tasks, refer to the :ref:`additional-resources`
Preparing
the
datasets
Preparing
the
datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/_BZearw7f0w"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
We
will
use
the
`
🤗
Datasets
<
https
:/
github
.
com
/
huggingface
/
datasets
/>`
__
library
to
download
and
preprocess
the
IMDB
We
will
use
the
`
🤗
Datasets
<
https
:/
github
.
com
/
huggingface
/
datasets
/>`
__
library
to
download
and
preprocess
the
IMDB
datasets
.
We
will
go
over
this
part
pretty
quickly
.
Since
the
focus
of
this
tutorial
is
on
training
,
you
should
refer
datasets
.
We
will
go
over
this
part
pretty
quickly
.
Since
the
focus
of
this
tutorial
is
on
training
,
you
should
refer
to
the
🤗
Datasets
`
documentation
<
https
://
huggingface
.
co
/
docs
/
datasets
/>`
__
or
the
:
doc
:`
preprocessing
`
tutorial
for
to
the
🤗
Datasets
`
documentation
<
https
://
huggingface
.
co
/
docs
/
datasets
/>`
__
or
the
:
doc
:`
preprocessing
`
tutorial
for
...
@@ -95,6 +101,12 @@ them by their `full` equivalent to train or evaluate on the full dataset.
...
@@ -95,6 +101,12 @@ them by their `full` equivalent to train or evaluate on the full dataset.
Fine
-
tuning
in
PyTorch
with
the
Trainer
API
Fine
-
tuning
in
PyTorch
with
the
Trainer
API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/nvBXf7s7vTI"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
Since
PyTorch
does
not
provide
a
training
loop
,
the
🤗
Transformers
library
provides
a
:
class
:`~
transformers
.
Trainer
`
Since
PyTorch
does
not
provide
a
training
loop
,
the
🤗
Transformers
library
provides
a
:
class
:`~
transformers
.
Trainer
`
API
that
is
optimized
for
🤗
Transformers
models
,
with
a
wide
range
of
training
options
and
with
built
-
in
features
like
API
that
is
optimized
for
🤗
Transformers
models
,
with
a
wide
range
of
training
options
and
with
built
-
in
features
like
logging
,
gradient
accumulation
,
and
mixed
precision
.
logging
,
gradient
accumulation
,
and
mixed
precision
.
...
@@ -200,6 +212,12 @@ See the documentation of :class:`~transformers.TrainingArguments` for more optio
...
@@ -200,6 +212,12 @@ See the documentation of :class:`~transformers.TrainingArguments` for more optio
Fine-tuning with Keras
Fine-tuning with Keras
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/rnTGBy2ax1c" title="YouTube video player"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
Models can also be trained natively in TensorFlow using the Keras API. First, let'
s
define
our
model
:
Models can also be trained natively in TensorFlow using the Keras API. First, let'
s
define
our
model
:
..
code
-
block
::
python
..
code
-
block
::
python
...
@@ -257,6 +275,12 @@ as a PyTorch model (or vice-versa):
...
@@ -257,6 +275,12 @@ as a PyTorch model (or vice-versa):
Fine
-
tuning
in
native
PyTorch
Fine
-
tuning
in
native
PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
..
raw
::
html
<
iframe
width
=
"560"
height
=
"315"
src
=
"https://www.youtube.com/embed/Dh9CL8fyG80"
title
=
"YouTube video player"
frameborder
=
"0"
allow
=
"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture"
allowfullscreen
></
iframe
>
You
might
need
to
restart
your
notebook
at
this
stage
to
free
some
memory
,
or
excute
the
following
code
:
You
might
need
to
restart
your
notebook
at
this
stage
to
free
some
memory
,
or
excute
the
following
code
:
..
code
-
block
::
python
..
code
-
block
::
python
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment