Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
chenpangpang
transformers
Commits
fa5423b1
Unverified
Commit
fa5423b1
authored
Jul 08, 2020
by
Stas Bekman
Committed by
GitHub
Jul 08, 2020
Browse files
doc fixes (#5613)
parent
7d0ef004
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
8 additions
and
8 deletions
+8
-8
docs/source/tokenizer_summary.rst
docs/source/tokenizer_summary.rst
+8
-8
No files found.
docs/source/tokenizer_summary.rst
View file @
fa5423b1
...
@@ -52,7 +52,7 @@ size of 267,735!
...
@@ -52,7 +52,7 @@ size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
transformers model rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
transformers model
s
rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
language.
language.
So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
...
@@ -69,7 +69,7 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la
...
@@ -69,7 +69,7 @@ decomposed as "annoying" and "ly". This is especially useful in agglutinative la
form (almost) arbitrarily long complex words by stringing together some subwords.
form (almost) arbitrarily long complex words by stringing together some subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
subwords. This also
gives the ability to
the model to process words it has never seen before, by decomposing them into
subwords. This also
enables
the model to process words it has never seen before, by decomposing them into
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
this:
this:
...
@@ -110,7 +110,7 @@ splitting the training data into words, which can be a simple space tokenization
...
@@ -110,7 +110,7 @@ splitting the training data into words, which can be a simple space tokenization
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy
)
and
,
counts the frequency of each word in the training corpus.
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy
,
and counts the frequency of each word in the training corpus.
It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
...
@@ -178,7 +178,7 @@ WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/b
...
@@ -178,7 +178,7 @@ WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/b
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
frequent but the one that will maximize the likelihood on the corpus once merged.
frequent but the one that will maximize the likelihood on the corpus once merged.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
...
@@ -217,7 +217,7 @@ training corpus. You can then give a probability to each tokenization (which is
...
@@ -217,7 +217,7 @@ training corpus. You can then give a probability to each tokenization (which is
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
of the tokenization according to their probabilities).
of the tokenization according to their probabilities).
Those probabilities
are what are used to
define the loss that trains the tokenizer: if our corpus consists of the
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
...
@@ -229,15 +229,15 @@ tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is d
...
@@ -229,15 +229,15 @@ tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is d
SentencePiece
SentencePiece
=============
=============
All the methods we have been looking at so far required some f
r
om of pret
r
okenization, which has a central problem: not
All the methods we have been looking at so far required some fo
r
m of pretokenization, which has a central problem: not
all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
som
e '▁' character
s
, that represent space
s
. Decoding a tokenized text is then super easy: we just have to concatenate
th
e '▁' character, that represent
s
space. Decoding a tokenized text is then super easy: we just have to concatenate
all of them together and replace
those '▁' by
space
s
.
all of them together and replace
'▁' with
space.
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment