Unverified Commit cdfa56af authored by Patrick von Platen's avatar Patrick von Platen Committed by GitHub
Browse files

[Tokenizer Doc] Improve tokenizer summary (#8622)

* improve summary

* small fixes

* cleaned line length

* correct "" formatting

* apply sylvains suggestions
parent 2f9d49b3
Tokenizer summary Summary of the tokenizers
----------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------
In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial On this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids through a
part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
:ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
those. and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
using :ref:`WordPiece <wordpiece>`. that the model uses :ref:`WordPiece <wordpiece>`.
Introduction to tokenization Introduction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."`` A simple way of tokenizing
text is just to split it by spaces, which would give: this text is to split it by spaces, which would give:
.. code-block:: .. code-block::
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."] ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those This is a sensible first step, but if we look at the tokens ``"Transformers?"`` and ``"do."``, we notice that the
will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation punctuation is attached to the words ``"Transformer"`` and ``"do"``, which is suboptimal. We should take the
into account. This would give: punctuation into account so that a model does not have to learn a different representation of a word and every possible
punctuation symbol that could follow it, which would explode the number of representations the model has to learn.
Taking punctuation into account, tokenizing our exemplary text would give:
.. code-block:: .. code-block::
["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so Better. However, it is disadvantageous, how the tokenization dealt with the word ``"Don't"``. ``"Don't"`` stands for
it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and ``"do not"``, so it would be better tokenized as ``["Do", "n't"]``. This is where things start getting complicated, and
part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a
into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an
perform properly if you don't use the exact same rules as the persons who pretrained it. input that was tokenized with the same rules that were used to tokenize its training data.
`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular `spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
rule-based tokenizers. On the text above, they'd output something like: rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like:
.. code-block:: .. code-block::
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and
sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735! tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, :doc:`Transformer XL
<model_doc/transformerxl>` uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems. Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general, causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single greater than 50,000, especially if they are pretrained only on a single language.
language.
So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters. So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds, the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization. for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
Subword tokenization Subword tokenization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
form (almost) arbitrarily long complex words by stringing together some subwords. stand-alone subwords would appear more frequently while at the same time the meaning of ``"annoyingly"`` is kept by the
composite meaning of ``"annoying"`` and ``"ly"``. This is especially useful in agglutinative languages such as Turkish,
where you can form (almost) arbitrarily long complex words by stringing together subwords.
This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful
subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it context-independent representations. In addition, subword tokenization enables the model to process words it has never
knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this: seen before, by decomposing them into known subwords. For instance, the :class:`~transformers.BertTokenizer` tokenizes
``"I have a new GPU!"`` as follows:
.. code-block:: .. code-block::
>>> from transformers import BertTokenizer >>> from transformers import BertTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.tokenize("I have a new GPU!") >>> tokenizer.tokenize("I have a new GPU!")
['i', 'have', 'a', 'new', 'gp', '##u', '!'] ["i", "have", "a", "new", "gp", "##u", "!"]
Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the Because we are considering the uncased model, the sentence was lowercased first. We can see that the words ``["i",
vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The "have", "a", "new"]`` are present in the tokenizer's vocabulary, but the word ``"gpu"`` is not. Consequently, the
"##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode tokenizer splits ``"gpu"`` into known subwords: ``["gp" and "##u"]``. ``"##"`` means that the rest of the token should
predictions and reverse the tokenization). be attached to the previous one, without space (for decoding or reversal of the tokenization).
Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text: As another example, :class:`~transformers.XLNetTokenizer` tokenizes our previously exemplary text as follows:
.. code-block:: .. code-block::
>>> from transformers import XLNetTokenizer >>> from transformers import XLNetTokenizer
>>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased') >>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.") >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.'] ["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]
We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see We'll get back to the meaning of those ``"▁"`` when we look at :ref:`SentencePiece <sentencepiece>`. As one can see,
Transformers has been split into "Transform" and "ers". the rare word ``"Transformers"`` has been split into the more frequent subwords ``"Transform"`` and ``"ers"``.
Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization
training which is usually done on the corpus the corresponding model will be trained on. algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained
on.
.. _byte-pair-encoding: .. _byte-pair-encoding:
Byte-Pair Encoding Byte-Pair Encoding (BPE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer Byte-Pair Encoding (BPE) was introduced in `Neural Machine Translation of Rare Words with Subword Units (Sennrich et
splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and al., 2015) <https://arxiv.org/abs/1508.07909>`__. BPE relies on a pre-tokenizer that splits the training data into
:doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use words. Pretokenization can be as simple as space tokenization, e.g. :doc:`GPT-2 <model_doc/gpt2>`, :doc:`Roberta
Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`), <model_doc/roberta>`. More advanced pre-tokenization include rule-based tokenization, e.g. :doc:`XLM <model_doc/xlm>`,
:doc:`FlauBERT <model_doc/flaubert>` which uses Moses for most languages, or :doc:`GPT <model_doc/gpt>` which uses
Spacy and ftfy, to count the frequency of each word in the training corpus.
:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus. After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the
training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until
the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to
define before training the tokenizer.
It then begins from the list of all characters and will learn merge rules to form a new token from two symbols in the As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been
vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick). determined:
Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
word):
.. code-block:: .. code-block::
('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5) ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character: Consequently, the base vocabulary is ``["b", "g", "h", "n", "p", "s", "u"]``. Splitting all words into symbols of the
base vocabulary, we obtain:
.. code-block:: .. code-block::
('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5) ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10 BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In
times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present the example above ``"h"`` followed by ``"u"`` is present `10 + 5 = 15` times (10 times in the 10 occurrences of
`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together ``"hug"``, 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is ``"u"`` followed by "g",
then it adds 'ug' to the vocabulary. Our corpus then becomes occurring `10 + 5 + 5 = 20` times in total. Thus, the first merge rule the tokenizer learns is to group all ``"u"``
symbols followed by a ``"g"`` symbol together. Next, "ug" is added to the vocabulary. The set of words then becomes
.. code-block:: .. code-block::
('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5) ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two BPE then identifies the next most common symbol pair. It's ``"u"`` followed by ``"n"``, which occurs 16 times. ``"u"``,
and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug' ``"n"`` is merged to ``"un"`` and added to the vocabulary. The next most frequent symbol pair is ``"h"`` followed by
to the vocabulary. ``"ug"``, occurring 15 times. Again the pair is merged and ``"hug"`` can be added to the vocabulary.
At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is At this stage, the vocabulary is ``["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`` and our set of unique words
represented as is represented as
.. code-block:: .. code-block::
('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5) ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied
that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance,
tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general the word ``"bug"`` would be tokenized to ``["b", "ug"]`` but ``"mug"`` would be tokenized as ``["<unk>", "ug"]`` since
(since the base corpus uses all of them), but to special characters like emojis. the symbol ``"m"`` is not in the base vocabulary. In general, single letters such as ``"m"`` are not replaced by the
``"<unk>"`` symbol because the training data usually includes at least one occurrence of each letter, but it is likely
to happen for very special characters like emojis.
As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
and chose to stop the training of the tokenizer at 40,000 merges. and chose to stop training after 40,000 merges.
Byte-level BPE Byte-level BPE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are
all unicode characters, the `GPT-2 paper considered as base characters. To have a better base vocabulary, `GPT-2
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ uses bytes
clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that
deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance, every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's
the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, tokenizer can tokenize every text without the need for the <unk> symbol. :doc:`GPT-2 <model_doc/gpt>` has a vocabulary
a special end-of-text token and the symbols learned with 50,000 merges. size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned
with 50,000 merges.
.. _wordpiece: .. _wordpiece:
WordPiece WordPiece
======================================================================================================================= =======================================================================================================================
WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
<model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper <model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same Voice Seach (Schuster et al., 2012)
base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
that will maximize the likelihood on the corpus once merged. progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
sure it's `worth it`. its second symbol is the greatest among all symbol pairs. *E.g.* ``"u"``, followed by ``"g"`` would have only been
merged if the probability of ``"ug"`` divided by ``"u"``, ``"g"`` would have been greater than for any other symbol
pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it `loses` by merging two symbols
to make ensure it's `worth it`.
.. _unigram: .. _unigram:
Unigram Unigram
======================================================================================================================= =======================================================================================================================
Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__. Unigram is a subword tokenization algorithm introduced in `Subword Regularization: Improving Neural Network Translation
Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts Models with Multiple Subword Candidates (Kudo, 2018) <https://arxiv.org/pdf/1804.10959.pdf>`__. In contrast to BPE or
from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each
progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and
with :ref:`SentencePiece <sentencepiece>`. the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in
conjunction with :ref:`SentencePiece <sentencepiece>`.
More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then, At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training
for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm
sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then
removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those
has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has
like BPE or WordPiece). reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.
Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of
tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:
vocabulary
.. code-block:: .. code-block::
['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug'] ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],
we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which ``"hugs"`` could be tokenized both as ``["hug", "s"]``, ``["h", "ug", "s"]`` or ``["h", "u", "g", "s"]``. So which one
one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that
training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the the probability of each possible tokenization can be computed after training. The algorithm simply picks the most
tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their
of the tokenization according to their probabilities). probabilities.
Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots, Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of the words :math:`x_{1}, \dots, x_{N}` and that the set of all possible tokenizations for a word :math:`x_{i}` is
:math:`x_{i}` (with the current vocabulary), then the loss is defined as defined as :math:`S(x_{i})`, then the overall loss is defined as
.. math:: .. math::
\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right ) \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
...@@ -227,15 +247,18 @@ x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all ...@@ -227,15 +247,18 @@ x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all
SentencePiece SentencePiece
======================================================================================================================= =======================================================================================================================
All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific separate words. However, not all languages use spaces to separate words. One possible solution is to use language
pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem, specific pre-tokenizers, *e.g.* :doc:`XLM <model_doc/xlm>` uses a specific Chinese, Japanese, and Thai pre-tokenizer).
SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream, To solve this problem more generally, `SentencePiece: A simple and language independent subword tokenizer and
includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary. detokenizer for Neural Text Processing (Kudo et al., 2018) <https://arxiv.org/pdf/1808.06226.pdf>`__ treats the input
as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had algorithm to construct the appropriate vocabulary.
the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
of them together and replace '▁' with space. The :class:`~transformers.XLNetTokenizer` uses SentencePiece for example, which is also why in the example earlier the
``"▁"`` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be
All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are concatenated and ``"▁"`` is replaced by a space.
:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models
using SentencePiece are :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>`, :doc:`Marian
<model_doc/marian>`, and :doc:`T5 <model_doc/t5>`.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment