[Tokenizer Doc] Improve tokenizer summary (#8622)

* improve summary * small fixes * cleaned line length * correct "" formatting * apply sylvains suggestions

[Tokenizer Doc] Improve tokenizer summary (#8622)
* improve summary * small fixes * cleaned line length * correct "" formatting * apply sylvains suggestions
cdfa56af · Patrick von Platen · GitHub · 2f9d49b3 · cdfa56af
Unverified Commit cdfa56af authored Nov 18, 2020 by Patrick von Platen Committed by GitHub Nov 18, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 162 additions and 139 deletions

docs/source/tokenizer_summary.rst docs/source/tokenizer_summary.rst +162 -139

No files found.
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
-Tokenizer summary
+Summary of the tokenizers
 -----------------------------------------------------------------------------------------------------------------------

-In this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
-<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids. The second
-part is pretty straightforward, here we will focus on the first part. More specifically, we will look at the three main
-different kinds of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`,
-:ref:`WordPiece <wordpiece>` and :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of
-those.
+On this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
+<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids through a
+look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
+text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
+tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
+and :ref:`SentencePiece <sentencepiece>`, and show exemplary which tokenizer type is used by which model.

-Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
-algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
-using :ref:`WordPiece <wordpiece>`.
+Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
+type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
+that the model uses :ref:`WordPiece <wordpiece>`.

-Introduction to tokenization
+Introduction
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
-instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing this
-text is just to split it by spaces, which would give:
+Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
+For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."`` A simple way of tokenizing
+this text is to split it by spaces, which would give:

 .. code-block::

    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]

-This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
-will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
-into account. This would give:
+This is a sensible first step, but if we look at the tokens ``"Transformers?"`` and ``"do."``, we notice that the
+punctuation is attached to the words ``"Transformer"`` and ``"do"``, which is suboptimal. We should take the
+punctuation into account so that a model does not have to learn a different representation of a word and every possible
+punctuation symbol that could follow it, which would explode the number of representations the model has to learn.
+Taking punctuation into account, tokenizing our exemplary text would give:

 .. code-block::

    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]

-which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
-it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
-part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
-into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
-perform properly if you don't use the exact same rules as the persons who pretrained it.
+Better. However, it is disadvantageous, how the tokenization dealt with the word ``"Don't"``. ``"Don't"`` stands for
+``"do not"``, so it would be better tokenized as ``["Do", "n't"]``. This is where things start getting complicated, and
+part of the reason each model has its own tokenizer type. Depending on the rules we apply for tokenizing a text, a
+different tokenized output is generated for the same text. A pretrained model only performs properly if you feed it an
+input that was tokenized with the same rules that were used to tokenize its training data.

 `spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
-rule-based tokenizers. On the text above, they'd output something like:
+rule-based tokenizers. Applying them on our example, *spaCy* and *Moses* would output something like:

 .. code-block::

    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]

-Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
-sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
-you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used). :doc:`Transformer
-XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary size of 267,735!
+As can be seen space and punctuation tokenization, as well as rule-based tokenization, is used here. Space and
+punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
+as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
+tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
+usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, :doc:`Transformer XL
+<model_doc/transformerxl>` uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!

-A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
-TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
-transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
-language.
+Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
+causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
+greater than 50,000, especially if they are pretrained only on a single language.

-So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
-While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
-as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
-all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
+So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
+character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
+the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
+for the letter ``"t"`` is much harder as learning a context-independent representation for the word ``"today"``.
+Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
+transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.

 Subword tokenization
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
-should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
-decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
-form (almost) arbitrarily long complex words by stringing together some subwords.
+Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
+subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
+considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
+stand-alone subwords would appear more frequently while at the same time the meaning of ``"annoyingly"`` is kept by the
+composite meaning of ``"annoying"`` and ``"ly"``. This is especially useful in agglutinative languages such as Turkish,
+where you can form (almost) arbitrarily long complex words by stringing together subwords.

-This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
-subwords. This also enables the model to process words it has never seen before, by decomposing them into subwords it
-knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like this:
+Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful
+context-independent representations. In addition, subword tokenization enables the model to process words it has never
+seen before, by decomposing them into known subwords. For instance, the :class:`~transformers.BertTokenizer` tokenizes
+``"I have a new GPU!"`` as follows:

 .. code-block::

    >>> from transformers import BertTokenizer
-    >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+    >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    >>> tokenizer.tokenize("I have a new GPU!")
-    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
+    ["i", "have", "a", "new", "gp", "##u", "!"]

-Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
-vocabulary of the tokenizer, except for "gpu", so the tokenizer splits it in subwords it knows: "gp" and "##u". The
-"##" means that the rest of the token should be attached to the previous one, without space (for when we need to decode
-predictions and reverse the tokenization).
+Because we are considering the uncased model, the sentence was lowercased first. We can see that the words ``["i",
+"have", "a", "new"]`` are present in the tokenizer's vocabulary, but the word ``"gpu"`` is not. Consequently, the
+tokenizer splits ``"gpu"`` into known subwords: ``["gp" and "##u"]``. ``"##"`` means that the rest of the token should
+be attached to the previous one, without space (for decoding or reversal of the tokenization).

-Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
+As another example, :class:`~transformers.XLNetTokenizer` tokenizes our previously exemplary text as follows:

 .. code-block::

    >>> from transformers import XLNetTokenizer
-    >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+    >>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
-    ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
+    ["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]

-We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
-Transformers has been split into "Transform" and "ers".
+We'll get back to the meaning of those ``"▁"`` when we look at :ref:`SentencePiece <sentencepiece>`. As one can see,
+the rare word ``"Transformers"`` has been split into the more frequent subwords ``"Transform"`` and ``"ers"``.

-Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
-training which is usually done on the corpus the corresponding model will be trained on.
+Let's now look at how the different subword tokenization algorithms work. Note that all of those tokenization
+algorithms rely on some form of training which is usually done on the corpus the corresponding model will be trained
+on.

 .. _byte-pair-encoding:

-Byte-Pair Encoding
+Byte-Pair Encoding (BPE)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
-splitting the training data into words, which can be a simple space tokenization (:doc:`GPT-2 <model_doc/gpt2>` and
-:doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer (:doc:`XLM <model_doc/xlm>` use
-Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
+Byte-Pair Encoding (BPE) was introduced in `Neural Machine Translation of Rare Words with Subword Units (Sennrich et
+al., 2015) <https://arxiv.org/abs/1508.07909>`__. BPE relies on a pre-tokenizer that splits the training data into
+words. Pretokenization can be as simple as space tokenization, e.g. :doc:`GPT-2 <model_doc/gpt2>`, :doc:`Roberta
+<model_doc/roberta>`. More advanced pre-tokenization include rule-based tokenization, e.g. :doc:`XLM <model_doc/xlm>`,
+:doc:`FlauBERT <model_doc/flaubert>` which uses Moses for most languages, or :doc:`GPT <model_doc/gpt>` which uses
+Spacy and ftfy, to count the frequency of each word in the training corpus.

-:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
+After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the
+training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
+of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until
+the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to
+define before training the tokenizer.

-It then begins from the list of all characters and will learn merge rules to form a new token from two symbols in the
-vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
-
-Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
-word):
+As an example, let's assume that after pre-tokenization, the following set of words including their frequency has been
+determined:

 .. code-block::

-    ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
+    ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

-Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
+Consequently, the base vocabulary is ``["b", "g", "h", "n", "p", "s", "u"]``. Splitting all words into symbols of the
+base vocabulary, we obtain:

 .. code-block::

-    ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
+    ("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)

-We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
-times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
-`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
-then it adds 'ug' to the vocabulary. Our corpus then becomes
+BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently. In
+the example above ``"h"`` followed by ``"u"`` is present `10 + 5 = 15` times (10 times in the 10 occurrences of
+``"hug"``, 5 times in the 5 occurrences of "hugs"). However, the most frequent symbol pair is ``"u"`` followed by "g",
+occurring `10 + 5 + 5 = 20` times in total. Thus, the first merge rule the tokenizer learns is to group all ``"u"``
+symbols followed by a ``"g"`` symbol together. Next, "ug" is added to the vocabulary. The set of words then becomes

 .. code-block::

-    ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
+    ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)

-and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
-and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
-to the vocabulary.
+BPE then identifies the next most common symbol pair. It's ``"u"`` followed by ``"n"``, which occurs 16 times. ``"u"``,
+``"n"`` is merged to ``"un"`` and added to the vocabulary. The next most frequent symbol pair is ``"h"`` followed by
+``"ug"``, occurring 15 times. Again the pair is merged and ``"hug"`` can be added to the vocabulary.

-At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
-represented as
+At this stage, the vocabulary is ``["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`` and our set of unique words
+is represented as

 .. code-block::

-    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
+    ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)

-If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters
-that were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be
-tokenized as ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general
-(since the base corpus uses all of them), but to special characters like emojis.
+Assuming, that the Byte-Pair Encoding training would stop at this point, the learned merge rules would then be applied
+to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance,
+the word ``"bug"`` would be tokenized to ``["b", "ug"]`` but ``"mug"`` would be tokenized as ``["<unk>", "ug"]`` since
+the symbol ``"m"`` is not in the base vocabulary. In general, single letters such as ``"m"`` are not replaced by the
+``"<unk>"`` symbol because the training data usually includes at least one occurrence of each letter, but it is likely
+to happen for very special characters like emojis.

-As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
+As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
 to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
-and chose to stop the training of the tokenizer at 40,000 merges.
+and chose to stop training after 40,000 merges.

 Byte-level BPE
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
-all unicode characters, the `GPT-2 paper
-<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ introduces a
-clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some additional rules to
-deal with punctuation, this manages to be able to tokenize every text without needing an unknown token. For instance,
-the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens,
-a special end-of-text token and the symbols learned with 50,000 merges.
+A base vocabulary that includes all possible base characters can be quite large if *e.g.* all unicode characters are
+considered as base characters. To have a better base vocabulary, `GPT-2
+<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__ uses bytes
+as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that
+every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2's
+tokenizer can tokenize every text without the need for the <unk> symbol. :doc:`GPT-2 <model_doc/gpt>` has a vocabulary
+size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned
+with 50,000 merges.

 .. _wordpiece:

 WordPiece
 =======================================================================================================================

-WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as :doc:`DistilBERT
-<model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in `this paper
-<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies on the same
-base as BPE, which is to initialize the vocabulary to every character present in the corpus and progressively learn a
-given number of merge rules, the difference is that it doesn't choose the pair that is the most frequent but the one
-that will maximize the likelihood on the corpus once merged.
-
-What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
-having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
-subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
-sure it's `worth it`.
+WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>`, :doc:`DistilBERT
+<model_doc/distilbert>`, and :doc:`Electra <model_doc/electra>`. The algorithm was outlined in `Japanese and Korean
+Voice Seach (Schuster et al., 2012)
+<https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__ and is very similar to
+BPE. WordPiece first initializes the vocabulary to include every character present in the training data and
+progressively learn a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent
+symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.
+
+So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is
+equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by
+its second symbol is the greatest among all symbol pairs. *E.g.* ``"u"``, followed by ``"g"`` would have only been
+merged if the probability of ``"ug"`` divided by ``"u"``, ``"g"`` would have been greater than for any other symbol
+pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it `loses` by merging two symbols
+to make ensure it's `worth it`.

 .. _unigram:

 Unigram
 =======================================================================================================================

-Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
-Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
-from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
-progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
-with :ref:`SentencePiece <sentencepiece>`.
+Unigram is a subword tokenization algorithm introduced in `Subword Regularization: Improving Neural Network Translation
+Models with Multiple Subword Candidates (Kudo, 2018) <https://arxiv.org/pdf/1804.10959.pdf>`__. In contrast to BPE or
+WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each
+symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and
+the most common substrings. Unigram is not used directly for any of the models in the transformers, but it's used in
+conjunction with :ref:`SentencePiece <sentencepiece>`.

-More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
-for each subword, evaluate how much the loss would increase if the subword was removed from the vocabulary. It then
-sorts the subwords by this quantity (that represents how much worse the loss becomes if the token is removed) and
-removes all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary
-has reached the desired size, always keeping the base characters (to be able to tokenize any word written with them,
-like BPE or WordPiece).
+At each training step, the Unigram algorithm defines a loss (often defined as the log-likelihood) over the training
+data given the current vocabulary and a unigram language model. Then, for each symbol in the vocabulary, the algorithm
+computes how much the overall loss would increase if the symbol was to be removed from the vocabulary. Unigram then
+removes p (with p usually being 10% or 20%) percent of the symbols whose loss increase is the lowest, *i.e.* those
+symbols that least affect the overall loss over the training data. This process is repeated until the vocabulary has
+reached the desired size. The Unigram algorithm always keeps the base characters so that any word can be tokenized.

-Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
-tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
-vocabulary
+Because Unigram is not based on merge rules (in contrast to BPE and WordPiece), the algorithm has several ways of
+tokenizing new text after training. As an example, if a trained Unigram tokenizer exhibits the vocabulary:

 .. code-block::

-    ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
+    ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],

-we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
-one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
-training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
-tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
-of the tokenization according to their probabilities).
+``"hugs"`` could be tokenized both as ``["hug", "s"]``, ``["h", "ug", "s"]`` or ``["h", "u", "g", "s"]``. So which one
+to choose? Unigram saves the probability of each token in the training corpus on top of saving the vocabulary so that
+the probability of each possible tokenization can be computed after training. The algorithm simply picks the most
+likely tokenization in practice, but also offers the possibility to sample a possible tokenization according to their
+probabilities.

-Those probabilities define the loss that trains the tokenizer: if our corpus consists of the words :math:`x_{1}, \dots,
-x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible tokenizations of
-:math:`x_{i}` (with the current vocabulary), then the loss is defined as
+Those probabilities are defined by the loss the tokenizer is trained on. Assuming that the training data consists of
+the words :math:`x_{1}, \dots, x_{N}` and that the set of all possible tokenizations for a word :math:`x_{i}` is
+defined as :math:`S(x_{i})`, then the overall loss is defined as

 .. math::
    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
@@ -227,15 +247,18 @@ x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all
 SentencePiece
 =======================================================================================================================

-All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
-all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
-pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
-SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
-includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
-
-That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
-the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate all
-of them together and replace '▁' with space.
-
-All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
-:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
+All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
+separate words. However, not all languages use spaces to separate words. One possible solution is to use language
+specific pre-tokenizers, *e.g.* :doc:`XLM <model_doc/xlm>` uses a specific Chinese, Japanese, and Thai pre-tokenizer).
+To solve this problem more generally, `SentencePiece: A simple and language independent subword tokenizer and
+detokenizer for Neural Text Processing (Kudo et al., 2018) <https://arxiv.org/pdf/1808.06226.pdf>`__ treats the input
+as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
+algorithm to construct the appropriate vocabulary.
+
+The :class:`~transformers.XLNetTokenizer` uses SentencePiece for example, which is also why in the example earlier the
+``"▁"`` character was included in the vocabulary. Decoding with SentencePiece is very easy since all tokens can just be
+concatenated and ``"▁"`` is replaced by a space.
+
+All transformers models in the library that use SentencePiece use it in combination with unigram. Examples of models
+using SentencePiece are :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>`, :doc:`Marian
+<model_doc/marian>`, and :doc:`T5 <model_doc/t5>`.