Unverified Commit c73ee133 authored by Aaron Jimenez's avatar Aaron Jimenez Committed by GitHub
Browse files

[docs] Spanish translation of tokenizer_summary.md (#31154)

* add tokenizer_summary to es/_toctree.yml

* add tokenizer_summary to es/

* fix link to Transformes XL in en/

* translate until Subword tokenization section

* fix GPT link in en/

* fix other GPT link in en/

* fix typo in en/

* translate the doc

* run make fixup

* Remove .md in Transformer XL link

* fix some link issues in es/

* fix typo
parent 8a1a23ae
...@@ -73,7 +73,7 @@ As can be seen space and punctuation tokenization, as well as rule-based tokeniz ...@@ -73,7 +73,7 @@ As can be seen space and punctuation tokenization, as well as rule-based tokeniz
punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined
as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this as splitting sentences into words. While it's the most intuitive way to split texts into smaller chunks, this
tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization
usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transformerxl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735! usually generates a very big vocabulary (the set of all unique words and tokens used). *E.g.*, [Transformer XL](model_doc/transfo-xl) uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which Such a big vocabulary size forces the model to have an enormous embedding matrix as the input and output layer, which
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
...@@ -142,7 +142,7 @@ on. ...@@ -142,7 +142,7 @@ on.
Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword Units (Sennrich et Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword Units (Sennrich et
al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into
words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm), words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm),
[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses [FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/openai-gpt) which uses
spaCy and ftfy, to count the frequency of each word in the training corpus. spaCy and ftfy, to count the frequency of each word in the training corpus.
After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the
...@@ -195,7 +195,7 @@ the symbol `"m"` is not in the base vocabulary. In general, single letters such ...@@ -195,7 +195,7 @@ the symbol `"m"` is not in the base vocabulary. In general, single letters such
to happen for very special characters like emojis. to happen for very special characters like emojis.
As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter As mentioned earlier, the vocabulary size, *i.e.* the base vocabulary size + the number of merges, is a hyperparameter
to choose. For instance [GPT](model_doc/gpt) has a vocabulary size of 40,478 since they have 478 base characters to choose. For instance [GPT](model_doc/openai-gpt) has a vocabulary size of 40,478 since they have 478 base characters
and chose to stop training after 40,000 merges. and chose to stop training after 40,000 merges.
#### Byte-level BPE #### Byte-level BPE
...@@ -268,7 +268,7 @@ $$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right ) ...@@ -268,7 +268,7 @@ $$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to
separate words. However, not all languages use spaces to separate words. One possible solution is to use language separate words. However, not all languages use spaces to separate words. One possible solution is to use language
specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer). specific pre-tokenizers, *e.g.* [XLM](model_doc/xlm) uses a specific Chinese, Japanese, and Thai pre-tokenizer.
To solve this problem more generally, [SentencePiece: A simple and language independent subword tokenizer and To solve this problem more generally, [SentencePiece: A simple and language independent subword tokenizer and
detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) treats the input
as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram
......
...@@ -92,6 +92,8 @@ ...@@ -92,6 +92,8 @@
title: Lo que 🤗 Transformers puede hacer title: Lo que 🤗 Transformers puede hacer
- local: tasks_explained - local: tasks_explained
title: Como los 🤗 Transformers resuelven tareas title: Como los 🤗 Transformers resuelven tareas
- local: tokenizer_summary
title: Descripción general de los tokenizadores
- local: attention - local: attention
title: Mecanismos de atención title: Mecanismos de atención
- local: pad_truncation - local: pad_truncation
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment