Unverified Commit 815ea8e8 authored by Aaron Jimenez's avatar Aaron Jimenez Committed by GitHub
Browse files

[Doc] Spanish translation of glossary.md (#27958)

* Add glossary to es/_toctree.yml

* Add glossary.md to es/

* A section translated

* B and C section translated

* Fix typo in en/glossary.md C section

* D section translated | Add a extra line in en/glossary.md

* E and F section translated | Fix typo in en/glossary.md

* Fix words preentrenado

* H and I section translated | Fix typo in en/glossary.md

* L section translated

* M and N section translated

* P section translated

* R section translated

* S section translated

* T section translated

* U and Z section translated | Fix TensorParallel link in both files

* Fix word
parent 93766251
...@@ -100,7 +100,7 @@ reading the whole sentence but using a mask inside the model to hide the future ...@@ -100,7 +100,7 @@ reading the whole sentence but using a mask inside the model to hide the future
### channel ### channel
Color images are made up of some combination of values in three channels - red, green, and blue (RGB) - and grayscale images only have one channel. In 🤗 Transformers, the channel can be the first or last dimension of an image's tensor: [`n_channels`, `height`, `width`] or [`height`, `width`, `n_channels`]. Color images are made up of some combination of values in three channels: red, green, and blue (RGB) and grayscale images only have one channel. In 🤗 Transformers, the channel can be the first or last dimension of an image's tensor: [`n_channels`, `height`, `width`] or [`height`, `width`, `n_channels`].
### connectionist temporal classification (CTC) ### connectionist temporal classification (CTC)
...@@ -116,6 +116,7 @@ A type of layer in a neural network where the input matrix is multiplied element ...@@ -116,6 +116,7 @@ A type of layer in a neural network where the input matrix is multiplied element
Parallelism technique for training on multiple GPUs where the same setup is replicated multiple times, with each instance Parallelism technique for training on multiple GPUs where the same setup is replicated multiple times, with each instance
receiving a distinct data slice. The processing is done in parallel and all setups are synchronized at the end of each training step. receiving a distinct data slice. The processing is done in parallel and all setups are synchronized at the end of each training step.
Learn more about how DataParallel works [here](perf_train_gpu_many#dataparallel-vs-distributeddataparallel). Learn more about how DataParallel works [here](perf_train_gpu_many#dataparallel-vs-distributeddataparallel).
### decoder input IDs ### decoder input IDs
...@@ -165,8 +166,7 @@ embeddings `[batch_size, sequence_length, config.intermediate_size]` can account ...@@ -165,8 +166,7 @@ embeddings `[batch_size, sequence_length, config.intermediate_size]` can account
use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the
computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output
embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n` embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`
individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n = individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n = sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically
sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically
**equivalent** result. **equivalent** result.
For models employing the function [`apply_chunking_to_forward`], the `chunk_size` defines the number of output For models employing the function [`apply_chunking_to_forward`], the `chunk_size` defines the number of output
...@@ -187,7 +187,7 @@ The model head refers to the last layer of a neural network that accepts the raw ...@@ -187,7 +187,7 @@ The model head refers to the last layer of a neural network that accepts the raw
* [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`]. * [`GPT2ForSequenceClassification`] is a sequence classification head - a linear layer - on top of the base [`GPT2Model`].
* [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`]. * [`ViTForImageClassification`] is an image classification head - a linear layer on top of the final hidden state of the `CLS` token - on top of the base [`ViTModel`].
* [`Wav2Vec2ForCTC`] ia a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`]. * [`Wav2Vec2ForCTC`] is a language modeling head with [CTC](#connectionist-temporal-classification-(CTC)) on top of the base [`Wav2Vec2Model`].
## I ## I
...@@ -232,9 +232,7 @@ is added for "RA" and "M": ...@@ -232,9 +232,7 @@ is added for "RA" and "M":
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M'] ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
``` ```
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding the sentence to the tokenizer, which leverages the Rust implementation of [🤗 Tokenizers](https://github.com/huggingface/tokenizers) for peak performance.
the sentence to the tokenizer, which leverages the Rust implementation of [🤗
Tokenizers](https://github.com/huggingface/tokenizers) for peak performance.
```python ```python
>>> inputs = tokenizer(sequence) >>> inputs = tokenizer(sequence)
...@@ -383,7 +381,7 @@ self-supervised objective, which can be reading the text and trying to predict t ...@@ -383,7 +381,7 @@ self-supervised objective, which can be reading the text and trying to predict t
modeling](#causal-language-modeling)) or masking some words and trying to predict them (see [masked language modeling](#causal-language-modeling)) or masking some words and trying to predict them (see [masked language
modeling](#masked-language-modeling-mlm)). modeling](#masked-language-modeling-mlm)).
Speech and vision models have their own pretraining objectives. For example, Wav2Vec2 is a speech model pretrained on a contrastive task which requires the model to identify the "true" speech representation from a set of "false" speech representations. On the other hand, BEiT is a vision model pretrained on a masked image modeling task which masks some of the image patches and requires the model to predict the masked patches (similar to the masked language modeling objective). Speech and vision models have their own pretraining objectives. For example, Wav2Vec2 is a speech model pretrained on a contrastive task which requires the model to identify the "true" speech representation from a set of "false" speech representations. On the other hand, BEiT is a vision model pretrained on a masked image modeling task which masks some of the image patches and requires the model to predict the masked patches (similar to the masked language modeling objective).
## R ## R
...@@ -518,7 +516,7 @@ A form of model training in which data provided to the model is not labeled. Uns ...@@ -518,7 +516,7 @@ A form of model training in which data provided to the model is not labeled. Uns
### Zero Redundancy Optimizer (ZeRO) ### Zero Redundancy Optimizer (ZeRO)
Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensorparallel--tp-), Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensor-parallelism-tp),
except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need
to be modified. This method also supports various offloading techniques to compensate for limited GPU memory. to be modified. This method also supports various offloading techniques to compensate for limited GPU memory.
Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism). Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism).
\ No newline at end of file
...@@ -75,6 +75,8 @@ ...@@ -75,6 +75,8 @@
- sections: - sections:
- local: philosophy - local: philosophy
title: Filosofía title: Filosofía
- local: glossary
title: Glosario
- local: pad_truncation - local: pad_truncation
title: Relleno y truncamiento title: Relleno y truncamiento
- local: bertology - local: bertology
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment