Unverified Commit 721ee783 authored by Klaus Hipp's avatar Klaus Hipp Committed by GitHub
Browse files

[Docs] Fix spelling and grammar mistakes (#28825)

* Fix typos and grammar mistakes in docs and examples

* Fix typos in docstrings and comments

* Fix spelling of `tokenizer` in model tests

* Remove erroneous spaces in decorators

* Remove extra spaces in Markdown link texts
parent 2418c64a
...@@ -67,7 +67,7 @@ python run_qa.py \ ...@@ -67,7 +67,7 @@ python run_qa.py \
<Tip warning={true}> <Tip warning={true}>
For PyTorch >= 1.14.0, JIT-mode could benefit any model for prediction and evaluaion since the dict input is supported in `jit.trace`. For PyTorch >= 1.14.0, JIT-mode could benefit any model for prediction and evaluation since the dict input is supported in `jit.trace`.
For PyTorch < 1.14.0, JIT-mode could benefit a model if its forward parameter order matches the tuple input order in `jit.trace`, such as a question-answering model. If the forward parameter order does not match the tuple input order in `jit.trace`, like a text classification model, `jit.trace` will fail and we are capturing this with the exception here to make it fallback. Logging is used to notify users. For PyTorch < 1.14.0, JIT-mode could benefit a model if its forward parameter order matches the tuple input order in `jit.trace`, such as a question-answering model. If the forward parameter order does not match the tuple input order in `jit.trace`, like a text classification model, `jit.trace` will fail and we are capturing this with the exception here to make it fallback. Logging is used to notify users.
......
...@@ -166,7 +166,7 @@ Note that instead of applying this to a whole class, you can apply it to the rel ...@@ -166,7 +166,7 @@ Note that instead of applying this to a whole class, you can apply it to the rel
# Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
``` ```
Sometimes the copy is exactly the same except for names: for instance in `RobertaAttention`, we use `RobertaSelfAttention` insted of `BertSelfAttention` but other than that, the code is exactly the same. This is why `# Copied from` supports simple string replacements with the follwoing syntax: `Copied from xxx with foo->bar`. This means the code is copied with all instances of `foo` being replaced by `bar`. You can see how it used [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` with the comment: Sometimes the copy is exactly the same except for names: for instance in `RobertaAttention`, we use `RobertaSelfAttention` insted of `BertSelfAttention` but other than that, the code is exactly the same. This is why `# Copied from` supports simple string replacements with the following syntax: `Copied from xxx with foo->bar`. This means the code is copied with all instances of `foo` being replaced by `bar`. You can see how it used [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` with the comment:
```py ```py
# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta # Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta
......
...@@ -36,7 +36,7 @@ Try AWQ quantization with this [notebook](https://colab.research.google.com/driv ...@@ -36,7 +36,7 @@ Try AWQ quantization with this [notebook](https://colab.research.google.com/driv
[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) doesn't quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that you can run models in 4-bit precision without experiencing any performance degradation. [Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) doesn't quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that you can run models in 4-bit precision without experiencing any performance degradation.
There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the processs is similar for llm-awq quantized models. There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the process is similar for llm-awq quantized models.
Make sure you have autoawq installed: Make sure you have autoawq installed:
...@@ -214,7 +214,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="aut ...@@ -214,7 +214,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="aut
<Tip warning={true}> <Tip warning={true}>
Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [faceboook/opt-350m]() model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists. Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m]() model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
</Tip> </Tip>
...@@ -583,7 +583,7 @@ The speed and throughput of fused and unfused modules were also tested with the ...@@ -583,7 +583,7 @@ The speed and throughput of fused and unfused modules were also tested with the
<div class="flex gap-4"> <div class="flex gap-4">
<div> <div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_forward_memory_plot.png" alt="generate throughput per batch size" /> <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_forward_memory_plot.png" alt="generate throughput per batch size" />
<figcaption class="mt-2 text-center text-sm text-gray-500">foward peak memory/batch size</figcaption> <figcaption class="mt-2 text-center text-sm text-gray-500">forward peak memory/batch size</figcaption>
</div> </div>
<div> <div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_generate_throughput_plot.png" alt="forward latency per batch size" /> <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_generate_throughput_plot.png" alt="forward latency per batch size" />
......
...@@ -42,7 +42,7 @@ In this guide, you'll learn how to: ...@@ -42,7 +42,7 @@ In this guide, you'll learn how to:
- [Prompted image captioning](#prompted-image-captioning) - [Prompted image captioning](#prompted-image-captioning)
- [Few-shot prompting](#few-shot-prompting) - [Few-shot prompting](#few-shot-prompting)
- [Visual question answering](#visual-question-answering) - [Visual question answering](#visual-question-answering)
- [Image classificaiton](#image-classification) - [Image classification](#image-classification)
- [Image-guided text generation](#image-guided-text-generation) - [Image-guided text generation](#image-guided-text-generation)
- [Run inference in batch mode](#running-inference-in-batch-mode) - [Run inference in batch mode](#running-inference-in-batch-mode)
- [Run IDEFICS instruct for conversational use](#idefics-instruct-for-conversational-use) - [Run IDEFICS instruct for conversational use](#idefics-instruct-for-conversational-use)
......
...@@ -108,8 +108,7 @@ For masked language modeling, the next step is to load a DistilRoBERTa tokenizer ...@@ -108,8 +108,7 @@ For masked language modeling, the next step is to load a DistilRoBERTa tokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base") >>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
``` ```
You'll notice from the example above, the `text` field is actually nested inside `answers`. This means you'll need to e You'll notice from the example above, the `text` field is actually nested inside `answers`. This means you'll need to extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process#flatten) method:
xtract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process#flatten) method:
```py ```py
>>> eli5 = eli5.flatten() >>> eli5 = eli5.flatten()
......
...@@ -976,7 +976,7 @@ Some decorators like `@parameterized` rewrite test names, therefore `@slow` and ...@@ -976,7 +976,7 @@ Some decorators like `@parameterized` rewrite test names, therefore `@slow` and
`@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage: `@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage:
```python no-style ```python no-style
@parameteriz ed.expand(...) @parameterized.expand(...)
@slow @slow
def test_integration_foo(): def test_integration_foo():
``` ```
......
...@@ -143,7 +143,7 @@ Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare W ...@@ -143,7 +143,7 @@ Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare W
al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into
words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm), words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm),
[FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses [FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses
Spacy and ftfy, to count the frequency of each word in the training corpus. spaCy and ftfy, to count the frequency of each word in the training corpus.
After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the
training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set
......
...@@ -17,7 +17,7 @@ Questa pagina raggruppa le risorse sviluppate dalla comunità riguardo 🤗 Tran ...@@ -17,7 +17,7 @@ Questa pagina raggruppa le risorse sviluppate dalla comunità riguardo 🤗 Tran
| Notebook | Descrizione | Autore | | | Notebook | Descrizione | Autore | |
|:----------|:-------------|:-------------|------:| |:----------|:-------------|:-------------|------:|
| [Fine-tuning di un Transformer pre-addestrato, al fine di generare testi di canzoni](https://github.com/AlekseyKorshuk/huggingartists) | Come generare testi di canzoni nello stile del vostro artista preferito attraverso il fine-tuning di un modello GPT-2. | [Aleksey Korshuk](https://github.com/AlekseyKorshuk) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AlekseyKorshuk/huggingartists/blob/master/huggingartists-demo.ipynb) | | [Fine-tuning di un Transformer pre-addestrato, al fine di generare testi di canzoni](https://github.com/AlekseyKorshuk/huggingartists) | Come generare testi di canzoni nello stile del vostro artista preferito attraverso il fine-tuning di un modello GPT-2. | [Aleksey Korshuk](https://github.com/AlekseyKorshuk) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AlekseyKorshuk/huggingartists/blob/master/huggingartists-demo.ipynb) |
| [Addestramento di T5 in Tensorflow 2 ](https://github.com/snapthat/TF-T5-text-to-text) | Come addestrare T5 per qualsiasi attività usando Tensorflow 2. Questo notebook mostra come risolvere l'attività di "Question Answering" usando Tensorflow 2 e SQUAD. | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) | | [Addestramento di T5 in Tensorflow 2](https://github.com/snapthat/TF-T5-text-to-text) | Come addestrare T5 per qualsiasi attività usando Tensorflow 2. Questo notebook mostra come risolvere l'attività di "Question Answering" usando Tensorflow 2 e SQUAD. | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) |
| [Addestramento di T5 con TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | Come addestrare T5 su SQUAD con Transformers e NLP. | [Suraj Patil](https://github.com/patil-suraj) |[![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) | | [Addestramento di T5 con TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | Come addestrare T5 su SQUAD con Transformers e NLP. | [Suraj Patil](https://github.com/patil-suraj) |[![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
| [Fine-tuning di T5 per la classificazione e scelta multipla](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | Come effettuare il fine-tuning di T5 per le attività di classificazione a scelta multipla - usando un formato testo-a-testo - con PyTorch Lightning. | [Suraj Patil](https://github.com/patil-suraj) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | | [Fine-tuning di T5 per la classificazione e scelta multipla](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | Come effettuare il fine-tuning di T5 per le attività di classificazione a scelta multipla - usando un formato testo-a-testo - con PyTorch Lightning. | [Suraj Patil](https://github.com/patil-suraj) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
| [Fine-tuning di DialoGPT su nuovi dataset e lingue](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | Come effettuare il fine-tuning di un modello DialoGPT su un nuovo dataset per chatbots conversazionali open-dialog. | [Nathan Cooper](https://github.com/ncoop57) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | | [Fine-tuning di DialoGPT su nuovi dataset e lingue](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | Come effettuare il fine-tuning di un modello DialoGPT su un nuovo dataset per chatbots conversazionali open-dialog. | [Nathan Cooper](https://github.com/ncoop57) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) |
......
...@@ -430,7 +430,7 @@ def _init_weights(self, module): ...@@ -430,7 +430,7 @@ def _init_weights(self, module):
```py ```py
def _init_weights(self, module): def _init_weights(self, module):
"""Initialize the weights""" """Initialize the weights"""
if isinstnace(module, Wav2Vec2ForPreTraining): if isinstance(module, Wav2Vec2ForPreTraining):
module.project_hid.reset_parameters() module.project_hid.reset_parameters()
module.project_q.reset_parameters() module.project_q.reset_parameters()
module.project_hid._is_hf_initialized = True module.project_hid._is_hf_initialized = True
......
...@@ -2135,7 +2135,7 @@ train_batch_size = 1 * world_size ...@@ -2135,7 +2135,7 @@ train_batch_size = 1 * world_size
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control # - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size # - which params should remain on gpus - the larger the value the smaller the offload size
# #
# For indepth info on Deepspeed config see # For in-depth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed # https://huggingface.co/docs/transformers/main/main_classes/deepspeed
# keeping the same format as json for consistency, except it uses lower case for true/false # keeping the same format as json for consistency, except it uses lower case for true/false
......
...@@ -904,7 +904,7 @@ RUN_SLOW=1 pytest tests ...@@ -904,7 +904,7 @@ RUN_SLOW=1 pytest tests
```python no-style ```python no-style
@parameteriz ed.expand(...) @parameterized.expand(...)
@slow @slow
def test_integration_foo(): def test_integration_foo():
``` ```
......
...@@ -369,7 +369,7 @@ def _init_weights(self, module): ...@@ -369,7 +369,7 @@ def _init_weights(self, module):
```py ```py
def _init_weights(self, module): def _init_weights(self, module):
"""Initialize the weights""" """Initialize the weights"""
if isinstnace(module, Wav2Vec2ForPreTraining): if isinstance(module, Wav2Vec2ForPreTraining):
module.project_hid.reset_parameters() module.project_hid.reset_parameters()
module.project_q.reset_parameters() module.project_q.reset_parameters()
module.project_hid._is_hf_initialized = True module.project_hid._is_hf_initialized = True
......
...@@ -1982,7 +1982,7 @@ train_batch_size = 1 * world_size ...@@ -1982,7 +1982,7 @@ train_batch_size = 1 * world_size
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control # - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size # - which params should remain on gpus - the larger the value the smaller the offload size
# #
# For indepth info on Deepspeed config see # For in-depth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed # https://huggingface.co/docs/transformers/main/main_classes/deepspeed
# keeping the same format as json for consistency, except it uses lower case for true/false # keeping the same format as json for consistency, except it uses lower case for true/false
......
...@@ -449,7 +449,7 @@ are 8 TPU cores on 4 chips (each chips has 2 cores), while "8 GPU" are 8 GPU chi ...@@ -449,7 +449,7 @@ are 8 TPU cores on 4 chips (each chips has 2 cores), while "8 GPU" are 8 GPU chi
For comparison one can run the same pre-training with PyTorch/XLA on TPU. To set up PyTorch/XLA on Cloud TPU VMs, please For comparison one can run the same pre-training with PyTorch/XLA on TPU. To set up PyTorch/XLA on Cloud TPU VMs, please
refer to [this](https://cloud.google.com/tpu/docs/pytorch-xla-ug-tpu-vm) guide. refer to [this](https://cloud.google.com/tpu/docs/pytorch-xla-ug-tpu-vm) guide.
Having created the tokenzier and configuration in `norwegian-roberta-base`, we create the following symbolic links: Having created the tokenizer and configuration in `norwegian-roberta-base`, we create the following symbolic links:
```bash ```bash
ln -s ~/transformers/examples/pytorch/language-modeling/run_mlm.py ./ ln -s ~/transformers/examples/pytorch/language-modeling/run_mlm.py ./
...@@ -499,7 +499,7 @@ python3 xla_spawn.py --num_cores ${NUM_TPUS} run_mlm.py --output_dir="./runs" \ ...@@ -499,7 +499,7 @@ python3 xla_spawn.py --num_cores ${NUM_TPUS} run_mlm.py --output_dir="./runs" \
For comparison you can run the same pre-training with PyTorch on GPU. Note that we have to make use of `gradient_accumulation` For comparison you can run the same pre-training with PyTorch on GPU. Note that we have to make use of `gradient_accumulation`
because the maximum batch size that fits on a single V100 GPU is 32 instead of 128. because the maximum batch size that fits on a single V100 GPU is 32 instead of 128.
Having created the tokenzier and configuration in `norwegian-roberta-base`, we create the following symbolic links: Having created the tokenizer and configuration in `norwegian-roberta-base`, we create the following symbolic links:
```bash ```bash
ln -s ~/transformers/examples/pytorch/language-modeling/run_mlm.py ./ ln -s ~/transformers/examples/pytorch/language-modeling/run_mlm.py ./
......
...@@ -674,7 +674,7 @@ def main(): ...@@ -674,7 +674,7 @@ def main():
raise ValueError("--do_train requires a train dataset") raise ValueError("--do_train requires a train dataset")
train_dataset = raw_datasets["train"] train_dataset = raw_datasets["train"]
if data_args.max_train_samples is not None: if data_args.max_train_samples is not None:
# We will select sample from whole data if agument is specified # We will select sample from whole data if argument is specified
max_train_samples = min(len(train_dataset), data_args.max_train_samples) max_train_samples = min(len(train_dataset), data_args.max_train_samples)
train_dataset = train_dataset.select(range(max_train_samples)) train_dataset = train_dataset.select(range(max_train_samples))
# Create train feature from dataset # Create train feature from dataset
......
...@@ -62,7 +62,7 @@ from transformers.utils.versions import require_version ...@@ -62,7 +62,7 @@ from transformers.utils.versions import require_version
# Will error if the minimal version of Transformers is not installed. Remove at your own risk. # Will error if the minimal version of Transformers is not installed. Remove at your own risk.
check_min_version("4.38.0.dev0") check_min_version("4.38.0.dev0")
require_version("datasets>=2.14.0", "To fix: pip install -r examples/flax/speech-recogintion/requirements.txt") require_version("datasets>=2.14.0", "To fix: pip install -r examples/flax/speech-recognition/requirements.txt")
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
......
...@@ -330,7 +330,7 @@ def main(): ...@@ -330,7 +330,7 @@ def main():
# Initialize datasets and pre-processing transforms # Initialize datasets and pre-processing transforms
# We use torchvision here for faster pre-processing # We use torchvision here for faster pre-processing
# Note that here we are using some default pre-processing, for maximum accuray # Note that here we are using some default pre-processing, for maximum accuracy
# one should tune this part and carefully select what transformations to use. # one should tune this part and carefully select what transformations to use.
normalize = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) normalize = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
train_dataset = torchvision.datasets.ImageFolder( train_dataset = torchvision.datasets.ImageFolder(
......
...@@ -148,7 +148,7 @@ def train(args, train_dataset, model, tokenizer): ...@@ -148,7 +148,7 @@ def train(args, train_dataset, model, tokenizer):
# Check if continuing training from a checkpoint # Check if continuing training from a checkpoint
if os.path.exists(args.model_name_or_path): if os.path.exists(args.model_name_or_path):
try: try:
# set global_step to gobal_step of last saved checkpoint from model path # set global_step to global_step of last saved checkpoint from model path
checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0] checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
global_step = int(checkpoint_suffix) global_step = int(checkpoint_suffix)
epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps) epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
...@@ -166,7 +166,7 @@ def train(args, train_dataset, model, tokenizer): ...@@ -166,7 +166,7 @@ def train(args, train_dataset, model, tokenizer):
train_iterator = trange( train_iterator = trange(
epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0] epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
) )
# Added here for reproductibility # Added here for reproducibility
set_seed(args) set_seed(args)
for _ in train_iterator: for _ in train_iterator:
...@@ -705,7 +705,7 @@ def main(): ...@@ -705,7 +705,7 @@ def main():
if args.local_rank == -1 or args.no_cuda: if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs else: # Initializes the distributed backend which will take care of synchronizing nodes/GPUs
torch.cuda.set_device(args.local_rank) torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank) device = torch.device("cuda", args.local_rank)
torch.distributed.init_process_group(backend="nccl") torch.distributed.init_process_group(backend="nccl")
......
...@@ -338,7 +338,7 @@ def train(args, train_dataset, model, tokenizer): ...@@ -338,7 +338,7 @@ def train(args, train_dataset, model, tokenizer):
tr_loss, logging_loss = 0.0, 0.0 tr_loss, logging_loss = 0.0, 0.0
model.zero_grad() model.zero_grad()
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]) train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
set_seed(args) # Added here for reproductibility set_seed(args) # Added here for reproducibility
for _ in train_iterator: for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator): for step, batch in enumerate(epoch_iterator):
...@@ -538,7 +538,7 @@ def main(): ...@@ -538,7 +538,7 @@ def main():
default=1, default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.", help="Number of updates steps to accumulate before performing a backward/update pass.",
) )
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.") parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument( parser.add_argument(
...@@ -612,7 +612,7 @@ def main(): ...@@ -612,7 +612,7 @@ def main():
if args.local_rank == -1 or args.no_cuda: if args.local_rank == -1 or args.no_cuda:
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs else: # Initializes the distributed backend which will take care of synchronizing nodes/GPUs
torch.cuda.set_device(args.local_rank) torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank) device = torch.device("cuda", args.local_rank)
torch.distributed.init_process_group(backend="nccl") torch.distributed.init_process_group(backend="nccl")
......
...@@ -321,7 +321,7 @@ For example, ...@@ -321,7 +321,7 @@ For example,
./save_len_file.py Helsinki-NLP/opus-mt-en-ro wmt_en_ro ./save_len_file.py Helsinki-NLP/opus-mt-en-ro wmt_en_ro
./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs ./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs
``` ```
splits `wmt_en_ro/train` into 11,197 uneven lengthed batches and can finish 1 epoch in 8 minutes on a v100. splits `wmt_en_ro/train` into 11,197 uneven length batches and can finish 1 epoch in 8 minutes on a v100.
For comparison, For comparison,
```bash ```bash
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment