Unverified Commit f5e8c9bd authored by Nathan Cooper's avatar Nathan Cooper Committed by GitHub
Browse files

Update readme with how to train offline and fix BPE command (#15897)



* Update readme with how to train offline and fix BPE command

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: default avatarLeandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: default avatarLeandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: default avatarLeandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md
Co-authored-by: default avatarLeandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: default avatarLeandro von Werra <lvwerra@users.noreply.github.com>
parent 9badcecf
...@@ -58,8 +58,8 @@ During preprocessing the dataset is downloaded and stored locally as well as cac ...@@ -58,8 +58,8 @@ During preprocessing the dataset is downloaded and stored locally as well as cac
## Tokenizer ## Tokenizer
Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command: Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command:
```bash ```bash
python scripts/bpe_training.py python scripts/bpe_training.py \
--base_tokenizer gpt2 --base_tokenizer gpt2 \
--dataset_name lvwerra/codeparrot-clean-train --dataset_name lvwerra/codeparrot-clean-train
``` ```
...@@ -113,6 +113,32 @@ Recall that you can see the full set of possible options with descriptions (for ...@@ -113,6 +113,32 @@ Recall that you can see the full set of possible options with descriptions (for
python scripts/codeparrot_training.py --help python scripts/codeparrot_training.py --help
``` ```
Instead of streaming the dataset from the hub you can also stream it from disk. This can be helpful for long training runs where the connection can be interrupted sometimes. To stream locally you simply need to clone the datasets and replace the dataset name with their path. In this example we store the data in a folder called `data`:
```bash
git lfs install
mkdir data
git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-train
git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-valid
```
And then pass the paths to the datasets when we run the training script:
```bash
accelerate launch scripts/codeparrot_training.py \
--model_ckpt lvwerra/codeparrot-small \
--dataset_name_train ./data/codeparrot-clean-train \
--dataset_name_valid ./data/codeparrot-clean-valid \
--train_batch_size 12 \
--valid_batch_size 12 \
--learning_rate 5e-4 \
--num_warmup_steps 2000 \
--gradient_accumulation 1 \
--gradient_checkpointing False \
--max_train_steps 150000 \
--save_checkpoint_steps 15000
```
## Evaluation ## Evaluation
For evaluating the language modeling loss on the validation set or any other dataset you can use the following command: For evaluating the language modeling loss on the validation set or any other dataset you can use the following command:
```bash ```bash
...@@ -158,4 +184,4 @@ Give the model a shot yourself! There are two demos to interact with CodeParrot ...@@ -158,4 +184,4 @@ Give the model a shot yourself! There are two demos to interact with CodeParrot
## Further Resources ## Further Resources
A detailed description of the project can be found in the chapter "Training Transformers from Scratch" in the upcoming O'Reilly book [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/). A detailed description of the project can be found in the chapter "Training Transformers from Scratch" in the upcoming O'Reilly book [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/).
This example was provided by [Leandro von Werra](www.github.com/lvwerra). This example was provided by [Leandro von Werra](www.github.com/lvwerra).
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment