Update readme with how to train offline and fix BPE command (#15897)

* Update readme with how to train offline and fix BPE command * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Update readme with how to train offline and fix BPE command (#15897)
* Update readme with how to train offline and fix BPE command * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
f5e8c9bd · Nathan Cooper · GitHub · 9badcecf · f5e8c9bd
Unverified Commit f5e8c9bd authored Mar 24, 2022 by Nathan Cooper Committed by GitHub Mar 24, 2022
Hide whitespace changes
Inline Side-by-side

Showing with 29 additions and 3 deletions

examples/research_projects/codeparrot/README.md examples/research_projects/codeparrot/README.md +29 -3

No files found.
--- a/examples/research_projects/codeparrot/README.md
+++ b/examples/research_projects/codeparrot/README.md
@@ -58,8 +58,8 @@ During preprocessing the dataset is downloaded and stored locally as well as cac
 ## Tokenizer
 Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command: 
 ```bash
-python scripts/bpe_training.py
+python scripts/bpe_training.py \
-    --base_tokenizer gpt2
+    --base_tokenizer gpt2 \
    --dataset_name lvwerra/codeparrot-clean-train
 ```
@@ -113,6 +113,32 @@ Recall that you can see the full set of possible options with descriptions (for
 python scripts/codeparrot_training.py --help
 ```
+Instead of streaming the dataset from the hub you can also stream it from disk. This can be helpful for long training runs where the connection can be interrupted sometimes. To stream locally you simply need to clone the datasets and replace the dataset name with their path. In this example we store the data in a folder called `data`: 
+```bash
+git lfs install
+mkdir data
+git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-train
+git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-valid
+```
+And then pass the paths to the datasets when we run the training script:
+```bash
+accelerate launch scripts/codeparrot_training.py \
+--model_ckpt lvwerra/codeparrot-small \
+--dataset_name_train ./data/codeparrot-clean-train \
+--dataset_name_valid ./data/codeparrot-clean-valid \
+--train_batch_size 12 \
+--valid_batch_size 12 \
+--learning_rate 5e-4 \
+--num_warmup_steps 2000 \
+--gradient_accumulation 1 \
+--gradient_checkpointing False \
+--max_train_steps 150000 \
+--save_checkpoint_steps 15000
+```
 ## Evaluation
 For evaluating the language modeling loss on the validation set or any other dataset you can use the following command:
 ```bash
@@ -158,4 +184,4 @@ Give the model a shot yourself! There are two demos to interact with CodeParrot
 ## Further Resources
 A detailed description of the project can be found in the chapter "Training Transformers from Scratch" in the upcoming O'Reilly book [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/).
 This example was provided by [Leandro von Werra](www.github.com/lvwerra).
\ No newline at end of file