Add Readme for language modeling scripts with accelerate (#11073)

6ab7d1a4 · Hemil Desai · GitHub · 2199608c · 6ab7d1a4 · 6ab7d1a4
Unverified Commit 6ab7d1a4 authored Apr 06, 2021 by Hemil Desai Committed by GitHub Apr 05, 2021
Showing with 26 additions and 8 deletions

examples/language-modeling/README.md examples/language-modeling/README.md +25 -7

examples/language-modeling/run_mlm_no_trainer.py examples/language-modeling/run_mlm_no_trainer.py +1 -1

No files found.
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -22,8 +22,7 @@ ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tu
 loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those
 objectives in our [model summary](https://huggingface.co/transformers/model_summary.html).

-These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you
-need extra processing on your datasets.
+There are two sets of scripts provided. The first set leverages the Trainer API. The second set with `no_trainer` in the suffix uses a custom training loop and leverages the 🤗 Accelerate library . Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.

 **Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py).

@@ -60,6 +59,15 @@ python run_clm.py \
    --output_dir /tmp/test-clm
 ```

+This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_clm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:
+
+```bash
+python run_clm_no_trainer.py \
+    --dataset_name wikitext \
+    --dataset_config_name wikitext-2-raw-v1 \
+    --model_name_or_path gpt2 \
+    --output_dir /tmp/test-clm
+```

 ### RoBERTa/BERT/DistilBERT and masked language modeling

@@ -95,6 +103,16 @@ python run_mlm.py \
 If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
 concatenates all texts and then splits them in blocks of the same length).

+This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_mlm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:
+
+```bash
+python run_mlm_no_trainer.py \
+    --dataset_name wikitext \
+    --dataset_config_name wikitext-2-raw-v1 \
+    --model_name_or_path roberta-base \
+    --output_dir /tmp/test-mlm
+```
+
 **Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
 sure all your batches have the same length.


--- a/examples/language-modeling/run_mlm_no_trainer.py
+++ b/examples/language-modeling/run_mlm_no_trainer.py
@@ -56,7 +56,7 @@ MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)


 def parse_args():
-    parser = argparse.ArgumentParser(description="Finetune a transformers model on a text classification task")
+    parser = argparse.ArgumentParser(description="Finetune a transformers model on a Masked Language Modeling task")
    parser.add_argument(
        "--dataset_name",
        type=str,