Unverified Commit 6d67837f authored by Tanay Mehta's avatar Tanay Mehta Committed by GitHub
Browse files

Add Fill-in-the-middle training objective example - PyTorch (#27464)

* add: initial script to train clm fim

* fix: if training model from scratch, new tokens will be added and embeddings resized

* fix: fixed attention_mask errors when generating FIM data

* fix: file formatted using black

* add: run_fim_no_trainer.py and fixed some comments in run_fim.py

* add: added fim examples to the README.md and ran code fixup

* fix: little bug in both fim training scripts

* fix: remove comment from notebook and added a note on fim related params

* fix: minor typo in README

* add: suggested minor changes to README and run_fim.py

* add: gradient_accumulation_steps and gradient_checkpointing args

* add: improved model embedding resizing

* add: pad_to_multiple_of and attn_implementation params

* add: requested minor changes

* add: deepspeed zero compatibility

* add: resize embeddings layer with zero3 support for fim model initialization
parent d80c9a34
......@@ -73,6 +73,57 @@ python run_clm_no_trainer.py \
--output_dir /tmp/test-clm
```
### GPT-2/GPT and causal language modeling with fill-in-the middle objective
The following example fine-tunes GPT-2 on WikiText-2 but using the Fill-in-middle training objective. FIM objective was proposed in [Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255). They showed that autoregressive language models can learn to infill text after applying a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end.
We're using the raw WikiText-2 (no tokens were replaced before the tokenization). The loss here is that of causal language modeling.
```bash
python run_fim.py \
--model_name_or_path gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--fim_rate 0.5 \
--fim_spm_rate 0.2 \
--do_train \
--do_eval \
--output_dir /tmp/test-clm
```
To run on your own training and validation files, use the following command:
```bash
python run_fim.py \
--model_name_or_path gpt2 \
--train_file path_to_train_file \
--validation_file path_to_validation_file \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--fim_rate 0.5 \
--fim_spm_rate 0.2 \
--do_train \
--do_eval \
--output_dir /tmp/test-clm
```
This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_fim_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:
```bash
python run_fim_no_trainer.py \
--model_name_or_path gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--model_name_or_path gpt2 \
--fim_rate 0.5 \
--fim_spm_rate 0.2 \
--output_dir /tmp/test-clm
```
**Note**: Passing in FIM rate as `0.5` means that FIM transformations will be applied to the dataset with a probability of 50%. Whereas passing in FIM SPM rate as `0.2` means that 20% of FIM transformations will use SPM (or Suffix-Prefix-Middle) and the remaining 80% will use PSM (or Prefix-Suffix-Middle) mode of transformation.
### RoBERTa/BERT/DistilBERT and masked language modeling
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
......@@ -176,11 +227,11 @@ sure all your batches have the same length.
## Streaming
To use the streaming dataset mode which can be very useful for large datasets, add `--streaming` to the command line. This is currently supported by `run_mlm.py` and `run_clm.py`.
To use the streaming dataset mode which can be very useful for large datasets, add `--streaming` to the command line. This is supported by `run_mlm.py`, `run_clm.py` and `run_fim.py`. Make sure to adapt the other scripts to your use case by taking inspiration from them.
## Low Cpu Memory Usage
To use low cpu memory mode which can be very useful for LLM, add `--low_cpu_mem_usage` to the command line. This is currently supported by `run_clm.py`,`run_mlm.py`, `run_plm.py`,`run_mlm_no_trainer.py` and `run_clm_no_trainer.py`.
To use low cpu memory mode which can be very useful for LLM, add `--low_cpu_mem_usage` to the command line. This is currently supported by `run_clm.py`,`run_mlm.py`, `run_plm.py`, `run_fim.py`, `run_mlm_no_trainer.py`, `run_clm_no_trainer.py` and `run_fim_no_trainer.py`.
## Creating a model on the fly
......@@ -192,4 +243,4 @@ python run_clm.py --model_type openai-community/gpt2 --tokenizer_name openai-com
[...]
```
This feature is only available in `run_clm.py`, `run_plm.py` and `run_mlm.py`.
This feature is only available in `run_clm.py`, `run_plm.py`, `run_mlm.py` and `run_fim.py`.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment