Several downstream tasks are described for both GPT-2 and BERT models below.
Several downstream tasks are described for both GPT-2 and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.
<aid="gpt-2-text-generation"></a>
<aid="gpt-2-text-generation"></a>
## GPT-2 Text Generation
## GPT-2 Text Generation
...
@@ -323,11 +323,11 @@ We include example scripts for GPT-2 evaluation on WikiText perplexity evaluatio
...
@@ -323,11 +323,11 @@ We include example scripts for GPT-2 evaluation on WikiText perplexity evaluatio
### WikiText Perplexity Evaluation
### WikiText Perplexity Evaluation
For even comparison with prior works, we evaluate perplexity on the word-level [WikiText-103 test dataset](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), and appropriately compute perplexity given the change in tokens when using our subword tokenizer.
For even comparison with prior works, we evaluate perplexity on the word-level [WikiText-103 test dataset](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), and appropriately compute perplexity given the change in tokens when using our subword tokenizer.
We use the following command to run WikiText-103 evaluation on a 345M parameter model:
We use the following command to run WikiText-103 evaluation on a 345M parameter model.
To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceeding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceeding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the `--strict-lambada` flag should be used to require whole word matching.
We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the `--strict-lambada` flag should be used to require whole word matching. Make that `lambada` is part of the file path.
<pre>
<pre>
TASK="LAMBADA"
TASK="LAMBADA"
VALID_DATA=<lambada path>
VALID_DATA=<lambada path>.json
VOCAB_FILE=gpt2-vocab.json
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m
CHECKPOINT_PATH=checkpoints/gpt2_345m
...
@@ -391,7 +391,7 @@ Further command line arguments are described in the source file [`main.py`](./ta
...
@@ -391,7 +391,7 @@ Further command line arguments are described in the source file [`main.py`](./ta
## BERT Task Evaluation
## BERT Task Evaluation
<aid="race-evaluation"></a>
<aid="race-evaluation"></a>
### RACE Evaluation
### RACE Evaluation
The following script finetunes the BERT model for evaluation on the [RACE dataset](http://www.cs.cmu.edu/~glai1/data/race/).
The following script finetunes the BERT model for evaluation on the [RACE dataset](http://www.cs.cmu.edu/~glai1/data/race/). The `TRAIN_DATA` and `VALID_DATA` directory contain the RACE dataset as separate `.txt` files.