Links to vocab in Data Preprocessing; added "WordPiece" to description

8e4cb2a6 · Steven Steinke · bdc3d5ba · 8e4cb2a6
Commit 8e4cb2a6 authored Apr 20, 2020 by Steven Steinke
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

README.md README.md +1 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -104,7 +104,7 @@ python tools/preprocess_data.py \

 Here the output files are named `my-gpt2_text_document.bin` and `my-gpt2_text_document.idx`. As before, in GPT-2 training, use the longer name without the extension as `--data-path`.

-The BERT uncased vocabulary file can be extracted from Google's [pretrained BERT models](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip). The GPT-2 [vocab file](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json) and [merge table](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt) can be downloaded directly.
+The BERT uncased WordPiece vocabulary file can be extracted from Google's [pretrained BERT models](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip). The GPT-2 [vocab file](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json) and [merge table](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt) can be downloaded directly.

 Further command line arguments are described in the source file [`preprocess_data.py`](./tools/preprocess_data.py).