Merge pull request #9560 from kyleziegler:pretraining_updates

PiperOrigin-RevId: 348278830

Merge pull request #9560 from kyleziegler:pretraining_updates
PiperOrigin-RevId: 348278830
cfbba16f · A. Unique TensorFlower · d57ba596 · ec29c2f4 · cfbba16f
Commit cfbba16f authored Dec 19, 2020 by A. Unique TensorFlower
Show whitespace changes
Inline Side-by-side

Showing with 17 additions and 0 deletions

official/nlp/bert/README.md official/nlp/bert/README.md +17 -0

No files found.
--- a/official/nlp/bert/README.md
+++ b/official/nlp/bert/README.md
@@ -129,6 +129,23 @@ which is essentially branched from [BERT research repo](https://github.com/googl
 to get processed pre-training data and it adapts to TF2 symbols and python3
 compatibility.

+Running the pre-training script requires an input and output directory, as well as a vocab file.  Note that max_seq_length will need to match the sequence length parameter you specify when you run pre-training.
+
+Example shell script to call create_pretraining_data.py
+```
+export WORKING_DIR='local disk or cloud location'
+export BERT_DIR='local disk or cloud location'
+python models/official/nlp/data/create_pretraining_data.py \
+  --input_file=$WORKING_DIR/input/input.txt \
+  --output_file=$WORKING_DIR/output/tf_examples.tfrecord \
+  --vocab_file=$BERT_DIR/wwm_uncased_L-24_H-1024_A-16/vocab.txt \
+  --do_lower_case=True \
+  --max_seq_length=512 \
+  --max_predictions_per_seq=76 \
+  --masked_lm_prob=0.15 \
+  --random_seed=12345 \
+  --dupe_factor=5
+```

 ### Fine-tuning