"git@developer.sourcefind.cn:zhaoyu6/sglang.git" did not exist on "dd650e0e21bbe07d131dd861aa475b0b9fc89ead"
Commit cfbba16f authored by A. Unique TensorFlower's avatar A. Unique TensorFlower
Browse files

Merge pull request #9560 from kyleziegler:pretraining_updates

PiperOrigin-RevId: 348278830
parents d57ba596 ec29c2f4
......@@ -129,6 +129,23 @@ which is essentially branched from [BERT research repo](https://github.com/googl
to get processed pre-training data and it adapts to TF2 symbols and python3
compatibility.
Running the pre-training script requires an input and output directory, as well as a vocab file. Note that max_seq_length will need to match the sequence length parameter you specify when you run pre-training.
Example shell script to call create_pretraining_data.py
```
export WORKING_DIR='local disk or cloud location'
export BERT_DIR='local disk or cloud location'
python models/official/nlp/data/create_pretraining_data.py \
--input_file=$WORKING_DIR/input/input.txt \
--output_file=$WORKING_DIR/output/tf_examples.tfrecord \
--vocab_file=$BERT_DIR/wwm_uncased_L-24_H-1024_A-16/vocab.txt \
--do_lower_case=True \
--max_seq_length=512 \
--max_predictions_per_seq=76 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5
```
### Fine-tuning
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment