# pretrain for wiki-en ## 1.prepare datasets * 下载数据并解压 ``` BERT_PREP_WORKING_DIR=./data_tf python3 ./bertPrep.py --action download --dataset wikicorpus_en bzip2 -dk enwiki-20170201-pages-articles-multistream.xml.bz2 BERT_PREP_WORKING_DIR=./data_tf_mlperf python3 ./bertPrep.py --action text_formatting --dataset wikicorpus_en ``` * 数据分片 ``` BERT_PREP_WORKING_DIR=./data_tf_mlperf python3 ./bertPrep.py --action sharding --dataset wikicorpus_en ``` * 生成phrase1 tf数据 ``` python3 bertPrep.py --action create_tfrecord_files --dataset ${DATASET} --max_seq_length 128 --max_predictions_per_seq 20 --vocab_file ~/NLP-0904/uncased_L-12_H-768_A-12/vocab.txt ``` * 生成phrase2 tf数据 ``` python3 bertPrep.py --action create_tfrecord_files --dataset ${DATASET} --max_seq_length 512 --max_predictions_per_seq 80 --vocab_file ~/NLP-0904/uncased_L-12_H-768_A-12/vocab.txt ``` ## 2.install newer apex ``` wget https://github.com/ROCmSoftwarePlatform/apex/archive/v0.3.tar.gz tar -zxf python3 setup.py install --cuda_ext --cpp_ext ``` ## 3.train 详见 [README](http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/dev_xuan/PyTorch/NLP/BERT/README.md)
# Fine Tune train and test for SQuAD1.1 ## 1.download datasets https://rajpurkar.github.io/SQuAD-explorer/ ## 2.download pretrained model https://github.com/google-research/bert#fine-tuning-with-bert ## 3.convert_tf_checkpoint ``` python3 convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt ``` * you can download converted model from : ``` 链接:https://pan.baidu.com/s/1V8kFpgsLQe8tOAeft-5UpQ 提取码:vs8d ``` 4.run 详见 [README](http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/dev_xuan/PyTorch/NLP/BERT/README.md) the detail records at : http://wiki.sugon.com/display/~%E6%9D%A8%E7%92%87/BERT