# pretrain for wiki-en

## 1.prepare datasets
* 下载数据并解压

```
BERT_PREP_WORKING_DIR=./data_tf python3 ./bertPrep.py --action download --dataset wikicorpus_en
bzip2 -dk enwiki-20170201-pages-articles-multistream.xml.bz2
BERT_PREP_WORKING_DIR=./data_tf_mlperf python3 ./bertPrep.py --action text_formatting --dataset wikicorpus_en
```

* 数据分片

```
BERT_PREP_WORKING_DIR=./data_tf_mlperf python3 ./bertPrep.py --action sharding --dataset wikicorpus_en
```

* 生成phrase1 tf数据
```
 python3 bertPrep.py --action create_tfrecord_files --dataset ${DATASET} --max_seq_length 128 --max_predictions_per_seq 20 --vocab_file ~/NLP-0904/uncased_L-12_H-768_A-12/vocab.txt
 ```
* 生成phrase2 tf数据
```
 python3 bertPrep.py --action create_tfrecord_files --dataset ${DATASET} --max_seq_length 512 --max_predictions_per_seq 80 --vocab_file ~/NLP-0904/uncased_L-12_H-768_A-12/vocab.txt
```
## 2.install newer apex

```
wget https://github.com/ROCmSoftwarePlatform/apex/archive/v0.3.tar.gz
tar -zxf
python3 setup.py install --cuda_ext --cpp_ext
```

## 3.train

详见 [README](http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/dev_xuan/PyTorch/NLP/BERT/README.md)

<br>


# Fine Tune train and test for SQuAD1.1

## 1.download datasets

https://rajpurkar.github.io/SQuAD-explorer/

## 2.download pretrained model

https://github.com/google-research/bert#fine-tuning-with-bert

## 3.convert_tf_checkpoint
```
python3 convert_tf_checkpoint.py --tf_checkpoint ~/NLP/cks/bs64k_32k_ckpt/model.ckpt-28252 --bert_config_path ~/NLP/cks/bs64k_32k_ckpt/bert_config.json --output_checkpoint model.ckpt-28252.pt
```
* you can download converted model from :
```
链接：https://pan.baidu.com/s/1V8kFpgsLQe8tOAeft-5UpQ 
提取码：vs8d 
```
4.run 

详见 [README](http://10.0.100.3/dcutoolkit/deeplearing/dlexamples/-/blob/dev_xuan/PyTorch/NLP/BERT/README.md)

the detail records at : http://wiki.sugon.com/display/~%E6%9D%A8%E7%92%87/BERT