README.md 2.53 KB
Newer Older
Mostofa Patwary's avatar
Mostofa Patwary committed
1
2
## End-to-End Training of Neural Retrievers for Open-Domain Question Answering

Jared Casper's avatar
Jared Casper committed
3
Below we present the steps to run unsupervised and supervised trainining and evaluation of the retriever for [open domain question answering](https://arxiv.org/abs/2101.00408).
Mostofa Patwary's avatar
Mostofa Patwary committed
4

Mostofa Patwary's avatar
Mostofa Patwary committed
5
6
### Retriever Training

Mostofa Patwary's avatar
Mostofa Patwary committed
7
##### Unsupervised pretraining by ICT
Mostofa Patwary's avatar
Mostofa Patwary committed
8
1. Use `tools/preprocess_data.py` to preprocess the dataset for Inverse Cloze Task (ICT), which we call unsupervised pretraining. This script takes as input a corpus in loose JSON format and creates fixed-size blocks of text as the fundamental units of data. For a corpus like Wikipedia, this will mean multiple sentences per block and multiple blocks per document. Run [`tools/preprocess_data.py`](../../tools/preprocess_data.py) to construct one or more indexed datasets with the `--split-sentences` argument to make sentences the basic unit. We construct two datasets, one with the title of every document and another with the body.
Mostofa Patwary's avatar
Mostofa Patwary committed
9
10

<pre>
Mostofa Patwary's avatar
Mostofa Patwary committed
11
python tools/preprocess_data.py \
Mostofa Patwary's avatar
Mostofa Patwary committed
12
13
14
15
16
17
    --input /path/to/corpus.json \
    --json-keys text title \
    --split-sentences \
    --tokenizer-type BertWordPieceLowerCase \
    --vocab-file /path/to/vocab.txt \
    --output-prefix corpus_indexed \
Mostofa Patwary's avatar
Mostofa Patwary committed
18
    --workers 10
Mostofa Patwary's avatar
Mostofa Patwary committed
19
20
</pre>

Mostofa Patwary's avatar
Mostofa Patwary committed
21
2. The [`examples/pretrain_ict.sh`](../../examples/pretrain_ict.sh) script runs a single GPU 217M parameter biencoder model for ICT retriever training. Single GPU training is primarily intended for debugging purposes, as the code is developed for distributed training. The script uses a pretrained BERT model and we use a total of batch size of 4096 for the ICT training.
Mostofa Patwary's avatar
Mostofa Patwary committed
22

Mostofa Patwary's avatar
Mostofa Patwary committed
23
3. Evaluate the pretrained ICT model using [`examples/evaluate_retriever_nq.sh`](../../examples/evaluate_retriever_nq.sh) for [Google's Natural Questions Open dataset](https://arxiv.org/pdf/1906.00300.pdf).
Mostofa Patwary's avatar
Mostofa Patwary committed
24

Mostofa Patwary's avatar
Mostofa Patwary committed
25
##### Supervised finetuning
Mostofa Patwary's avatar
Mostofa Patwary committed
26

Mostofa Patwary's avatar
Mostofa Patwary committed
27
1. Use the above pretrained ICT model to finetune using [Google's Natural Questions Open dataset](https://github.com/google-research/language/tree/master/language/orqa). The script [`examples/finetune_retriever_distributed.sh`](../../examples/finetune_retriever_distributed.sh) provides an example for how to perform the training. Our finetuning process includes retriever score scaling and longer training (80 epochs) on top [DPR training](https://arxiv.org/abs/2004.04906).
Mostofa Patwary's avatar
Mostofa Patwary committed
28

Jared Casper's avatar
Jared Casper committed
29
2. Evaluate the finetuned model using the same evaluation script as mentioned above for the unsupervised model.
Mostofa Patwary's avatar
Mostofa Patwary committed
30
31

More details on the retriever are available in [our paper](https://arxiv.org/abs/2101.00408).
Mostofa Patwary's avatar
Mostofa Patwary committed
32

Mostofa Patwary's avatar
Mostofa Patwary committed
33
34
### Reader Training

Mostofa Patwary's avatar
Mostofa Patwary committed
35
The reader component will be available soon.
Mostofa Patwary's avatar
Mostofa Patwary committed
36