Merge branch 'main_retriver_merge_dpr' into 'main'

Changes in Readme (Retriever) See merge request ADLR/megatron-lm!281

Merge branch 'main_retriver_merge_dpr' into 'main'
Changes in Readme (Retriever) See merge request ADLR/megatron-lm!281
82b69e86 · Jared Casper · 2be1e510 · 4c92ca82 · 82b69e86 · 82b69e86
Commit 82b69e86 authored Jun 11, 2021 by Jared Casper
Hide whitespace changes
Inline Side-by-side

Showing with 14 additions and 7 deletions

tasks/orqa/README.md tasks/orqa/README.md +11 -7

tasks/orqa/supervised/data.py tasks/orqa/supervised/data.py +3 -0

No files found.
--- a/tasks/orqa/README.md
+++ b/tasks/orqa/README.md
@@ -2,8 +2,10 @@
 Below we present the steps to run unsupervised and supervised trainining and evaluation of the retriever for [open domain question answering](https://arxiv.org/abs/2101.00408).
-### Unsupervised pretraining
+## Retriever Training
-1. Use `tools/preprocess_data.py` to preprocess the dataset for Inverse Cloze Task (ICT) task, which we call unsupervised pretraining. This script takes as input a corpus in loose JSON format and creates fixed-size blocks of text as the fundamental units of data. For a corpus like Wikipedia, this will mean multiple sentences per block and multiple blocks per document. Run [`tools/preprocess_data.py`](../../tools/preprocess_data.py) to construct one or more indexed datasets with the `--split-sentences` argument to make sentences the basic unit. We construct two datasets, one with the title of every document and another with the body.
+#### Unsupervised pretraining
+1. Use `tools/preprocess_data.py` to preprocess the dataset for Inverse Cloze Task (ICT), which we call unsupervised pretraining. This script takes as input a corpus in loose JSON format and creates fixed-size blocks of text as the fundamental units of data. For a corpus like Wikipedia, this will mean multiple sentences per block and multiple blocks per document. Run [`tools/preprocess_data.py`](../../tools/preprocess_data.py) to construct one or more indexed datasets with the `--split-sentences` argument to make sentences the basic unit. We construct two datasets, one with the title of every document and another with the body.
 <pre>
 python tools/preprocess_data.py \
@@ -16,17 +18,19 @@ python tools/preprocess_data.py \
    --workers 10
 </pre>
-2. The [`examples/pretrain_ict.sh`](../../examples/pretrain_ict.sh) script runs a single GPU 217M parameter biencoder model for ICT retriever training. Single GPU training is primarily intended for debugging purposes, as the code is developed for distributed training. The script uses a pretrained BERT model with a batch size of 4096 (hence the need for a data parallel world size of 32).
+2. The [`examples/pretrain_ict.sh`](../../examples/pretrain_ict.sh) script runs a single GPU 217M parameter biencoder model for ICT retriever training. Single GPU training is primarily intended for debugging purposes, as the code is developed for distributed training. The script uses a pretrained BERT model and we use a total of batch size of 4096 for the ICT training.
-3. Evaluate the pretrained ICT model using [`examples/evaluate_retriever_nq.sh`](../../examples/evaluate_retriever_nq.sh) for natural question answering dataset.
+3. Evaluate the pretrained ICT model using [`examples/evaluate_retriever_nq.sh`](../../examples/evaluate_retriever_nq.sh) for [Google's Natural Questions Open dataset](https://arxiv.org/pdf/1906.00300.pdf).
-### Supervised finetuning
+#### Supervised finetuning
-1. Use the above pretrained ICT model to finetune using [Google's natural question answering dataset](https://ai.google.com/research/NaturalQuestions/). The script [`examples/finetune_retriever_distributed.sh`](../../examples/finetune_retriever_distributed.sh) provides an example for how to do this. Our finetuning consists of score scaling, longer training (80 epochs), and hard negative examples.
+1. Use the above pretrained ICT model to finetune using [Google's Natural Questions Open dataset](https://github.com/google-research/language/tree/master/language/orqa). The script [`examples/finetune_retriever_distributed.sh`](../../examples/finetune_retriever_distributed.sh) provides an example for how to perform the training. Our finetuning process includes retriever score scaling and longer training (80 epochs) on top [DPR training](https://arxiv.org/abs/2004.04906).
 2. Evaluate the finetuned model using the same evaluation script as mentioned above for the unsupervised model.
 More details on the retriever are available in [our paper](https://arxiv.org/abs/2101.00408).
+## Reader Training
 The reader component will be available soon.
--- a/tasks/orqa/supervised/data.py
+++ b/tasks/orqa/supervised/data.py
@@ -244,6 +244,9 @@ def normalize_question(question):
        question = question[:-1]
    return question
+# The following class reads the datasets for training retriever as
+# prepared by the DPR codebase (https://github.com/facebookresearch/DPR)
 class NQSupervisedDataset(OpenRetrievalAbstractDataset):
    def __init__(self, name, datapaths, tokenizer, max_seq_length, \