README.md 10.2 KB
Newer Older
Ola Piktus's avatar
Ola Piktus committed
1
# Intro
Sylvain Gugger's avatar
Sylvain Gugger committed
2
3
4

Authors: @patrickvonplaten and @lhoestq

5
6
7
Aimed at tackling the knowledge-intensive NLP tasks (think tasks a human wouldn't be expected to solve without access to external knowledge sources), RAG models are seq2seq models with access to a retrieval mechanism providing relevant context documents at training and evaluation time.

A RAG model encapsulates two core components: a question encoder and a generator.
Ola Piktus's avatar
Ola Piktus committed
8
9
During a forward pass, we encode the input with the question encoder and pass it
to the retriever to extract relevant context documents. The documents are then prepended to the input.
10
Such contextualized inputs are passed to the generator.
Ola Piktus's avatar
Ola Piktus committed
11

12
Read more about RAG  at https://arxiv.org/abs/2005.11401.
Ola Piktus's avatar
Ola Piktus committed
13

14
15
16
17
# Note

鈿狅笍 This project should be run with pytorch-lightning==1.3.1 which has a potential security vulnerability

18
# Finetuning
Ola Piktus's avatar
Ola Piktus committed
19

20
Our finetuning logic is based on scripts from [`examples/seq2seq`](https://github.com/huggingface/transformers/tree/main/examples/seq2seq). We accept training data in the same format as specified there - we expect a directory consisting of 6 text files:
21
22
23
24
25
26
27
28
```bash
train.source
train.target
val.source
val.target
test.source
test.target
```
Ola Piktus's avatar
Ola Piktus committed
29

30
A sample finetuning command (run ` ./examples/research_projects/rag/finetune_rag.py --help` to list all available options):
Ola Piktus's avatar
Ola Piktus committed
31

32
```bash
33
python examples/research_projects/rag/finetune_rag.py \
Ola Piktus's avatar
Ola Piktus committed
34
35
36
37
38
39
40
    --data_dir $DATA_DIR \
    --output_dir $OUTPUT_DIR \
    --model_name_or_path $MODEL_NAME_OR_PATH \
    --model_type rag_sequence \
    --fp16 \
    --gpus 8
```
41
42
43
44
45
46
47
48
We publish two `base` models which can serve as a starting point for finetuning on downstream tasks (use them as `model_name_or_path`):
- [`facebook/rag-sequence-base`](https://huggingface.co/facebook/rag-sequence-base) - a base for finetuning `RagSequenceForGeneration` models,
- [`facebook/rag-token-base`](https://huggingface.co/facebook/rag-token-base) - a base for finetuning `RagTokenForGeneration` models.

The `base` models initialize the question encoder with [`facebook/dpr-question_encoder-single-nq-base`](https://huggingface.co/facebook/dpr-question_encoder-single-nq-base) and the generator with [`facebook/bart-large`](https://huggingface.co/facebook/bart-large).

If you would like to initialize finetuning with a base model using different question encoder and generator architectures, you can build it with a consolidation script, e.g.:
```
49
python examples/research_projects/rag/consolidate_rag_checkpoint.py \
50
51
52
53
54
    --model_type rag_sequence \
    --generator_name_or_path facebook/bart-large-cnn \
    --question_encoder_name_or_path facebook/dpr-question_encoder-single-nq-base \
    --dest path/to/checkpoint
```
55
You will then be able to pass `path/to/checkpoint` as `model_name_or_path` to the `finetune_rag.py` script.
Ola Piktus's avatar
Ola Piktus committed
56

57
58
## Document Retrieval
When running distributed fine-tuning, each training worker needs to retrieve contextual documents
59
60
for its input by querying a index loaded into memory. RAG provides two implementations for document retrieval,
one with [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html) communication package and the other
61
62
63
64
65
66
67
68
with [`Ray`](https://docs.ray.io/en/master/).

This option can be configured with the `--distributed_retriever` flag which can either be set to `pytorch` or `ray`.
By default this flag is set to `pytorch`.

For the Pytorch implementation, only training worker 0 loads the index into CPU memory, and a gather/scatter pattern is used
to collect the inputs from the other training workers and send back the corresponding document embeddings.

69
For the Ray implementation, the index is loaded in *separate* process(es). The training workers randomly select which
70
71
72
73
74
75
76
77
retriever worker to query. To use Ray for distributed retrieval, you have to set the `--distributed_retriever` arg to `ray`.
To configure the number of retrieval workers (the number of processes that load the index), you can set the `num_retrieval_workers` flag.
Also make sure to start the Ray cluster before running fine-tuning.

```bash
# Start a single-node Ray cluster.
ray start --head

78
python examples/research_projects/rag/finetune_rag.py \
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
    --data_dir $DATA_DIR \
    --output_dir $OUTPUT_DIR \
    --model_name_or_path $MODEL_NAME_OR_PATH \
    --model_type rag_sequence \
    --fp16 \
    --gpus 8
    --distributed_retriever ray \
    --num_retrieval_workers 4

# Stop the ray cluster once fine-tuning has finished.
ray stop
```

Using Ray can lead to retrieval speedups on multi-GPU settings since multiple processes load the index rather than
just the rank 0 training worker. Using Ray also allows you to load the index on GPU since the index is loaded on a separate
processes than the model, while with pytorch distributed retrieval, both are loaded in the same process potentially leading to GPU OOM.
Ola Piktus's avatar
Ola Piktus committed
95
96

# Evaluation
97
Our evaluation script enables two modes of evaluation (controlled by the `eval_mode` argument): `e2e` - end2end evaluation, returns EM (exact match) and F1 scores calculated for the downstream task and `retrieval` - which returns precision@k of the documents retrieved for provided inputs.
Ola Piktus's avatar
Ola Piktus committed
98

99
100
101
The evaluation script expects paths to two files:
- `evaluation_set` - a path to a file specifying the evaluation dataset, a single input per line.
- `gold_data_path` - a path to a file contaning ground truth answers for datapoints from the `evaluation_set`, a single output per line. Check below for expected formats of the gold data files.
Ola Piktus's avatar
Ola Piktus committed
102
103


104
105
## Retrieval evaluation
For `retrieval` evaluation, we expect a gold data file where each line will consist of a tab-separated list of document titles constituting positive contexts for respective datapoints from the `evaluation_set`. E.g. given a question `who sings does he love me with reba` in the `evaluation_set`, a respective ground truth line could look as follows:
Ola Piktus's avatar
Ola Piktus committed
106
107
108
109
110
111
112
```
Does He Love You	Does He Love You	Red Sandy Spika dress of Reba McEntire	Greatest Hits Volume Two (Reba McEntire album)	Shoot for the Moon (album)
```

We demonstrate how to evaluate retrieval against DPR evaluation data. You can download respective files from links listed [here](https://github.com/facebookresearch/DPR/blob/master/data/download_data.py#L39-L45).

1. Download and unzip the gold data file. We use the `biencoder-nq-dev` from https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz.
113
114
115
116
    ```bash
    wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz && gzip -d biencoder-nq-dev.json.gz
   ```

Ola Piktus's avatar
Ola Piktus committed
117
2. Parse the unziped file using the `parse_dpr_relevance_data.py`
118
    ```bash
119
    mkdir output # or wherever you want to save this
120
    python examples/research_projects/rag/parse_dpr_relevance_data.py \
121
122
123
        --src_path biencoder-nq-dev.json \
        --evaluation_set output/biencoder-nq-dev.questions \
        --gold_data_path output/biencoder-nq-dev.pages
124
125
    ```
3. Run evaluation:
126
    ```bash
127
    python examples/research_projects/rag/eval_rag.py \
128
129
130
131
132
133
134
135
136
137
        --model_name_or_path facebook/rag-sequence-nq \
        --model_type rag_sequence \
        --evaluation_set output/biencoder-nq-dev.questions \
        --gold_data_path output/biencoder-nq-dev.pages \
        --predictions_path output/retrieval_preds.tsv  \
        --eval_mode retrieval \
        --k 1
    ```
   ```bash
   # EXPLANATION
138
    python examples/research_projects/rag/eval_rag.py \
139
140
        --model_name_or_path facebook/rag-sequence-nq \ # model name or path of the model we're evaluating
        --model_type rag_sequence \ # RAG model type (rag_token or rag_sequence)
141
142
143
        --evaluation_set output/biencoder-nq-dev.questions \ # an input dataset for evaluation
        --gold_data_path poutput/biencoder-nq-dev.pages \ # a dataset containing ground truth answers for samples from the evaluation_set
        --predictions_path output/retrieval_preds.tsv  \ # name of file where predictions will be stored
144
145
        --eval_mode retrieval \ # indicates whether we're performing retrieval evaluation or e2e evaluation
        --k 1 # parameter k for the precision@k metric
146

147
148
149
150
151
    ```
## End-to-end evaluation

We support two formats of the gold data file (controlled by the `gold_data_mode` parameter):
- `qa` - where a single line has the following format: `input [tab] output_list`, e.g.:
Ola Piktus's avatar
Ola Piktus committed
152
```
153
who is the owner of reading football club	['Xiu Li Dai', 'Dai Yongge', 'Dai Xiuli', 'Yongge Dai']
Ola Piktus's avatar
Ola Piktus committed
154
```
155
- `ans` - where a single line contains a single expected answer, e.g.:
Ola Piktus's avatar
Ola Piktus committed
156
```
157
Xiu Li Dai
Ola Piktus's avatar
Ola Piktus committed
158
159
```

160
161
Predictions of the model for the samples from the `evaluation_set` will be saved under the path specified by the `predictions_path` parameter.
If this path already exists, the script will use saved predictions to calculate metrics.
162
Add `--recalculate` parameter to force the script to perform inference from scratch.
Ola Piktus's avatar
Ola Piktus committed
163

164
165
An example e2e evaluation run could look as follows:
```bash
166
python examples/research_projects/rag/eval_rag.py \
167
    --model_name_or_path facebook/rag-sequence-nq \
Ola Piktus's avatar
Ola Piktus committed
168
169
170
171
    --model_type rag_sequence \
    --evaluation_set path/to/test.source \
    --gold_data_path path/to/gold_data \
    --predictions_path path/to/e2e_preds.txt \
172
173
    --eval_mode e2e \
    --gold_data_mode qa \
Ola Piktus's avatar
Ola Piktus committed
174
    --n_docs 5 \ # You can experiment with retrieving different number of documents at evaluation time
175
176
    --print_predictions \
    --recalculate \ # adding this parameter will force recalculating predictions even if predictions_path already exists
Ola Piktus's avatar
Ola Piktus committed
177
```
178
179
180
181
182
183
184
185

# Use your own knowledge source

By default, RAG uses the English Wikipedia as a knowledge source, known as the 'wiki_dpr' dataset.
With `use_custom_knowledge_dataset.py` you can build your own knowledge source, *e.g.* for RAG.

For instance, if documents are serialized as tab-separated csv files with the columns "title" and "text", one can use `use_own_knowledge_dataset.py` as follows:
```bash
186
python examples/research_projects/rag/use_own_knowledge_dataset.py \
187
188
189
190
191
192
    --csv_path path/to/my_csv \
    --output_dir path/to/my_knowledge_dataset \
```

The created outputs in `path/to/my_knowledge_dataset` can then be used to finetune RAG as follows:
```bash
193
python examples/research_projects/rag/finetune_rag.py \
194
195
196
197
198
199
200
201
202
    --data_dir $DATA_DIR \
    --output_dir $OUTPUT_DIR \
    --model_name_or_path $MODEL_NAME_OR_PATH \
    --model_type rag_sequence \
    --fp16 \
    --gpus 8
    --index_name custom
    --passages_path path/to/data/my_knowledge_dataset
    --index_path path/to/my_knowledge_dataset_hnsw_index.faiss
203
```