README.md 3.44 KB
Newer Older
Shamane Siri's avatar
Shamane Siri committed
1
2
3
4
# End-to-End finetuning of RAG (including DPR retriever) for Question Answering.

This finetuning script is actively maintained by [Shamane Siri](https://github.com/shamanez). Feel free to ask questions on the [Forum](https://discuss.huggingface.co/) or post an issue on [GitHub](https://github.com/huggingface/transformers/issues/new/choose) and tag @shamanez.

5
Others that helped out: Patrick von Platen (@patrickvonplaten), Quentin Lhoest (@lhoestq), and Rivindu Weerasekera (@rivinduw)
Shamane Siri's avatar
Shamane Siri committed
6

7
8
The original RAG implementation is able to train the question encoder and generator end-to-end.
This extension enables complete end-to-end training of RAG including the context encoder in the retriever component.
Shamane Siri's avatar
Shamane Siri committed
9
10
Please read the [accompanying blog post](https://shamanesiri.medium.com/how-to-finetune-the-entire-rag-architecture-including-dpr-retriever-4b4385322552) for details on this implementation.

11
The original RAG code has also been modified to work with the latest versions of pytorch lightning (version 1.2.10) and RAY (version 1.3.0). All other implementation details remain the same as the [original RAG code](https://github.com/huggingface/transformers/tree/main/examples/research_projects/rag).
Shamane Siri's avatar
Shamane Siri committed
12
13
Read more about RAG  at https://arxiv.org/abs/2005.11401.

14
This code can be modified to experiment with other research on retrival augmented models which include training of the retriever (e.g. [REALM](https://arxiv.org/abs/2002.08909) and [MARGE](https://arxiv.org/abs/2006.15020)).
Shamane Siri's avatar
Shamane Siri committed
15

16
To start training, use the bash script (finetune_rag_ray_end2end.sh) in this folder. This script also includes descriptions on each command-line argument used.
Shamane Siri's avatar
Shamane Siri committed
17

Shamane Siri's avatar
Shamane Siri committed
18
19
20
21
# Latest Update

鈿狅笍 Updated the rag-end2end-retriever to be compatible with PL==1.6.4 and RAY==1.13.0 (latest versions to the date 2022-June-11)

22
23
24
# Note

鈿狅笍 This project should be run with pytorch-lightning==1.3.1 which has a potential security vulnerability
Shamane Siri's avatar
Shamane Siri committed
25
26
27
28

# Testing

The following two bash scripts can be used to quickly test the implementation.
Shamane Siri's avatar
Shamane Siri committed
29
1. sh ./test_run/test_finetune.sh script
Shamane Siri's avatar
Shamane Siri committed
30
    - Tests the full end-to-end fine-tuning ability with a dummy knowlendge-base and dummy training dataset (check test_dir directory).
31
    - Users can replace the dummy dataset and knowledge-base with their own to do their own finetuning.
Shamane Siri's avatar
Shamane Siri committed
32
33
34
35
36
    - Please read the comments in the test_finetune.sh file.
2. sh ./test_run/test_rag_new_features.sh
    - Tests the newly added functions (set_context_encoder and set_context_encoder_tokenizer) related to modeling rag.
    - This is sufficient to check the model's ability to use the set functions correctly.

Shamane Siri's avatar
Shamane Siri committed
37
38
39
40
41
42
43
44
45


# Comparison of end2end RAG (including DPR finetuning)  VS original-RAG

We conducted a simple experiment to investigate the effectiveness of this end2end training extension using the SQuAD dataset. Please execute the following steps to reproduce the results.

-   Create a knowledge-base using all the context passages in the SQuAD dataset with their respective titles.
-   Use the question-answer pairs as training data.
-   Train the system for 10 epochs.
46
47
-   Test the Exact Match (EM) score with the SQuAD dataset's validation set.
-   Training dataset, the knowledge-base, and hyperparameters used in experiments can be accessed from [here](https://drive.google.com/drive/folders/1qyzV-PaEARWvaU_jjpnU_NUS3U_dSjtG?usp=sharing).
Shamane Siri's avatar
Shamane Siri committed
48

49
# Results
Shamane Siri's avatar
Shamane Siri committed
50

51
- We train both models for 10 epochs.
Shamane Siri's avatar
Shamane Siri committed
52
53

| Model Type          | EM-Score|
54
| --------------------| --------|
Shamane Siri's avatar
Shamane Siri committed
55
| RAG-original        | 28.12   |
56
| RAG-end2end with DPR| 40.02   |