README.md 14.6 KB
Newer Older
liangjing's avatar
liangjing committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Retro and InstructRetro

Retro [(Borgeaud et al., 2022)](https://arxiv.org/abs/2112.04426) is an autoregressive decoder-only language model (LM)
pretrained with retrieval-augmentation.
Retro features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of
tokens.
Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing
factual knowledge implicitly within the network's parameters, thus largely reducing model parameters while achieving
lower perplexity than standard GPT.
Retro also provides the flexibility to update the
knowledge stored in LMs [(Wang et al., 2023a)](https://arxiv.org/abs/2304.06762)
by updating the retrieval database without training LMs again.

InstructRetro [(Wang et al., 2023b)](https://arxiv.org/abs/2310.07713) further scales up the size of Retro to 48B,
featuring the largest LLM pretrained with retrieval (as of December 2023).
The obtained foundation model, Retro 48B, largely outperforms the GPT counterpart in terms of perplexity.
With instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on
downstream tasks in the zero-shot setting. Specifically, the average improvement of InstructRetro is 7% over its GPT
counterpart across 8 short-form QA tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across
3 summarization tasks. We also find that one can ablate the encoder from InstructRetro architecture and directly use the
InstructRetro decoder backbone as GPT, while achieving comparable results.

This README provides an end-to-end tutorial to reproduce Retro and InstructRetro.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
24
25
26

# Contents

liangjing's avatar
liangjing committed
27
28
29
30
31
32
33
34
35
36
37
* [Checkpoints](#checkpoints)
* [End-to-end Reproduction Guide](#end-to-end-reproduction-guide)
    * [Step 0: Prepare the environment](#step-0-prepare-the-environment)
        * [Docker image](#docker-image)
        * [Install dependencies](#install-dependencies)
    * [Step 1: Build retrieval database](#step-1-build-retrieval-database)
    * [Step 2: Pretraining](#step-2-pretraining)
    * [Step 3: Perplexity evaluation](#step-3-perplexity-evaluation)
    * [Step 4: Instruction tuning](#step-4-instruction-tuning)
    * [Step 5: Downstream task evaluation](#step-5-downstream-task-evaluation)
* [Citations](#citations)
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
38

liangjing's avatar
liangjing committed
39
# Checkpoints
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
40

liangjing's avatar
liangjing committed
41
42
We provide the pretrained checkpoints of Retro and InstructRetro in the following table. The checkpoints are available
to download through the following links:
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
43

liangjing's avatar
liangjing committed
44
45
46
47
48
49
| Model                   | Size | Instruction Tuning | Download Link 1                                                    | Download Link 2                                                                | Download Link 3                                                                                      |
|-------------------------|------|--------------------|--------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| `retro-8b-base-4k`      | 8b   |                    | [Huggingface](https://huggingface.co/nvidia/retro-8b-base-4k)      | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/models/retro-8b-base-4k)      | [Google Drive](https://drive.google.com/drive/folders/1uSQ5DAsuvx_8XcbtnVfs_MGvEOcx0uK_?usp=sharing) |
| `retro-8b-instruct-4k`  | 8b   | ✅                  | [Huggingface](https://huggingface.co/nvidia/retro-8b-instruct-4k)  | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/models/retro-8b-instruct-4k)  | [Google Drive](https://drive.google.com/drive/folders/1v5dKaSN0cm2lwyAWpFaJtlTrLhtMZXsI?usp=sharing) |
| `retro-48b-base-4k`     | 48b  |                    | [Huggingface](https://huggingface.co/nvidia/retro-48b-base-4k)     | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/models/retro-48b-base-4k)     | [Google Drive](https://drive.google.com/drive/folders/1rtNpf0CiLElSHQcr3aLI3zgfI3teGTP5?usp=sharing) |
| `retro-48b-instruct-4k` | 48b  | ✅                  | [Huggingface](https://huggingface.co/nvidia/retro-48b-instruct-4k) | [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/models/retro-48b-instruct-4k) | [Google Drive](https://drive.google.com/drive/folders/1qdb0AQjSsAPGlWaIu3wgHPjf_nwLeY5h?usp=sharing) |
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
50

liangjing's avatar
liangjing committed
51
# End-to-end Reproduction Guide
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
52

liangjing's avatar
liangjing committed
53
54
In this README, we provide an end-to-end reproduction guide for InstructRetro, covering from large-scale retrieval
construction, pretraining, perplexity evaluation, instruction tuning, to downstream task evaluation.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
55

liangjing's avatar
liangjing committed
56
57
If you are interested in evaluation only, we also [open-sourced our checkpoints](#checkpoints) and you can directly go
to [Step 5](#step-5-downstream-task-evaluation) to evaluate the checkpoints on downstream tasks.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
58

liangjing's avatar
liangjing committed
59
## Step 0: Prepare the environment
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
60

liangjing's avatar
liangjing committed
61
We recommend using docker environment to run the code.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
62

liangjing's avatar
liangjing committed
63
### Docker image
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
64

liangjing's avatar
liangjing committed
65
66
We provide a docker build file in [tools/retro/examples/Dockerfile](examples/Dockerfile) for the reproduction. The
docker image is based on the [NGC docker](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) `nvcr.io/nvidia/pytorch:23.09-py3`.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
67

liangjing's avatar
liangjing committed
68
### Install dependencies
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
69

liangjing's avatar
liangjing committed
70
Clone the Megatron repo:
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
71

liangjing's avatar
liangjing committed
72
73
74
```bash
git clone --branch InstructRetro https://github.com/NVIDIA/Megatron-LM.git
```
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
75

liangjing's avatar
liangjing committed
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
If docker is not available, we recommend starting from a clean conda environment with the following runtime
dependencies:

- Python 3.10
- NVIDIA CUDA® 12.2.1
- NVIDIA cuBLAS 12.2.5.6
- NVIDIA cuDNN 8.9.5
- NVIDIA NCCL 2.18.5
- PyTorch 2.1.0a0+32f93b1

Then install Retro-specific dependencies, including:

```bash
pip install -U faiss-gpu
pip install -U transformers
pip install -U sentencepiece
pip install -U h5py
pip install -U nltk
pip install -U einops
```
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
96

liangjing's avatar
liangjing committed
97
## Step 1: Build retrieval database
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
98

liangjing's avatar
liangjing committed
99
100
101
In this step, we build a large-scale retrieval database for InstructRetro
through [Faiss](https://github.com/facebookresearch/faiss) to retrieve from trillions of tokens, and preprocess (and
save) the retrieval neighbors for the pretraining step.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
102

liangjing's avatar
liangjing committed
103
Please refer to [tools/retro/build_db.md](build_db.md) for more details.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
104

liangjing's avatar
liangjing committed
105
## Step 2: Pretraining
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
106

liangjing's avatar
liangjing committed
107
108
*Please strictly follow Step 1 to build the retrieval database before pretraining to make sure the preprocessed
retrieval neighbors match the pretraining corpus.*
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
109

liangjing's avatar
liangjing committed
110
In the pretraining step, we support both pretraining from scratch and continued pretraining from a pretrained GPT model.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
111

liangjing's avatar
liangjing committed
112
113
114
115
We provide a template pretraining script to pretrain 843M Retro from scratch. Prepare your own arguments and update our
templates in [tools/retro/examples/pretrain_model.sh](examples/pretrain_model.sh). Please note that the data path should
be exactly matching the one used in Step 1 to make sure the preprocessed retrieval neighbors match the pretraining
corpus.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
116

liangjing's avatar
liangjing committed
117
[//]: # (Take the example of the Wikipedia corpus)
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
118

liangjing's avatar
liangjing committed
119
120
121
```bash
bash tools/retro/examples/pretrain_model.sh
```
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
122

liangjing's avatar
liangjing committed
123
124
After pretraining, the model checkpoints will be saved in the `--save` directory if you specified the arg
in `pretrain_model.sh`.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
125

liangjing's avatar
liangjing committed
126
127
128
129
130
131
132
To continue pretraining with retrieval from a pretrained GPT model, please specify `--load` in `pretrain_model.sh` to
load the pretrained GPT model checkpoint (the architecture of GPT, including hidden size, number of layers, and
activation methods, should be exactly the same as the one used for Retro). You should also
specify  `--no-load-optim --finetune` to make sure the optimizer state is not loaded from the pretrained GPT model and
the continued pretraining with retrieval is from a clean start. After the first job / the first run, you will continue
pretraining with retrieval from your last checkpoint. In the follow-up jobs, you should launch the pretraining without
the flags `--no-load-optim --finetune` to make sure the optimizer state is correctly loaded from your last job.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
133

liangjing's avatar
liangjing committed
134
## Step 3: Perplexity evaluation
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
135

liangjing's avatar
liangjing committed
136
137
138
During pretraining, we will automatically evaluate the model perplexity on the specified validation corpus
every `--eval-interval` steps. The validation corpus should be exactly the same as the one used in Step 1 to make sure
the preprocessed retrieval neighbors match the pretraining corpus.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
139

liangjing's avatar
liangjing committed
140
141
142
To evaluate the perplexity of a pretrained model, please add `--skip-train` in `pretrain_model.sh` to skip the
pretraining step and only evaluate the perplexity of the model specified in `--load` on the validation corpus. Run the
above command again to evaluate the perplexity of a pretrained model:
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
143

liangjing's avatar
liangjing committed
144
145
146
```bash
bash tools/retro/examples/pretrain_model.sh
```
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
147

liangjing's avatar
liangjing committed
148
## Step 4: Instruction tuning
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
149

liangjing's avatar
liangjing committed
150
151
In this step, we fine-tune the pretrained model on the downstream task with instructions. We provide a template
instruction tuning script to fine-tune 843M Retro.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
152

liangjing's avatar
liangjing committed
153
154
155
We also provide an open-source blend of instruction tuning datasets. The dataset is available to download
through [here](https://drive.google.com/file/d/1nzKwwYf8lYb9gN3P4YO8pFNU_B2nMYe1/view?usp=sharing). The blendable
dataset consists of the following open-source instruction tuning datasets:
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
156

liangjing's avatar
liangjing committed
157
### Instruction Tuning Dataset Breakdown
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
158

liangjing's avatar
liangjing committed
159
160
161
162
163
164
165
166
167
168
169
| Dataset                                                    | Samples | Epochs | Sampling Prob |
|------------------------------------------------------------|--------:|-------:|--------------:|
| [soda](https://arxiv.org/abs/2212.10465)                   |    2560 |  0.005 |         0.020 |
| [eli5](https://arxiv.org/abs/1907.09190)                   |    2561 |  0.055 |         0.020 |
| [self_instruct_short](https://arxiv.org/abs/2212.10560)    |    1280 |  0.043 |         0.010 |
| [self_instruct_long](https://arxiv.org/abs/2212.10560)     |    2560 |  0.333 |         0.020 |
| [unnatural-instructions](https://arxiv.org/abs/2212.09689) |    2560 |  0.024 |         0.020 |
| [flan_cot](https://arxiv.org/abs/2210.11416)               |    1280 |  0.093 |         0.010 |
| [dolly](https://arxiv.org/abs/2305.13735)                  |    6400 |  0.938 |         0.050 |
| [oasst-skip-noncode](https://open-assistant.io/)           |  104558 |  1.839 |         0.817 |
| [oasst-skip-code](https://open-assistant.io/)              |    4243 |  1.839 |         0.033 |
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
170

liangjing's avatar
liangjing committed
171
Refer to the paper links above for more details about each instruction tuning dataset.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
172

liangjing's avatar
liangjing committed
173
174
175
*We note that the provided instruction tuning dataset is all from open-source instruction tuning datasets. It is
slightly different from what we use in [InstructRetro](https://arxiv.org/abs/2310.07713), which contains private and
proprietary datasets. Thus a 1-2% accuracy difference in downstream tasks may be expected.*
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
176

liangjing's avatar
liangjing committed
177
### Instruction tuning script
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
178

liangjing's avatar
liangjing committed
179
180
181
182
Download
the [blended instruction tuning dataset](https://drive.google.com/file/d/1nzKwwYf8lYb9gN3P4YO8pFNU_B2nMYe1/view?usp=sharing)
in your data home directory `$DATA_HOME` and update our templates
in [tools/retro/sft/sft_retro_lm.sh](sft/sft_retro_lm.sh).
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
183

liangjing's avatar
liangjing committed
184
An example command to run instruction tuning on 843M Retro is as follows:
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
185

liangjing's avatar
liangjing committed
186
187
188
```bash
                                      [blend-dataset-name] [model-size] [batch-size]  [lr]    [checkpoints]
bash tools/retro/sft/sft_retro_lm.sh       open_inst               843m            128    5e-6  <path/to/pretrained/retro>
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
189
190
```

liangjing's avatar
liangjing committed
191
192
193
194
The `blend_dataset_name` argument will blend all the datasets within the `$DATA_HOME` following the weights and
configurations specified in the `${blend_dataset_name}.sh` ([open_inst.sh](sft/open_inst.sh) in the example above).
The checkpoints will be saved in the `--save` directory. For example, it will be saved to
`<SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6`.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
195

liangjing's avatar
liangjing committed
196
## Step 5: Downstream task evaluation
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
197

liangjing's avatar
liangjing committed
198
199
200
201
202
203
204
205
In this step, we demonstrate how to run InstructRetro for zero-shot evaluation on downstream question answering (QA)
tasks. We provide the pre-processed open-source evaluation datasets with a unified format for different tasks. The
evaluation datasets used in our paper are available to download
through [here](https://drive.google.com/drive/folders/1xw-N0LJR_lIWnH6BKzHIb49quVCS_V72?usp=sharing). Please stick to
the same retro workdir used in Step 0-4 to make sure the preprocessed retrieval neighbors match the pretraining corpus.
If you directly come to Step 5, an example retro workdir with `args.json` for 800M Retro is
provided [here](https://drive.google.com/file/d/121GqAdMvf8bJEBZRt-SD4uhW-SRWgI3s/view?usp=sharing). Note that the args
in the json can be overwritten through the command line.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
206

liangjing's avatar
liangjing committed
207
208
209
We present an example command to run retro generation given the InstructRetro checkpoints and the Natural Question (NQ)
task. The example command is for the 843m InstructRetro obtained in Step 4. Please specify the directory for the NQ
dataset and update the command accordingly for other checkpoints.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
210

liangjing's avatar
liangjing committed
211
212
213
```bash
bash tools/retro/text_generation/retro_generate.sh nq 843m greedy test  0 20000 1000 5 pp1 <SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6 2
```
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
214

liangjing's avatar
liangjing committed
215
216
217
The generated responses will be saved in the corresponding checkpoint directory. For example, for the 843m
InstructRetro, it will be saved to
`<SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6/retro-generate-nq_5_2_843m_test_greedy_0_20000_1000.txt`.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
218

liangjing's avatar
liangjing committed
219
220
221
To evaluate the F1 / Exact Match (EM) scores of the generated responses, we provide an example script to run the
evaluation on the NQ dataset. Please specify the directory for the NQ dataset and update the command accordingly for
other checkpoints and downstream tasks.
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
222

liangjing's avatar
liangjing committed
223
224
225
```bash
python3 tools/retro/text_generation/evaluate.py
```
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
226

liangjing's avatar
liangjing committed
227
# Citations
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
228

liangjing's avatar
liangjing committed
229
See more details from our papers:
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
230

liangjing's avatar
liangjing committed
231
[Shall we Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study.](https://arxiv.org/abs/2304.06762)
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
232

liangjing's avatar
liangjing committed
233
234
_Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei
Xiao, Anima Anandkumar, Bryan Catanzaro._ (EMNLP 2023)
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
235

liangjing's avatar
liangjing committed
236
[InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining.](https://arxiv.org/abs/2310.07713)
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
237

liangjing's avatar
liangjing committed
238
_Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro._
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
239

liangjing's avatar
liangjing committed
240
Please cite the papers as follows if you use the data or code from this repo:
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
241

liangjing's avatar
liangjing committed
242
243
244
245
246
247
248
```bibtex
@inproceedings{wang2023shall,
    title   = {Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study},
    author  = {Boxin Wang and Wei Ping and Peng Xu and Lawrence McAfee and Zihan Liu and Mohammad Shoeybi and Yi Dong and Oleksii Kuchaiev and Bo Li and Chaowei Xiao and Anima Anandkumar and Bryan Catanzaro},
    journal = {The 2023 Conference on Empirical Methods in Natural Language Processing},
    year    = {2023}
}
Lawrence McAfee's avatar
Retro  
Lawrence McAfee committed
249

liangjing's avatar
liangjing committed
250
251
252
253
254
255
256
@article{wang2023instructretro,
    title   = {InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining},
    author  = {Boxin Wang and Wei Ping and Lawrence McAfee and Peng Xu and Bo Li and Mohammad Shoeybi and Bryan Catanzaro},
    year    = {2023},
    journal = {arXiv preprint arXiv: 2310.07713}
}
```