fine-tune.md 8.82 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# Fine-tuning

## Environment
It is recommended that you create a new environment:
```
cd FlagEmbedding/llm_embedder

conda env create -f environment.yaml --name llm-embedder
conda activate llm-embedder
```

To use BM25, you must download **java11** and **anserini**, then add java to your `PATH`:
```bash
# feel free to alternate /data to your prefered location
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/java11.tar.gz?download=true -O /data/java11.tar.gz
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/anserini.tar.gz?download=true -O /data/anserini.tar.gz

cd /data
tar -xzvf java11.tar.gz
tar -xzvf anserini.tar.gz

# below just temporarily set JAVA_HOME; it is RECOMMENDED that you store the lines the setting in ~/.bashrc
export JAVA_HOME=/data/jdk-11.0.2
export PATH=$JAVA_HOME/bin:$PATH
```

## Data
You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. `/data`, which results in a folder `/data/llm-embedder`:
```bash
# feel free to alternate /data to your prefered location
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/llm-embedder.tar.gz?download=true -O /data/llm-embedder.tar.gz

cd /data
tar -xzvf llm-embedder-eval.tar.gz
```

The corpus of QReCC for conversational search is too large (54M passages), we separately upload it to huggingface datasets [namespace-Pt/qrecc-corpus](https://huggingface.co/datasets/namespace-Pt/qrecc-corpus). To evaluate the performance on conversational search, you should load it and save it as json file in the `qrecc` folder:
```python
import datasets
# load dataset
qrecc_corpus = datasets.load_dataset("namespace-Pt/qrecc-corpus", split="train")
# save to jsonline format in YOUR data folder
qrecc_corpus.to_json("/data/llm-embedder/convsearch/qrecc/corpus.json", force_ascii=False, lines=True, orient="records")
```

The data formats for training and evaluation are as follows:

```python
# training
{
  "query": str,
  "pos": List[str],
  "neg": List[str],
  "pos_index": Optional[List[int]],         # Indices of the positives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
  "neg_index": Optional[List[int]],         # Indices of the negatives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
  "teacher_scores": Optional[List[float]],  # Scores from an LM or a reranker, used for distillation.
  "answers": Optional[List[str]],           # List of answers for the query, used for LM scoring.
}

# evaluation
{
  "query": str,
  "pos_index": Optional[List[int]],         # Indices of the positives w.r.t. corpus. When there is no positives pre-defined (e.g. NQ), just ignore this field.
  "answers": Optional[List[str]],           # List of answers for computing NQ metrics.
  "key": Optional[List[str]],               # Retrieval results of the query. Usually used for RAG or reranking.
  "key_index": Optional[List[int]],         # Key indices w.r.t. the corpus.
}
```

## Retriever
Below are several important arguments for training. The meaning and usage of other arguments can be inspected from [code](../src/retrieval/args.py) or running `python run_dense.py --help` from command line.
- `train_data`: required, one or a list of json files with the aforementioned formatting.
- `eval_data`: optional, one json file with the aforementioned formatting. If an `eval_data` is speficied, the trainer will automatically do evaluation on the `eval_data`.
- `corpus`: optional, the global corpus where `positives`.

**IMPORTANT NOTE**
- For any path specified for `train_data`, `eval_data`, and `corpus`: if it is prefixed with `llm-embedder`, it will be solved to the relative path against [`data_root`](../src/retrieval/args.py). *Note that you can modify the default value of `data_root`, so that you don't need to type it for each command.*
- During fine-tuning, we save the output model in the `huggingface transformers`🤗 format. To use it from `sentence_transformers`, you should convert it to `sentence_transformers` checkpoint in advance:
  ```bash
  python scripts/ours2st.py --encoder data/outputs/your-output-dir/encoder
  ```
  Then everything is the same as described in [README](../README.md).

### LLM-Embedder (Multi-Task Fine-Tune)
```bash
# Remember to modify the data_root to your data root in the script :)
bash scripts/llm-embedder.sh
```

### Single Task Fine-Tune
Below we provide commands to fine-tune a retriever on a single task.

#### QA
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/nq \
--train_data llm-embedder:qa/nq/train.json \
--eval_data llm-embedder:qa/nq/test.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics nq \
--key_max_length 128 \
--query_max_length 32 \
--contrastive_weight 0 \
--stable_distill \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
```

#### In-Context Learning
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/icl \
--train_data llm-embedder:icl/icl/train.json \
--select_positive random \
--contrastive_weight 0 \
--stable_distill \
--save_steps 6000 \
--max_steps 6000 \
--data_root /data/llm-embedder
```

#### Long-Range Language Modeling
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/lrlm \
--train_data llm-embedder:lrlm/books3/train.json llm-embedder:lrlm/arxiv/train.json llm-embedder:lrlm/codeparrot/train.json \
--select_positive teacher \
--teacher_scores_margin 0.1 \
--contrastive_weight 0 \
--teacher_temperature 0.1 \
--save_steps 4000 \
--max_steps 4000 \
--data_root /data/llm-embedder
```

#### Long Chat
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/msc \
--train_data llm-embedder:chat/msc/train.json \
--select_positive teacher \
--select_negative random \
--contrastive_weight 0 \
--teacher_temperature 0.1 \
--save_steps 4000 \
--max_steps 4000 \
--data_root /data/llm-embedder
```

#### Tool
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/tool \
--train_data llm-embedder:tool/toolbench/train.json \
--eval_data llm-embedder:tool/toolbench/test.json \
--corpus llm-embedder:tool/toolbench/corpus.json \
--key_template {text} \
--metrics ndcg \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
```

#### Conversation Search
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/qrecc \
--train_data llm-embedder:conversation/qrecc/train.concat.json \
--eval_data llm-embedder:conversation/qrecc/test.concat.json \
--corpus llm-embedder:conversation/qrecc/corpus.json \
--key_template '{text}' \
--metrics mrr ndcg \
--cutoffs 3 10 100 \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
```

### Mine Negatives
```bash
# BGE (the result will be saved at llm-embedder:qa/nq/train.neg.bge.json)
torchrun --nproc_per_node=8 -m evaluation.eval_retrieval \
--eval_data llm-embedder:qa/nq/train.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics mrr recall collate_neg \
--save_name bge \
--data_root /data/llm-embedder

# BM25 (the result will be saved at llm-embedder:qa/nq/train.neg.bm25.json; anserini_dir is the folder where you untar anserini.tar.gz)
torchrun --nproc_per_node 8 -m evaluation.eval_retrieval \
--anserini_dir /data/anserini \
--retrieval_method bm25 \
--eval_data llm-embedder:qa/nq/train.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics mrr recall collate_neg \
--save_name bm25 \
--data_root /data/llm-embedder
```

## LM Scoring
Score positives and negatives in `eval_data` with $p(o|q,k)$ where $o$ is the desired output (i.e. `answers` field), $q$ is the query, and $k$ is a key (could be positive or negative).

```bash
torchrun --nproc_per_node=8 run_lm_score.py \
--eval_data llm-embedder:qa/msmarco/train.json \
--data_root /data/llm-embedder \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--save_name llama2-7b-chat
```
Results will be saved at `/data/llm-embedder/qa/msmarco/train.scored.llama2-7b-chat.json`


## Known Issues
- `transformers==4.30.0` raises error when using deepspeed schedulerconfig
  - modify line `1750` in `trainer.py`
  ```python
    if use_accelerator_prepare:
        # NOTE: fix bug in transformers 4.30.0
        # model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
        self.model.train()
        if hasattr(self.lr_scheduler, "step"):
            if self.use_apex:
                model = self.accelerator.prepare(self.model)
            else:
                model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
        else:
            # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config.
            model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
                self.model, self.optimizer, self.lr_scheduler
            )
  ```