README.md 3.63 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
3
4
5
6
7
8
# 微调交叉编译器
在这个文档中,我们将会展示如何使用自己的数据集去微调交叉编译器reranker。

## 环境配置
参考[环境配置](../../README.md#环境配置)

## 数据格式
reranker的数据格式与[embedding fine-tune](../../examples/finetune/README.md#数据集)一样。
9
此外,我们强烈建议参考[mine hard negatives](../../examples/finetune/README.md)中的Hard Negatives去微调reranker。
Rayyyyy's avatar
Rayyyyy committed
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

## 训练
```
torchrun --nproc_per_node {number of gpus} \
    -m FlagEmbedding.reranker.run \
    --output_dir {path to save model} \
    --model_name_or_path BAAI/bge-reranker-base \
    --train_data ./toy_finetune_data.jsonl \
    --learning_rate 6e-5 \
    --fp16 \
    --num_train_epochs 5 \
    --per_device_train_batch_size {batch size; set 1 for toy data} \
    --gradient_accumulation_steps 4 \
    --dataloader_drop_last True \
    --train_group_size 16 \
    --max_len 512 \
    --weight_decay 0.01 \
    --logging_steps 10
```

**一些重要参数**:
- `per_device_train_batch_size`: 训练时候的batch size。
- `train_group_size`: 训练中查询语句的的正负样本数。总是有一个正样本,所以这个参数将控制负样本的数量(#negatives=train_group_size-1)。注意,负样本设置的数量大小不可以超过数据`"neg":List[str]`的大小。除了这组的负样本,批量的负样本也会用于微调中。

更多的参数解释可以参考[transformers.TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)

### 通过[LM-Cocktail](../../LM_Cocktail)完成模型融合[optional]
更多信息可参考[LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).

对基本模型进行微调可以提高其在目标任务上的性能,但超出目标域也可能导致模型综合性能的严重退化(例如,c-mteb任务的性能较低)。
通过合并微调模型和基本模型,LM-Cocktail可以显著提高下游任务的效果,同时在其他不相关的任务中保持性能。

```python
43
from LM_Cocktail.LM_Cocktail import mix_models, mix_models_with_data
Rayyyyy's avatar
Rayyyyy committed
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

# Mix fine-tuned model and base model; then save it to output_path: ./mixed_model_1
model = mix_models(
    model_names_or_paths=["BAAI/bge-reranker-base", "your_fine-tuned_model"],
    model_type='reranker',
    weights=[0.5, 0.5],  # you can change the weights to get a better trade-off.
    output_path='./mixed_model_1')
```

### 加载自己的模型
#### 使用 FlagEmbedding
```python
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) #use fp16 can speed up computing

score = reranker.compute_score(['query', 'passage'])
print(score)

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores)
```

#### 使用 Huggingface transformers
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BatchEncoding, PreTrainedTokenizerFast

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
model.eval()

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)
```