README.md 3.62 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# 微调交叉编译器
在这个文档中,我们将会展示如何使用自己的数据集去微调交叉编译器reranker。

## 环境配置
参考[环境配置](../../README.md#环境配置)

## 数据格式
reranker的数据格式与[embedding fine-tune](../../examples/finetune/README.md#数据集)一样。
此外,我们强烈建议参考[mine hard negatives](../../xamples/finetune/README.md)中的Hard Negatives去微调reranker。

## 训练
```
torchrun --nproc_per_node {number of gpus} \
    -m FlagEmbedding.reranker.run \
    --output_dir {path to save model} \
    --model_name_or_path BAAI/bge-reranker-base \
    --train_data ./toy_finetune_data.jsonl \
    --learning_rate 6e-5 \
    --fp16 \
    --num_train_epochs 5 \
    --per_device_train_batch_size {batch size; set 1 for toy data} \
    --gradient_accumulation_steps 4 \
    --dataloader_drop_last True \
    --train_group_size 16 \
    --max_len 512 \
    --weight_decay 0.01 \
    --logging_steps 10
```

**一些重要参数**:
- `per_device_train_batch_size`: 训练时候的batch size。
- `train_group_size`: 训练中查询语句的的正负样本数。总是有一个正样本,所以这个参数将控制负样本的数量(#negatives=train_group_size-1)。注意,负样本设置的数量大小不可以超过数据`"neg":List[str]`的大小。除了这组的负样本,批量的负样本也会用于微调中。

更多的参数解释可以参考[transformers.TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)

### 通过[LM-Cocktail](../../LM_Cocktail)完成模型融合[optional]
更多信息可参考[LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail).

对基本模型进行微调可以提高其在目标任务上的性能,但超出目标域也可能导致模型综合性能的严重退化(例如,c-mteb任务的性能较低)。
通过合并微调模型和基本模型,LM-Cocktail可以显著提高下游任务的效果,同时在其他不相关的任务中保持性能。

```python
from LM_Cocktail import mix_models, mix_models_with_data

# Mix fine-tuned model and base model; then save it to output_path: ./mixed_model_1
model = mix_models(
    model_names_or_paths=["BAAI/bge-reranker-base", "your_fine-tuned_model"],
    model_type='reranker',
    weights=[0.5, 0.5],  # you can change the weights to get a better trade-off.
    output_path='./mixed_model_1')
```

### 加载自己的模型
#### 使用 FlagEmbedding
```python
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) #use fp16 can speed up computing

score = reranker.compute_score(['query', 'passage'])
print(score)

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores)
```

#### 使用 Huggingface transformers
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BatchEncoding, PreTrainedTokenizerFast

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
model.eval()

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)
```