# 微调交叉编译器 在这个文档中,我们将会展示如何使用自己的数据集去微调交叉编译器reranker。 ## 环境配置 参考[环境配置](../../README.md#环境配置) ## 数据格式 reranker的数据格式与[embedding fine-tune](../../examples/finetune/README.md#数据集)一样。 此外,我们强烈建议参考[mine hard negatives](../../examples/finetune/README.md)中的Hard Negatives去微调reranker。 ## 训练 ``` torchrun --nproc_per_node {number of gpus} \ -m FlagEmbedding.reranker.run \ --output_dir {path to save model} \ --model_name_or_path BAAI/bge-reranker-base \ --train_data ./toy_finetune_data.jsonl \ --learning_rate 6e-5 \ --fp16 \ --num_train_epochs 5 \ --per_device_train_batch_size {batch size; set 1 for toy data} \ --gradient_accumulation_steps 4 \ --dataloader_drop_last True \ --train_group_size 16 \ --max_len 512 \ --weight_decay 0.01 \ --logging_steps 10 ``` **一些重要参数**: - `per_device_train_batch_size`: 训练时候的batch size。 - `train_group_size`: 训练中查询语句的的正负样本数。总是有一个正样本,所以这个参数将控制负样本的数量(#negatives=train_group_size-1)。注意,负样本设置的数量大小不可以超过数据`"neg":List[str]`的大小。除了这组的负样本,批量的负样本也会用于微调中。 更多的参数解释可以参考[transformers.TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) ### 通过[LM-Cocktail](../../LM_Cocktail)完成模型融合[optional] 更多信息可参考[LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail). 对基本模型进行微调可以提高其在目标任务上的性能,但超出目标域也可能导致模型综合性能的严重退化(例如,c-mteb任务的性能较低)。 通过合并微调模型和基本模型,LM-Cocktail可以显著提高下游任务的效果,同时在其他不相关的任务中保持性能。 ```python from LM_Cocktail.LM_Cocktail import mix_models, mix_models_with_data # Mix fine-tuned model and base model; then save it to output_path: ./mixed_model_1 model = mix_models( model_names_or_paths=["BAAI/bge-reranker-base", "your_fine-tuned_model"], model_type='reranker', weights=[0.5, 0.5], # you can change the weights to get a better trade-off. output_path='./mixed_model_1') ``` ### 加载自己的模型 #### 使用 FlagEmbedding ```python from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) #use fp16 can speed up computing score = reranker.compute_score(['query', 'passage']) print(score) scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) ``` #### 使用 Huggingface transformers ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer, BatchEncoding, PreTrainedTokenizerFast tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) ```