README.md 5.2 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# Embedding Model
不同于其他使用均值池化的embedding模型,BGE使用`[cls]`的表征作为句子的embedding:`sentence_embeddings = model_output[0][:, 0]`
如果你使用均值池化,效果将会有显著的劣化。因此,一定要使用正确的方法来获取句子向量。您可以参考我们提供的使用方法。

## 推理
```python
from FlagEmbedding import FlagModel
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = FlagModel('BAAI/bge-large-zh-v1.5',
                  query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
                  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

# for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query
# corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction
queries = ['query_1', 'query_2']
passages = ["样例文档-1", "样例文档-2"]
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode(passages)
scores = q_embeddings @ p_embeddings.T
```

对于参数`query_instruction_for_retrieval`的值的设定,参考[Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list).

默认FlagModel在encoding的时候使用所有可使用的显卡,可以通过设置`os.environ["CUDA_VISIBLE_DEVICES"]`环境变量来制定显卡。同样的,设定`os.environ["CUDA_VISIBLE_DEVICES"]=""`表示所有显卡不可用。


### 使用 Sentence-Transformers
你也可以使用`bge`模型和 [sentence-transformers](https://www.SBERT.net):

```
pip install -U sentence-transformers
```

```python
from sentence_transformers import SentenceTransformer
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
```

针对s2p(short query to long passage)检索任务, 每个简短的查询语句需要有一个特定的指令开头(指令请参考[Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list))。但是对于长段落,制定不是必须的。
```python
from sentence_transformers import SentenceTransformer
queries = ['query_1', 'query_2']
passages = ["样例文档-1", "样例文档-2"]
instruction = "为这个句子生成表示以用于检索相关文章:"

model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T
```

### 使用 Langchain
你可以参考下面的方法在langchain中使用`bge`:

```python
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
model = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    query_instruction="为这个句子生成表示以用于检索相关文章:"
)
model.query_instruction = "为这个句子生成表示以用于检索相关文章:"
```

### 使用 HuggingFace Transformers
使用`transformers`包,你可以这样使用模型:首先,将输入传递给`Transformer`模型,然后选择第一个`token`(如[CLS]标记)的最后隐藏状态作为句子embedding。

```python
from transformers import AutoTokenizer, AutoModel
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)
```

## 验证
`baai-general-embedding`模型达到了**在 MTEB 和 C-MTEB 排行榜上的SOTA效果!**
跟多验证细节和脚本可以参考[scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md)
如果你想使用**自有数据**去验证开原模型(或者自己的模型),你可以参考[这里](../../examples/finetune)