# Embedding Model 不同于其他使用均值池化的embedding模型,BGE使用`[cls]`的表征作为句子的embedding:`sentence_embeddings = model_output[0][:, 0]` 如果你使用均值池化,效果将会有显著的劣化。因此,一定要使用正确的方法来获取句子向量。您可以参考我们提供的使用方法。 ## 推理 ```python from FlagEmbedding import FlagModel sentences_1 = ["样例数据-1", "样例数据-2"] sentences_2 = ["样例数据-3", "样例数据-4"] model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode(sentences_1) embeddings_2 = model.encode(sentences_2) similarity = embeddings_1 @ embeddings_2.T print(similarity) # for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query # corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction queries = ['query_1', 'query_2'] passages = ["样例文档-1", "样例文档-2"] q_embeddings = model.encode_queries(queries) p_embeddings = model.encode(passages) scores = q_embeddings @ p_embeddings.T ``` 对于参数`query_instruction_for_retrieval`的值的设定,参考[Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list). 默认FlagModel在encoding的时候使用所有可使用的显卡,可以通过设置`os.environ["CUDA_VISIBLE_DEVICES"]`环境变量来制定显卡。同样的,设定`os.environ["CUDA_VISIBLE_DEVICES"]=""`表示所有显卡不可用。 ### 使用 Sentence-Transformers 你也可以使用`bge`模型和 [sentence-transformers](https://www.SBERT.net): ``` pip install -U sentence-transformers ``` ```python from sentence_transformers import SentenceTransformer sentences_1 = ["样例数据-1", "样例数据-2"] sentences_2 = ["样例数据-3", "样例数据-4"] model = SentenceTransformer('BAAI/bge-large-zh-v1.5') embeddings_1 = model.encode(sentences_1, normalize_embeddings=True) embeddings_2 = model.encode(sentences_2, normalize_embeddings=True) similarity = embeddings_1 @ embeddings_2.T print(similarity) ``` 针对s2p(short query to long passage)检索任务, 每个简短的查询语句需要有一个特定的指令开头(指令请参考[Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list))。但是对于长段落,制定不是必须的。 ```python from sentence_transformers import SentenceTransformer queries = ['query_1', 'query_2'] passages = ["样例文档-1", "样例文档-2"] instruction = "为这个句子生成表示以用于检索相关文章:" model = SentenceTransformer('BAAI/bge-large-zh-v1.5') q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True) p_embeddings = model.encode(passages, normalize_embeddings=True) scores = q_embeddings @ p_embeddings.T ``` ### 使用 Langchain 你可以参考下面的方法在langchain中使用`bge`: ```python from langchain.embeddings import HuggingFaceBgeEmbeddings model_name = "BAAI/bge-large-en-v1.5" model_kwargs = {'device': 'cuda'} encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity model = HuggingFaceBgeEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="为这个句子生成表示以用于检索相关文章:" ) model.query_instruction = "为这个句子生成表示以用于检索相关文章:" ``` ### 使用 HuggingFace Transformers 使用`transformers`包,你可以这样使用模型:首先,将输入传递给`Transformer`模型,然后选择第一个`token`(如[CLS]标记)的最后隐藏状态作为句子embedding。 ```python from transformers import AutoTokenizer, AutoModel import torch # Sentences we want sentence embeddings for sentences = ["样例数据-1", "样例数据-2"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5') model.eval() # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, cls pooling. sentence_embeddings = model_output[0][:, 0] # normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) ``` ## 验证 `baai-general-embedding`模型达到了**在 MTEB 和 C-MTEB 排行榜上的SOTA效果!** 跟多验证细节和脚本可以参考[scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) 如果你想使用**自有数据**去验证开原模型(或者自己的模型),你可以参考[这里](../../examples/finetune)。