README.md 11.2 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
# Semantic Search
Rayyyyy's avatar
Rayyyyy committed
2
Semantic search seeks to improve search accuracy by understanding the semantic meaning of the search query and the corpus to search over. Semantic search can also perform well given synonyms, abbreviations, and misspellings, unlike keyword search engines that can only find documents based on lexical matches.
Rayyyyy's avatar
Rayyyyy committed
3
4

## Background
Rayyyyy's avatar
Rayyyyy committed
5
The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic similarity with the query.
Rayyyyy's avatar
Rayyyyy committed
6
7
8
9
10
11

![SemanticSearch](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png) 

## Symmetric vs. Asymmetric Semantic Search

A **critical distinction** for your setup is *symmetric* vs. *asymmetric semantic search*:
Rayyyyy's avatar
Rayyyyy committed
12
13
14
- For **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be *"How to learn Python online?"* and you want to find an entry like *"How to learn Python on the web?"*. For symmetric tasks, you could potentially flip the query and the entries in your corpus. 
    - Related training example: [Quora Duplicate Questions](../../training/quora_duplicate_questions/README.md).
    - Suitable models: [Pre-Trained Sentence Embedding Models](../../../docs/sentence_transformer/pretrained_models#sentence-embedding-models)
Rayyyyy's avatar
Rayyyyy committed
15
- For **asymmetric semantic search**, you usually have a **short query** (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like *"What is Python"* and you want to find the paragraph *"Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy ..."*. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.
Rayyyyy's avatar
Rayyyyy committed
16
17
    - Related training example: [MS MARCO](../../training/ms_marco/README.html)
    - Suitable models: [Pre-Trained MS MARCO Models](../../../docs/pretrained-models/msmarco-v3.md)
Rayyyyy's avatar
Rayyyyy committed
18
19
20

It is critical **that you choose the right model** for your type of task.

Rayyyyy's avatar
Rayyyyy committed
21
## Manual Implementation
Rayyyyy's avatar
Rayyyyy committed
22

Rayyyyy's avatar
Rayyyyy committed
23
For small corpora (up to about 1 million entries), we can perform semantic search with a manual implementation by computing the embeddings for the corpus as well as for our query, and then calculating the [semantic textual similarity](../../../docs/sentence_transformer/usage/semantic_textual_similarity.rst) using [<code>SentenceTransformer.similarity</code>](../../../docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity).
Rayyyyy's avatar
Rayyyyy committed
24
25
26
For a simple example, see [semantic_search.py](semantic_search.py):

```eval_rst
Rayyyyy's avatar
Rayyyyy committed
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

.. sidebar:: Output

   .. code-block:: txt

        Query: A man is eating pasta.
        Top 5 most similar sentences in corpus:
        A man is eating food. (Score: 0.7035)
        A man is eating a piece of bread. (Score: 0.5272)
        A man is riding a horse. (Score: 0.1889)
        A man is riding a white horse on an enclosed ground. (Score: 0.1047)
        A cheetah is running behind its prey. (Score: 0.0980)

        Query: Someone in a gorilla costume is playing a set of drums.
        Top 5 most similar sentences in corpus:
        A monkey is playing drums. (Score: 0.6433)
        A woman is playing violin. (Score: 0.2564)
        A man is riding a horse. (Score: 0.1389)
        A man is riding a white horse on an enclosed ground. (Score: 0.1191)
        A cheetah is running behind its prey. (Score: 0.1080)

        Query: A cheetah chases prey on across a field.
        Top 5 most similar sentences in corpus:
        A cheetah is running behind its prey. (Score: 0.8253)
        A man is eating food. (Score: 0.1399)
        A monkey is playing drums. (Score: 0.1292)
        A man is riding a white horse on an enclosed ground. (Score: 0.1097)
        A man is riding a horse. (Score: 0.0650)

Rayyyyy's avatar
Rayyyyy committed
56
57
58
59
.. literalinclude:: semantic_search.py
```


Rayyyyy's avatar
Rayyyyy committed
60
## Optimized Implementation
Rayyyyy's avatar
Rayyyyy committed
61

Rayyyyy's avatar
Rayyyyy committed
62
Instead of implementing semantic search by yourself, you can use the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) function.
Rayyyyy's avatar
Rayyyyy committed
63
64
65
66
67
68
69

The function accepts the following parameters:

```eval_rst
.. autofunction:: sentence_transformers.util.semantic_search
```

Rayyyyy's avatar
Rayyyyy committed
70
By default, up to 100 queries are processed in parallel. Further, the corpus is chunked into set of up to 500k entries. You can increase ``query_chunk_size`` and ``corpus_chunk_size``, which leads to increased speed for large corpora, but also increases the memory requirement.
Rayyyyy's avatar
Rayyyyy committed
71
72

## Speed Optimization
Rayyyyy's avatar
Rayyyyy committed
73
To get the optimal speed for the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance. Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores.
Rayyyyy's avatar
Rayyyyy committed
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
```python
corpus_embeddings = corpus_embeddings.to("cuda")
corpus_embeddings = util.normalize_embeddings(corpus_embeddings)

query_embeddings = query_embeddings.to("cuda")
query_embeddings = util.normalize_embeddings(query_embeddings)
hits = util.semantic_search(query_embeddings, corpus_embeddings, score_function=util.dot_score)
```

## Elasticsearch
[Elasticsearch](https://www.elastic.co/elasticsearch/) has the possibility to [index dense vectors](https://www.elastic.co/what-is/vector-search) and to use them for document scoring. We can easily index embedding vectors, store other data alongside our vectors and, most importantly, efficiently retrieve relevant entries using [approximate nearest neighbor search](https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0) (HNSW, see also below) on the embeddings.

For further details, see [semantic_search_quora_elasticsearch.py](semantic_search_quora_elasticsearch.py).


## Approximate Nearest Neighbor
Rayyyyy's avatar
Rayyyyy committed
90
Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search)).
Rayyyyy's avatar
Rayyyyy committed
91

Rayyyyy's avatar
Rayyyyy committed
92
In that case, Approximate Nearest Neighbor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be searched efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of vectors. However, the results are not necessarily exact. It is possible that some vectors with high similarity will be missed.
Rayyyyy's avatar
Rayyyyy committed
93
94
95

For all ANN methods, there are usually one or more parameters to tune that determine the recall-speed trade-off. If you want the highest speed, you have a high chance of missing hits. If you want high recall, the search speed decreases.

Rayyyyy's avatar
Rayyyyy committed
96
Three popular libraries for approximate nearest neighbor are [Annoy](https://github.com/spotify/annoy), [FAISS](https://github.com/facebookresearch/faiss), and [hnswlib](https://github.com/nmslib/hnswlib/).
Rayyyyy's avatar
Rayyyyy committed
97
98

Examples:
Rayyyyy's avatar
Rayyyyy committed
99

Rayyyyy's avatar
Rayyyyy committed
100
101
102
103
104
- [semantic_search_quora_hnswlib.py](semantic_search_quora_hnswlib.py)
- [semantic_search_quora_annoy.py](semantic_search_quora_annoy.py)
- [semantic_search_quora_faiss.py](semantic_search_quora_faiss.py)

## Retrieve & Re-Rank
Rayyyyy's avatar
Rayyyyy committed
105
For complex semantic search scenarios, a two-stage retrieve & re-rank pipeline is advisable:
Rayyyyy's avatar
Rayyyyy committed
106
107
108
109
110
111
![InformationRetrieval](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/InformationRetrieval.png)

For further details, see [Retrieve & Re-rank](../retrieve_rerank/README.md).

## Examples

Rayyyyy's avatar
Rayyyyy committed
112
We list a handful of common use cases:
Rayyyyy's avatar
Rayyyyy committed
113
114

### Similar Questions Retrieval
Rayyyyy's avatar
Rayyyyy committed
115
[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py) [ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method. As model, we use [distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking), which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.
Rayyyyy's avatar
Rayyyyy committed
116
117

### Similar Publication Retrieval
Rayyyyy's avatar
Rayyyyy committed
118
[semantic_search_publications.py](semantic_search_publications.py) [ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://huggingface.co/sentence-transformers/allenai-specter) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract.
Rayyyyy's avatar
Rayyyyy committed
119
120

### Question & Answer Retrieval
Rayyyyy's avatar
Rayyyyy committed
121
[semantic_search_wikipedia_qa.py](semantic_search_wikipedia_qa.py) [ [Colab Version](https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing) ]: This example uses a model that was trained on the [Natural Questions dataset](https://huggingface.co/datasets/sentence-transformers/natural-questions). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an **asymmetric search task**. As corpus, we use the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) so that it fits easily into memory.
Rayyyyy's avatar
Rayyyyy committed
122

Rayyyyy's avatar
Rayyyyy committed
123
[retrieve_rerank_simple_wikipedia.ipynb](../retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]: This script uses the [Retrieve & Re-rank](../retrieve_rerank/README.md) strategy and is an example for an **asymmetric search task**. We split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved. Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user. We use models that were trained on the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset, a dataset with about 500k real queries from Bing search.