@@ -12,7 +12,7 @@ Ensure that you have [transformers](https://pypi.org/project/transformers/) inst
SentenceTransformers provides a wrapper for the [OpenAI CLIP Model](https://github.com/openai/CLIP), which was trained on a variety of (image, text)-pairs.
@@ -24,10 +24,10 @@ This is an example sentences. Dies ist ein Beispielsatz.
Usually you apply this method to large corpora, for example, you want to find all translated sentences in the English Wikipedia and the Chinese Wikipedia.
## Marging Based Mining
## Margin Based Mining
We follow the setup from [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pdf/1812.10464.pdf) to find translated sentences in two datasets:
1) First, we encode all sentences to their respective embedding. As shown in [our paper](https://arxiv.org/abs/2004.09813) is [LaBSE](https://tfhub.dev/google/LaBSE/1) currently the best method for bitext mining. The model is integrated in Sentence-Transformers
1) First, we encode all sentences to their respective embedding. As shown in [our paper](https://arxiv.org/abs/2004.09813) is [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) currently the best method for bitext mining. The model is integrated in Sentence-Transformers
2) Once we have all embeddings, we find the *k* nearest neighbor sentences for all sentences in both directions. Typical choices for k are between 4 and 16.
3) Then, we score all possible sentence combinations using the formula mentioned in Section 4.3.
4) The pairs with the highest scores are most likely translated sentences. Note, that the score can be larger than 1. Usually you have to find some cut-off where you ignore pairs below that threshold. For a high quality, a threshold of about 1.2 - 1.3 works quite well.
...
...
@@ -35,4 +35,4 @@ We follow the setup from [Artetxe and Schwenk, Section 4.3](https://arxiv.org/pd
## Examples
-**[bucc2018.py](bucc2018.py)** - This script contains an example for the [BUCC 2018 shared task](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) on finding parallel sentences. This dataset can be used to evaluate different strategies, as we know which sentences are parallel in the two corpora. The script mines for parallel sentences and then prints the optimal threshold that leads to the highest F1-score.
-**[bitext_mining.py](bitext_mining.py)** - This file reads in two text files (with a single sentence in each line) and outputs parallel sentences to *parallel-sentences-out.tsv.gz.
-**[In-domain Data Selection for MT](https://www.clinjournal.org/clinj/article/view/137)** - This paper also employed S-BERT to generate/select in-domain parallel data for machine translation systems – using monolingual texts.
-**[In-domain Data Selection for MT](https://www.clinjournal.org/clinj/article/view/137)** - This paper also employed Sentence Transformers to generate/select in-domain parallel data for machine translation systems – using monolingual texts.
Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. In [Semantic Textual Similarity](../../../docs/usage/semantic_textual_similarity.md) we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs.
Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. In [Semantic Textual Similarity](../../../docs/sentence_transformer/usage/semantic_textual_similarity.rst) we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs.
However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences.
```eval_rst
However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences. For larger collections, the :func:`~sentence_transformers.util.paraphrase_mining` function can be used::
For larger collections, *util* offers the *paraphrase_mining* function that can be used like this:
The :func:`~sentence_transformers.util.paraphrase_mining` accepts the following parameters:
Instead of computing all pairwise cosine scores and ranking all possible, combinations, the approach is a bit more complex (and hence efficient). We chunk our corpus into smaller pieces, which is defined by *query_chunk_size* and *corpus_chunk_size*. For example, if we set *query_chunk_size=1000*, we search paraphrases for 1,000 sentences at a time in the remaining corpus (all other sentences). However, the remaining corpus is also chunked, for example, if we set *corpus_chunk_size=10000*, we look for paraphrases in 10k sentences at a time.
If we pass a list of 20k sentences, we will chunk it to 20x1000 sentences, and each of the query is compared first against sentences 0-10k and then 10k-20k.
This is done to reduce the memory requirement. Increasing both values improves the speed, but increases also the memory requirement.
To optimize memory and computation time, paraphrase mining is performed in chunks, as specified by ``query_chunk_size`` and ``corpus_chunk_size``.
To be specific, only ``query_chunk_size * corpus_chunk_size`` pairs will be compared at a time, rather than ``len(sentences) * len(sentences)``. This is more time- and memory-efficient. Additionally, :func:`~sentence_transformers.util.paraphrase_mining` only considers the ``top_k`` best scores per sentences per chunk. You can experiment with this value as an efficiency-performance trade-off.
For example, for each sentence you will get only the one most relevant sentence in this script.
The next critical thing is finding the pairs with the highest similarities. Instead of getting and sorting all n^2 pairwise scores, we take for each query only the *top_k* scores. So with *top_k=100*, we find at most 100 paraphrases per sentence per chunk. You can play around with *top_k* to the ensure a certain behaviour.
You will get for each sentence only the one most other relevant sentence. Note, if B is the most similar sentence for A, A must not be the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C).
The final key parameter is ``max_pairs``, which determines the maximum number of paraphrase pairs that the function returns. Usually, you get fewer pairs returned because the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned.
The final relevant parameter is *max_pairs*, which determines the maximum number of paraphrase pairs you like to get returned. If you set it to e.g. *max_pairs=100*, you will not get more than 100 paraphrase pairs returned. Usually, you get fewer pairs returned as the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned.
.. note::
If B is the most similar sentence for A, A is not necessarily the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C).
In [Semantic Search](../semantic-search/README.md) we have shown how to use SentenceTransformer to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search.
For complex search tasks, for example, for question answering retrieval, the search can significantly be improved by using **Retrieve & Re-Rank**.
For complex search tasks, for example question answering retrieval, the search can significantly be improved by using **Retrieve & Re-Rank**.
## Retrieve & Re-Rank Pipeline
A pipeline for information retrieval / question answering retrieval that works well is the following. All components are provided and explained in this article:
The following pipeline for Information Retrieval / Question Answering Retrieval works very well. All components are provided and explained in this article:
Given a search query, we first use a **retrieval system** that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g. with Elasticsearch, or we can use dense retrieval with a bi-encoder.
However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a **re-ranker** based on a **cross-encoder** that scores the relevancy of all candidates for the given search query.
The output will be a ranked list of hits we can present to the user.
Given a search query, we first use a **retrieval system** that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g. with a vector engine like Elasticsearch, or we can use dense retrieval with a bi-encoder. However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a **re-ranker** based on a **cross-encoder** that scores the relevancy of all candidates for the given search query. The output will be a ranked list of hits we can present to the user.
## Retrieval: Bi-Encoder
For the retrieval of the candidate set, we can either use lexical search (e.g. [Elasticsearch](https://www.elastic.co/elasticsearch/)), or we can use a bi-encoder which is implemented in this repository.
For the retrieval of the candidate set, we can either use lexical search (e.g. [Elasticsearch](https://www.elastic.co/elasticsearch/)), or we can use a bi-encoder which is implemented in Sentence Transformers.
Lexical search looks for literal matches of the query words in your document collection. It will not recognize synonyms, acronyms or spelling variations. In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space.
Semantic search overcomes the shortcomings of lexical search and can recognize synonym and acronyms. Have a look at the [semantic search article](../semantic-search/README.md)for different options to implement semantic search.
Semantic search overcomes the shortcomings of lexical search and can recognize synonym and acronyms. Have a look at the [semantic search article](../semantic-search/README.md) for different options to implement semantic search.
## Re-Ranker: Cross-Encoder
The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.
A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.
The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates. A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.
The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document.
Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.
The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document. Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.
## Example Scripts
...
...
@@ -50,7 +42,7 @@ The bi-encoder produces embeddings independently for your paragraphs and for you
# As document, we take the first two section from the Wikipedia article about Europe
document="""Europe is a continent located entirely in the Northern Hemisphere and mostly in the Eastern Hemisphere. It comprises the westernmost part of Eurasia and is bordered by the Arctic Ocean to the north, the Atlantic Ocean to the west, the Mediterranean Sea to the south, and Asia to the east. Europe is commonly considered to be separated from Asia by the watershed of the Ural Mountains, the Ural River, the Caspian Sea, the Greater Caucasus, the Black Sea, and the waterways of the Turkish Straits. Although some of this border is over land, Europe is generally accorded the status of a full continent because of its great physical size and the weight of history and tradition.
Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.
Semantic search seeks to improve search accuracy by understanding the semantic meaning of the search query and the corpus to search over. Semantic search can also perform well given synonyms, abbreviations, and misspellings, unlike keyword search engines that can only find documents based on lexical matches.
## Background
The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.
At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.
The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic similarity with the query.
A **critical distinction** for your setup is *symmetric* vs. *asymmetric semantic search*:
- For **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be *"How to learn Python online?"* and you want to find an entry like *"How to learn Python on the web?"*. For symmetric tasks, you could potentially flip the query and the entries in your corpus.
- Related training example: [Quora Duplicate Questions](../../training/quora_duplicate_questions/README.md).
- For **asymmetric semantic search**, you usually have a **short query** (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like *"What is Python"* and you want to find the paragraph *"Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy ..."*. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.
- Related training example: [MS MARCO](../../training/ms_marco/README.html)
- Suitable models: [Pre-Trained MS MARCO Models](../../../docs/pretrained-models/msmarco-v3.md)
It is critical **that you choose the right model** for your type of task.
Suitable models for **symmetric semantic search**: [Pre-Trained Sentence Embedding Models](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models)
Suitable models for **asymmetric semantic search**: [Pre-Trained MS MARCO Models](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html)
## Python
For small corpora (up to about 1 million entries) we can compute the cosine-similarity between the query and all entries in the corpus.
In the following example, we define a small corpus with few example sentences and compute the embeddings for the corpus as well as for our query.
We then use the [util.cos_sim()](../../../docs/usage/semantic_textual_similarity.md) function to compute the cosine similarity between the query and all corpus entries.
For large corpora, sorting all scores would take too much time. Hence, we use [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html) to only get the top k entries.
## Manual Implementation
For small corpora (up to about 1 million entries), we can perform semantic search with a manual implementation by computing the embeddings for the corpus as well as for our query, and then calculating the [semantic textual similarity](../../../docs/sentence_transformer/usage/semantic_textual_similarity.rst) using [<code>SentenceTransformer.similarity</code>](../../../docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.similarity).
For a simple example, see [semantic_search.py](semantic_search.py):
```eval_rst
.. sidebar:: Output
.. code-block:: txt
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)
Query: Someone in a gorilla costume is playing a set of drums.
Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its prey. (Score: 0.1080)
Query: A cheetah chases prey on across a field.
Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a horse. (Score: 0.0650)
.. literalinclude:: semantic_search.py
```
## util.semantic_search
## Optimized Implementation
Instead of implementing semantic search by yourself, you can use the *util.semantic_search* function.
Instead of implementing semantic search by yourself, you can use the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) function.
The function accepts the following parameters:
...
...
@@ -52,12 +67,10 @@ The function accepts the following parameters:
By default, up to 100 queries are processed in parallel. Further, the corpus is chunked into set of up to 500k entries. You can increase *query_chunk_size* and *corpus_chunk_size*, which leads to increased speed for large corpora, but also increases the memory requirement.
By default, up to 100 queries are processed in parallel. Further, the corpus is chunked into set of up to 500k entries. You can increase ``query_chunk_size`` and ``corpus_chunk_size``, which leads to increased speed for large corpora, but also increases the memory requirement.
## Speed Optimization
To get the optimal speed for the `util.semantic_search` method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance.
Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores.
To get the optimal speed for the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method, it is advisable to have the `query_embeddings` as well as the `corpus_embeddings` on the same GPU-device. This significantly boost the performance. Further, we can normalize the corpus embeddings so that each corpus embeddings is of length 1. In that case, we can use dot-product for computing scores.
[Elasticsearch](https://www.elastic.co/elasticsearch/) has the possibility to [index dense vectors](https://www.elastic.co/what-is/vector-search) and to use them for document scoring. We can easily index embedding vectors, store other data alongside our vectors and, most importantly, efficiently retrieve relevant entries using [approximate nearest neighbor search](https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0)(HNSW, see also below) on the embeddings.
...
...
@@ -77,38 +87,37 @@ For further details, see [semantic_search_quora_elasticsearch.py](semantic_searc
## Approximate Nearest Neighbor
Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by *util.semantic_search*).
Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search)).
In that case, Approximate Nearest Neighbor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be searched efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of vectors.
However, the results are not necessarily exact. It is possible that some vectors with high similarity will be missed. That's the reason why it is called approximate nearest neighbor.
In that case, Approximate Nearest Neighbor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be searched efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of vectors. However, the results are not necessarily exact. It is possible that some vectors with high similarity will be missed.
For all ANN methods, there are usually one or more parameters to tune that determine the recall-speed trade-off. If you want the highest speed, you have a high chance of missing hits. If you want high recall, the search speed decreases.
Three popular libraries for approximate nearest neighbor are [Annoy](https://github.com/spotify/annoy), [FAISS](https://github.com/facebookresearch/faiss), and [hnswlib](https://github.com/nmslib/hnswlib/). Personally I find hnswlib the most suitable library: It is easy to use, offers a great performance and has nice features included that are important for real applications.
Three popular libraries for approximate nearest neighbor are [Annoy](https://github.com/spotify/annoy), [FAISS](https://github.com/facebookresearch/faiss), and [hnswlib](https://github.com/nmslib/hnswlib/).
For further details, see [Retrieve & Re-rank](../retrieve_rerank/README.md).
## Examples
In the following we list examples for different use-cases.
We list a handful of common usecases:
### Similar Questions Retrieval
[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py)[ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the *util.semantic_search* method. As model, we use *distilbert-multilingual-nli-stsb-quora-ranking*, which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.
[semantic_search_quora_pytorch.py](semantic_search_quora_pytorch.py)[ [Colab version](https://colab.research.google.com/drive/12cn5Oo0v3HfQQ8Tv6-ukgxXSmT3zl35A?usp=sharing) ] shows an example based on the [Quora duplicate questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the [<code>util.semantic_search</code>](../../../docs/package_reference/util.html#sentence_transformers.util.semantic_search) method. As model, we use [distilbert-multilingual-nli-stsb-quora-ranking](https://huggingface.co/sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking), which was trained to identify similar questions and supports 50+ languages. Hence, the user can input the question in any of the 50+ languages. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.
### Similar Publication Retrieval
[semantic_search_publications.py](semantic_search_publications.py)[ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://arxiv.org/abs/2004.07180) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract.
[semantic_search_publications.py](semantic_search_publications.py)[ [Colab version](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06?usp=sharing) ] shows an example how to find similar scientific publications. As corpus, we use all publications that have been presented at the EMNLP 2016 - 2018 conferences. As search query, we input the title and abstract of more recent publications and find related publications from our copurs. We use the [SPECTER](https://huggingface.co/sentence-transformers/allenai-specter) model. This is a **symmetric search task**, as the paper in the corpus consists of title & abstract and we search for title & abstract.
### Question & Answer Retrieval
[semantic_search_wikipedia_qa.py](semantic_search_wikipedia_qa.py) [ [Colab Version](https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing) ]:This example uses a model that was trained on the [Natural Questions dataset](https://ai.google.com/research/NaturalQuestions/). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an **asymmetric search task**. As corpus, we use the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) so that it fits easily into memory.
[semantic_search_wikipedia_qa.py](semantic_search_wikipedia_qa.py) [ [Colab Version](https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing) ]:This example uses a model that was trained on the [Natural Questions dataset](https://huggingface.co/datasets/sentence-transformers/natural-questions). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an **asymmetric search task**. As corpus, we use the smaller [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page) so that it fits easily into memory.
[retrieve_rerank_simple_wikipedia.ipynb](../retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]:This script uses the [Retrieve & Re-rank](../retrieve_rerank/README.md) strategy and is an example for an **asymmetric search task**. We split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved (see [semantic search](../semantic-search/README.md)). Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user. We use models that were trained on the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset, a dataset with about 500k real queries from Bing search.
[retrieve_rerank_simple_wikipedia.ipynb](../retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) [ [Colab Version](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb) ]:This script uses the [Retrieve & Re-rank](../retrieve_rerank/README.md) strategy and is an example for an **asymmetric search task**. We split all Wikipedia articles into paragraphs and encode them with a bi-encoder. If a new query / question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved. Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-Encoder are presented to the user. We use models that were trained on the [MS Marco Passage Reranking](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset, a dataset with about 500k real queries from Bing search.