Commit 24db6dab authored by Rayyyyy's avatar Rayyyyy
Browse files

first add

parents
Pipeline #850 failed with stages
in 0 seconds
# MSMARCO Models
[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.
The training data consists of over 500k examples, while the complete corpus consist of over 8.8 Million passages.
## Usage
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("msmarco-distilroberta-base-v3")
query_embedding = model.encode("How big is London")
passage_embedding = model.encode("London has 9,787,426 inhabitants at the 2011 census")
print("Similarity:", util.cos_sim(query_embedding, passage_embedding))
```
For more details on the usage, see [Applications - Information Retrieval](../../examples/applications/retrieve_rerank/README.md)
## Performance
Performance is evaluated on [TREC-DL 2019](https://microsoft.github.io/TREC-2019-Deep-Learning/), which is a query-passage retrieval task where multiple queries have been annotated as with their relevance with respect to the given query. Further, we evaluate on the [MS Marco Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset.
As baseline we show the results for lexical search with BM25 using Elasticsearch.
| Approach | NDCG@10 (TREC DL 19 Reranking) | MRR@10 (MS Marco Dev) | Queries (GPU / CPU) | Docs (GPU / CPU)
| ------------- |:-------------: | :---: | :---: | :---: |
| **Models tuned for cosine-similarity** | |
| msmarco-MiniLM-L-6-v3 | 67.46 | 32.27 | 18,000 / 750 | 2,800 / 180
| msmarco-MiniLM-L-12-v3 | 65.14 | 32.75 | 11,000 / 400 | 1,500 / 90
| msmarco-distilbert-base-v3| 69.02 | 33.13 | 7,000 / 350 | 1,100 / 70
| msmarco-distilbert-base-v4 | **70.24** | **33.79**| 7,000 / 350 | 1,100 / 70
| msmarco-roberta-base-v3 | 69.08 | 33.01 | 4,000 / 170 | 540 / 30
| **Models tuned for dot-product** | |
| msmarco-distilbert-base-dot-prod-v3 | 68.42 | 33.04 | 7,000 / 350 | 1100 / 70
| [msmarco-roberta-base-ance-firstp](https://github.com/microsoft/ANCE) | 67.84 | 33.01 | 4,000 / 170 | 540 / 30
| [msmarco-distilbert-base-tas-b](https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco) | **71.04** | **34.43** | 7,000 / 350 | 1100 / 70
| **Previous approaches** | | |
| BM25 (Elasticsearch) | 45.46 | 17.29 |
| msmarco-distilroberta-base-v2 | 65.65 | 28.55 |
| msmarco-roberta-base-v2 | 67.18 | 29.17 |
| msmarco-distilbert-base-v2 | 68.35 | 30.77 |
**Notes:**
- We provide two type of models, one tuned for **cosine-similarity**, the other for **dot-product**. Make sure to use the right method to compute the similarity between query and passages.
- Models tuned for **cosine-similarity** will prefer the retrieval of shorter passages, while models for **dot-product** will prefer the retrieval of longer passages. Depending on your task, you might prefer the one or the other type of model.
- **msmarco-roberta-base-ance-firstp** is the MSMARCO Dev Passage Retrieval ANCE(FirstP) 600K model from [ANCE](https://github.com/microsoft/ANCE). This model should be used with dot-product instead of cosine similarity.
- **msmarco-distilbert-base-tas-b** uses the model from [sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco](https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco). See the linked documentation / paper for more details.
- Encoding speeds are per second and were measured on a V100 GPU and an 8 core Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
## Changes in v3
The models from v2 have been used for find for all training queries similar passages. An [MS MARCO Cross-Encoder](ce-msmarco.md) based on the electra-base-model has been then used to classify if these retrieved passages answer the question.
If they received a low score by the cross-encoder, we saved them as hard negatives: They got a high score from the bi-encoder, but a low-score from the (better) cross-encoder.
We then trained the v2 models with these new hard negatives.
## Version Histroy
As we work on the topic, we will publish updated (and improved) models.
- [Version 2](msmarco-v2.md)
- [Version 1](msmarco-v1.md)
# MSMARCO Models
[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.
The training data constist of over 500k examples, while the complete corpus consist of over 8.8 Million passages.
## Usage
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("msmarco-distilbert-dot-v5")
query_embedding = model.encode("How big is London")
passage_embedding = model.encode([
"London has 9,787,426 inhabitants at the 2011 census",
"London is known for its finacial district",
])
print("Similarity:", util.dot_score(query_embedding, passage_embedding))
```
For more details on the usage, see [Applications - Information Retrieval](../../examples/applications/retrieve_rerank/README.md)
## Performance
Performance is evaluated on [TREC-DL 2019](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019) and [TREC-DL 2020](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020), which are a query-passage retrieval task where multiple queries have been annotated as with their relevance with respect to the given query. Further, we evaluate on the [MS Marco Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking/) dataset.
| Approach | MRR@10 (MS Marco Dev) | NDCG@10 (TREC DL 19 Reranking) | NDCG@10 (TREC DL 20 Reranking) | Queries (GPU / CPU) | Docs (GPU / CPU)
| ------------- | :-------------: | :-------------: | :---: | :---: | :---: |
| **Models tuned with normalized embeddings** | |
| [msmarco-MiniLM-L6-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5) | 32.27 | 67.46 | 64.73 | 18,000 / 750 | 2,800 / 180
| [msmarco-MiniLM-L12-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) | 32.75 | 65.14 | 67.48 | 11,000 / 400 | 1,500 / 90
| [msmarco-distilbert-cos-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5) | 33.79 | 70.24 | 66.24 | 7,000 / 350 | 1,100 / 70
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | | 65.55 | 64.66 | 18,000 / 750 | 2,800 / 180
| [multi-qa-distilbert-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) | | 67.59 | 66.46 | 7,000 / 350 | 1,100 / 70
| [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) | | 67.78 | 69.87 | 4,000 / 170 | 540 / 30
| **Models tuned for dot-product** | |
| [msmarco-distilbert-base-tas-b](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b) | 34.43 | 71.04 | 69.78 | 7,000 / 350 | 1100 / 70
| [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5) | 37.25 | 70.14 | 71.08 | 7,000 / 350 | 1100 / 70
| [msmarco-bert-base-dot-v5](https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5) | 38.08 | 70.51 | 73.45 | 4,000 / 170 | 540 / 30
| [multi-qa-MiniLM-L6-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1) | | 66.70 | 65.98 | 18,000 / 750 | 2,800 / 180
| [multi-qa-distilbert-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-dot-v1) | | 68.05 | 70.49 | 7,000 / 350 | 1,100 / 70
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | | 70.66 | 71.18 | 4,000 / 170 | 540 / 30
**Notes:**
- We provide two type of models: One that produces **normalized embedding** and can be used with dot-product, cosine-similarity or euclidean distance (all three scoring function will produce the same results). The models tuned for **dot-product** will produce embeddings of different lengths and must be used with dot-product to find close items in a vector space.
- Models with normalized embeddings will prefer the retrieval of shorter passages, while models tuned for **dot-product** will prefer the retrieval of longer passages. Depending on your task, you might prefer the one or the other type of model.
- Encoding speeds are per second and were measured on a V100 GPU and an 8 core Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
## Changes in v5
- Models with normalized embeddings were added: These are the v3 cosine-similarity models, but with an additional normalize layer on-top.
- New models trained with MarginMSE loss trained: msmarco-distilbert-dot-v5 and msmarco-bert-base-dot-v5
## Changes in v4
- Just one new model was trained with better hard negatives, leading to a small improvement compared to v3
## Changes in v3
The models from v2 have been used for find for all training queries similar passages. An [MS MARCO Cross-Encoder](ce-msmarco.md) based on the electra-base-model has been then used to classify if these retrieved passages answer the question.
If they received a low score by the cross-encoder, we saved them as hard negatives: They got a high score from the bi-encoder, but a low-score from the (better) cross-encoder.
We then trained the v2 models with these new hard negatives.
## Version Histroy
As we work on the topic, we will publish updated (and improved) models.
- [Version 3](msmarco-v3.md)
- [Version 2](msmarco-v2.md)
- [Version 1](msmarco-v1.md)
# NLI Models
Conneau et al., 2017, show in the InferSent-Paper ([Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/abs/1705.02364)) that training on Natural Language Inference (NLI) data can produce universal sentence embeddings.
The datasets labeled sentence pairs with the labels *entail*, *contradict*, and *neutral*. For both sentences, we compute a sentence embedding. These two embeddings are concatenated and passed to softmax classifier to derive the final label.
As shown, this produces sentence embeddings that can be used for various use cases like clustering or semantic search.
# Datasets
We train the models on the [SNLI](https://nlp.stanford.edu/projects/snli/) and on the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset. We call the combination of the two datasets AllNLI.
For a training example, see [examples/training_nli_bert.py](../../examples/training_nli_bert.py).
# Pretrained models
We provide the various pre-trained models. The performance was evaluated on the test set of the [STS benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) using Spearman rank correlation.
[» Full List of NLI & STS Models](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0)
# Performance Comparison
Here are the performances on the STS benchmark for other sentence embeddings methods. They were also computed by using cosine-similarity and Spearman rank correlation:
- Avg. GloVe embeddings: 58.02
- BERT-as-a-service avg. embeddings: 46.35
- BERT-as-a-service CLS-vector: 16.50
- InferSent - GloVe: 68.03
- Universal Sentence Encoder: 74.92
# Applications
This model works well in accessing the coarse-grained similarity between sentences. For application examples, see [semantic_textual_similarity](../usage/semantic_textual_similarity.md) and [semantic search](../usage/semantic_search.md).
\ No newline at end of file
# Natural Questions Models
[Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions) constists of about 100k real search queries from Google with the respective, relevant passage from Wikipedia. Models trained on this dataset work well for question-answer retrieval.
## Usage
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("nq-distilbert-base-v1")
query_embedding = model.encode("How many people live in London?")
# The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode(
[["London", "London has 9,787,426 inhabitants at the 2011 census."]]
)
print("Similarity:", util.cos_sim(query_embedding, passage_embedding))
```
Note: For the passage, we have to encode the Wikipedia article title together with a text paragraph from that article.
## Performance
The models are evaluated on the Natural Questions development dataset using MRR@10.
| Approach | MRR@10 (NQ dev set small) |
| ------------- |:-------------: |
| nq-distilbert-base-v1 | 72.36 |
| *Other models* | |
| [DPR](https://huggingface.co/transformers/model_doc/dpr.html) | 58.96 |
\ No newline at end of file
# STS Models
The models were first trained on [NLI data](nli-models.md), then we fine-tuned them on the [STS benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark). This generate sentence embeddings that are especially suitable to measure the semantic similarity between sentence pairs.
# Datasets
We use the training file from the [STS benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark).
For a training example, see:
- [examples/training_stsbenchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py) - Train directly on STS data
- [examples/training_stsbenchmark_continue_training.py ](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py) - First train on NLI, than train on STS data.
# Pre-trained models
We provide the following pre-trained models:
[» Full List of STS Models](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0)
# Performance Comparison
Here are the performances on the STS benchmark for other sentence embeddings methods. They were also computed by using cosine-similarity and Spearman rank correlation. Note, these models were not-fined on the STS benchmark.
- Avg. GloVe embeddings: 58.02
- BERT-as-a-service avg. embeddings: 46.35
- BERT-as-a-service CLS-vector: 16.50
- InferSent - GloVe: 68.03
- Universal Sentence Encoder: 74.92
# Wikipedia Sections Models
The `wikipedia-sections-models` implement the idea from Ein Dor et al., 2018, [Learning Thematic Similarity Metric Using Triplet Networks](https://aclweb.org/anthology/P18-2009).
It was trained with a triplet-loss: The anchor and the positive example were sentences from the same section from an wikipedia article, for example, from the History section of the London article. The negative example came from a different section from the same article, for example, from the Education section of the London article.
# Dataset
We use dataset from Ein Dor et al., 2018, [Learning Thematic Similarity Metric Using Triplet Networks](https://aclweb.org/anthology/P18-2009).
See [examples/training_wikipedia_sections.py](../../examples/training_wikipedia_sections.py) for how to train on this dataset.
# Pre-trained models
We provide the following pre-trained models:
- **bert-base-wikipedia-sections-mean-tokens**: 80.42% accuracy on test set.
You can use them in the following way:
```
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('pretrained-model-name')
```
# Performance Comparison
Performance (accuracy) reported by Dor et al.:
- mean-vectors: 0.65
- skip-thoughts-CS: 0.615
- skip-thoughts-SICK: 0.547
- triplet-sen: 0.74
# Applications
The models achieve a rather low performance on the STS benchmark dataset. The reason for this is the training objective: An anchor, a positive and a negative example are presented. The network must only learn to differentiate what the positive and what the negative example is by ensuring that the negative example is further away from the anchor than the positive example.
However, it does not matter how far the negative example is away, it can be little or really far away. This makes this model rather bad for deciding if a pair is somewhat similar. It learns only to recognize similar pairs (high scores) and dissimilar pairs (low scores).
However, this model works well for **fine-grained clustering**.
For an example, see:
[examples/application_clustering_wikipedia_sections.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering_wikipedia_sections.py)
# Pretrained Cross-Encoders
This page lists available **pretrained Cross-Encoders**. Cross-Encoders require the input of a text pair and output a score 0...1. They do not work for individual sentences and they don't compute embeddings for individual texts.
![BiEncoder](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/Bi_vs_Cross-Encoder.png)
## MS MARCO
[MS MARCO Passage Retrieval](https://github.com/microsoft/MSMARCO-Passage-Ranking) is a large dataset with real user queries from Bing search engine with annotated relevant text passages.
These models can be used like this:
```python
from sentence_transformers import CrossEncoder
model = CrossEncoder("model_name", max_length=512)
scores = model.predict([("Query1", "Paragraph1"), ("Query1", "Paragraph2")])
# For Example
scores = model.predict([
("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
("How many people live in Berlin?", "Berlin is well known for its museums."),
])
```
- **cross-encoder/ms-marco-TinyBERT-L-2-v2** - MRR@10 on MS Marco Dev Set: 32.56
- **cross-encoder/ms-marco-MiniLM-L-2-v2** - MRR@10 on MS Marco Dev Set: 34.85
- **cross-encoder/ms-marco-MiniLM-L-4-v2** - MRR@10 on MS Marco Dev Set: 37.70
- **cross-encoder/ms-marco-MiniLM-L-6-v2** - MRR@10 on MS Marco Dev Set: 39.01
- **cross-encoder/ms-marco-MiniLM-L-12-v2** - MRR@10 on MS Marco Dev Set: 39.02
For details on the usage, see [Applications - Information Retrieval](../examples/applications/retrieve_rerank/README.md)
[MS MARCO Cross-Encoders - More details](pretrained-models/ce-msmarco.md)
## SQuAD (QNLI)
QNLI is based on the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) and was introduced by the [GLUE Benchmark](https://arxiv.org/abs/1804.07461). Given a passage from Wikipedia, annotators created questions that are answerable by that passage.
- **cross-encoder/qnli-distilroberta-base** - Accuracy on QNLI dev set: 90.96
- **cross-encoder/qnli-electra-base** - Accuracy on QNLI dev set: 93.21
## STSbenchmark
The following models can be used like this:
```python
from sentence_transformers import CrossEncoder
model = CrossEncoder("model_name")
scores = model.predict([("Sent A1", "Sent B1"), ("Sent A2", "Sent B2")])
```
They return a score 0...1 indicating the semantic similarity of the given sentence pair.
- **cross-encoder/stsb-TinyBERT-L-4** - STSbenchmark test performance: 85.50
- **cross-encoder/stsb-distilroberta-base** - STSbenchmark test performance: 87.92
- **cross-encoder/stsb-roberta-base** - STSbenchmark test performance: 90.17
- **cross-encoder/stsb-roberta-large** - STSbenchmark test performance: 91.47
## Quora Duplicate Questions
These models have been trained on the [Quora duplicate questions dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs). They can used like the STSb models and give a score 0...1 indicating the probability that two questions are duplicate questions.
- **cross-encoder/quora-distilroberta-base** - Average Precision dev set: 87.48
- **cross-encoder/quora-roberta-base** - Average Precision dev set: 87.80
- **cross-encoder/quora-roberta-large** - Average Precision dev set: 87.91
Note: The model don't work for question similarity. The question *How to learn Java* and *How to learn Python* will get a low score, as these questions are not duplicates. For question similarity, the respective bi-encoder trained on the Quora dataset yields much more meaningful results.
## NLI
Given two sentences, are these contradicting each other, entailing one the other or are these netural? The following models were trained on the [SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) datasets.
- **cross-encoder/nli-deberta-v3-base** - Accuracy on MNLI mismatched set: 90.04
- **cross-encoder/nli-deberta-base** - Accuracy on MNLI mismatched set: 88.08
- **cross-encoder/nli-deberta-v3-xsmall** - Accuracy on MNLI mismatched set: 87.77
- **cross-encoder/nli-deberta-v3-small** - Accuracy on MNLI mismatched set: 87.55
- **cross-encoder/nli-roberta-base** - Accuracy on MNLI mismatched set: 87.47
- **cross-encoder/nli-MiniLM2-L6-H768** - Accuracy on MNLI mismatched set: 86.89
- **cross-encoder/nli-distilroberta-base** - Accuracy on MNLI mismatched set: 83.98
```python
from sentence_transformers import CrossEncoder
model = CrossEncoder("model_name")
scores = model.predict([
("A man is eating pizza", "A man eats something"),
("A black race car starts up in front of a crowd of people.", "A man is driving down a lonely road."),
])
# Convert scores to labels
label_mapping = ["contradiction", "entailment", "neutral"]
labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)]
```
# Pretrained Models
We provide various pre-trained models. Using these models is easy:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("model_name")
```
All models are hosted on the [HuggingFace Model Hub](https://huggingface.co/sentence-transformers).
## Model Overview
The following table provides an overview of (selected) models. They have been extensively evaluated for their quality to embedded sentences (Performance Sentence Embeddings) and to embedded search queries & paragraphs (Performance Semantic Search).
The **all-*** models were trained on all available training data (more than 1 billion training pairs) and are designed as **general purpose** models. The **all-mpnet-base-v2** model provides the best quality, while **all-MiniLM-L6-v2** is 5 times faster and still offers good quality. Toggle *All models* to see all evaluated models or visit [HuggingFace Model Hub](https://huggingface.co/models?library=sentence-transformers) to view all existing sentence-transformers models.
<iframe src="../_static/html/models_en_sentence_embeddings.html" height="600" style="width:100%; border:none;" title="Iframe Example"></iframe>
---
## Semantic Search
The following models have been specifically trained for **Semantic Search**: Given a question / search query, these models are able to find relevant text passages. For more details, see [Usage - Semantic Search](../examples/applications/semantic-search/README.md).
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
query_embedding = model.encode("How big is London")
passage_embedding = model.encode([
"London has 9,787,426 inhabitants at the 2011 census",
"London is known for its finacial district",
])
print("Similarity:", util.dot_score(query_embedding, passage_embedding))
```
### Multi-QA Models
The following models have been trained on [215M question-answer pairs](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1#training) from various sources and domains, including StackExchange, Yahoo Answers, Google & Bing search queries and many more. These model perform well across many search tasks and domains.
These models were tuned to be used with dot-product:
| Model | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. |
| --- | :---: | :---: |
| [multi-qa-MiniLM-L6-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-dot-v1) | 49.19 | 18,000 / 750 |
| [multi-qa-distilbert-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-dot-v1) | 52.51 | 7,000 / 350 |
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | 57.60 | 4,000 / 170 |
These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance:
| Model | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. |
| --- | :---: | :---: |
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | 51.83 | 18,000 / 750 |
| [multi-qa-distilbert-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-distilbert-cos-v1) | 52.83 | 7,000 / 350 |
| [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) | 57.46 | 4,000 / 170 |
### MSMARCO Passage Models
The [MSMARCO Passage Ranking Dataset](https://github.com/microsoft/MSMARCO-Passage-Ranking) contains 500k real queries from Bing search together with the relevant passages from various web sources. Given the diversity of the MSMARCO dataset, models also perform well on other domains.
Models tuned to be used with dot-product:
| Model | MSMARCO MRR@10 dev set | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. |
| --- | :---: | :---: | :---: |
| [msmarco-distilbert-base-tas-b](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b) | 34.43 | 49.25 | 7,000 / 350 |
| [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5) | 37.25 | 49.47 | 7,000 / 350 |
| [msmarco-bert-base-dot-v5](https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5) | 38.08 | 52.11 | 4,000 / 170 |
These models produce normalized vectors of length 1, which can be used with dot-product, cosine-similarity and Euclidean distance:
| Model | MSMARCO MRR@10 dev set | Performance Semantic Search (6 Datasets) | Queries (GPU / CPU) per sec. |
| --- | :---: | :---: | :---: |
| [msmarco-MiniLM-L6-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5) | 32.27 | 42.16 | 18,000 / 750 |
| [msmarco-MiniLM-L12-cos-v5](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) | 32.75 | 43.89 | 11,000 / 400 |
| [msmarco-distilbert-cos-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5) | 33.79 | 44.98 | 7,000 / 350 |
[MSMARCO Models - More details](pretrained-models/msmarco-v5.md)
---
## Multi-Lingual Models
The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language. Details are in our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813). We used the following 50+ languages: ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.
**Semantic Similarity**
These models find semantically similar sentences within one language or across languages:
- **distiluse-base-multilingual-cased-v1**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
- **distiluse-base-multilingual-cased-v2**: Multilingual knowledge distilled version of [multilingual Universal Sentence Encoder](https://arxiv.org/abs/1907.04307). This version supports 50+ languages, but performs a bit weaker than the v1 model.
- **paraphrase-multilingual-MiniLM-L12-v2** - Multilingual version of *paraphrase-MiniLM-L12-v2*, trained on parallel data for 50+ languages.
- **paraphrase-multilingual-mpnet-base-v2** - Multilingual version of *paraphrase-mpnet-base-v2*, trained on parallel data for 50+ languages.
**Bitext Mining**
Bitext mining describes the process of finding translated sentence pairs in two languages. If this is your use-case, the following model gives the best performance:
- **LaBSE** - [LaBSE](https://arxiv.org/abs/2007.01852) Model. Supports 109 languages. Works well for finding translation pairs in multiple languages. As detailed [here](https://arxiv.org/abs/2004.09813), LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other.
Extending a model to new languages is easy by following [the description here](https://www.sbert.net/examples/training/multilingual/README.html).
----
## Image & Text-Models
The following models can embed images and text into a joint vector space. See [Image Search](../examples/applications/image-search/README.md) for more details how to use for text2image-search, image2image-search, image clustering, and zero-shot image classification.
The following models are available with their respective Top 1 accuracy on zero-shot ImageNet validation dataset.
| Model | Top 1 Performance |
| --- | :---: |
| [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) | 63.3 |
| [clip-ViT-B-16](https://huggingface.co/sentence-transformers/clip-ViT-B-16) | 68.1 |
| [clip-ViT-L-14](https://huggingface.co/sentence-transformers/clip-ViT-L-14) | 75.4 |
We further provide this multilingual text-image model:
- **clip-ViT-B-32-multilingual-v1** - Multilingual text encoder for the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model using [Multilingual Knowledge Distillation](https://arxiv.org/abs/2004.09813). This model can encode text in 50+ languages to match the image vectors from the [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) model.
---
## Other Models
### INSTRUCTOR models
Some INSTRUCTOR models, such as [hkunlp/instructor-large](https://huggingface.co/hkunlp/instructor-large), are natively supported in Sentence Transformers. These models are special, as they are trained with instructions in mind. Notably, the primary difference between normal Sentence Transformer models and Instructor models is that the latter do not include the instructions themselves in the pooling step.
The following models work out of the box:
* [hkunlp/instructor-base](https://huggingface.co/hkunlp/instructor-base)
* [hkunlp/instructor-large](https://huggingface.co/hkunlp/instructor-large)
* [hkunlp/instructor-xl](https://huggingface.co/hkunlp/instructor-xl)
You can use these models like so:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("hkunlp/instructor-large")
embeddings = model.encode(
[
"Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity",
"Comparison of Atmospheric Neutrino Flux Calculations at Low Energies",
"Fermion Bags in the Massive Gross-Neveu Model",
"QCD corrections to Associated t-tbar-H production at the Tevatron",
],
prompt="Represent the Medicine sentence for clustering: ",
)
print(embeddings.shape)
# => (4, 768)
```
For example, for information retrieval:
```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer("hkunlp/instructor-large")
query = "where is the food stored in a yam plant"
query_instruction = (
"Represent the Wikipedia question for retrieving supporting documents: "
)
corpus = [
'Yams are perennial herbaceous vines native to Africa, Asia, and the Americas and cultivated for the consumption of their starchy tubers in many temperate and tropical regions. The tubers themselves, also called "yams", come in a variety of forms owing to numerous cultivars and related species.',
"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loans—and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession",
"Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.",
]
corpus_instruction = "Represent the Wikipedia document for retrieval: "
query_embedding = model.encode(query, prompt=query_instruction)
corpus_embeddings = model.encode(corpus, prompt=corpus_instruction)
similarities = cos_sim(query_embedding, corpus_embeddings)
print(similarities)
# => tensor([[0.8835, 0.7037, 0.6970]])
```
All other Instructor models either 1) will not load as they refer to `InstructorEmbedding` in their `modules.json` or 2) require calling `model.set_pooling_include_prompt(include_prompt=False)` after loading.
### Scientific Publications
[SPECTER](https://arxiv.org/abs/2004.07180) is a model trained on scientific citations and can be used to estimate the similarity of two publications. We can use it to find similar papers.
- **allenai-specter** - [Semantic Search Python Example](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_publications.py) / [Semantic Search Colab Example](https://colab.research.google.com/drive/12hfBveGHRsxhPIUMmJYrll2lFU4fOX06)
### Natural Questions (NQ) Dataset Models
The following models were trained on [Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions), a dataset with 100k real queries from Google search together with the relevant passages from Wikipedia.
- **nq-distilbert-base-v1**: MRR10: 72.36 on NQ dev set (small)
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("nq-distilbert-base-v1")
query_embedding = model.encode("How many people live in London?")
# The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode(
[["London", "London has 9,787,426 inhabitants at the 2011 census."]]
)
print("Similarity:", util.cos_sim(query_embedding, passage_embedding))
```
You can index the passages as shown [here](../examples/applications/semantic-search/README.md).
**Note:** The NQ model doesn't perform well. Use the above mentioned Multi-QA models to achieve the optimal performance.
[More details](pretrained-models/nq-v1.md)
### DPR-Models
In [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) Karpukhin et al. trained models based on [Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions):
- **facebook-dpr-ctx_encoder-single-nq-base**
- **facebook-dpr-question_encoder-single-nq-base**
They also trained models on the combination of Natural Questions, TriviaQA, WebQuestions, and CuratedTREC.
- **facebook-dpr-ctx_encoder-multiset-base**
- **facebook-dpr-question_encoder-multiset-base**
**Note:** The DPR models perform comparabily bad. Use the above mentioned Multi-QA models to achieve the optimal performance.
[More details & usage of the DPR models](pretrained-models/dpr.md)
### Average Word Embeddings Models
The following models apply compute the average word embedding for some well-known word embedding methods. Their computation speed is much higher than the transformer based models, but the quality of the embeddings are worse.
- **average_word_embeddings_glove.6B.300d**
- **average_word_embeddings_komninos**
- **average_word_embeddings_levy_dependency**
- **average_word_embeddings_glove.840B.300d**
# Publications
If you find this repository helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}
```
If you use one of the multilingual models, feel free to cite our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813):
```bibtex
@inproceedings{reimers-2020-multilingual-sentence-bert,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2004.09813",
}
```
If you use the code for [data augmentation](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/data_augmentation), feel free to cite our publication [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240):
```bibtex
@inproceedings{thakur-2020-AugSBERT,
title = "Augmented {SBERT}: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = "6",
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2010.08240",
pages = "296--310",
}
```
If you use the models for [MS MARCO](pretrained-models/msmarco-v2.md), feel free to cite the paper: [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210)
```bibtex
@inproceedings{reimers-2020-Curse_Dense_Retrieval,
title = "The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
month = "8",
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2012.14210",
pages = "605--611",
}
```
When you use the unsupervised learning example, please have a look at: [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979):
```bibtex
@inproceedings{wang-2021-TSDAE,
title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
pages = "671--688",
url = "https://arxiv.org/abs/2104.06979",
}
```
When you use the GenQ learning example, please have a look at: [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663):
```bibtex
@inproceedings{thakur-2021-BEIR,
title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models",
author = {Thakur, Nandan and Reimers, Nils and R{\"{u}}ckl{\'{e}}, Andreas and Srivastava, Abhishek and Gurevych, Iryna},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) - Datasets and Benchmarks Track (Round 2)},
month = "4",
year = "2021",
url = "https://arxiv.org/abs/2104.08663",
}
```
When you use GPL, please have a look at: [GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval](https://arxiv.org/abs/2112.07577):
```bibtex
@inproceedings{wang-2021-GPL,
title = "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval",
author = "Wang, Kexin and Thakur, Nandan and Reimers, Nils and Gurevych, Iryna",
journal= "arXiv preprint arXiv:2112.07577",
month = "12",
year = "2021",
url = "https://arxiv.org/abs/2112.07577",
}
```
**Repositories using SentenceTransformers**
- **[haystack](https://github.com/deepset-ai/haystack)** - Neural Search / Q&A
- **[Top2Vec](https://github.com/ddangelov/Top2Vec)** - Topic modeling
- **[txtai](https://github.com/neuml/txtai)** - AI-powered search engine
- **[BERTTopic](https://github.com/MaartenGr/BERTopic)** - Topic model using SBERT embeddings
- **[KeyBERT](https://github.com/MaartenGr/KeyBERT)** - Key phrase extraction using SBERT
- **[contextualized-topic-models](https://github.com/MilaNLProc/contextualized-topic-models)** - Cross-Lingual Topic Modeling
- **[covid-papers-browser](https://github.com/gsarti/covid-papers-browser)** - Semantic Search for Covid-19 papers
- **[backprop](https://github.com/backprop-ai/backprop)** - Natural Language Engine that makes using state-of-the-art language models easy, accessible and scalable.
**SentenceTransformers in Articles**
In the following you find a (selective) list of articles / applications using SentenceTransformers to do amazing stuff. Feel free to contact me (info@nils-reimers.de) to add you application here.
- **December 2021 - [Sentence Transformer Fine-Tuning (SetFit): Outperforming GPT-3 on few-shot Text-Classification while being 1600 times smaller](https://towardsdatascience.com/sentence-transformer-fine-tuning-setfit-outperforms-gpt-3-on-few-shot-text-classification-while-d9a3788f0b4e?gi=4bdbaff416e3)**
- **October 2021: [Natural Language Processing (NLP) for Semantic Search](https://www.pinecone.io/learn/nlp)**
- **January 2021 - [Advance BERT model via transferring knowledge from Cross-Encoders to Bi-Encoders](https://towardsdatascience.com/advance-nlp-model-via-transferring-knowledge-from-cross-encoders-to-bi-encoders-3e0fc564f554)**
- **November 2020 - [How to Build a Semantic Search Engine With Transformers and Faiss](https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8)**
- **October 2020 - [Topic Modeling with BERT](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6)**
- **September 2020 - [Elastic Transformers -
Making BERT stretchy - Scalable Semantic Search on a Jupyter Notebook](https://medium.com/@mihail.dungarov/elastic-transformers-ae011e8f5b88)**
- **July 2020 - [Simple Sentence Similarity Search with SentenceBERT](https://laptrinhx.com/simple-sentence-similarity-search-with-sentencebert-800684405/?fbclid=IwAR0rxdYS2DBGuHhijIRO_lsXqGc9BbjtDA-dDQM5Ng_StahT9xrHdRZuP9M)**
- **May 2020 - [HN Time Machine: finally some Hacker News history!](https://peltarion.com/blog/applied-ai/hacker-news-time-machine)**
- **May 2020 - [A complete guide to transfer learning from English to other Languages using Sentence Embeddings BERT Models](https://towardsdatascience.com/a-complete-guide-to-transfer-learning-from-english-to-other-languages-using-sentence-embeddings-8c427f8804a9)**
- **March 2020 - [Building a k-NN Similarity Search Engine using Amazon Elasticsearch and SageMaker](https://towardsdatascience.com/building-a-k-nn-similarity-search-engine-using-amazon-elasticsearch-and-sagemaker-98df18d883bd)**
- **February 2020 - [Semantic Search Engine with Sentence BERT](https://medium.com/@evergreenllc2020/semantic-search-engine-with-s-abbfb3cd9377)**
**SentenceTransformers used in Research**
SentenceTransformers is used in hundreds of research projects. For a list of publications, see [Google Scholar](https://scholar.google.com/scholar?oi=bibs&hl=de&cites=12599223809118664426) or [Semantic Scholar](https://www.semanticscholar.org/paper/Sentence-BERT%3A-Sentence-Embeddings-using-Siamese-Reimers-Gurevych/93d63ec754f29fa22572615320afe0521f7ec66d).
\ No newline at end of file
# Quickstart
Once you have [installed](installation.md) Sentence Transformers, the usage is simple:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Our sentences we like to encode
sentences = [
"This framework generates embeddings for each input sentence",
"Sentences are passed as a list of string.",
"The quick brown fox jumps over the lazy dog.",
]
# Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)
# Print the embeddings
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
```
With `SentenceTransformer('all-MiniLM-L6-v2')` we define which sentence transformer model we like to load. In this example, we load [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), which is a MiniLM model finetuned on a large dataset of over 1 billion training pairs.
BERT (and other transformer networks) output for each token in our input text an embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are averaged to yield a fixed-sized vector.
## Comparing Sentence Similarities
The sentences (texts) are mapped such that sentences with similar meanings are close in vector space. One common method to measure the similarity in vector space is to use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). For two sentences, this can be done like this:
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
# Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")
cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)
```
If you have a list with more sentences, you can use the following code example:
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin.",
"Two men pushed carts through the woods.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"Someone in a gorilla costume is playing a set of drums.",
]
# Encode all sentences
embeddings = model.encode(sentences)
# Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)
# Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim) - 1):
for j in range(i + 1, len(cos_sim)):
all_sentence_combinations.append([cos_sim[i][j], i, j])
# Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)
print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))
```
See on the left the *Usage* sections for more examples how to use SentenceTransformers.
## Pre-Trained Models
Various pre-trained models exists optimized for many tasks exists. For a full list, see **[Pretrained Models](pretrained_models.md)**.
## Training your own Embeddings
Training your own sentence embeddings models for all type of use-cases is easy and requires often only minimal coding effort. For a comprehensive tutorial, see [Training/Overview](training/overview.md).
You can also extend easily existent sentence embeddings models to **further languages**. For details, see [Multi-Lingual Training](../examples/training/multilingual/README).
# Must use Python 3.8!
sphinx<4
Jinja2<3.1
sphinx_markdown_tables
recommonmark
-e ..
\ No newline at end of file
# Loss Overview
Loss functions play a critical role in the performance of your fine-tuned model. Sadly, there is no "one size fits all" loss function. Ideally, this overview should help narrow down your choice of loss function(s) by matching them to your data formats.
**Note**: you can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, `(sentence_A, sentence_B) pairs` with `class` labels can be converted into `(anchor, positive, negative) triplets` by sampling sentences with the same or different classes.
| Texts | Labels | Appropriate Loss Functions |
|-----------------------------------------------|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `single sentences` | `class` | <a href="../package_reference/losses.html#batchalltripletloss">`BatchAllTripletLoss`</a><br><a href="../package_reference/losses.html#batchhardsoftmargintripletloss">`BatchHardSoftMarginTripletLoss`</a><br><a href="../package_reference/losses.html#batchhardtripletloss">`BatchHardTripletLoss`</a><br><a href="../package_reference/losses.html#batchsemihardtripletloss">`BatchSemiHardTripletLoss`</a> |
| `single sentences` | `none` | <a href="../package_reference/losses.html#contrastivetensionloss">`ContrastiveTensionLoss`</a><br><a href="../package_reference/losses.html#denoisingautoencoderloss">`DenoisingAutoEncoderLoss`</a> |
| `(anchor, anchor) pairs` | `none` | <a href="../package_reference/losses.html#contrastivetensionlossinbatchnegatives">`ContrastiveTensionLossInBatchNegatives`</a> |
| `(damaged_sentence, original_sentence) pairs` | `none` | <a href="../package_reference/losses.html#denoisingautoencoderloss">`DenoisingAutoEncoderLoss`</a> |
| `(sentence_A, sentence_B) pairs` | `class` | <a href="../package_reference/losses.html#softmaxloss">`SoftmaxLoss`</a> |
| `(anchor, positive) pairs` | `none` | <a href="../package_reference/losses.html#cachedmultiplenegativesrankingloss">`CachedMultipleNegativesRankingLoss`</a><br><a href="../package_reference/losses.html#multiplenegativesrankingloss">`MultipleNegativesRankingLoss`</a><br><a href="../package_reference/losses.html#multiplenegativessymmetricrankingloss">`MultipleNegativesSymmetricRankingLoss`</a><br><a href="../package_reference/losses.html#megabatchmarginloss">`MegaBatchMarginLoss`</a><br><a href="../package_reference/losses.html#gistembedloss">`GISTEmbedLoss`</a> |
| `(anchor, positive/negative) pairs` | `1 if positive, 0 if negative` | <a href="../package_reference/losses.html#contrastiveloss">`ContrastiveLoss`</a><br><a href="../package_reference/losses.html#onlinecontrastiveloss">`OnlineContrastiveLoss`</a> |
| `(sentence_A, sentence_B) pairs` | `float similarity score` | <a href="../package_reference/losses.html#cosentloss">`CoSENTLoss`</a><br><a href="../package_reference/losses.html#angleloss">`AnglELoss`</a><br><a href="../package_reference/losses.html#cosinesimilarityloss">`CosineSimilarityLoss`</a> |
| `(anchor, positive, negative) triplets` | `none` | <a href="../package_reference/losses.html#cachedmultiplenegativesrankingloss">`CachedMultipleNegativesRankingLoss`</a><br><a href="../package_reference/losses.html#multiplenegativesrankingloss">`MultipleNegativesRankingLoss`</a><br><a href="../package_reference/losses.html#tripletloss">`TripletLoss`</a><br><a href="../package_reference/losses.html#gistembedloss">`GISTEmbedLoss`</a> |
## Loss modifiers
These loss functions can be seen as *loss modifiers*: they work on top of standard loss functions, but apply those loss functions in different ways to try and instil useful properties into the trained embedding model.
For example, models trained with <a href="../package_reference/losses.html#matryoshkaloss">`MatryoshkaLoss`</a> produce embeddings whose size can be truncated without notable losses in performance, and models trained with <a href="../package_reference/losses.html#adaptivelayerloss">`AdaptiveLayerLoss`</a> still perform well when you remove model layers for faster inference.
| Texts | Labels | Appropriate Loss Functions |
|-------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `any` | `any` | <a href="../package_reference/losses.html#matryoshkaloss">`MatryoshkaLoss`</a><br><a href="../package_reference/losses.html#adaptivelayerloss">`AdaptiveLayerLoss`</a><br><a href="../package_reference/losses.html#matryoshka2dloss">`Matryoshka2dLoss`</a> |
## Distillation
These loss functions are specifically designed to be used when distilling the knowledge from one model into another.
For example, when finetuning a small model to behave more like a larger & stronger one, or when finetuning a model to become multi-lingual.
| Texts | Labels | Appropriate Loss Functions |
|----------------------------------------------|---------------------------------------------------------------|------------------------------------------------------------------------------|
| `single sentences` | `model sentence embeddings` | <a href="../package_reference/losses.html#mseloss">`MSELoss`</a> |
| `(query, passage_one, passage_two) triplets` | `gold_sim(query, passage_one) - gold_sim(query, passage_two)` | <a href="../package_reference/losses.html#marginmseloss">`MarginMSELoss`</a> |
## Commonly used Loss Functions
In practice, not all loss functions get used equally often. The most common scenarios are:
* `(anchor, positive) pairs` without any labels: <a href="../package_reference/losses.html#multiplenegativesrankingloss"><code>MultipleNegativesRankingLoss</code></a> is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. <a href="../package_reference/losses.html#cachedmultiplenegativesrankingloss"><code>CachedMultipleNegativesRankingLoss</code></a> is often used to increase the batch size, resulting in superior performance.
* `(sentence_A, sentence_B) pairs` with a `float similarity score`: <a href="../package_reference/losses.html#cosinesimilarityloss"><code>CosineSimilarityLoss</code></a> is traditionally used a lot, though more recently <a href="../package_reference/losses.html#cosentloss"><code>CoSENTLoss</code></a> and <a href="../package_reference/losses.html#angleloss"><code>AnglELoss</code></a> are used as drop-in replacements with superior performance.
\ No newline at end of file
# Training Overview
Each task is unique, and having sentence / text embeddings tuned for that specific task greatly improves the performance.
SentenceTransformers was designed in such way that fine-tuning your own sentence / text embeddings models is easy. It provides most of the building blocks that you can stick together to tune embeddings for your specific task.
Sadly there is no single training strategy that works for all use-cases. Instead, which training strategy to use greatly depends on your available data and on your target task.
In the **Training** section, I will discuss the fundamentals of training your own embedding models with SentenceTransformers. In the **Training Examples** section, I will provide examples how to tune embedding models for common real-world applications.
## Network Architecture
For sentence / text embeddings, we want to map a variable length input text to a fixed sized dense vector. The most basic network architecture we can use is the following:
![SBERT Network Architecture](../img/SBERT_Architecture.png "SBERT Siamese Architecture")
We feed the input sentence or text into a transformer network like BERT. BERT produces contextualized word embeddings for all input tokens in our text. As we want a fixed-sized output representation (vector u), we need a pooling layer. Different pooling options are available, the most basic one is mean-pooling: We simply average all contextualized word embeddings BERT is giving us. This gives us a fixed 768 dimensional output vector independent how long our input text was.
The depicted architecture, consisting of a BERT layer and a pooling layer is one final SentenceTransformer model.
## Creating Networks from Scratch
In the quick start & usage examples, we used pre-trained SentenceTransformer models that already come with a BERT layer and a pooling layer.
But we can create the networks architectures from scratch by defining the individual layers. For example, the following code would create the depicted network architecture:
```python
from sentence_transformers import SentenceTransformer, models
word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
```
First we define our individual layers, in this case, we define 'bert-base-uncased' as the *word_embedding_model*. We limit that layer to a maximal sequence length of 256, texts longer than that will be truncated. Further, we create a (mean) pooling layer. We create a new *SentenceTransformer* model by calling `SentenceTransformer(modules=[word_embedding_model, pooling_model])`. For the *modules* parameter, we pass a list of layers which are executed consecutively. Input text are first passed to the first entry (*word_embedding_model*). The output is then passed to the second entry (*pooling_model*), which then returns our sentence embedding.
We can also construct more complex models:
```python
from sentence_transformers import SentenceTransformer, models
from torch import nn
word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(
in_features=pooling_model.get_sentence_embedding_dimension(),
out_features=256,
activation_function=nn.Tanh(),
)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
```
Here, we add on top of the pooling layer a fully connected dense layer with Tanh activation, which performs a down-project to 256 dimensions. Hence, embeddings by this model will only have 256 instead of 768 dimensions.
Additionally, we can also create SentenceTransformer models from scratch for image search by loading any CLIP model from the Hugging Face Hub or a local path:
```py
from sentence_transformers import SentenceTransformer, models
image_embedding_model = models.CLIPModel("openai/clip-vit-base-patch32")
model = SentenceTransformer(modules=[image_embedding_model])
```
For all available building blocks see [» Models Package Reference](../package_reference/models.md)
## Training Data
To train a SentenceTransformer model, you need to inform it somehow that two sentences have a certain degree of similarity. Therefore, each example in the data requires a label or structure that allows the model to understand whether two sentences are similar or different.
Unfortunately, there is no single way to prepare your data to train a Sentence Transformers model. It largely depends on your goals and the structure of your data. If you don't have an explicit label, which is the most likely scenario, you can derive it from the design of the documents where you obtained the sentences. For example, two sentences in the same report should be more comparable than two sentences in different reports. Neighboring sentences might be more comparable than non-neighboring sentences.
For more information on available datasets for training SentenceTransformers models see [» Datasets Reference](../examples/training/datasets/README.md).
To represent our training data, we use the `InputExample` class to store training examples. As parameters, it accepts texts, which is a list of strings representing our pairs (or triplets). Further, we can also pass a label (either float or int). The following shows a simple example, where we pass text pairs to `InputExample` together with a label indicating the semantic similarity.
```python
from sentence_transformers import SentenceTransformer, InputExample
from torch.utils.data import DataLoader
model = SentenceTransformer("distilbert-base-nli-mean-tokens")
train_examples = [
InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
```
We wrap our `train_examples` with the standard PyTorch `DataLoader`, which shuffles our data and produces batches of certain sizes.
## Loss Functions
The loss function plays a critical role when fine-tuning the model. It determines how well our embedding model will work for the specific downstream task.
Sadly there is no "one size fits all" loss function. Which loss function is suitable depends on the available training data and on the target task.
To fine-tune our network, we need somehow to tell our network which sentence pairs are similar, and should be close in vector space, and which pairs are dissimilar, and should be far away in vector space.
The most simple way is to have sentence pairs annotated with a score indicating their similarity, e.g. on a scale 0 to 1. We can then train the network with a Siamese Network Architecture (for details see: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084))
![SBERT Siamese Network Architecture](../img/SBERT_Siamese_Network.png "SBERT Siamese Architecture")
For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences.
A minimal example with `CosineSimilarityLoss` is the following:
```python
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("distilbert-base-nli-mean-tokens")
# Define your train examples. You need more than just two examples...
train_examples = [
InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
]
# Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
# Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
```
We tune the model by calling model.fit(). We pass a list of `train_objectives`, which consist of tuples `(dataloader, loss_function)`. We can pass more than one tuple in order to perform multi-task learning on several datasets with different loss functions.
The `fit` method accepts the following parameter:
```eval_rst
.. autoclass:: sentence_transformers.SentenceTransformer
:members: fit
```
## Evaluators
During training, we usually want to measure the performance to see if the performance improves. For this, the *[sentence_transformers.evaluation](../package_reference/evaluation)* package exists. It contains various evaluators which we can pass to the `fit`-method. These evaluators are run periodically during training. Further, they return a score and only the model with the highest score will be stored on disc.
The usage is simple:
```python
from sentence_transformers import evaluation
sentences1 = [
"This list contains the first column",
"With your sentences",
"You want your model to evaluate on",
]
sentences2 = [
"Sentences contains the other column",
"The evaluator matches sentences1[i] with sentences2[i]",
"Compute the cosine similarity and compares it to scores[i]",
]
scores = [0.3, 0.6, 0.2]
evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)
# ... Your other code to load training data
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
evaluator=evaluator,
evaluation_steps=500,
)
```
### Continue Training on Other Data
[training_stsbenchmark_continue_training.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py) shows an example where training on a fine-tuned model is continued. In that example, we use a sentence transformer model that was first fine-tuned on the NLI dataset and then continue training on the training data from the STS benchmark.
First, we load a pre-trained model from the server:
```python
model = SentenceTransformer("bert-base-nli-mean-tokens")
```
The next steps are as before. We specify training and dev data:
```python
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
sts_reader.get_examples("sts-dev.csv")
)
```
In that example, we use CosineSimilarityLoss, which computes the cosine similarity between two sentences and compares this score with a provided gold similarity score.
Then we can train as before:
```python
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=num_epochs,
evaluation_steps=1000,
warmup_steps=warmup_steps,
output_path=model_save_path,
)
```
## Loading Custom SentenceTransformer Models
Loading trained models is easy. You can specify a path:
```python
model = SentenceTransformer("./my/path/to/model/")
```
Note: It is important that a / or \ is present in the path, otherwise, it is not recognized as a path.
You can also host the training output on a server and download it:
```python
model = SentenceTransformer('http://www.server.com/path/to/model/my_model.zip')
```
With the first call, the model is downloaded and stored in the local Hugging Face cache folder (`~/.cache/huggingface`). In order to work, you must zip all files and subfolders of your model.
## Multitask Training
This code allows multi-task learning with training data from different datasets and with different loss-functions. For an example, see [training_multi-task.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/other/training_multi-task.py).
## Adding Special Tokens
Depending on the task, you might want to add special tokens to the tokenizer and the Transformer model. You can use the following code-snippet to achieve this:
```python
from sentence_transformers import SentenceTransformer, models
word_embedding_model = models.Transformer("bert-base-uncased")
tokens = ["[DOC]", "[QRY]"]
word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
```
If you want to extend the vocabulary for an existent SentenceTransformer model, you can use the following code:
```python
from sentence_transformers import SentenceTransformer, models
model = SentenceTransformer("all-MiniLM-L6-v2")
word_embedding_model = model._first_module()
tokens = ["[DOC]", "[QRY]"]
word_embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
```
In the above example, the two new tokens `[DOC]` and `[QRY]` are added to the model. Their respective word embeddings are intialized randomly. It is advisable to then fine-tune the model on your downstream task.
## Best Transformer Model
The quality of your text embedding model depends on which transformer model you choose. Sadly we cannot infer from a better performance on e.g. the GLUE or SuperGLUE benchmark that this model will also yield better representations.
To test the suitability of transformer models, I use the [training_nli_v2.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py) script and train on 560k (anchor, positive, negative)-triplets for 1 epoch with batch size 64. I then evaluate on 14 diverse text similarity tasks (clustering, semantic search, duplicate detection etc.) from various domains.
In the following table you find the performance for different models and their performance on this benchmark:
| Model | Performance (14 sentence similarity tasks) |
| --- | :---: |
| [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) | 60.99 |
| [nghuyong/ernie-2.0-en](https://huggingface.co/nghuyong/ernie-2.0-en) | 60.73 |
| [microsoft/deberta-base](https://huggingface.co/microsoft/deberta-base) | 60.21 |
| [roberta-base](https://huggingface.co/roberta-base) | 59.63 |
| [t5-base](https://huggingface.co/t5-base) | 59.21 |
| [bert-base-uncased](https://huggingface.co/bert-base-uncased) | 59.17 |
| [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) | 59.03 |
| [nreimers/TinyBERT_L-6_H-768_v2](https://huggingface.co/nreimers/TinyBERT_L-6_H-768_v2) | 58.27 |
| [google/t5-v1_1-base](https://huggingface.co/google/t5-v1_1-base) | 57.63 |
| [nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large](https://huggingface.co/nreimers/MiniLMv2-L6-H768-distilled-from-BERT-Large) | 57.31 |
| [albert-base-v2](https://huggingface.co/albert-base-v2) | 57.14 |
| [microsoft/MiniLM-L12-H384-uncased](https://huggingface.co/microsoft/MiniLM-L12-H384-uncased) | 56.79 |
| [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) | 54.46 |
# Semantic Textual Similarity
Once you have [sentence embeddings computed](../../examples/applications/computing-embeddings/README.md), you usually want to compare them to each other. Here, I show you how you can compute the cosine similarity between embeddings, for example, to measure the semantic similarity of two texts.
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
# Two lists of sentences
sentences1 = [
"The cat sits outside",
"A man is playing guitar",
"The new movie is awesome",
]
sentences2 = [
"The dog plays in the garden",
"A woman watches TV",
"The new movie is so great",
]
# Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)
# Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
# Output the pairs with their score
for i in range(len(sentences1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(
sentences1[i], sentences2[i], cosine_scores[i][i]
))
```
We pass the `convert_to_tensor=True` parameter to the encode function. This will return a pytorch tensor containing our embeddings. We can then call `util.cos_sim(A, B)` which computes the cosine similarity between all vectors in *A* and all vectors in *B*.
It returns in the above example a 3x3 matrix with the respective cosine similarity scores for all possible pairs between *embeddings1* and *embeddings2*.
You can use this function also to find out the pairs with the highest cosine similarity scores:
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
# Single list of sentences
sentences = [
"The cat sits outside",
"A man is playing guitar",
"I love pasta",
"The new movie is awesome",
"The cat plays in the garden",
"A woman watches TV",
"The new movie is so great",
"Do you like pizza?",
]
# Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)
# Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.cos_sim(embeddings, embeddings)
# Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(cosine_scores.shape[0]):
for j in range(cosine_scores.shape[1]):
pairs.append({"index": [i, j], "score": cosine_scores[i][j]})
# Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x["score"], reverse=True)
for pair in pairs[0:10]:
i, j = pair["index"]
print("{} \t\t {} \t\t Score: {:.4f}".format(
sentences[i], sentences[j], pair["score"]
))
```
Note, in the above approach we use a brute-force approach to find the highest scoring pairs, which has a quadratic complexity. For long lists of sentences, this might be infeasible. If you want find the highest scoring pairs in a long list of sentences, have a look at [Paraphrase Mining](../../examples/applications/paraphrase-mining/README.md).
\ No newline at end of file
# Examples
This folder contains various examples how to use SentenceTransformers.
## Applications
The [applications](applications/) folder contains examples how to use SentenceTransformers for tasks like clustering or semantic search.
## Evaluation
The [evaluation](evaluation/) folder contains some examples how to evaluate SentenceTransformer models for common tasks.
## Training
The [training](training/) folder contains examples how to fine-tune transformer models like BERT, RoBERTa, or XLM-RoBERTa for generating sentence embedding. For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/training/overview.html).
## Unsupervised Learning
The [unsupervised_learning](unsupervised_learning/) folder contains examples how to train sentence embedding models without labeled data.
# Applications
SentenceTransformers can be used for various use-cases. In these folders, you find several example scripts that show case how SentenceTransformers can be used
## Computing Embeddings
The [computing-embeddings](computing-embeddings/) folder contains examples how to compute sentence embeddings using SentenceTransformers.
## Clustering
The [clustering](clustering/) folder shows how SentenceTransformers can be used for text clustering, i.e., grouping sentences together based on their similarity.
## Cross-Encoder
SentenceTransformers also support training and inference of [Cross-Encoders](cross-encoder/). There, two sentences are presented simultaneously to the transformer network and a score (0...1) is derived indicating the similarity or a label.
## Parallel Sentence Mining
The [parallel-sentence-mining](parallel-sentence-mining/) folder contains examples of how parallel (translated) sentences can be found in two corpora of different languages. For example, you take the English and the Spanish Wikipedia and the script finds and returns all translated English-Spanish sentence pairs.
## Paraphrase Mining
The [paraphrase-mining](paraphrase-mining/) folder contains examples to find all paraphrase sentences in a large set of sentences. The example can be used to find e.g. duplicate questions or duplicate sentences in a set of Millions of questions / sentences.
## Semantic Search
The [semantic-search](semantic-search/) folder shows examples for semantic search: Given a sentence, find in a large collection semantically similar sentences.
## Retrieve & Rerank
The [retrieve_rerank](retrieve_rerank/) folder shows how to combine a bi-encoder for semantic search retrieval and a more powerful re-ranking stage with a cross-encoder.
## Image Search
The [image-search](image-search/) folder shows how to use the image&text-models, which can map images and text to the same vector space. This allows for an image search given a user query.
## Text Summarization
The [text-summarization](text-summarization/) folder shows how SentenceTransformers can be used for extractive summarization: Give a long document, find the k sentences that give a good and short summary of the content.
# Clustering
Sentence-Transformers can be used in different ways to perform clustering of small or large set of sentences.
## k-Means
[kmeans.py](kmeans.py) contains an example of using [K-means Clustering Algorithm](https://scikit-learn.org/stable/modules/clustering.html#k-means). K-Means requires that the number of clusters is specified beforehand. The sentences are clustered in groups of about equal size.
## Agglomerative Clustering
[agglomerative.py](agglomerative.py) shows an example of using [Hierarchical clustering](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering) using the [Agglomerative Clustering Algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering). In contrast to k-means, we can specify a threshold for the clustering: Clusters below that threshold are merged. This algorithm can be useful if the number of clusters is unknown. By the threshold, we can control if we want to have many small and fine-grained clusters or few coarse-grained clusters.
## Fast Clustering
Agglomerative Clustering for larger datasets is quite slow, so it is only applicable for maybe a few thousand sentences.
In [fast_clustering.py](fast_clustering.py) we present a clustering algorithm that is tuned for large datasets (50k sentences in less than 5 seconds). In a large list of sentences it searches for local communities: A local community is a set of highly similar sentences.
You can configure the threshold of cosine-similarity for which we consider two sentences as similar. Also, you can specify the minimal size for a local community. This allows you to get either large coarse-grained clusters or small fine-grained clusters.
We apply it on the [Quora Duplicate Questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset and the output looks something like this:
```
Cluster 1, #83 Elements
What should I do to improve my English ?
What should I do to improve my spoken English?
Can I improve my English?
...
Cluster 2, #79 Elements
How can I earn money online?
How do I earn money online?
Can I earn money online?
...
...
Cluster 47, #25 Elements
What are some mind-blowing Mobile gadgets that exist that most people don't know about?
What are some mind-blowing gadgets and technologies that exist that most people don't know about?
What are some mind-blowing mobile technology tools that exist that most people don't know about?
...
```
## Topic Modeling
Topic modeling is the process of discovering topics in a collection of documents.
An example is shown in the following picture, which shows the identified topics in the 20 newsgroup dataset:
![20news](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/20news_semantic.png)
For each topic, you want to extract the words that describe this topic:
![20news](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/20news_top2vec.png)
Sentence-Transformers can be used to identify these topics in a collection of sentences, paragraphs or short documents. For an excellent tutorial, see [Topic Modeling with BERT](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) as well as the repositories [Top2Vec](https://github.com/ddangelov/Top2Vec) and [BERTopic](https://github.com/MaartenGr/BERTopic).
Image source: [Top2Vec: Distributed Representations of Topics](https://arxiv.org/abs/2008.09470)
"""
This is a simple application for sentence embeddings: clustering
Sentences are mapped to sentence embeddings and then agglomerative clustering with a threshold is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# Corpus with example sentences
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"A man is eating pasta.",
"The girl is carrying a baby.",
"The baby is carried by the woman",
"A man is riding a horse.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"Someone in a gorilla costume is playing a set of drums.",
"A cheetah is running behind its prey.",
"A cheetah chases prey on across a field.",
]
corpus_embeddings = embedder.encode(corpus)
# Some models don't automatically normalize the embeddings, in which case you should normalize the embeddings:
# corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
# Perform kmean clustering
clustering_model = AgglomerativeClustering(
n_clusters=None, distance_threshold=1.5
) # , affinity='cosine', linkage='average', distance_threshold=0.4)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
if cluster_id not in clustered_sentences:
clustered_sentences[cluster_id] = []
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in clustered_sentences.items():
print("Cluster ", i + 1)
print(cluster)
print("")
"""
This is a more complex example on performing clustering on large scale dataset.
This examples find in a large set of sentences local communities, i.e., groups of sentences that are highly
similar. You can freely configure the threshold what is considered as similar. A high threshold will
only find extremely similar sentences, a lower threshold will find more sentence that are less similar.
A second parameter is 'min_community_size': Only communities with at least a certain number of sentences will be returned.
The method for finding the communities is extremely fast, for clustering 50k sentences it requires only 5 seconds (plus embedding comuptation).
In this example, we download a large set of questions from Quora and then find similar questions in this set.
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time
# Model for computing sentence embeddings. We use one trained for similar questions detection
model = SentenceTransformer("all-MiniLM-L6-v2")
# We download the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar question in it
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 50000 # We limit our corpus to only the first 50k questions
# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
print("Download dataset")
util.http_get(url, dataset_path)
# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding="utf8") as fIn:
reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_MINIMAL)
for row in reader:
corpus_sentences.add(row["question1"])
corpus_sentences.add(row["question2"])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)
print("Start clustering")
start_time = time.time()
# Two parameters to tune:
# min_cluster_size: Only consider cluster that have at least 25 elements
# threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, min_community_size=25, threshold=0.75)
print("Clustering done after {:.2f} sec".format(time.time() - start_time))
# Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
print("\nCluster {}, #{} Elements ".format(i + 1, len(cluster)))
for sentence_id in cluster[0:3]:
print("\t", corpus_sentences[sentence_id])
print("\t", "...")
for sentence_id in cluster[-3:]:
print("\t", corpus_sentences[sentence_id])
"""
This is a simple application for sentence embeddings: clustering
Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# Corpus with example sentences
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"A man is eating pasta.",
"The girl is carrying a baby.",
"The baby is carried by the woman",
"A man is riding a horse.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"Someone in a gorilla costume is playing a set of drums.",
"A cheetah is running behind its prey.",
"A cheetah chases prey on across a field.",
]
corpus_embeddings = embedder.encode(corpus)
# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
clustered_sentences[cluster_id].append(corpus[sentence_id])
for i, cluster in enumerate(clustered_sentences):
print("Cluster ", i + 1)
print(cluster)
print("")
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment