README.md 3.81 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
# Semantic Textual Similarity

Rayyyyy's avatar
Rayyyyy committed
3
Semantic Textual Similarity (STS) assigns a score on the similarity of two texts. In this example, we use the [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) dataset as training data to fine-tune our model. See the following example scripts how to tune SentenceTransformer on STS data:
Rayyyyy's avatar
Rayyyyy committed
4

Rayyyyy's avatar
Rayyyyy committed
5
6
- **[training_stsbenchmark.py](training_stsbenchmark.py)** - This example shows how to create a SentenceTransformer model from scratch by using a pre-trained transformer model (e.g. [`distilbert-base-uncased`](https://huggingface.co/distilbert/distilbert-base-uncased)) together with a pooling layer.
- **[training_stsbenchmark_continue_training.py](training_stsbenchmark_continue_training.py)** - This example shows how to continue training on STS data for a previously created & trained SentenceTransformer model (e.g. [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)).
Rayyyyy's avatar
Rayyyyy committed
7
8

## Training data
Rayyyyy's avatar
Rayyyyy committed
9
10
11
```eval_rst
In STS, we have sentence pairs annotated together with a score indicating the similarity. In the original STSbenchmark dataset, the scores range from 0 to 5. We have normalized these scores to range between 0 and 1 in `stsb <https://huggingface.co/datasets/sentence-transformers/stsb>`_, as that is required for :class:`~sentence_transformers.losses.CosineSimilarityLoss` as you can see in the `Loss Overiew <../../../docs/sentence_transformer/loss_overview.html>`_.
```
Rayyyyy's avatar
Rayyyyy committed
12

Rayyyyy's avatar
Rayyyyy committed
13
Here is a simplified version of our training data:
Rayyyyy's avatar
Rayyyyy committed
14
15

```python
Rayyyyy's avatar
Rayyyyy committed
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from datasets import Dataset

sentence1_list = ["My first sentence", "Another pair"]
sentence2_list = ["My second sentence", "Unrelated sentence"]
labels_list = [0.8, 0.3]
train_dataset = Dataset.from_dict({
    "sentence1": sentence1_list,
    "sentence2": sentence2_list,
    "label": labels_list,
})
# => Dataset({
#     features: ['sentence1', 'sentence2', 'label'],
#     num_rows: 2
# })
print(train_dataset[0])
# => {'sentence1': 'My first sentence', 'sentence2': 'My second sentence', 'label': 0.8}
print(train_dataset[1])
# => {'sentence1': 'Another pair', 'sentence2': 'Unrelated sentence', 'label': 0.3}
Rayyyyy's avatar
Rayyyyy committed
34
35
```

Rayyyyy's avatar
Rayyyyy committed
36
In the aforementioned scripts, we directly load the [stsb](https://huggingface.co/datasets/sentence-transformers/stsb) dataset:
Rayyyyy's avatar
Rayyyyy committed
37

Rayyyyy's avatar
Rayyyyy committed
38
39
```python
from datasets import load_dataset
Rayyyyy's avatar
Rayyyyy committed
40

Rayyyyy's avatar
Rayyyyy committed
41
42
43
44
45
46
train_dataset = load_dataset("sentence-transformers/stsb", split="train")
# => Dataset({
#     features: ['sentence1', 'sentence2', 'score'],
#     num_rows: 5749
# })
```
Rayyyyy's avatar
Rayyyyy committed
47

Rayyyyy's avatar
Rayyyyy committed
48
49
50
51
## Loss Function
```eval_rst
We use :class:`~sentence_transformers.losses.CosineSimilarityLoss` as our loss function.
```
Rayyyyy's avatar
Rayyyyy committed
52

Rayyyyy's avatar
Rayyyyy committed
53
<img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SBERT_Siamese_Network.png" alt="SBERT Siamese Network Architecture" width="250"/>
Rayyyyy's avatar
Rayyyyy committed
54

Rayyyyy's avatar
Rayyyyy committed
55
For each sentence pair, we pass sentence A and sentence B through the BERT-based model, which yields the embeddings *u* und *v*. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. Note that the two sentences are fed through the same model rather than two separate models. In particular, the cosine similarity for similar texts is maximized and the cosine similarity for dissimilar texts is minimized. This allows our model to be fine-tuned and to recognize the similarity of sentences.
Rayyyyy's avatar
Rayyyyy committed
56

Rayyyyy's avatar
Rayyyyy committed
57
For more details, see [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084).
Rayyyyy's avatar
Rayyyyy committed
58

Rayyyyy's avatar
Rayyyyy committed
59
60
61
```eval_rst
:class:`~sentence_transformers.losses.CoSENTLoss` and :class:`~sentence_transformers.losses.AnglELoss` are more modern variants of :class:`~sentence_transformers.losses.CosineSimilarityLoss` that accept the same data format of a sentence pair with a similarity score ranging from 0.0 to 1.0. Informal experiments indicate that these two produce stronger models than :class:`~sentence_transformers.losses.CosineSimilarityLoss`.
```