README.md 2.71 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
# Paraphrase Mining

Rayyyyy's avatar
Rayyyyy committed
3
Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. In [Semantic Textual Similarity](../../../docs/sentence_transformer/usage/semantic_textual_similarity.rst) we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs. 
Rayyyyy's avatar
Rayyyyy committed
4

Rayyyyy's avatar
Rayyyyy committed
5
6
```eval_rst
However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences. For larger collections, the :func:`~sentence_transformers.util.paraphrase_mining` function can be used::
Rayyyyy's avatar
Rayyyyy committed
7

Rayyyyy's avatar
Rayyyyy committed
8
9
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util import paraphrase_mining
Rayyyyy's avatar
Rayyyyy committed
10

Rayyyyy's avatar
Rayyyyy committed
11
    model = SentenceTransformer("all-MiniLM-L6-v2")
Rayyyyy's avatar
Rayyyyy committed
12

Rayyyyy's avatar
Rayyyyy committed
13
14
15
16
17
18
19
20
21
22
23
    # Single list of sentences - Possible tens of thousands of sentences
    sentences = [
        "The cat sits outside",
        "A man is playing guitar",
        "I love pasta",
        "The new movie is awesome",
        "The cat plays in the garden",
        "A woman watches TV",
        "The new movie is so great",
        "Do you like pizza?",
    ]
Rayyyyy's avatar
Rayyyyy committed
24

Rayyyyy's avatar
Rayyyyy committed
25
    paraphrases = paraphrase_mining(model, sentences)
Rayyyyy's avatar
Rayyyyy committed
26

Rayyyyy's avatar
Rayyyyy committed
27
28
29
    for paraphrase in paraphrases[0:10]:
        score, i, j = paraphrase
        print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))
Rayyyyy's avatar
Rayyyyy committed
30

Rayyyyy's avatar
Rayyyyy committed
31
The :func:`~sentence_transformers.util.paraphrase_mining` accepts the following parameters:
Rayyyyy's avatar
Rayyyyy committed
32

Rayyyyy's avatar
Rayyyyy committed
33
.. autofunction:: sentence_transformers.util.paraphrase_mining
Rayyyyy's avatar
Rayyyyy committed
34

Rayyyyy's avatar
Rayyyyy committed
35
36
To optimize memory and computation time, paraphrase mining is performed in chunks, as specified by ``query_chunk_size`` and ``corpus_chunk_size``.
To be specific, only ``query_chunk_size * corpus_chunk_size`` pairs will be compared at a time, rather than ``len(sentences) * len(sentences)``. This is more time- and memory-efficient. Additionally, :func:`~sentence_transformers.util.paraphrase_mining` only considers the ``top_k`` best scores per sentences per chunk. You can experiment with this value as an efficiency-performance trade-off.
Rayyyyy's avatar
Rayyyyy committed
37

Rayyyyy's avatar
Rayyyyy committed
38
For example, for each sentence you will get only the one most relevant sentence in this script.
Rayyyyy's avatar
Rayyyyy committed
39

Rayyyyy's avatar
Rayyyyy committed
40
::
Rayyyyy's avatar
Rayyyyy committed
41

Rayyyyy's avatar
Rayyyyy committed
42
    paraphrases = paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1)
Rayyyyy's avatar
Rayyyyy committed
43

Rayyyyy's avatar
Rayyyyy committed
44
The final key parameter is ``max_pairs``, which determines the maximum number of paraphrase pairs that the function returns. Usually, you get fewer pairs returned because the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned.
Rayyyyy's avatar
Rayyyyy committed
45

Rayyyyy's avatar
Rayyyyy committed
46
47
48
49
.. note::
    
    If B is the most similar sentence for A, A is not necessarily the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C).
```