README.md 2.65 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
# Paraphrase Data

Rayyyyy's avatar
Rayyyyy committed
3
4
5
```eval_rst
In our paper `Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation <https://arxiv.org/abs/2004.09813>`_, we showed that paraphrase data together with :class:`~sentence_transformers.losses.MultipleNegativesRankingLoss` is a powerful combination to learn sentence embeddings models. Read `NLI > MultipleNegativesRankingLoss <../nli/README.html#multiplenegativesrankingloss>`_ for more information on this loss function.
```
Rayyyyy's avatar
Rayyyyy committed
6

Rayyyyy's avatar
Rayyyyy committed
7
The [training.py](training.py) script loads various datasets from the [Dataset Overview](../../../docs/sentence_transformer/dataset_overview.html@pre-existing-datasets). We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset.
Rayyyyy's avatar
Rayyyyy committed
8

Rayyyyy's avatar
Rayyyyy committed
9
As the dataset sizes are quite different in size, we perform [round-robin sampling](../../../docs/package_reference/sentence_transformer/training_args.html#sentence_transformers.training_args.MultiDatasetBatchSamplers) to train using the same amount of batches from each dataset.
Rayyyyy's avatar
Rayyyyy committed
10
11

## Pre-Trained Models
Rayyyyy's avatar
Rayyyyy committed
12
Have a look at [pre-trained models](../../../docs/sentence_transformer/pretrained_models.md) to view all models that were trained on these paraphrase datasets.
Rayyyyy's avatar
Rayyyyy committed
13

Rayyyyy's avatar
Rayyyyy committed
14
15
16
17
- [paraphrase-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L12-v2) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
- [paraphrase-distilroberta-base-v2](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v2) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
- [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split
- [paraphrase-xlm-r-multilingual-v1](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: [paraphrase-distilroberta-base-v1](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1), Student: [xlm-r-base](https://huggingface.co/FacebookAI/xlm-roberta-base))