@@ -7,13 +7,13 @@ Domain adaptation is still an active research field and there exists no perfect
## Domain Adaptation vs. Unsupervised Learning
There exists methods for [unsupervised text embedding learning](../unsupervised_learning/README.md), however, they generally perform rather badly: They are not really able to learn domain specific concepts.
A much better approach is domain adaptation: Here you have an unlabeled corpus from your specific domain together with an existing labeled corpus. You can find many suitable labeled training datasets here: [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)
A much better approach is domain adaptation: Here you have an unlabeled corpus from your specific domain together with an existing labeled corpus. You can find many suitable labeled training datasets here: [Embedding Model Datasets Collection](https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552)
## Adaptive Pre-Training
When using adaptive pre-training, you first pre-train on your target corpus using e.g. [Masked Language Modeling](../unsupervised_learning/MLM/README.md) or [TSDAE](../unsupervised_learning/TSDAE/README.md) and then you fine-tune on an existing training dataset (see [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)).
When using adaptive pre-training, you first pre-train on your target corpus using e.g. [Masked Language Modeling](../unsupervised_learning/MLM/README.md) or [TSDAE](../unsupervised_learning/TSDAE/README.md) and then you fine-tune on an existing training dataset (see [Embedding Model Datasets Collection](https://huggingface.co/collections/sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552)).
In our paper [TSDAE](https://arxiv.org/abs/2104.06979) we evaluated several methods for domain adaptation on 4 domain specific sentence embedding tasks:
...
...
@@ -44,9 +44,9 @@ A big **disadvantage of adaptive pre-training** is the high computational overhe
## GPL: Generative Pseudo-Labeling
[GPL](https://arxiv.org/abs/2112.07577) overcomes the aforementioned issue: It can be applied on-top of a fine-tuned model. Hence, you can use one of the [pre-trained models](https://www.sbert.net/docs/pretrained_models.html) and adapt it to your specific domain:
[GPL](https://arxiv.org/abs/2112.07577) overcomes the aforementioned issue: It can be applied on-top of a fine-tuned model. Hence, you can use one of the [pre-trained models](../../docs/sentence_transformer/pretrained_models.md) and adapt it to your specific domain:
The longer you train, the better your model gets. In our experiments, we were training the models for about 1 day on a V100-GPU. GPL can be combined with adaptive pre-training, which can give another performance boost.
...
...
@@ -58,15 +58,16 @@ The longer you train, the better your model gets. In our experiments, we were tr
-**Query Generation**: For a given text from our domain, we first use a T5 model that generates a possible query for the given text. E.g. when your text is *"Python is a high-level general-purpose programming language"*, the model might generate a query like *"What is Python"*. You can find various query generators on our [doc2query-hub](https://huggingface.co/doc2query).
-**Negative Mining**: Next, for the generate query *"What is Python"* we mine negative passages from our corpus, i.e. passages that are similar to the query but don't which a user would not consider relevant. Such a negative passage could be *"Java is a high-level, class-based, object-oriented programming language."*. We do this mining using dense retrieval, i.e. we use one of the existing text embedding models and retrieve relevant paragraphs for the given query.
-**Pseudo Labeling**: It might be that in the negative mining step we retrieve a passage that is actually relevant for the query (like another definition for *"What is Python"*). To overcome this issue, we use a [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) to score all (query, passage)-pairs.
-**Training**: Once we have the triplets *(generated query, positive passage, mined negative passage)* and the Cross-Encoder scores for *(query, positive)* and *(query, negative)* we can start training the text embedding model using [MarginMSELoss](https://www.sbert.net/docs/package_reference/losses.html#marginmseloss).
-**Negative Mining**: Next, for the generated query *"What is Python"* we mine negative passages from our corpus, i.e. passages that are similar to the query but which a user would not consider relevant. Such a negative passage could be *"Java is a high-level, class-based, object-oriented programming language."*. We do this mining using dense retrieval, i.e. we use one of the existing text embedding models and retrieve relevant paragraphs for the given query.
-**Pseudo Labeling**: It might be that in the negative mining step we retrieve a passage that is actually relevant for the query (like another definition for *"What is Python"*). To overcome this issue, we use a [Cross-Encoder](../applications/cross-encoder/README.html) to score all (query, passage)-pairs.
-**Training**: Once we have the triplets *(generated query, positive passage, mined negative passage)* and the Cross-Encoder scores for *(query, positive)* and *(query, negative)* we can start training the text embedding model using [MarginMSELoss](../../docs/package_reference/sentence_transformer/losses.html#marginmseloss).
The **pseudo labeling** step is quite important and which results in the increased performance compared to the previous method QGen, which treated passages just as positive (1) or negative (0). As we see in the following picture, for a generate query (*"what is futures contract"*), the negative mining step retrieves passages that are partly or highly relevant to the generated query. Using MarginMSELoss and the Cross-Encoder, we can identify these passages and teach the text embedding model that these passages are also relevant for the given query.
Given a tab separated file (.tsv) with parallel sentences, where the second column is the translation of the sentence in the first column, for example, in the format:
src1 trg1
src2 trg2
...
where trg_i is the translation of src_i.
Given src_i, the TranslationEvaluator checks which trg_j has the highest similarity using cosine similarity. If i == j, we assume
a match, i.e., the correct translation has been found for src_i out of all possible target sentences.
Given a dataset with parallel sentences, one "english" column and one "non_english" column, this script evaluates a model on the translation task.
Given a sentence in the "english" column, the model should find the correct translation in the "non_english" column, based on just the embeddings.
It then computes an accuracy over all possible source sentences src_i. Equivalently, it computes also the accuracy for the other direction.
A high accuracy score indicates that the model is able to find the correct translation out of a large pool with sentences.
@@ -4,14 +4,22 @@ This folder contains various examples to fine-tune `SentenceTransformers` for sp
For the beginning, I can recommend to have a look at the Semantic Textual Similarity ([STS](sts/)) or the Natural Language Inference ([NLI](nli/)) examples.
For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/training/overview.html).
For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/sentence_transformer/training_overview.html).
## Training Examples
-[adaptive_layer](adaptive_layer/) - Examples to train models whose layers can be removed on the fly for faster inference.
-[avg_word_embeddings](avg_word_embeddings/) - This folder contains examples to train models based on classical word embeddings like GloVe. These models are extremely fast, but are a more inaccuracte than transformers based models.
-[clip](clip/) - Examples to train CLIP image models.
-[cross-encoder](cross-encoder/) - Examples to train [CrossEncoder](http://www.sbert.net/docs/cross_encoder/usage/usage.html) models.
-[data_augmentation](data_augmentation/) Examples of how to apply data augmentation strategies to improve embedding models.
-[distillation](distillation/) - Examples to make models smaller, faster and lighter.
-[hpo](hpo/) - Examples with hyperparameter search to find the best hyperparameters for your task.
-[matryoshka](matryoshka/) - Examples with training embedding models whose embeddings can be truncated (allowing for faster search) with minimal performance loss.
-[ms_marco](ms_marco/) - Example training scripts for training on the MS MARCO information retrieval dataset.
-[multilingual](multilingual/) - Existent monolingual models can be extend to various languages ([paper](https://arxiv.org/abs/2004.09813)). This folder contains a step-by-step guide to extend existent models to new languages.
-[nli](nli/) - Natural Language Inference (NLI) data can be quite helpful to pre-train and fine-tune models to create meaningful sentence embeddings.
-[other](other/) - Various tiny examples for show-casing one specific training case.
-[paraphrases](paraphrases/) - Examples for training models capable of recognizing paraphrases, i.e. understand when texts have the same meaning despite using different words.
-[quora_duplicate_questions](quora_duplicate_questions/) - Quora Duplicate Questions is large set corpus with duplicate questions from the Quora community. The folder contains examples how to train models for duplicate questions mining and for semantic search.
-[sts](sts/) - The most basic method to train models is using Semantic Textual Similarity (STS) data. Here, we have a sentence pair and a score indicating the semantic similarity.
-[other](other/) - Various tiny examples for show-casing one specific training case.
Embedding models are often encoder models with numerous layers, such as 12 (e.g. [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) or 6 (e.g. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)). To get embeddings, every single one of these layers must be traversed. [2D Matryoshka Sentence Embeddings](https://arxiv.org/abs/2402.14776)(2DMSE) revisits this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs.
Embedding models are often encoder models with numerous layers, such as 12 (e.g. [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) or 6 (e.g. [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)). To get embeddings, every single one of these layers must be traversed. The [2D Matryoshka Sentence Embeddings](https://arxiv.org/abs/2402.14776v1)(2DMSE) preprint revisits this concept by proposing an approach to train embedding models that will perform well when only using a selection of all layers. This results in faster inference speeds at relatively low performance costs.
```eval_rst
.. note::
The 2DMSE preprint was later updated and renamed to `ESE: Espresso Sentence Embeddings <https://arxiv.org/abs/2402.14776>`_. The Sentence Transformers implementation of Adaptive Layers and Matryoshka2d (Adaptive Layer + Matryoshka Embeddings) are based on the initial preprint, and we accept contributions that implement the updated ESE paper.
```
## Use Cases
...
...
@@ -36,7 +41,7 @@ model = SentenceTransformer("microsoft/mpnet-base")
Then we can run inference with it using <ahref="../../../docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformers.encode</code></a>.
Then we can run inference with it using <ahref="../../../docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"><code>SentenceTransformers.encode</code></a>.
# compared to tensor([[ 0.7547, -0.0162]]) for the full model
```
...
...
@@ -143,11 +147,11 @@ As you can see, the similarity between the related sentences is much higher than
## Code Examples
See the following scripts as examples of how to apply the <ahref="../../../docs/package_reference/losses.html#adaptivelayerloss"><code>AdaptiveLayerLoss</code></a> in practice:
See the following scripts as examples of how to apply the <ahref="../../../docs/package_reference/sentence_transformer/losses.html#adaptivelayerloss"><code>AdaptiveLayerLoss</code></a> in practice:
***[adaptive_layer_nli.py](adaptive_layer_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `AdaptiveLayerLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation.
***[adaptive_layer_sts.py](adaptive_layer_sts.py)**: This example uses the CoSENTLoss with AdaptiveLayerLoss to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation.
And the following scripts to see how to apply <ahref="../../../docs/package_reference/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>:
And the following scripts to see how to apply <ahref="../../../docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss"><code>Matryoshka2dLoss</code></a>:
***[2d_matryoshka_nli.py](../matryoshka/2d_matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `Matryoshka2dLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation.
***[2d_matryoshka_sts.py](../matryoshka/2d_matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `Matryoshka2dLoss` to train an embedding model on the training set of the STSBenchmark dataset. It is an adaptation of the [STS](../sts/README) documentation.