Sentence Transformers implements two forms of distributed training: Data Parallel (DP) and Distributed Data Parallel (DDP). Read the `Data Parallelism documentation <https://huggingface.co/docs/transformers/en/perf_train_gpu_many#data-parallelism>`_ on Hugging Face for more details on these strategies. Some of the key differences include:
1. DDP is generally faster than DP because it has to communicate less data.
2. With DP, GPU 0 does the bulk of the work, while with DDP, the work is distributed more evenly across all GPUs.
3. DDP allows for training across multiple machines, while DP is limited to a single machine.
In short, **DDP is generally recommended**. You can use DDP by running your normal training scripts with ``torchrun`` or ``accelerate``. For example, if you have a script called ``train_script.py``, you can run it with DDP using the following command:
When performing distributed training, you have to wrap your code in a ``main`` function and call it with ``if __name__ == "__main__":``. This is because each process will run the entire script, so you don't want to run the same code multiple times. Here is an example of how to do this::
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainingArguments, SentenceTransformerTrainer
# Other imports here
def main():
# Your training code here
if __name__ == "__main__":
main()
.. note::
When using an `Evaluator <../training_overview.html#evaluator>`_, the evaluator only runs on the first device unlike the training and evaluation datasets, which are shared across all devices.
Comparison
----------
The following table shows the speedup of DDP over DP and no parallelism given a certain hardware setup.
- Hardware: a ``p3.8xlarge`` AWS instance, i.e. 4x V100 GPUs
- Model being trained: `microsoft/mpnet-base <https://huggingface.co/microsoft/mpnet-base>`_ (133M parameters)
- Maximum sequence length: 384 (following `all-mpnet-base-v2 <https://huggingface.co/sentence-transformers/all-mpnet-base-v2>`_)
- Training datasets: MultiNLI, SNLI and STSB (note: these have short texts)
- Losses: :class:`~sentence_transformers.losses.SoftmaxLoss` for MultiNLI and SNLI, :class:`~sentence_transformers.losses.CosineSimilarityLoss` for STSB
- ``python train_script.py`` (DP is used by default when launching a script with ``python``)
- 3675 (1.349x speedup)
* - **Distributed Data Parallel (DDP)**
- ``torchrun --nproc_per_node=4 train_script.py`` or ``accelerate launch --num_processes 4 train_script.py``
- **6980 (2.562x speedup)**
FSDP
----
Fully Sharded Data Parallelism (FSDP) is another distributed training strategy that is not fully supported by Sentence Transformers. It is a more advanced version of DDP that is particularly useful for very large models. Note that in the previous comparison, FSDP reaches 5782 samples per second (2.122x speedup), i.e. **worse than DDP**. FSDP only makes sense with very large models. If you want to use FSDP with Sentence Transformers, you have to be aware of the following limitations:
- You can't use the ``evaluator`` functionality with FSDP.
- You have to save the trained model with ``trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")`` followed with ``trainer.save_model("output")``.
- You have to use ``fsdp=["full_shard", "auto_wrap"]`` and ``fsdp_config={"transformer_layer_cls_to_wrap": "BertLayer"}`` in your ``SentenceTransformerTrainingArguments``, where ``BertLayer`` is the repeated layer in the encoder that houses the multi-head attention and feed-forward layers, so e.g. ``BertLayer`` or ``MPNetLayer``.
Read the `FSDP documentation <https://huggingface.co/docs/accelerate/en/usage_guides/fsdp>`_ by Accelerate for more details.
@@ -9,7 +9,7 @@ The [applications](applications/) folder contains examples how to use SentenceTr
The [evaluation](evaluation/) folder contains some examples how to evaluate SentenceTransformer models for common tasks.
## Training
The [training](training/) folder contains examples how to fine-tune transformer models like BERT, RoBERTa, or XLM-RoBERTa for generating sentence embedding. For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/training/overview.html).
The [training](training/) folder contains examples how to fine-tune transformer models like BERT, RoBERTa, or XLM-RoBERTa for generating sentence embedding. For the documentation how to train your own models, see [Training Overview](http://www.sbert.net/docs/sentence_transformer/training_overview.html).
@@ -15,7 +15,7 @@ In [fast_clustering.py](fast_clustering.py) we present a clustering algorithm th
You can configure the threshold of cosine-similarity for which we consider two sentences as similar. Also, you can specify the minimal size for a local community. This allows you to get either large coarse-grained clusters or small fine-grained clusters.
We apply it on the [Quora Duplicate Questions](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) dataset and the output looks something like this:
We apply it on the [Quora Duplicate Questions](https://huggingface.co/datasets/sentence-transformers/quora-duplicates) dataset and the output looks something like this:
```
Cluster 1, #83 Elements
...
...
@@ -51,7 +51,6 @@ For each topic, you want to extract the words that describe this topic:
Sentence-Transformers can be used to identify these topics in a collection of sentences, paragraphs or short documents. For an excellent tutorial, see [Topic Modeling with BERT](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) as well as the repositories [Top2Vec](https://github.com/ddangelov/Top2Vec) and [BERTopic](https://github.com/MaartenGr/BERTopic).
Sentence-Transformers can be used to identify these topics in a collection of sentences, paragraphs or short documents. For an excellent tutorial, see [Topic Modeling with BERT](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) as well as the [BERTopic](https://github.com/MaartenGr/BERTopic) and [Top2Vec](https://github.com/ddangelov/Top2Vec) repositories.
Image source: [Top2Vec: Distributed Representations of Topics](https://arxiv.org/abs/2008.09470)
Image source: [Top2Vec: Distributed Representations of Topics](https://arxiv.org/abs/2008.09470)
Somemodelsrequireusingspecifictext*prompts*toachieveoptimalperformance.Forexample,with`intfloat/multilingual-e5-large<https://huggingface.co/intfloat/multilingual-e5-large>`_youshouldprefixallquerieswith``"query: "``andallpassageswith``"passage: "``.Anotherexampleis`BAAI/bge-large-en-v1.5<https://huggingface.co/BAAI/bge-large-en-v1.5>`_,whichperformsbestforretrievalwhentheinputtextsareprefixedwith``"Represent this sentence for searching relevant passages: "``.
"retrieval":"Retrieve semantically similar text: ",
"clustering":"Identify the topic or theme based on the text: ",
},
default_prompt_name="retrieval",
)
#or
model.default_prompt_name="retrieval"
Bothoftheseparameterscanalsobespecifiedinthe``config_sentence_transformers.json``fileofasavedmodel.Thatway,youwon't have to specify these options manually when loading. When you save a Sentence Transformer model, these options will be automatically saved as well.
During inference, prompts can be applied in a few different ways. All of these scenarios result in identical texts being embedded:
1. Explicitly using the ``prompt`` option in ``SentenceTransformer.encode``::
embeddings = model.encode("How to bake a strawberry cake", prompt="Retrieve semantically similar text: ")
2. Explicitly using the ``prompt_name`` option in ``SentenceTransformer.encode`` by relying on the prompts loaded from a) initialization or b) the model config::
embeddings = model.encode("How to bake a strawberry cake", prompt_name="retrieval")
3. If ``prompt`` nor ``prompt_name`` are specified in ``SentenceTransformer.encode``, then the prompt specified by ``default_prompt_name`` will be applied. If it is ``None``, then no prompt will be applied::
embeddings = model.encode("How to bake a strawberry cake")
Input Sequence Length
---------------------
For transformer models like BERT, RoBERTa, DistilBERT etc., the runtime and memory requirement grows quadratic with the input length. This limits transformers to inputs of certain lengths. A common value for BERT-based models are 512 tokens, which corresponds to about 300-400 words (for English).
Each model has a maximum sequence length under ``model.max_seq_length``, which is the maximal number of tokens that can be processed. Longer texts will be truncated to the first ``model.max_seq_length`` tokens::
from sentence_transformers import SentenceTransformer
You cannot increase the length higher than what is maximally supported by the respective transformer model. Also note that if a model was trained on short texts, the representations for long texts might not be that good.
Multi-Process / Multi-GPU Encoding
----------------------------------
You can encode input texts with more than one GPU (or with multiple processes on a CPU machine). For an example, see: `computing_embeddings_multi_gpu.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/computing-embeddings/computing_embeddings_multi_gpu.py>`_.
The relevant method is :meth:`~sentence_transformers.SentenceTransformer.start_multi_process_pool`, which starts multiple processes that are used for encoding.
@@ -20,6 +20,16 @@ Quantizing an embedding with a dimensionality of 1024 to binary would result in
As a result, in practice quantizing a `float32` embedding with a dimensionality of 1024 yields an `int8` or `uint8` embedding with a dimensionality of 128. See two approaches of how you can produce quantized embeddings using Sentence Transformers below: