@@ -120,7 +120,7 @@ A code example can be found here: <gh-file:examples/offline_inference/basic/clas
### `LLM.score`
The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
It is designed for embedding models and crossencoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.
It is designed for embedding models and cross-encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.
!!! note
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
@@ -311,6 +311,8 @@ See [this page](generative_models.md) for more information on how to use generat
#### Text Generation
These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API.
<style>
th {
white-space: nowrap;
...
...
@@ -419,7 +421,9 @@ See [this page](./pooling_models.md) for more information on how to use pooling
Since some model architectures support both generative and pooling tasks,
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
#### Text Embedding
#### Embedding
These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) API.
@@ -491,7 +477,10 @@ If your model is not in the above list, we will try to automatically convert the
If your model is not in the above list, we will try to automatically convert the model using
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
#### Sentence Pair Scoring
#### Cross-encoder / Reranker
Cross-encoder and reranker models are a subset of classification models that accept two prompts as input.
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
@@ -579,6 +591,8 @@ See [this page](generative_models.md) for more information on how to use generat
#### Text Generation
These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API.