pooling_models.md 4.51 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(pooling-models)=

# Pooling Models

vLLM also supports pooling models, including embedding, reranking and reward models.

In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmModelForPooling` interface.
These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
before returning them.

```{note}
We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
```

17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:

```{list-table}
:widths: 50 25 25 25
:header-rows: 1

* - Task
  - Pooling Type
  - Normalization
  - Softmax
* - Embedding (`embed`)
  - `LAST`
  - ✅︎
  - ✗
* - Classification (`classify`)
  - `LAST`
  - ✗
  - ✅︎
* - Sentence Pair Scoring (`score`)
  - \*
  - \*
  - \*
* - Reward Modeling (`reward`)
  - `ALL`
  - ✗
  - ✗
```
45

46
\*The default pooler is always defined by the model.
47

48
49
50
```{note}
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
```
51
52

When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
53
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
54

55
56
```{tip}
You can customize the model's pooling method via the `--override-pooler-config` option,
57
which takes priority over both the model's and Sentence Transformers's defaults.
58
59
60
61
62
63
```

## Offline Inference

The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

### `LLM.encode`

The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM.
It returns the extracted hidden states directly, which is useful for reward models.

```python
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
(output,) = llm.encode("Hello, my name is")

data = output.outputs.data
print(f"Data: {data!r}")
```

### `LLM.embed`

The {class}`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
It is primarily designed for embedding models.

```python
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

91
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_embedding.py>
92
93
94
95
96
97
98
99
100
101
102
103
104
105

### `LLM.classify`

The {class}`~vllm.LLM.classify` method outputs a probability vector for each prompt.
It is primarily designed for classification models.

```python
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
```

106
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_classification.py>
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127

### `LLM.score`

The {class}`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html).
These types of models serve as rerankers between candidate query-document pairs in RAG systems.

```{note}
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
```

```python
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
(output,) = llm.score("What is the capital of France?",
                      "The capital of Brazil is Brasilia.")

score = output.outputs.score
print(f"Score: {score}")
```

128
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>
129

130
## Online Serving
131

132
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
133

134
135
136
- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
- [Score API](#score-api) is similar to `LLM.score` for cross-encoder models.