pooling_models.md 11.1 KB
Newer Older
1
# Pooling Models
2

3
vLLM also supports pooling models, such as embedding, classification and reward models.
4

5
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
6
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
7
8
before returning them.

9
!!! note
10
11
12
    We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly.

    We are now planning to optimize pooling models in vLLM. Please comment on <gh-issue:21796> if you have any suggestions!
13

14
## Configuration
15

16
### Model Runner
17

18
Run a model in pooling mode via the option `--runner pooling`.
19

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
!!! tip
    There is no need to set this option in the vast majority of cases as vLLM can automatically
    detect the model runner to use via `--runner auto`.

### Model Conversion

vLLM can adapt models for various pooling tasks via the option `--convert <type>`.

If `--runner pooling` has been set (manually or automatically) but the model does not implement the
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.

| Architecture                                    | `--convert` | Supported pooling tasks       |
|-------------------------------------------------|-------------|-------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed`     | `encode`, `embed`             |
| `*For*Classification`, `*ClassificationModel`   | `classify`  | `encode`, `classify`, `score` |
| `*ForRewardModeling`, `*RewardModel`            | `reward`    | `encode`                      |

!!! tip
    You can explicitly set `--convert <type>` to specify how to convert the model.

### Pooling Tasks

Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs:
47

48
49
50
51
52
53
| Task       | APIs                                 |
|------------|--------------------------------------|
| `encode`   | `LLM.reward(...)`                    |
| `embed`    | `LLM.embed(...)`, `LLM.score(...)`\* |
| `classify` | `LLM.classify(...)`                  |
| `score`    | `LLM.score(...)`                     |
54

55
\* The `LLM.score(...)` API falls back to `embed` task if the model does not support `score` task.
56

57
### Pooler Configuration
58

59
60
61
#### Predefined models

If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
62
you can override some of its attributes via the `--pooler-config` option.
63
64
65
66
67

#### Converted models

If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default:
68

69
70
71
72
73
| Task       | Pooling Type | Normalization | Softmax |
|------------|--------------|---------------|---------|
| `reward`   | `ALL`        | ❌            | ❌     |
| `embed`    | `LAST`       | ✅︎            | ❌      |
| `classify` | `LAST`       | ❌            | ✅︎      |
74

75
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
76
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
77

78
You can further customize this via the `--pooler-config` option,
79
80
which takes priority over both the model's and Sentence Transformers's defaults.

81
82
## Offline Inference

83
The [LLM][vllm.LLM] class provides various methods for offline inference.
84
See [configuration](../api/README.md#configuration) for a list of options when initializing the model.
85
86
87

### `LLM.embed`

88
The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt.
89
90
91
It is primarily designed for embedding models.

```python
Reid's avatar
Reid committed
92
93
from vllm import LLM

94
llm = LLM(model="intfloat/e5-small", runner="pooling")
95
96
97
98
99
100
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

101
A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py>
102
103
104

### `LLM.classify`

105
The [classify][vllm.LLM.classify] method outputs a probability vector for each prompt.
106
107
108
It is primarily designed for classification models.

```python
Reid's avatar
Reid committed
109
110
from vllm import LLM

111
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
112
113
114
115
116
117
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
```

118
A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py>
119
120
121

### `LLM.score`

122
The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
123
It is designed for embedding models and cross-encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.
124

125
126
127
!!! note
    vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
    To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
128
129

```python
Reid's avatar
Reid committed
130
131
from vllm import LLM

132
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
133
134
135
136
(output,) = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)
137
138
139
140
141

score = output.outputs.score
print(f"Score: {score}")
```

142
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py>
143

144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
### `LLM.reward`

The [reward][vllm.LLM.reward] method is available to all reward models in vLLM.
It returns the extracted hidden states directly.

```python
from vllm import LLM

llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.reward("Hello, my name is")

data = output.outputs.data
print(f"Data: {data!r}")
```

A code example can be found here: <gh-file:examples/offline_inference/basic/reward.py>

### `LLM.encode`

The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
It returns the extracted hidden states directly.

!!! note
    Please use one of the more specific methods or set the task directly when using `LLM.encode`:

    - For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
    - For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
    - For rewards, use `LLM.reward(...)` or `pooling_task="reward"`.
    - For similarity scores, use `LLM.score(...)`.  

```python
from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="embed")

data = output.outputs.data
print(f"Data: {data!r}")
```

184
## Online Serving
185

186
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
187

188
- [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
189
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
190
191
- [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API][score-api] is similar to `LLM.score` for cross-encoder models.
192
193
194
195
196

## Matryoshka Embeddings

[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows user to trade off between performance and cost.

197
198
!!! warning
    Not all embedding models are trained using Matryoshka Representation Learning. To avoid misuse of the `dimensions` parameter, vLLM returns an error for requests that attempt to change the output dimension of models that do not support Matryoshka Embeddings.
199

200
    For example, setting `dimensions` parameter while using the `BAAI/bge-m3` model will result in the following error.
201

202
203
204
    ```json
    {"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}
    ```
205
206
207

### Manually enable Matryoshka Embeddings

208
There is currently no official interface for specifying support for Matryoshka Embeddings. In vLLM, if `is_matryoshka` is `True` in `config.json,` it is allowed to change the output to arbitrary dimensions. Using `matryoshka_dimensions` can control the allowed output dimensions.
209

210
For models that support Matryoshka Embeddings but not recognized by vLLM, please manually override the config using `hf_overrides={"is_matryoshka": True}`, `hf_overrides={"matryoshka_dimensions": [<allowed output dimensions>]}` (offline) or `--hf-overrides '{"is_matryoshka": true}'`,  `--hf-overrides '{"matryoshka_dimensions": [<allowed output dimensions>]}'`(online).
211
212
213

Here is an example to serve a model with Matryoshka Embeddings enabled.

214
```bash
215
vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf-overrides '{"matryoshka_dimensions":[256]}'
216
217
218
219
```

### Offline Inference

220
You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter in [PoolingParams][vllm.PoolingParams].
221
222
223
224

```python
from vllm import LLM, PoolingParams

225
226
227
228
229
230
231
232
233
llm = LLM(
    model="jinaai/jina-embeddings-v3",
    runner="pooling",
    trust_remote_code=True,
)
outputs = llm.embed(
    ["Follow the white rabbit."],
    pooling_params=PoolingParams(dimensions=32),
)
234
235
236
print(outputs[0].outputs)
```

237
A code example can be found here: <gh-file:examples/offline_inference/pooling/embed_matryoshka_fy.py>
238
239
240
241
242

### Online Inference

Use the following command to start vllm server.

243
```bash
244
245
246
247
248
vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
```

You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.

249
```bash
250
251
252
253
254
255
256
curl http://127.0.0.1:8000/v1/embeddings \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Follow the white rabbit.",
    "model": "jinaai/jina-embeddings-v3",
    "encoding_format": "float",
257
    "dimensions": 32
258
259
260
261
262
263
  }'
```

Expected output:

```json
264
{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
265
266
```

267
An OpenAI client example can be found here: <gh-file:examples/online_serving/pooling/openai_embedding_matryoshka_fy.py>