pooling_models.md 8.31 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
(pooling-models)=

# Pooling Models

vLLM also supports pooling models, including embedding, reranking and reward models.

In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmModelForPooling` interface.
These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
before returning them.

11
:::{note}
12
13
14
We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
15
:::
16

17
18
19
For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:

20
:::{list-table}
21
22
23
:widths: 50 25 25 25
:header-rows: 1

24
25
26
27
28
29
30
- * Task
  * Pooling Type
  * Normalization
  * Softmax
- * Embedding (`embed`)
  * `LAST`
  * ✅︎
31
  *
32
33
- * Classification (`classify`)
  * `LAST`
34
  *
35
36
37
38
39
40
41
  * ✅︎
- * Sentence Pair Scoring (`score`)
  * \*
  * \*
  * \*
- * Reward Modeling (`reward`)
  * `ALL`
42
43
  *
  *
44
:::
45

46
\*The default pooler is always defined by the model.
47

48
:::{note}
49
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
50
:::
51
52

When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
53
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
54

55
:::{tip}
56
You can customize the model's pooling method via the `--override-pooler-config` option,
57
which takes priority over both the model's and Sentence Transformers's defaults.
58
:::
59
60
61
62
63

## Offline Inference

The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model.
64
65
66
67
68
69
70

### `LLM.encode`

The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM.
It returns the extracted hidden states directly, which is useful for reward models.

```python
Reid's avatar
Reid committed
71
72
from vllm import LLM

73
74
75
76
77
78
79
80
81
82
83
84
85
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
(output,) = llm.encode("Hello, my name is")

data = output.outputs.data
print(f"Data: {data!r}")
```

### `LLM.embed`

The {class}`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
It is primarily designed for embedding models.

```python
Reid's avatar
Reid committed
86
87
from vllm import LLM

88
89
90
91
92
93
94
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

95
A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py>
96
97
98
99
100
101
102

### `LLM.classify`

The {class}`~vllm.LLM.classify` method outputs a probability vector for each prompt.
It is primarily designed for classification models.

```python
Reid's avatar
Reid committed
103
104
from vllm import LLM

105
106
107
108
109
110
111
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
```

112
A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py>
113
114
115
116

### `LLM.score`

The {class}`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
117
It is designed for embedding models and cross encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.
118

119
:::{note}
120
121
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
122
:::
123
124

```python
Reid's avatar
Reid committed
125
126
from vllm import LLM

127
128
129
130
131
132
133
134
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
(output,) = llm.score("What is the capital of France?",
                      "The capital of Brazil is Brasilia.")

score = output.outputs.score
print(f"Score: {score}")
```

135
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py>
136

137
## Online Serving
138

139
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
140

141
142
143
- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
- [Score API](#score-api) is similar to `LLM.score` for cross-encoder models.
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161

## Matryoshka Embeddings

[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows user to trade off between performance and cost.

:::{warning}
Not all embedding models are trained using Matryoshka Representation Learning. To avoid misuse of the `dimensions` parameter, vLLM returns an error for requests that attempt to change the output dimension of models that do not support Matryoshka Embeddings.

For example, setting `dimensions` parameter while using the `BAAI/bge-m3` model will result in the following error.

```json
{"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}
```

:::

### Manually enable Matryoshka Embeddings

162
There is currently no official interface for specifying support for Matryoshka Embeddings. In vLLM, if `is_matryoshka` is `True` in `config.json,` it is allowed to change the output to arbitrary dimensions. Using `matryoshka_dimensions` can control the allowed output dimensions.
163

164
For models that support Matryoshka Embeddings but not recognized by vLLM, please manually override the config using `hf_overrides={"is_matryoshka": True}`, `hf_overrides={"matryoshka_dimensions": [<allowed output dimensions>]}` (offline) or `--hf_overrides '{"is_matryoshka": true}'`,  `--hf_overrides '{"matryoshka_dimensions": [<allowed output dimensions>]}'`(online).
165
166
167
168

Here is an example to serve a model with Matryoshka Embeddings enabled.

```text
169
vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf_overrides '{"matryoshka_dimensions":[256]}'
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
```

### Offline Inference

You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter in {class}`~vllm.PoolingParams`.

```python
from vllm import LLM, PoolingParams

model = LLM(model="jinaai/jina-embeddings-v3", 
            task="embed", 
            trust_remote_code=True)
outputs = model.embed(["Follow the white rabbit."], 
                      pooling_params=PoolingParams(dimensions=32))
print(outputs[0].outputs)
```

A code example can be found here: <gh-file:examples/offline_inference/embed_matryoshka_fy.py>

### Online Inference

Use the following command to start vllm server.

```text
vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
```

You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.

```text
curl http://127.0.0.1:8000/v1/embeddings \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Follow the white rabbit.",
    "model": "jinaai/jina-embeddings-v3",
    "encoding_format": "float",
207
    "dimensions": 32
208
209
210
211
212
213
  }'
```

Expected output:

```json
214
{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
215
216
217
```

A openai client example can be found here: <gh-file:examples/online_serving/openai_embedding_matryoshka_fy.py>