pooling_models.md 9.46 KB
Newer Older
1
# Pooling Models
2
3
4

vLLM also supports pooling models, including embedding, reranking and reward models.

5
6
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
7
8
before returning them.

9
10
!!! note
    We currently support pooling models primarily as a matter of convenience.
11
    As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
12
    pooling models as they only work on the generation or decode stage, so performance may not improve as much.
13

14
15
If the model doesn't implement this interface, you can set `--task` which tells vLLM
to convert the model into a pooling model.
16

17
18
19
20
21
| `--task`   | Model type           | Supported pooling tasks       |
|------------|----------------------|-------------------------------|
| `embed`    | Embedding model      | `encode`, `embed`             |
| `classify` | Classification model | `encode`, `classify`, `score` |
| `reward`   | Reward model         | `encode`                      |
22

23
## Pooling Tasks
24

25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
In vLLM, we define the following pooling tasks and corresponding APIs:

| Task       | APIs               |
|------------|--------------------|
| `encode`   | `encode`           |
| `embed`    | `embed`, `score`\* |
| `classify` | `classify`         |
| `score`    | `score`            |

\*The `score` API falls back to `embed` task if the model does not support `score` task.

Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].

By default, the pooler assigned to each task has the following attributes:

| Task       | Pooling Type   | Normalization | Softmax |
|------------|----------------|---------------|---------|
| `encode`   | `ALL`          | ❌            | ❌      |
| `embed`    | `LAST`         | ✅︎            | ❌      |
| `classify` | `LAST`         | ❌            | ✅︎      |

These defaults may be overridden by the model's implementation in vLLM.
47
48

When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
49
50
51
52
53
54
55
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
which takes priority over the model's defaults.

You can further customize this via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.

!!! note
56

57
58
    The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
    that is not based on [PoolerConfig][vllm.config.PoolerConfig].
59
60
61

## Offline Inference

62
63
The [LLM][vllm.LLM] class provides various methods for offline inference.
See [configuration][configuration] for a list of options when initializing the model.
64
65
66

### `LLM.encode`

67
The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
68
69
70
It returns the extracted hidden states directly, which is useful for reward models.

```python
Reid's avatar
Reid committed
71
72
from vllm import LLM

73
74
75
76
77
78
79
80
81
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
(output,) = llm.encode("Hello, my name is")

data = output.outputs.data
print(f"Data: {data!r}")
```

### `LLM.embed`

82
The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt.
83
84
85
It is primarily designed for embedding models.

```python
Reid's avatar
Reid committed
86
87
from vllm import LLM

88
89
90
91
92
93
94
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```

95
A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py>
96
97
98

### `LLM.classify`

99
The [classify][vllm.LLM.classify] method outputs a probability vector for each prompt.
100
101
102
It is primarily designed for classification models.

```python
Reid's avatar
Reid committed
103
104
from vllm import LLM

105
106
107
108
109
110
111
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
```

112
A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py>
113
114
115

### `LLM.score`

116
The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
117
It is designed for embedding models and cross encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.
118

119
120
121
!!! note
    vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
    To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
122
123

```python
Reid's avatar
Reid committed
124
125
from vllm import LLM

126
127
128
129
130
131
132
133
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
(output,) = llm.score("What is the capital of France?",
                      "The capital of Brazil is Brasilia.")

score = output.outputs.score
print(f"Score: {score}")
```

134
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py>
135

136
## Online Serving
137

138
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
139

140
- [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
141
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
142
143
- [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API][score-api] is similar to `LLM.score` for cross-encoder models.
144
145
146
147
148

## Matryoshka Embeddings

[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows user to trade off between performance and cost.

149
150
!!! warning
    Not all embedding models are trained using Matryoshka Representation Learning. To avoid misuse of the `dimensions` parameter, vLLM returns an error for requests that attempt to change the output dimension of models that do not support Matryoshka Embeddings.
151

152
    For example, setting `dimensions` parameter while using the `BAAI/bge-m3` model will result in the following error.
153

154
155
156
    ```json
    {"object":"error","message":"Model \"BAAI/bge-m3\" does not support matryoshka representation, changing output dimensions will lead to poor results.","type":"BadRequestError","param":null,"code":400}
    ```
157
158
159

### Manually enable Matryoshka Embeddings

160
There is currently no official interface for specifying support for Matryoshka Embeddings. In vLLM, if `is_matryoshka` is `True` in `config.json,` it is allowed to change the output to arbitrary dimensions. Using `matryoshka_dimensions` can control the allowed output dimensions.
161

162
For models that support Matryoshka Embeddings but not recognized by vLLM, please manually override the config using `hf_overrides={"is_matryoshka": True}`, `hf_overrides={"matryoshka_dimensions": [<allowed output dimensions>]}` (offline) or `--hf_overrides '{"is_matryoshka": true}'`,  `--hf_overrides '{"matryoshka_dimensions": [<allowed output dimensions>]}'`(online).
163
164
165
166

Here is an example to serve a model with Matryoshka Embeddings enabled.

```text
167
vllm serve Snowflake/snowflake-arctic-embed-m-v1.5 --hf_overrides '{"matryoshka_dimensions":[256]}'
168
169
170
171
```

### Offline Inference

172
You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter in [PoolingParams][vllm.PoolingParams].
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204

```python
from vllm import LLM, PoolingParams

model = LLM(model="jinaai/jina-embeddings-v3", 
            task="embed", 
            trust_remote_code=True)
outputs = model.embed(["Follow the white rabbit."], 
                      pooling_params=PoolingParams(dimensions=32))
print(outputs[0].outputs)
```

A code example can be found here: <gh-file:examples/offline_inference/embed_matryoshka_fy.py>

### Online Inference

Use the following command to start vllm server.

```text
vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
```

You can change the output dimensions of embedding models that support Matryoshka Embeddings by using the dimensions parameter.

```text
curl http://127.0.0.1:8000/v1/embeddings \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Follow the white rabbit.",
    "model": "jinaai/jina-embeddings-v3",
    "encoding_format": "float",
205
    "dimensions": 32
206
207
208
209
210
211
  }'
```

Expected output:

```json
212
{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
213
214
215
```

A openai client example can be found here: <gh-file:examples/online_serving/openai_embedding_matryoshka_fy.py>