pooling_models.rst 5.71 KB
Newer Older
1
2
3
4
5
6
7
8
.. _pooling_models:

Pooling Models
==============

vLLM also supports pooling models, including embedding, reranking and reward models.

In vLLM, pooling models implement the :class:`~vllm.model_executor.models.VllmModelForPooling` interface.
9
These models use a :class:`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
before returning them.

.. note::

    We currently support pooling models primarily as a matter of convenience.
    As shown in the :ref:`Compatibility Matrix <compatibility_matrix>`, most vLLM features are not applicable to
    pooling models as they only work on the generation or decode stage, so performance may not improve as much.

Offline Inference
-----------------

The :class:`~vllm.LLM` class provides various methods for offline inference.
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.

For pooling models, we support the following :code:`task` options:

- Embedding (:code:`"embed"` / :code:`"embedding"`)
- Classification (:code:`"classify"`)
- Sentence Pair Scoring (:code:`"score"`)
- Reward Modeling (:code:`"reward"`)

The selected task determines the default :class:`~vllm.model_executor.layers.Pooler` that is used:

- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
- Reward Modeling: Extract all of the hidden states and return them directly.

When loading `Sentence Transformers <https://huggingface.co/sentence-transformers>`__ models,
we attempt to override the default pooler based on its Sentence Transformers configuration file (:code:`modules.json`).

You can customize the model's pooling method via the :code:`override_pooler_config` option,
which takes priority over both the model's and Sentence Transformers's defaults.

``LLM.encode``
^^^^^^^^^^^^^^

The :class:`~vllm.LLM.encode` method is available to all pooling models in vLLM.
48
49
50
51
52
It returns the extracted hidden states directly, which is useful for reward models.

.. code-block:: python

    llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
53
    (output,) = llm.encode("Hello, my name is")
54
55

    data = output.outputs.data
56
    print(f"Data: {data!r}")
57
58
59
60
61
62

``LLM.embed``
^^^^^^^^^^^^^

The :class:`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
It is primarily designed for embedding models.
63
64
65
66

.. code-block:: python

    llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
67
    (output,) = llm.embed("Hello, my name is")
68

69
70
    embeds = output.outputs.embedding
    print(f"Embeddings: {embeds!r} (size={len(embeds)})")
71
72
73

A code example can be found in `examples/offline_inference_embedding.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_embedding.py>`_.

74
75
76
77
78
79
80
81
82
``LLM.classify``
^^^^^^^^^^^^^^^^

The :class:`~vllm.LLM.classify` method outputs a probability vector for each prompt.
It is primarily designed for classification models.

.. code-block:: python

    llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
83
    (output,) = llm.classify("Hello, my name is")
84
85
86
87
88
89

    probs = output.outputs.probs
    print(f"Class Probabilities: {probs!r} (size={len(probs)})")

A code example can be found in `examples/offline_inference_classification.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_classification.py>`_.

90
91
92
93
94
95
96
97
98
99
100
101
``LLM.score``
^^^^^^^^^^^^^

The :class:`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
It is primarily designed for `cross-encoder models <https://www.sbert.net/examples/applications/cross-encoder/README.html>`__.
These types of models serve as rerankers between candidate query-document pairs in RAG systems.

.. note::

    vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
    To handle RAG at a higher level, you should use integration frameworks such as `LangChain <https://github.com/langchain-ai/langchain>`_.

102
103
104
.. code-block:: python

    llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
105
106
    (output,) = llm.score("What is the capital of France?",
                          "The capital of Brazil is Brasilia.")
107
108
109
110
111

    score = output.outputs.score
    print(f"Score: {score}")

A code example can be found in `examples/offline_inference_scoring.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_scoring.py>`_.
112
113
114
115
116
117
118
119
120
121

Online Inference
----------------

Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
Please click on the above link for more details on how to launch the server.

Embeddings API
^^^^^^^^^^^^^^

122
Our Embeddings API is similar to ``LLM.embed``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
123
124
125
126
127
128
129
130
131
132
133
134
135
136

The text-only API is compatible with `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
so that you can use OpenAI client to interact with it.
A code example can be found in `examples/openai_embedding_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py>`_.

The multi-modal API is an extension of the `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
that incorporates `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__,
so it is not part of the OpenAI standard. Please see :ref:`this page <multimodal_inputs>` for more details on how to use it.

Score API
^^^^^^^^^

Our Score API is similar to ``LLM.score``.
Please see `this page <../serving/openai_compatible_server.html#score-api-for-cross-encoder-models>`__ for more details on how to use it.