"vllm/vscode:/vscode.git/clone" did not exist on "c9b38be8aafb02b69ccb704b33d2bb4329fbb0e6"
Unverified Commit 9744b699 authored by wang.yuqi's avatar wang.yuqi Committed by GitHub
Browse files

[Deprecate] Deprecate LLM.reward offline api, use LLM.encode instead. (#40688)


Signed-off-by: default avatarwang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: default avatarwang.yuqi <noooop@126.com>
Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: default avatarCyrus Leung <cyrus.tl.leung@gmail.com>
parent c662b435
...@@ -78,7 +78,7 @@ The scoring models is designed to compute similarity scores between two input pr ...@@ -78,7 +78,7 @@ The scoring models is designed to compute similarity scores between two input pr
|-----------------------|---------------|----------------------------------------------|--------------------|--------------------------| |-----------------------|---------------|----------------------------------------------|--------------------|--------------------------|
| `classify` (see note) | Sequence-wise | reranker score for each sequence | `cross-encoder` | linear classifier | | `classify` (see note) | Sequence-wise | reranker score for each sequence | `cross-encoder` | linear classifier |
| `embed` | Sequence-wise | vector representations for each sequence | `bi-encoder` | cosine similarity | | `embed` | Sequence-wise | vector representations for each sequence | `bi-encoder` | cosine similarity |
| `token_classify` | Token-wise | probability vector of classes for each token | nan | nan | | `token_classify` | Token-wise | probability vector of classes for each token | N/A | N/A |
| `token_embed` | Token-wise | vector representations for each token | `late-interaction` | late interaction(MaxSim) | | `token_embed` | Token-wise | vector representations for each token | `late-interaction` | late interaction(MaxSim) |
!!! note !!! note
...@@ -87,13 +87,14 @@ The scoring models is designed to compute similarity scores between two input pr ...@@ -87,13 +87,14 @@ The scoring models is designed to compute similarity scores between two input pr
### Pooling Usages ### Pooling Usages
| Pooling Usages | Description | | Pooling Usages | Description |
|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| |-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Classification Usages | Predicting which predefined category, class, or label best corresponds to a given input. | | Classification Usages | Predicting which predefined category, class, or label best corresponds to a given input. |
| Embedding Usages | Converts unstructured data (text, images, audio, etc.) into structured numerical vectors (embeddings). | | Embedding Usages | Converts unstructured data (text, images, audio, etc.) into structured numerical vectors (embeddings). |
| Token Classification Usages | Token-wise classification | | Token Classification Usages | Token-wise classification |
| Token Embedding Usages | Token-wise embedding | | Token Embedding Usages | Token-wise embedding |
| Scoring Usages | Computes similarity scores between two inputs. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`. |
| Reward Usages | Evaluates the quality of outputs generated by a language model, acting as a proxy for human preferences. | | Reward Usages | Evaluates the quality of outputs generated by a language model, acting as a proxy for human preferences. |
| Scoring Usages | Computes similarity scores between two inputs. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`. |
| Plugins Usages | Allow users to customize input and output processors. For more information, please refer to [IO Processor Plugins](../../design/io_processor_plugins.md). |
We also have some special models that support multiple pooling tasks, or have specific usage scenarios, or support special inputs and outputs. We also have some special models that support multiple pooling tasks, or have specific usage scenarios, or support special inputs and outputs.
...@@ -101,9 +102,9 @@ For more detailed information, please refer to the link below. ...@@ -101,9 +102,9 @@ For more detailed information, please refer to the link below.
- [Classification Usages](classify.md) - [Classification Usages](classify.md)
- [Embedding Usages](embed.md) - [Embedding Usages](embed.md)
- [Reward Usages](reward.md)
- [Token Classification Usages](token_classify.md) - [Token Classification Usages](token_classify.md)
- [Token Embedding Usages](token_embed.md) - [Token Embedding Usages](token_embed.md)
- [Reward Usages](reward.md)
- [Scoring Usages](scoring.md) - [Scoring Usages](scoring.md)
- [Specific Model Examples](specific_models.md) - [Specific Model Examples](specific_models.md)
...@@ -113,15 +114,17 @@ Each pooling model in vLLM supports one or more of these tasks according to ...@@ -113,15 +114,17 @@ Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks], [Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs. enabling the corresponding APIs.
### Offline APIs corresponding to pooling tasks ### Offline APIs corresponding to pooling usages
| Task | APIs | | Pooling Usages | Dedicated API | Pooling task for `LLM.encode` API | Score Types | scoring function |
|------------------|---------------------------------------------------------------------------------------| |-----------------------------|---------------------|-----------------------------------|----------------------------|--------------------------|
| `embed` | `LLM.embed(...)`, `LLM.encode(..., pooling_task="embed")`, `LLM.score(...)`(see note) | | Classification Usages | `LLM.classify(...)` | `classify` | `cross-encoder` (see note) | linear classifier |
| `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`, `LLM.score(...)` | | Embedding Usages | `LLM.embed(...)` | `embed` | `bi-encoder` | cosine similarity |
| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")` | | Token Classification Usages | N/A | `token_classify` | N/A | N/A |
| `token_embed` | `LLM.encode(..., pooling_task="token_embed")`, `LLM.score(...)` | | Token Embedding Usages | N/A | `token_embed` | `late-interaction` | late interaction(MaxSim) |
| `plugin` | `LLM.encode(..., pooling_task="plugin")` | | Reward Usages | N/A | `classify` & `token_classify` | N/A | N/A |
| Scoring Usages | `LLM.score(...)` | N/A | N/A | N/A |
| Plugins Usages | N/A | `plugin` | N/A | N/A |
!!! note !!! note
Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled. Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
...@@ -147,7 +150,7 @@ It is primarily designed for [score models](scoring.md). ...@@ -147,7 +150,7 @@ It is primarily designed for [score models](scoring.md).
The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM. The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
Please use one of the more specific methods or set the task directly when using `LLM.encode`, refer to the [table above](#offline-apis-corresponding-to-pooling-tasks). Please use one of the more specific methods or set the task directly when using `LLM.encode`, refer to the [table above](#offline-apis-corresponding-to-pooling-usages).
### Examples ### Examples
...@@ -183,9 +186,12 @@ Our Pooling API (`/pooling`) is similar to `LLM.encode`, being applicable to all ...@@ -183,9 +186,12 @@ Our Pooling API (`/pooling`) is similar to `LLM.encode`, being applicable to all
The input format is the same as [Embeddings API](embed.md#openai-compatible-embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats. The input format is the same as [Embeddings API](embed.md#openai-compatible-embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Please use one of the more specific APIs or set the task directly when using the Pooling API, refer to the [table above](#offline-apis-corresponding-to-pooling-tasks). Please use one of the more specific APIs or set the task directly when using the Pooling API, refer to the [table above](#offline-apis-corresponding-to-pooling-usages).
Code examples:
Code example: [examples/pooling/pooling/pooling_online.py](../../../examples/pooling/pooling/pooling_online.py) - [Online example](../../../examples/pooling/reward/token_reward_online.py)
- [Offline example](../../../examples/pooling/reward/token_reward_offline.py)
### Examples ### Examples
......
...@@ -134,3 +134,13 @@ print(f"Data: {data!r}") ...@@ -134,3 +134,13 @@ print(f"Data: {data!r}")
## Online Serving ## Online Serving
Please refer to the [pooling API](README.md#pooling-api). Pooling task corresponding to reward model types refer to the [table above](#summary). Please refer to the [pooling API](README.md#pooling-api). Pooling task corresponding to reward model types refer to the [table above](#summary).
## More examples
More examples can be found here: [examples/pooling/reward](../../../examples/pooling/reward)
## Deprecated Features
### `LLM.reward`
`llm.reward` api is deprecated and will be removed in v0.23. Please use `LLM.encode` with `pooling_task="classify"` or `pooling_task="token_classify"` instead.
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example offline usage of sequence reward models.
The key distinction between sequence classification and token classification
lies in their output granularity: sequence classification produces a single
result for an entire input sequence, whereas token classification yields a
result for each individual token within the sequence.
"""
from argparse import Namespace
from vllm import LLM, EngineArgs
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.print_utils import print_embeddings
def parse_args():
parser = FlexibleArgumentParser()
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="Skywork/Skywork-Reward-V2-Qwen3-0.6B",
runner="pooling",
enforce_eager=True,
max_model_len=1024,
trust_remote_code=True,
)
return parser.parse_args()
def main(args: Namespace):
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM.
# You should pass runner="pooling" for reward models
llm = LLM(**vars(args))
# Generate rewards. The output is a list of PoolingRequestOutput.
# Use pooling_task="classify" for sequence reward models.
outputs = llm.encode(prompts, pooling_task="classify")
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for prompt, output in zip(prompts, outputs):
rewards = output.outputs.data
print(f"Prompt: {prompt!r}")
print_embeddings(rewards.tolist(), prefix="Reward")
print("-" * 60)
if __name__ == "__main__":
args = parse_args()
main(args)
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example online usage of sequence reward models.
Run `vllm serve <model> --runner pooling`
to start up the server in vLLM. e.g.
vllm serve Skywork/Skywork-Reward-V2-Qwen3-0.6B
The key distinction between sequence classification and token classification
lies in their output granularity: sequence classification produces a single
result for an entire input sequence, whereas token classification yields a
result for each individual token within the sequence.
"""
import argparse
import pprint
import requests
def post_http_request(prompt: dict, api_url: str) -> requests.Response:
headers = {"User-Agent": "Test Client"}
response = requests.post(api_url, headers=headers, json=prompt)
return response
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8000)
return parser.parse_args()
def main(args):
base_url = f"http://{args.host}:{args.port}"
models_url = base_url + "/v1/models"
pooing_url = base_url + "/pooling"
response = requests.get(models_url)
model = response.json()["data"][0]["id"]
# Input like Completions API
prompt = {"model": model, "input": "vLLM is great!"}
pooling_response = post_http_request(prompt=prompt, api_url=pooing_url)
print("-" * 50)
print("Pooling Response:")
pprint.pprint(pooling_response.json())
print("-" * 50)
# Input like Chat API
prompt = {
"model": model,
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": "vLLM is great!"}],
}
],
}
pooling_response = post_http_request(prompt=prompt, api_url=pooing_url)
print("Pooling Response:")
pprint.pprint(pooling_response.json())
print("-" * 50)
if __name__ == "__main__":
args = parse_args()
main(args)
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example offline usage of token reward models.
The key distinction between sequence classification and token classification
lies in their output granularity: sequence classification produces a single
result for an entire input sequence, whereas token classification yields a
result for each individual token within the sequence.
"""
from argparse import Namespace from argparse import Namespace
from vllm import LLM, EngineArgs from vllm import LLM, EngineArgs
...@@ -36,14 +45,14 @@ def main(args: Namespace): ...@@ -36,14 +45,14 @@ def main(args: Namespace):
llm = LLM(**vars(args)) llm = LLM(**vars(args))
# Generate rewards. The output is a list of PoolingRequestOutput. # Generate rewards. The output is a list of PoolingRequestOutput.
outputs = llm.reward(prompts) outputs = llm.encode(prompts, pooling_task="token_classify")
# Print the outputs. # Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60) print("\nGenerated Outputs:\n" + "-" * 60)
for prompt, output in zip(prompts, outputs): for prompt, output in zip(prompts, outputs):
rewards = output.outputs.data rewards = output.outputs.data
print(f"Prompt: {prompt!r}") print(f"Prompt: {prompt!r}")
print_embeddings(rewards, prefix="Reward") print_embeddings(rewards.tolist(), prefix="Reward")
print("-" * 60) print("-" * 60)
......
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
""" """
Example online usage of Pooling API. Example online usage of token reward models.
Run `vllm serve <model> --runner pooling` Run `vllm serve <model> --runner pooling`
to start up the server in vLLM. e.g. to start up the server in vLLM. e.g.
vllm serve internlm/internlm2-1_8b-reward --trust-remote-code vllm serve internlm/internlm2-1_8b-reward --trust-remote-code
The key distinction between sequence classification and token classification
lies in their output granularity: sequence classification produces a single
result for an entire input sequence, whereas token classification yields a
result for each individual token within the sequence.
""" """
import argparse import argparse
......
...@@ -1183,7 +1183,7 @@ class VllmRunner: ...@@ -1183,7 +1183,7 @@ class VllmRunner:
return [req_output.outputs.data for req_output in req_outputs] return [req_output.outputs.data for req_output in req_outputs]
def reward(self, prompts: list[str]) -> list[list[float]]: def reward(self, prompts: list[str]) -> list[list[float]]:
req_outputs = self.llm.reward(prompts) req_outputs = self.llm.encode(prompts, pooling_task="token_classify")
return [req_output.outputs.data for req_output in req_outputs] return [req_output.outputs.data for req_output in req_outputs]
def score( def score(
......
...@@ -1166,19 +1166,16 @@ class LLM: ...@@ -1166,19 +1166,16 @@ class LLM:
if pooling_task is None: if pooling_task is None:
raise ValueError( raise ValueError(
"pooling_task required for `LLM.encode`\n" """
"Please use one of the more specific methods or set the " pooling_task required for `LLM.encode`.
"pooling_task when using `LLM.encode`:\n" Please use one of the more specific methods or set the pooling_task when using `LLM.encode`:
" - For embeddings, use `LLM.embed(...)` " - For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
'or `pooling_task="embed"`.\n' - For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
" - For classification logits, use `LLM.classify(...)` " - For similarity scores, use `LLM.score(...)`.
'or `pooling_task="classify"`.\n' - For rewards, `pooling_task="classify"` or `pooling_task="token_classify"`.
" - For similarity scores, use `LLM.score(...)`.\n" - For token classification, use `pooling_task="token_classify"`.
" - For rewards, use `LLM.reward(...)` " - For multi-vector retrieval, use `pooling_task="token_embed"`.
'or `pooling_task="token_classify"`\n' """ # noqa: E501
" - For token classification, "
'use `pooling_task="token_classify"`\n'
' - For multi-vector retrieval, use `pooling_task="token_embed"`'
) )
if ( if (
...@@ -1340,6 +1337,11 @@ class LLM: ...@@ -1340,6 +1337,11 @@ class LLM:
A list of `PoolingRequestOutput` objects containing the A list of `PoolingRequestOutput` objects containing the
pooled hidden states in the same order as the input prompts. pooled hidden states in the same order as the input prompts.
""" """
logger.warning_once(
"`llm.reward` api is deprecated and will be removed in v0.23. "
'Please use `LLM.encode` with `pooling_task="classify"` or '
'`pooling_task="token_classify"` instead.'
)
return self.encode( return self.encode(
prompts, prompts,
use_tqdm=use_tqdm, use_tqdm=use_tqdm,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment