Unverified Commit 9744b699 authored by wang.yuqi's avatar wang.yuqi Committed by GitHub
Browse files

[Deprecate] Deprecate LLM.reward offline api, use LLM.encode instead. (#40688)


Signed-off-by: default avatarwang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: default avatarwang.yuqi <noooop@126.com>
Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: default avatarCyrus Leung <cyrus.tl.leung@gmail.com>
parent c662b435
......@@ -78,7 +78,7 @@ The scoring models is designed to compute similarity scores between two input pr
|-----------------------|---------------|----------------------------------------------|--------------------|--------------------------|
| `classify` (see note) | Sequence-wise | reranker score for each sequence | `cross-encoder` | linear classifier |
| `embed` | Sequence-wise | vector representations for each sequence | `bi-encoder` | cosine similarity |
| `token_classify` | Token-wise | probability vector of classes for each token | nan | nan |
| `token_classify` | Token-wise | probability vector of classes for each token | N/A | N/A |
| `token_embed` | Token-wise | vector representations for each token | `late-interaction` | late interaction(MaxSim) |
!!! note
......@@ -86,14 +86,15 @@ The scoring models is designed to compute similarity scores between two input pr
### Pooling Usages
| Pooling Usages | Description |
|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| Classification Usages | Predicting which predefined category, class, or label best corresponds to a given input. |
| Embedding Usages | Converts unstructured data (text, images, audio, etc.) into structured numerical vectors (embeddings). |
| Token Classification Usages | Token-wise classification |
| Token Embedding Usages | Token-wise embedding |
| Scoring Usages | Computes similarity scores between two inputs. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`. |
| Reward Usages | Evaluates the quality of outputs generated by a language model, acting as a proxy for human preferences. |
| Pooling Usages | Description |
|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Classification Usages | Predicting which predefined category, class, or label best corresponds to a given input. |
| Embedding Usages | Converts unstructured data (text, images, audio, etc.) into structured numerical vectors (embeddings). |
| Token Classification Usages | Token-wise classification |
| Token Embedding Usages | Token-wise embedding |
| Reward Usages | Evaluates the quality of outputs generated by a language model, acting as a proxy for human preferences. |
| Scoring Usages | Computes similarity scores between two inputs. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`. |
| Plugins Usages | Allow users to customize input and output processors. For more information, please refer to [IO Processor Plugins](../../design/io_processor_plugins.md). |
We also have some special models that support multiple pooling tasks, or have specific usage scenarios, or support special inputs and outputs.
......@@ -101,9 +102,9 @@ For more detailed information, please refer to the link below.
- [Classification Usages](classify.md)
- [Embedding Usages](embed.md)
- [Reward Usages](reward.md)
- [Token Classification Usages](token_classify.md)
- [Token Embedding Usages](token_embed.md)
- [Reward Usages](reward.md)
- [Scoring Usages](scoring.md)
- [Specific Model Examples](specific_models.md)
......@@ -113,15 +114,17 @@ Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs.
### Offline APIs corresponding to pooling tasks
### Offline APIs corresponding to pooling usages
| Task | APIs |
|------------------|---------------------------------------------------------------------------------------|
| `embed` | `LLM.embed(...)`, `LLM.encode(..., pooling_task="embed")`, `LLM.score(...)`(see note) |
| `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`, `LLM.score(...)` |
| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")` |
| `token_embed` | `LLM.encode(..., pooling_task="token_embed")`, `LLM.score(...)` |
| `plugin` | `LLM.encode(..., pooling_task="plugin")` |
| Pooling Usages | Dedicated API | Pooling task for `LLM.encode` API | Score Types | scoring function |
|-----------------------------|---------------------|-----------------------------------|----------------------------|--------------------------|
| Classification Usages | `LLM.classify(...)` | `classify` | `cross-encoder` (see note) | linear classifier |
| Embedding Usages | `LLM.embed(...)` | `embed` | `bi-encoder` | cosine similarity |
| Token Classification Usages | N/A | `token_classify` | N/A | N/A |
| Token Embedding Usages | N/A | `token_embed` | `late-interaction` | late interaction(MaxSim) |
| Reward Usages | N/A | `classify` & `token_classify` | N/A | N/A |
| Scoring Usages | `LLM.score(...)` | N/A | N/A | N/A |
| Plugins Usages | N/A | `plugin` | N/A | N/A |
!!! note
Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
......@@ -147,7 +150,7 @@ It is primarily designed for [score models](scoring.md).
The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
Please use one of the more specific methods or set the task directly when using `LLM.encode`, refer to the [table above](#offline-apis-corresponding-to-pooling-tasks).
Please use one of the more specific methods or set the task directly when using `LLM.encode`, refer to the [table above](#offline-apis-corresponding-to-pooling-usages).
### Examples
......@@ -183,9 +186,12 @@ Our Pooling API (`/pooling`) is similar to `LLM.encode`, being applicable to all
The input format is the same as [Embeddings API](embed.md#openai-compatible-embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Please use one of the more specific APIs or set the task directly when using the Pooling API, refer to the [table above](#offline-apis-corresponding-to-pooling-tasks).
Please use one of the more specific APIs or set the task directly when using the Pooling API, refer to the [table above](#offline-apis-corresponding-to-pooling-usages).
Code examples:
Code example: [examples/pooling/pooling/pooling_online.py](../../../examples/pooling/pooling/pooling_online.py)
- [Online example](../../../examples/pooling/reward/token_reward_online.py)
- [Offline example](../../../examples/pooling/reward/token_reward_offline.py)
### Examples
......
......@@ -134,3 +134,13 @@ print(f"Data: {data!r}")
## Online Serving
Please refer to the [pooling API](README.md#pooling-api). Pooling task corresponding to reward model types refer to the [table above](#summary).
## More examples
More examples can be found here: [examples/pooling/reward](../../../examples/pooling/reward)
## Deprecated Features
### `LLM.reward`
`llm.reward` api is deprecated and will be removed in v0.23. Please use `LLM.encode` with `pooling_task="classify"` or `pooling_task="token_classify"` instead.
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example offline usage of sequence reward models.
The key distinction between sequence classification and token classification
lies in their output granularity: sequence classification produces a single
result for an entire input sequence, whereas token classification yields a
result for each individual token within the sequence.
"""
from argparse import Namespace
from vllm import LLM, EngineArgs
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.print_utils import print_embeddings
def parse_args():
parser = FlexibleArgumentParser()
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="Skywork/Skywork-Reward-V2-Qwen3-0.6B",
runner="pooling",
enforce_eager=True,
max_model_len=1024,
trust_remote_code=True,
)
return parser.parse_args()
def main(args: Namespace):
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create an LLM.
# You should pass runner="pooling" for reward models
llm = LLM(**vars(args))
# Generate rewards. The output is a list of PoolingRequestOutput.
# Use pooling_task="classify" for sequence reward models.
outputs = llm.encode(prompts, pooling_task="classify")
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for prompt, output in zip(prompts, outputs):
rewards = output.outputs.data
print(f"Prompt: {prompt!r}")
print_embeddings(rewards.tolist(), prefix="Reward")
print("-" * 60)
if __name__ == "__main__":
args = parse_args()
main(args)
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example online usage of sequence reward models.
Run `vllm serve <model> --runner pooling`
to start up the server in vLLM. e.g.
vllm serve Skywork/Skywork-Reward-V2-Qwen3-0.6B
The key distinction between sequence classification and token classification
lies in their output granularity: sequence classification produces a single
result for an entire input sequence, whereas token classification yields a
result for each individual token within the sequence.
"""
import argparse
import pprint
import requests
def post_http_request(prompt: dict, api_url: str) -> requests.Response:
headers = {"User-Agent": "Test Client"}
response = requests.post(api_url, headers=headers, json=prompt)
return response
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8000)
return parser.parse_args()
def main(args):
base_url = f"http://{args.host}:{args.port}"
models_url = base_url + "/v1/models"
pooing_url = base_url + "/pooling"
response = requests.get(models_url)
model = response.json()["data"][0]["id"]
# Input like Completions API
prompt = {"model": model, "input": "vLLM is great!"}
pooling_response = post_http_request(prompt=prompt, api_url=pooing_url)
print("-" * 50)
print("Pooling Response:")
pprint.pprint(pooling_response.json())
print("-" * 50)
# Input like Chat API
prompt = {
"model": model,
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": "vLLM is great!"}],
}
],
}
pooling_response = post_http_request(prompt=prompt, api_url=pooing_url)
print("Pooling Response:")
pprint.pprint(pooling_response.json())
print("-" * 50)
if __name__ == "__main__":
args = parse_args()
main(args)
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example offline usage of token reward models.
The key distinction between sequence classification and token classification
lies in their output granularity: sequence classification produces a single
result for an entire input sequence, whereas token classification yields a
result for each individual token within the sequence.
"""
from argparse import Namespace
from vllm import LLM, EngineArgs
......@@ -36,14 +45,14 @@ def main(args: Namespace):
llm = LLM(**vars(args))
# Generate rewards. The output is a list of PoolingRequestOutput.
outputs = llm.reward(prompts)
outputs = llm.encode(prompts, pooling_task="token_classify")
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for prompt, output in zip(prompts, outputs):
rewards = output.outputs.data
print(f"Prompt: {prompt!r}")
print_embeddings(rewards, prefix="Reward")
print_embeddings(rewards.tolist(), prefix="Reward")
print("-" * 60)
......
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Example online usage of Pooling API.
Example online usage of token reward models.
Run `vllm serve <model> --runner pooling`
to start up the server in vLLM. e.g.
vllm serve internlm/internlm2-1_8b-reward --trust-remote-code
The key distinction between sequence classification and token classification
lies in their output granularity: sequence classification produces a single
result for an entire input sequence, whereas token classification yields a
result for each individual token within the sequence.
"""
import argparse
......
......@@ -1183,7 +1183,7 @@ class VllmRunner:
return [req_output.outputs.data for req_output in req_outputs]
def reward(self, prompts: list[str]) -> list[list[float]]:
req_outputs = self.llm.reward(prompts)
req_outputs = self.llm.encode(prompts, pooling_task="token_classify")
return [req_output.outputs.data for req_output in req_outputs]
def score(
......
......@@ -1166,19 +1166,16 @@ class LLM:
if pooling_task is None:
raise ValueError(
"pooling_task required for `LLM.encode`\n"
"Please use one of the more specific methods or set the "
"pooling_task when using `LLM.encode`:\n"
" - For embeddings, use `LLM.embed(...)` "
'or `pooling_task="embed"`.\n'
" - For classification logits, use `LLM.classify(...)` "
'or `pooling_task="classify"`.\n'
" - For similarity scores, use `LLM.score(...)`.\n"
" - For rewards, use `LLM.reward(...)` "
'or `pooling_task="token_classify"`\n'
" - For token classification, "
'use `pooling_task="token_classify"`\n'
' - For multi-vector retrieval, use `pooling_task="token_embed"`'
"""
pooling_task required for `LLM.encode`.
Please use one of the more specific methods or set the pooling_task when using `LLM.encode`:
- For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
- For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
- For similarity scores, use `LLM.score(...)`.
- For rewards, `pooling_task="classify"` or `pooling_task="token_classify"`.
- For token classification, use `pooling_task="token_classify"`.
- For multi-vector retrieval, use `pooling_task="token_embed"`.
""" # noqa: E501
)
if (
......@@ -1340,6 +1337,11 @@ class LLM:
A list of `PoolingRequestOutput` objects containing the
pooled hidden states in the same order as the input prompts.
"""
logger.warning_once(
"`llm.reward` api is deprecated and will be removed in v0.23. "
'Please use `LLM.encode` with `pooling_task="classify"` or '
'`pooling_task="token_classify"` instead.'
)
return self.encode(
prompts,
use_tqdm=use_tqdm,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment