[Deprecate] Deprecate LLM.reward offline api, use LLM.encode instead. (#40688)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Deprecate] Deprecate LLM.reward offline api, use LLM.encode instead. (#40688)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
9744b699 · wang.yuqi · GitHub · c662b435 · 9744b699 · 9744b699
Unverified Commit 9744b699 authored Apr 24, 2026 by wang.yuqi Committed by GitHub Apr 24, 2026
11 changed files
--- a/docs/models/pooling_models/README.md
+++ b/docs/models/pooling_models/README.md
@@ -78,7 +78,7 @@ The scoring models is designed to compute similarity scores between two input pr
 |-----------------------|---------------|----------------------------------------------|--------------------|--------------------------|
 | `classify` (see note) | Sequence-wise | reranker score for each sequence             | `cross-encoder`    | linear classifier        |
 | `embed`               | Sequence-wise | vector representations for each sequence     | `bi-encoder`       | cosine similarity        |
-| `token_classify`      | Token-wise    | probability vector of classes for each token | nan                | nan                      |
+| `token_classify`      | Token-wise    | probability vector of classes for each token | N/A                | N/A                      |
 | `token_embed`         | Token-wise    | vector representations for each token        | `late-interaction` | late interaction(MaxSim) |
 !!! note
@@ -87,13 +87,14 @@ The scoring models is designed to compute similarity scores between two input pr
 ### Pooling Usages
 | Pooling Usages              | Description                                                                                                                                               |
-|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
 | Classification Usages       | Predicting which predefined category, class, or label best corresponds to a given input.                                                                  |
 | Embedding Usages            | Converts unstructured data (text, images, audio, etc.) into structured numerical vectors (embeddings).                                                    |
 | Token Classification Usages | Token-wise classification                                                                                                                                 |
 | Token Embedding Usages      | Token-wise embedding                                                                                                                                      |
-| Scoring Usages              | Computes similarity scores between two inputs. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`. |
 | Reward Usages               | Evaluates the quality of outputs generated by a language model, acting as a proxy for human preferences.                                                  |
+| Scoring Usages              | Computes similarity scores between two inputs. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.   |
+| Plugins Usages              | Allow users to customize input and output processors. For more information, please refer to [IO Processor Plugins](../../design/io_processor_plugins.md). |
 We also have some special models that support multiple pooling tasks, or have specific usage scenarios, or support special inputs and outputs.
@@ -101,9 +102,9 @@ For more detailed information, please refer to the link below.
 - [Classification Usages](classify.md)
 - [Embedding Usages](embed.md)
- [Reward Usages](reward.md)
 - [Token Classification Usages](token_classify.md)
 - [Token Embedding Usages](token_embed.md)
+- [Reward Usages](reward.md)
 - [Scoring Usages](scoring.md)
 - [Specific Model Examples](specific_models.md)
@@ -113,15 +114,17 @@ Each pooling model in vLLM supports one or more of these tasks according to
 [Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
 enabling the corresponding APIs.
-### Offline APIs corresponding to pooling tasks
+### Offline APIs corresponding to pooling usages
-| Task             | APIs                                                                                  |
+| Pooling Usages              | Dedicated API       | Pooling task for `LLM.encode` API | Score Types                | scoring function         |
-|------------------|---------------------------------------------------------------------------------------|
+|-----------------------------|---------------------|-----------------------------------|----------------------------|--------------------------|
-| `embed`          | `LLM.embed(...)`, `LLM.encode(..., pooling_task="embed")`, `LLM.score(...)`(see note) |
+| Classification Usages       | `LLM.classify(...)` | `classify`                        | `cross-encoder` (see note) | linear classifier        |
-| `classify`       | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`, `LLM.score(...)`     |
+| Embedding Usages            | `LLM.embed(...)`    | `embed`                           | `bi-encoder`               | cosine similarity        |
-| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")`                   |
+| Token Classification Usages | N/A                 | `token_classify`                  | N/A                        | N/A                      |
-| `token_embed`    | `LLM.encode(..., pooling_task="token_embed")`, `LLM.score(...)`                       |
+| Token Embedding Usages      | N/A                 | `token_embed`                     | `late-interaction`         | late interaction(MaxSim) |
-| `plugin`         | `LLM.encode(..., pooling_task="plugin")`                                              |
+| Reward Usages               | N/A                 | `classify` & `token_classify`     | N/A                        | N/A                      |
+| Scoring Usages              | `LLM.score(...)`    | N/A                               | N/A                        | N/A                      |
+| Plugins Usages              | N/A                 | `plugin`                          | N/A                        | N/A                      |
 !!! note
    Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
@@ -147,7 +150,7 @@ It is primarily designed for [score models](scoring.md).
 The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
-Please use one of the more specific methods or set the task directly when using `LLM.encode`, refer to the [table above](#offline-apis-corresponding-to-pooling-tasks).
+Please use one of the more specific methods or set the task directly when using `LLM.encode`, refer to the [table above](#offline-apis-corresponding-to-pooling-usages).
 ### Examples
@@ -183,9 +186,12 @@ Our Pooling API (`/pooling`) is similar to `LLM.encode`, being applicable to all
 The input format is the same as [Embeddings API](embed.md#openai-compatible-embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
-Please use one of the more specific APIs or set the task directly when using the Pooling API, refer to the [table above](#offline-apis-corresponding-to-pooling-tasks).
+Please use one of the more specific APIs or set the task directly when using the Pooling API, refer to the [table above](#offline-apis-corresponding-to-pooling-usages).
+Code examples:
-Code example: [examples/pooling/pooling/pooling_online.py](../../../examples/pooling/pooling/pooling_online.py)
+- [Online example](../../../examples/pooling/reward/token_reward_online.py)
+- [Offline example](../../../examples/pooling/reward/token_reward_offline.py)
 ### Examples

--- a/docs/models/pooling_models/reward.md
+++ b/docs/models/pooling_models/reward.md
@@ -134,3 +134,13 @@ print(f"Data: {data!r}")
 ## Online Serving
 Please refer to the [pooling API](README.md#pooling-api). Pooling task corresponding to reward model types refer to the [table above](#summary).
+## More examples
+More examples can be found here: [examples/pooling/reward](../../../examples/pooling/reward)
+## Deprecated Features
+### `LLM.reward`
+`llm.reward` api is deprecated and will be removed in v0.23. Please use `LLM.encode` with `pooling_task="classify"` or `pooling_task="token_classify"` instead.
--- a/examples/pooling/reward/sequence_reward_offline.py
+++ b/examples/pooling/reward/sequence_reward_offline.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+Example offline usage of sequence reward models.
+The key distinction between sequence classification and token classification
+lies in their output granularity: sequence classification produces a single
+result for an entire input sequence, whereas token classification yields a
+result for each individual token within the sequence.
+"""
+from argparse import Namespace
+from vllm import LLM, EngineArgs
+from vllm.utils.argparse_utils import FlexibleArgumentParser
+from vllm.utils.print_utils import print_embeddings
+def parse_args():
+    parser = FlexibleArgumentParser()
+    parser = EngineArgs.add_cli_args(parser)
+    # Set example specific arguments
+    parser.set_defaults(
+        model="Skywork/Skywork-Reward-V2-Qwen3-0.6B",
+        runner="pooling",
+        enforce_eager=True,
+        max_model_len=1024,
+        trust_remote_code=True,
+    )
+    return parser.parse_args()
+def main(args: Namespace):
+    # Sample prompts.
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+    # Create an LLM.
+    # You should pass runner="pooling" for reward models
+    llm = LLM(**vars(args))
+    # Generate rewards. The output is a list of PoolingRequestOutput.
+    # Use pooling_task="classify" for sequence reward models.
+    outputs = llm.encode(prompts, pooling_task="classify")
+    # Print the outputs.
+    print("\nGenerated Outputs:\n" + "-" * 60)
+    for prompt, output in zip(prompts, outputs):
+        rewards = output.outputs.data
+        print(f"Prompt: {prompt!r}")
+        print_embeddings(rewards.tolist(), prefix="Reward")
+        print("-" * 60)
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/examples/pooling/reward/sequence_reward_online.py
+++ b/examples/pooling/reward/sequence_reward_online.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+Example online usage of sequence reward models.
+Run `vllm serve <model> --runner pooling`
+to start up the server in vLLM. e.g.
+vllm serve Skywork/Skywork-Reward-V2-Qwen3-0.6B
+The key distinction between sequence classification and token classification
+lies in their output granularity: sequence classification produces a single
+result for an entire input sequence, whereas token classification yields a
+result for each individual token within the sequence.
+"""
+import argparse
+import pprint
+import requests
+def post_http_request(prompt: dict, api_url: str) -> requests.Response:
+    headers = {"User-Agent": "Test Client"}
+    response = requests.post(api_url, headers=headers, json=prompt)
+    return response
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default="localhost")
+    parser.add_argument("--port", type=int, default=8000)
+    return parser.parse_args()
+def main(args):
+    base_url = f"http://{args.host}:{args.port}"
+    models_url = base_url + "/v1/models"
+    pooing_url = base_url + "/pooling"
+    response = requests.get(models_url)
+    model = response.json()["data"][0]["id"]
+    # Input like Completions API
+    prompt = {"model": model, "input": "vLLM is great!"}
+    pooling_response = post_http_request(prompt=prompt, api_url=pooing_url)
+    print("-" * 50)
+    print("Pooling Response:")
+    pprint.pprint(pooling_response.json())
+    print("-" * 50)
+    # Input like Chat API
+    prompt = {
+        "model": model,
+        "messages": [
+            {
+                "role": "user",
+                "content": [{"type": "text", "text": "vLLM is great!"}],
+            }
+        ],
+    }
+    pooling_response = post_http_request(prompt=prompt, api_url=pooing_url)
+    print("Pooling Response:")
+    pprint.pprint(pooling_response.json())
+    print("-" * 50)
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/examples/basic/offline_inference/reward.py
+++ b/examples/basic/offline_inference/reward.py
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+Example offline usage of token reward models.
+The key distinction between sequence classification and token classification
+lies in their output granularity: sequence classification produces a single
+result for an entire input sequence, whereas token classification yields a
+result for each individual token within the sequence.
+"""
 from argparse import Namespace
 from vllm import LLM, EngineArgs
@@ -36,14 +45,14 @@ def main(args: Namespace):
    llm = LLM(**vars(args))
    # Generate rewards. The output is a list of PoolingRequestOutput.
-    outputs = llm.reward(prompts)
+    outputs = llm.encode(prompts, pooling_task="token_classify")
    # Print the outputs.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for prompt, output in zip(prompts, outputs):
        rewards = output.outputs.data
        print(f"Prompt: {prompt!r}")
-        print_embeddings(rewards, prefix="Reward")
+        print_embeddings(rewards.tolist(), prefix="Reward")
        print("-" * 60)

--- a/examples/pooling/pooling/pooling_online.py
+++ b/examples/pooling/pooling/pooling_online.py
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 """
-Example online usage of Pooling API.
+Example online usage of token reward models.
 Run `vllm serve <model> --runner pooling`
 to start up the server in vLLM. e.g.
 vllm serve internlm/internlm2-1_8b-reward --trust-remote-code
+The key distinction between sequence classification and token classification
+lies in their output granularity: sequence classification produces a single
+result for an entire input sequence, whereas token classification yields a
+result for each individual token within the sequence.
 """
 import argparse

--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -1183,7 +1183,7 @@ class VllmRunner:
        return [req_output.outputs.data for req_output in req_outputs]
    def reward(self, prompts: list[str]) -> list[list[float]]:
-        req_outputs = self.llm.reward(prompts)
+        req_outputs = self.llm.encode(prompts, pooling_task="token_classify")
        return [req_output.outputs.data for req_output in req_outputs]
    def score(

--- a/tests/entrypoints/pooling/pooling/__init__.py
+++ b/tests/entrypoints/pooling/pooling/__init__.py
--- a/tests/entrypoints/pooling/reward/test_offline.py
+++ b/tests/entrypoints/pooling/reward/test_offline.py
--- a/tests/entrypoints/pooling/pooling/test_online.py
+++ b/tests/entrypoints/pooling/pooling/test_online.py
--- a/vllm/entrypoints/llm.py
+++ b/vllm/entrypoints/llm.py
@@ -1166,19 +1166,16 @@ class LLM:
        if pooling_task is None:
            raise ValueError(
-                "pooling_task required for `LLM.encode`\n"
+                """
-                "Please use one of the more specific methods or set the "
+                pooling_task required for `LLM.encode`.
-                "pooling_task when using `LLM.encode`:\n"
+                Please use one of the more specific methods or set the pooling_task when using `LLM.encode`:
-                "  - For embeddings, use `LLM.embed(...)` "
+                  - For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
-                'or `pooling_task="embed"`.\n'
+                  - For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
-                "  - For classification logits, use `LLM.classify(...)` "
+                  - For similarity scores, use `LLM.score(...)`.
-                'or `pooling_task="classify"`.\n'
+                  - For rewards, `pooling_task="classify"` or `pooling_task="token_classify"`.
-                "  - For similarity scores, use `LLM.score(...)`.\n"
+                  - For token classification, use `pooling_task="token_classify"`.
-                "  - For rewards, use `LLM.reward(...)` "
+                  - For multi-vector retrieval, use `pooling_task="token_embed"`.
-                'or `pooling_task="token_classify"`\n'
+                """  # noqa: E501
-                "  - For token classification, "
-                'use `pooling_task="token_classify"`\n'
-                '  - For multi-vector retrieval, use `pooling_task="token_embed"`'
            )
        if (
@@ -1340,6 +1337,11 @@ class LLM:
            A list of `PoolingRequestOutput` objects containing the
            pooled hidden states in the same order as the input prompts.
        """
+        logger.warning_once(
+            "`llm.reward` api is deprecated and will be removed in v0.23. "
+            'Please use `LLM.encode` with `pooling_task="classify"` or '
+            '`pooling_task="token_classify"` instead.'
+        )
        return self.encode(
            prompts,
            use_tqdm=use_tqdm,