Merge tag 'v0.18.1rc0' into v0.18.1rc0-ori

0da93439 · zhuwenwen · 25f2f756 · 298e5108 · 0da93439 · 0da93439
Commit 0da93439 authored Mar 26, 2026 by zhuwenwen
20 changed files
--- a/docker/Dockerfile.xpu
+++ b/docker/Dockerfile.xpu
@@ -76,19 +76,22 @@ ENV UV_LINK_MODE="copy"
 RUN --mount=type=cache,target=/root/.cache/uv \
    --mount=type=bind,src=requirements/common.txt,target=/workspace/vllm/requirements/common.txt \
    --mount=type=bind,src=requirements/xpu.txt,target=/workspace/vllm/requirements/xpu.txt \
+    --mount=type=bind,src=requirements/xpu-test.in,target=/workspace/vllm/requirements/xpu-test.in \
    uv pip install --upgrade pip && \
-    uv pip install -r requirements/xpu.txt
-
- # used for suffix method speculative decoding
- # build deps for proto + nanobind-based extensions to set up the build environment
-RUN --mount=type=cache,target=/root/.cache/uv \
-    uv pip install grpcio-tools protobuf nanobind
- # arctic-inference is built from source which needs torch-xpu properly installed first
-RUN --mount=type=cache,target=/root/.cache/uv \
+    uv pip install -r requirements/xpu.txt && \
+    uv pip compile /workspace/vllm/requirements/xpu-test.in \
+        -o /workspace/vllm/requirements/xpu-test.txt \
+        -c /workspace/vllm/requirements/xpu.txt \
+        --index-strategy unsafe-best-match \
+        --extra-index-url ${PIP_EXTRA_INDEX_URL} \
+        --python-version ${PYTHON_VERSION} && \
+    uv pip install grpcio-tools protobuf nanobind && \
    source /opt/intel/oneapi/setvars.sh --force && \
    source /opt/intel/oneapi/ccl/2021.15/env/vars.sh --force && \
-    export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH}" && \
-    uv pip install --no-build-isolation arctic-inference==0.1.1
+    export CMAKE_PREFIX_PATH="$(python3 -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH}" && \
+    uv pip install --no-build-isolation -r /workspace/vllm/requirements/xpu-test.txt
+
+

 ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib/"


--- a/docs/.nav.yml
+++ b/docs/.nav.yml
@@ -25,7 +25,7 @@ nav:
    - Models:
      - models/supported_models.md
      - models/generative_models.md
-      - models/pooling_models.md
+      - Pooling Models: models/pooling_models
      - models/extensions
      - Hardware Supported Models:
        - models/hardware_supported_models/*

--- a/docs/contributing/model/tests.md
+++ b/docs/contributing/model/tests.md
@@ -37,7 +37,7 @@ For [generative models](../../models/generative_models.md), there are two levels

 #### Pooling models

-For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in [tests/models/utils.py](../../../tests/models/utils.py).
+For [pooling models](../../models/pooling_models/README.md), we simply check the cosine similarity, as defined in [tests/models/utils.py](../../../tests/models/utils.py).

 ### Multi-modal processing


--- a/docs/contributing/profiling.md
+++ b/docs/contributing/profiling.md
@@ -3,6 +3,10 @@
 !!! warning
    Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.

+!!! tip "Choosing a profiler"
+    - Use **Nsight Systems** for low-overhead, performance-critical profiling.
+    - Use **PyTorch Profiler** for medium-overhead profiling with richer debugging information (e.g., stack traces, memory, shapes). Note that enabling these features adds overhead and is not recommended for benchmarking.
+
 ## Profile with PyTorch Profiler

 We support tracing vLLM workers using different profilers. You can enable profiling by setting the `--profiler-config` flag when launching the server.

--- a/docs/design/attention_backends.md
+++ b/docs/design/attention_backends.md
@@ -127,8 +127,8 @@ Priority is **1 = highest** (tried first).
 | 3 | `FLASH_ATTN_MLA` |
 | 4 | `FLASHMLA` |
 | 5 | `TRITON_MLA` |
-| 6 | `FLASHMLA_SPARSE` |
-| 7 | `FLASHINFER_MLA_SPARSE` |
+| 6 | `FLASHINFER_MLA_SPARSE`**\*** |
+| 7 | `FLASHMLA_SPARSE` |

 **Ampere/Hopper (SM 8.x-9.x):**

@@ -140,6 +140,8 @@ Priority is **1 = highest** (tried first).
 | 4 | `TRITON_MLA` |
 | 5 | `FLASHMLA_SPARSE` |

+> **\*** For sparse MLA, FP8 KV cache always prefers `FLASHINFER_MLA_SPARSE`. With BF16 KV cache, `FLASHINFER_MLA_SPARSE` is preferred for low query-head counts (<= 16), while `FLASHMLA_SPARSE` is preferred otherwise.
+>
 > **Note:** ROCm and CPU platforms have their own selection logic. See the platform-specific documentation for details.

 ## Legend

--- a/docs/design/custom_op.md
+++ b/docs/design/custom_op.md
@@ -51,11 +51,8 @@ For example:
 **1. Attention:**

 ```python
--8<-- "vllm/model_executor/layers/attention/mm_encoder_attention.py:mm_encoder_attn"
-
 --8<-- "vllm/model_executor/layers/mla.py:multi_head_latent_attention"

--8<-- "vllm/model_executor/models/deepencoder.py:rel_pos_attention"
 ```

 **2. Activation:**
@@ -170,6 +167,16 @@ For example:
 --8<-- "vllm/model_executor/layers/rotary_embedding/common.py:apply_rotary_emb"
 ```

+**12. Encoder:**
+
+```python
+--8<-- "vllm/model_executor/models/deepencoder2.py:qwen2_decoder"
+
+--8<-- "vllm/model_executor/layers/attention/mm_encoder_attention.py:mm_encoder_attn"
+
+--8<-- "vllm/model_executor/models/deepencoder.py:rel_pos_attention"
+```
+
 ## Guidelines for Implementing a New CustomOp

 ### Implement a New CustomOp in vLLM

--- a/docs/design/moe_kernel_features.md
+++ b/docs/design/moe_kernel_features.md
@@ -88,8 +88,8 @@ To be used with a particular `FusedMoEPrepareAndFinalizeModular` subclass, MoE k
 | flashinfer | standard | nvfp4,</br>fp8 | T | <sup>5</sup> | N | Y | [`FlashInferExperts`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.FlashInferExperts] |
 | gpt oss triton | standard | N/A | N/A | <sup>5</sup> | Y | Y | [`triton_kernel_fused_experts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.triton_kernel_fused_experts],</br>[`OAITritonExperts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.OAITritonExperts] |
 | marlin | standard,</br>batched | <sup>3</sup> / N/A | <sup>3</sup> / N/A | silu,</br>swigluoai | Y | Y | [`fused_marlin_moe`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.fused_marlin_moe],</br>[`MarlinExperts`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.MarlinExperts],</br>[`BatchedMarlinExperts`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.BatchedMarlinExperts] |
-| trtllm | standard | mxfp4,</br>nvfp4 | G(16),G(32) | <sup>5</sup> | N | Y | [`TrtLlmGenExperts`][vllm.model_executor.layers.fused_moe.trtllm_moe.TrtLlmGenExperts] |
-| rocm aiter moe | standard | fp8 | G(128),A,T | silu, gelu | Y | N | [`rocm_aiter_fused_experts`][vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe.rocm_aiter_fused_experts] |
+| trtllm | standard | mxfp4,</br>nvfp4 | G(16),G(32) | <sup>5</sup> | N | Y | [`TrtLlmMxfp4ExpertsMonolithic`][vllm.model_executor.layers.fused_moe.experts.trtllm_mxfp4_moe.TrtLlmMxfp4ExpertsMonolithic],</br>[`TrtLlmMxfp4ExpertsModular`][vllm.model_executor.layers.fused_moe.experts.trtllm_mxfp4_moe.TrtLlmMxfp4ExpertsModular],</br>[`TrtLlmNvFp4ExpertsMonolithic`][vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe.TrtLlmNvFp4ExpertsMonolithic],</br>[`TrtLlmNvfp4ExpertsModular`][vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe.TrtLlmNvFp4ExpertsModular] |
+| rocm aiter moe | standard | mxfp4,</br>fp8 | G(32),G(128),A,T | silu, gelu,</br>swigluoai | Y | N | `rocm_aiter_fused_experts`,</br>`AiterExperts` |
 | cpu_fused_moe | standard | N/A | N/A | silu | N | N | [`CPUFusedMOE`][vllm.model_executor.layers.fused_moe.cpu_fused_moe.CPUFusedMOE] |
 | naive batched<sup>4</sup> | batched | int8,</br>fp8 | G,A,T | silu, gelu | <sup>6</sup> | Y | [`NaiveBatchedExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.NaiveBatchedExperts] |

@@ -103,7 +103,7 @@ To be used with a particular `FusedMoEPrepareAndFinalizeModular` subclass, MoE k

 ## Modular Kernel "families"

-The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts. Note that the "naive" backend will work with any non-modular experts.
+The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts.

 | backend | `FusedMoEPrepareAndFinalizeModular` subclasses | `FusedMoEExpertsModular` subclasses |
 | ------- | ---------------------------------------------- | ----------------------------------- |

--- a/docs/design/torch_compile_multimodal.md
+++ b/docs/design/torch_compile_multimodal.md
@@ -29,10 +29,9 @@ To compile a multimodal component such as an encoder, we follow the same mechani
 1. The `@support_torch_compile` decorator should include `enable_if=should_torch_compile_mm_encoder`. This will gate the compilation behind our
 `compile_mm_encoder` configuration

-2. `with set_model_tag("<component_name>", is_encoder=True)` context manager should be used around the nn.Module's instantiation. Since torch.compile
-relies on caching artifacts to reduce start time, we must properly propagate the `<component_name>` information to the cache in order to avoid collisions
-with the LLM text-backbone, or other instances of the same artifact (as is the case with vision block). `is_encoder=True` is also needed for encoder
-components (see Compile Range Integration).
+2. The `@support_torch_compile` decorator should include `is_encoder=True` for encoder components. This is needed for compile range integration
+(see Compile Range Integration). The decorator automatically uses the class name as the cache directory prefix, avoiding collisions between
+independently compiled sub-modules (e.g. vision encoder components vs the text backbone).

 ### CompilationConfig

@@ -57,8 +56,8 @@ tradeoff
 ### Compile ranges

 The torch.compile integration will try to rely on max_batch_size to infer compilation ranges for dynamic shapes; however, for modules used in the encoder, this
-shape can be difficult to infer due to the unspecified range of shapes the encoder may see as input. Therefore, we rely on `is_encoder=True` in the `set_model_tag`
-to alert torch.compile to the fact that this range cannot be inferred, and we default to the range (1, MAX_INT).
+shape can be difficult to infer due to the unspecified range of shapes the encoder may see as input. Therefore, we rely on `is_encoder=True` in the
+`@support_torch_compile` decorator to alert torch.compile to the fact that this range cannot be inferred, and we default to the range (1, MAX_INT).

 !!! note
    We may seek to tighten this range for better performance in the future

--- a/docs/features/README.md
+++ b/docs/features/README.md
@@ -36,14 +36,14 @@ th:not(:first-child) {
 }
 </style>

-| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](speculative_decoding/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
+| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](speculative_decoding/README.md) | CUDA graph | [pooling](../models/pooling_models/README.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
 | [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | |
 | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
 | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
 | [SD](speculative_decoding/README.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
 | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | |
-| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | |
+| [pooling](../models/pooling_models/README.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | |
 | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [❌](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | |
 | <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | |
 | <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | |
@@ -66,7 +66,7 @@ th:not(:first-child) {
 | [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
 | [SD](speculative_decoding/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
 | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [❌](https://github.com/vllm-project/vllm/issues/26970) |
-| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [pooling](../models/pooling_models/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
 | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
 | [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
 | [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ |

--- a/docs/features/lora.md
+++ b/docs/features/lora.md
@@ -388,4 +388,19 @@ vllm serve model --enable-lora --max-lora-rank 64

 # Bad: unnecessarily high, wastes memory
 vllm serve model --enable-lora --max-lora-rank 256
-```
\ No newline at end of file
+
+```
+
+### Restricting LoRA to Specific Modules
+
+The `--lora-target-modules` parameter allows you to restrict which model modules have LoRA applied at deployment time. This is useful for performance tuning when you only need LoRA on specific layers:
+
+```bash
+# Apply LoRA only to output projection layers
+vllm serve model --enable-lora --lora-target-modules o_proj
+
+# Apply LoRA to multiple specific modules
+vllm serve model --enable-lora --lora-target-modules o_proj qkv_proj down_proj
+```
+
+When `--lora-target-modules` is not specified, LoRA will be applied to all supported modules in the model. This parameter accepts module suffixes (the last component of the module name), such as `o_proj`, `qkv_proj`, `gate_proj`, etc.
--- a/docs/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
@@ -5,7 +5,7 @@ vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.
 Reasoning models return an additional `reasoning` field in their outputs, which contains the reasoning steps that led to the final conclusion. This field is not present in the outputs of other models.

 !!! warning
-    `reasoning` used to be called `reasoning_content`. For now, `reasoning_content` will continue to work. However, we encourage you to migrate to `reasoning` in case `reasoning_content` is removed in future.
+    `reasoning` used to be called `reasoning_content`. To migrate, directly replace `reasoning_content` with `reasoning`.

 ## Supported Models


--- a/docs/features/tool_calling.md
+++ b/docs/features/tool_calling.md
@@ -107,6 +107,27 @@ vLLM supports the `tool_choice='none'` option in the chat completion API. When t
 !!! note
    When tools are specified in the request, vLLM includes tool definitions in the prompt by default, regardless of the `tool_choice` setting. To exclude tool definitions when `tool_choice='none'`, use the `--exclude-tools-when-tool-choice-none` option.

+## Constrained Decoding Behavior
+
+Whether vLLM enforces the tool parameter schema during generation depends on the `tool_choice` mode:
+
+| `tool_choice` value | Schema-constrained decoding | Behavior |
+| --- | --- | --- |
+| Named function | Yes (via structured outputs backend) | Arguments are guaranteed to be valid JSON conforming to the function's parameter schema. |
+| `"required"` | Yes (via structured outputs backend) | Same as named function. The model must produce at least one tool call. |
+| `"auto"` | No | The model generates freely. A tool-call parser extracts tool calls from the raw text. Arguments may be malformed or not match the schema. |
+| `"none"` | N/A | No tool calls are produced. |
+
+When schema conformance matters, prefer `tool_choice="required"` or named function calling over `"auto"`.
+
+### Strict Mode (`strict` parameter)
+
+The [OpenAI API](https://platform.openai.com/docs/guides/function-calling#strict-mode) supports a `strict` field on function definitions. When set to `true`, OpenAI uses constrained decoding to guarantee that tool-call arguments match the function schema, even in `tool_choice="auto"` mode.
+
+vLLM **does not implement** `strict` mode today. The `strict` field is accepted in requests (to avoid breaking clients that set it), but it has no effect on decoding behavior. In auto mode, argument validity depends entirely on the model's output quality and the parser's extraction logic.
+
+Tracking issues: [#15526](https://github.com/vllm-project/vllm/issues/15526), [#16313](https://github.com/vllm-project/vllm/issues/16313).
+
 ## Automatic Function Calling

 To enable this feature, you should set the following flags:
@@ -124,6 +145,9 @@ from HuggingFace; and you can find an example of this in a `tokenizer_config.jso

 If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!

+!!! note
+    With `tool_choice="auto"`, tool-call arguments are extracted from the model's raw text output by the selected parser. No schema-level constraint is applied during decoding, so arguments may occasionally be malformed or violate the function's parameter schema. See [Constrained Decoding Behavior](#constrained-decoding-behavior) for details.
+
 ### Hermes Models (`hermes`)

 All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.

--- a/docs/mkdocs/hooks/generate_examples.py
+++ b/docs/mkdocs/hooks/generate_examples.py
@@ -23,15 +23,18 @@ def title(text: str) -> str:
    # Custom substitutions
    subs = {
        "io": "IO",
-        "api": "API",
+        "rl": "RL",
+        "api(s?)": r"API\1",
        "cli": "CLI",
        "cpu": "CPU",
+        "ipc": "IPC",
        "llm": "LLM",
        "mae": "MAE",
        "ner": "NER",
        "tpu": "TPU",
        "gguf": "GGUF",
        "lora": "LoRA",
+        "nccl": "NCCL",
        "rlhf": "RLHF",
        "vllm": "vLLM",
        "openai": "OpenAI",
@@ -196,6 +199,11 @@ class Example:


 def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+    # Monkey-patch dirname_to_title in awesome-nav so that sub-directory names are
+    # title-cased (e.g. "Offline Inference" instead of "Offline inference").
+    import mkdocs_awesome_nav.nav.directory as _nav_dir
+
+    _nav_dir.dirname_to_title = title
    logger.info("Generating example documentation")
    logger.debug("Root directory: %s", ROOT_DIR.resolve())
    logger.debug("Example directory: %s", EXAMPLE_DIR.resolve())

--- a/docs/mkdocs/hooks/url_schemes.py
+++ b/docs/mkdocs/hooks/url_schemes.py
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 """
-MkDocs hook to enable the following links to render correctly:
+MkDocs hook + markdown extension to enable the following links to render correctly,
+including inside content included via pymdownx.snippets:

 - Relative file links outside of the `docs/` directory, e.g.:
    - [Text](../some_file.py)
@@ -12,13 +13,17 @@ MkDocs hook to enable the following links to render correctly:
        e.g. <...pull/123> -> [Pull Request #123](.../pull/123)
    - Works for external repos too by including the `owner/repo` in the link title

-The goal is to simplify cross-referencing common GitHub resources
-in project docs.
+The link replacement runs as a markdown preprocessor (priority 25) so that it executes
+after pymdownx.snippets (priority 32) has expanded all included content.
+The on_page_markdown hook passes the current page context to the preprocessor before
+each page is converted.
 """

 from pathlib import Path

 import regex as re
+from markdown import Extension
+from markdown.preprocessors import Preprocessor
 from mkdocs.config.defaults import MkDocsConfig
 from mkdocs.structure.files import Files
 from mkdocs.structure.pages import Page
@@ -26,7 +31,6 @@ from mkdocs.structure.pages import Page
 ROOT_DIR = Path(__file__).parent.parent.parent.parent.resolve()
 DOC_DIR = ROOT_DIR / "docs"

-
 gh_icon = ":octicons-mark-github-16:"

 # Regex pieces
@@ -48,46 +52,90 @@ github_link = re.compile(rf"(\[{TITLE}\]\(|<){URL}(\)|>)")
 relative_link = re.compile(rf"\[{TITLE}\]\({RELATIVE}\)")


+class UrlSchemesPreprocessor(Preprocessor):
+    """Preprocessor that runs after pymdownx.snippets to process all links."""
+
+    def __init__(self, md, ext):
+        super().__init__(md)
+        self.ext = ext
+
+    def run(self, lines):
+        page = self.ext.page
+        if page is None or getattr(page.file, "abs_src_path", None) is None:
+            return lines
+
+        def replace_relative_link(match: re.Match) -> str:
+            """
+            Replace relative file links with URLs if they point outside the docs dir.
+            """
+            title = match.group("title")
+            path = match.group("path")
+            path = (Path(page.file.abs_src_path).parent / path).resolve()
+            fragment = match.group("fragment") or ""
+
+            # Check if the path exists and is outside the docs dir
+            if not path.exists() or path.is_relative_to(DOC_DIR):
+                return match.group(0)
+
+            # Files and directories have different URL schemes on GitHub
+            slug = "tree/main" if path.is_dir() else "blob/main"
+
+            path = path.relative_to(ROOT_DIR)
+            url = f"https://github.com/vllm-project/vllm/{slug}/{path}{fragment}"
+            return f"[{gh_icon} {title}]({url})"
+
+        def replace_github_link(match: re.Match) -> str:
+            """
+            Replace GitHub issue, PR, and project links with enhanced Markdown links.
+            """
+            repo = match.group("repo")
+            type = match.group("type")
+            number = match.group("number")
+            # Title and fragment could be None
+            title = match.group("title") or ""
+            fragment = match.group("fragment") or ""
+
+            # Use default titles for raw links
+            if not title:
+                title = TITLES[type]
+                if "vllm-project" not in repo:
+                    title += repo
+                title += f"#{number}"
+
+            url = f"https://github.com/{repo}/{type}/{number}{fragment}"
+            return f"[{gh_icon} {title}]({url})"
+
+        markdown = "\n".join(lines)
+        markdown = relative_link.sub(replace_relative_link, markdown)
+        markdown = github_link.sub(replace_github_link, markdown)
+        return markdown.split("\n")
+
+
+class UrlSchemesExtension(Extension):
+    """Markdown extension that registers the URL schemes preprocessor."""
+
+    def __init__(self, **kwargs):
+        self.page = None
+        super().__init__(**kwargs)
+
+    def extendMarkdown(self, md):
+        # Priority 25 runs after pymdownx.snippets (priority 32)
+        md.preprocessors.register(UrlSchemesPreprocessor(md, self), "url_schemes", 25)
+
+
+# Singleton extension instance shared between the hook and the preprocessor.
+_ext = UrlSchemesExtension()
+
+
+def on_config(config: MkDocsConfig) -> MkDocsConfig:
+    """Register the URL schemes markdown extension."""
+    config["markdown_extensions"].append(_ext)
+    return config
+
+
 def on_page_markdown(
    markdown: str, *, page: Page, config: MkDocsConfig, files: Files
 ) -> str:
-    def replace_relative_link(match: re.Match) -> str:
-        """Replace relative file links with URLs if they point outside the docs dir."""
-        title = match.group("title")
-        path = match.group("path")
-        path = (Path(page.file.abs_src_path).parent / path).resolve()
-        fragment = match.group("fragment") or ""
-
-        # Check if the path exists and is outside the docs dir
-        if not path.exists() or path.is_relative_to(DOC_DIR):
-            return match.group(0)
-
-        # Files and directories have different URL schemes on GitHub
-        slug = "tree/main" if path.is_dir() else "blob/main"
-
-        path = path.relative_to(ROOT_DIR)
-        url = f"https://github.com/vllm-project/vllm/{slug}/{path}{fragment}"
-        return f"[{gh_icon} {title}]({url})"
-
-    def replace_github_link(match: re.Match) -> str:
-        """Replace GitHub issue, PR, and project links with enhanced Markdown links."""
-        repo = match.group("repo")
-        type = match.group("type")
-        number = match.group("number")
-        # Title and fragment could be None
-        title = match.group("title") or ""
-        fragment = match.group("fragment") or ""
-
-        # Use default titles for raw links
-        if not title:
-            title = TITLES[type]
-            if "vllm-project" not in repo:
-                title += repo
-            title += f"#{number}"
-
-        url = f"https://github.com/{repo}/{type}/{number}{fragment}"
-        return f"[{gh_icon} {title}]({url})"
-
-    markdown = relative_link.sub(replace_relative_link, markdown)
-    markdown = github_link.sub(replace_github_link, markdown)
+    """Pass the current page context to the preprocessor."""
+    _ext.page = page
    return markdown
--- a/docs/models/pooling_models/README.md
+++ b/docs/models/pooling_models/README.md
+# Pooling Models
+
+!!! note
+    We currently support pooling models primarily for convenience. This is not guaranteed to provide any performance improvements over using Hugging Face Transformers or Sentence Transformers directly.
+
+    We plan to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!
+
+## What are pooling models?
+
+Natural Language Processing (NLP) can be primarily divided into the following two types of tasks:
+
+- Natural Language Understanding (NLU)
+- Natural Language Generation (NLG)
+
+The generative models supported by vLLM cover a variety of task types, such as the large language models (LLMs) we are familiar with, multimodal models (VLM) that handle multimodal inputs like images, videos, and audio, speech-to-text transcription models, and real-time models that support streaming input. Their common feature is the ability to generate text. Taking it a step further, vLLM-Omni supports the generation of multimodal content, including images, videos, and audio.
+
+As the capabilities of generative models continue to improve, the boundaries of these models are also constantly expanding. However, certain application scenarios still require specialized small language models to efficiently complete specific tasks. These models typically have the following characteristics:
+
+- They do not require content generation.
+- They only need to perform very limited functions, without requiring strong generalization, creativity, or high intelligence.
+- They demand extremely low latency and may operate on cost-constrained hardware.
+- Text-only models typically have fewer than 1 billion parameters, while multimodal models generally have fewer than 10 billion parameters.
+
+Although these models are relatively small in scale, they are still based on the Transformer architecture, similar or even identical to the most advanced large language models today. Many recently released pooling models are also fine-tuned from large language models, allowing them to benefit from the continuous improvements in large models. This architecture similarity enables them to reuse much of vLLM’s infrastructure. If compatible, we would be happy to help them leverage the latest features of vLLM as well.
+
+### Sequence-wise Task and Token-wise Task
+
+The key distinction between sequence-wise task and token-wise task lies in their output granularity: sequence-wise task produces a single result for an entire input sequence, whereas token-wise task yields a result for each individual token within the sequence.
+
+Of course, we also have "plugin" tasks that allow users to customize input and output processors. For more information, please refer to [IO Processor Plugins](../../design/io_processor_plugins.md).
+
+### Pooling Tasks
+
+| Pooling Tasks         | Granularity   | Outputs                                         |
+|-----------------------|---------------|-------------------------------------------------|
+| `classify` (see note) | Sequence-wise | probability vector of classes for each sequence |
+| `embed`               | Sequence-wise | vector representations for each sequence        |
+| `token_classify`      | Token-wise    | probability vector of classes for each token    |
+| `token_embed`         | Token-wise    | vector representations for each token           |
+
+!!! note
+    Within classification tasks, there is a specialized subcategory: Cross-encoder (aka reranker) models. These models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.
+
+### Score Types
+
+The scoring models is designed to compute similarity scores between two input prompts. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.
+
+| Pooling Tasks         | Granularity   | Outputs                                      | Score Types        | scoring function         |
+|-----------------------|---------------|----------------------------------------------|--------------------|--------------------------|
+| `classify` (see note) | Sequence-wise | reranker score for each sequence             | `cross-encoder`    | linear classifier        |
+| `embed`               | Sequence-wise | vector representations for each sequence     | `bi-encoder`       | cosine similarity        |
+| `token_classify`      | Token-wise    | probability vector of classes for each token | nan                | nan                      |
+| `token_embed`         | Token-wise    | vector representations for each token        | `late-interaction` | late interaction(MaxSim) |
+
+!!! note
+    Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
+
+### Pooling Usages
+
+| Pooling Usages              | Description                                                                                                                                             |
+|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Classification Usages       | Predicting which predefined category, class, or label best corresponds to a given input.                                                                |
+| Embedding Usages            | Converts unstructured data (text, images, audio, etc.) into structured numerical vectors (embeddings).                                                  |
+| Token Classification Usages | Token-wise classification                                                                                                                               |
+| Token Embedding Usages      | Token-wise embedding                                                                                                                                    |
+| Scoring Usages              | Computes similarity scores between two inputs. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`. |
+| Reward Usages               | Evaluates the quality of outputs generated by a language model, acting as a proxy for human preferences.                                                |
+
+We also have some special models that support multiple pooling tasks, or have specific usage scenarios, or support special inputs and outputs.
+
+For more detailed information, please refer to the link below.
+
+- [Classification Usages](classify.md)
+- [Embedding Usages](embed.md)
+- [Reward Usages](reward.md)
+- [Token Classification Usages](token_classify.md)
+- [Token Embedding Usages](token_embed.md)
+- [Scoring Usages](scoring.md)
+- [Specific Model Examples](specific_models.md)
+
+## Offline Inference
+
+Each pooling model in vLLM supports one or more of these tasks according to
+[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
+enabling the corresponding APIs.
+
+### Offline APIs corresponding to pooling tasks
+
+| Task             | APIs                                                                                  |
+|------------------|---------------------------------------------------------------------------------------|
+| `embed`          | `LLM.embed(...)`, `LLM.encode(..., pooling_task="embed")`, `LLM.score(...)`(see note) |
+| `classify`       | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`, `LLM.score(...)`     |
+| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")`                   |
+| `token_embed`    | `LLM.encode(..., pooling_task="token_embed")`, `LLM.score(...)`                       |
+| `plugin`         | `LLM.encode(..., pooling_task="plugin")`                                              |
+
+!!! note
+    Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
+
+### `LLM.classify`
+
+The [classify][vllm.LLM.classify] method outputs a probability vector for each prompt.
+It is primarily designed for [classification models](classify.md).
+For more information about `LLM.embed`, see [this page](classify.md#offline-inference).
+
+### `LLM.embed`
+
+The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt.
+It is primarily designed for [embedding models](embed.md).
+For more information about `LLM.embed`, see [this page](embed.md#offline-inference).
+
+### `LLM.score`
+
+The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
+It is primarily designed for [score models](scoring.md).
+
+### `LLM.encode`
+
+The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
+
+Please use one of the more specific methods or set the task directly when using `LLM.encode`, refer to the [table above](#offline-apis-corresponding-to-pooling-tasks).
+
+### Examples
+
+```python
+from vllm import LLM
+
+llm = LLM(model="intfloat/e5-small", runner="pooling")
+(output,) = llm.encode("Hello, my name is", pooling_task="embed")
+
+data = output.outputs.data
+print(f"Data: {data!r}")
+```
+
+## Online Serving
+
+Our online Server provides endpoints that correspond to the offline APIs:
+
+- Corresponding to `LLM.embed`:
+    - [Cohere Embed API](embed.md#cohere-embed-api) (`/v2/embed`)
+    - [Openai-compatible Embeddings API](embed.md#openai-compatible-embeddings-api) (`/v1/embeddings`)
+- Corresponding to `LLM.classify`:
+    - [Classification API](classify.md#online-serving)(`/classify`)
+- Corresponding to `LLM.score`:
+    - [Score API](scoring.md#score-api)(`/score`)
+    - [Rerank API](scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
+- Pooling API (`/pooling`) is similar to `LLM.encode`, being applicable to all types of pooling models.
+
+The following introduces the Pooling API. For other APIs, please refer to the link above.
+
+### Pooling API
+
+Our Pooling API (`/pooling`) is similar to `LLM.encode`, being applicable to all types of pooling models.
+
+The input format is the same as [Embeddings API](embed.md#openai-compatible-embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
+
+Please use one of the more specific APIs or set the task directly when using the Pooling API, refer to the [table above](#offline-apis-corresponding-to-pooling-tasks).
+
+Code example: [examples/pooling/pooling/pooling_online.py](../../../examples/pooling/pooling/pooling_online.py)
+
+### Examples
+
+```python
+# start a supported embeddings model server with `vllm serve`, e.g.
+# vllm serve intfloat/e5-small
+import requests
+
+host = "localhost"
+port = "8000"
+model_name = "intfloat/e5-small"
+
+api_url = f"http://{host}:{port}/pooling"
+
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+prompt = {"model": model_name, "input": prompts, "task": "embed"}
+
+response = requests.post(api_url, json=prompt)
+
+for output in response.json()["data"]:
+    data = output["data"]
+    print(f"Data: {data!r} (size={len(data)})")
+```
+
+## Configuration
+
+In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
+These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
+before returning them.
+
+### Model Runner
+
+Run a model in pooling mode via the option `--runner pooling`.
+
+!!! tip
+    There is no need to set this option in the vast majority of cases as vLLM can automatically
+    detect the appropriate model runner via `--runner auto`.
+
+### Model Conversion
+
+vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
+
+If `--runner pooling` has been set (manually or automatically) but the model does not implement the
+[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
+vLLM will attempt to automatically convert the model according to the architecture names
+shown in the table below.
+
+| Architecture                                    | `--convert` | Supported pooling tasks      |
+|-------------------------------------------------|-------------|------------------------------|
+| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed`     | `token_embed`, `embed`       |
+| `*ForRewardModeling`, `*RewardModel`            | `embed`     | `token_embed`, `embed`       |
+| `*For*Classification`, `*ClassificationModel`   | `classify`  | `token_classify`, `classify` |
+
+!!! tip
+    You can explicitly set `--convert <type>` to specify how to convert the model.
+
+### Pooler Configuration
+
+#### Predefined models
+
+If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
+you can override some of its attributes via the `--pooler-config` option.
+
+#### Converted models
+
+If the model has been converted via `--convert` (see above),
+the pooler assigned to each task has the following attributes by default:
+
+| Task       | Pooling Type | Normalization | Softmax |
+| ---------- | ------------ | ------------- | ------- |
+| `embed`    | `LAST`       | ✅︎            | ❌      |
+| `classify` | `LAST`       | ❌            | ✅︎      |
+
+When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
+its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
+
+You can further customize this via the `--pooler-config` option,
+which takes priority over both the model's and Sentence Transformers' defaults.
+
+## Removed Features
+
+### Encode task
+
+We have split the `encode` task into two more specific token-wise tasks: `token_embed` and `token_classify`:
+
+- `token_embed` is the same as `embed`, using normalization as the activation.
+- `token_classify` is the same as `classify`, by default using softmax as the activation.
+
+Pooling models now default support all pooling, you can use it without any settings.
+
+- Extracting hidden states prefers using `token_embed` task.
+- Named Entity Recognition (NER) and reward models prefers using `token_classify` task.
+
+### Score task
+
+`score` task is deprecated and will be removed in v0.20. Please use `classify` instead. Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
--- a/docs/models/pooling_models/classify.md
+++ b/docs/models/pooling_models/classify.md
+# Classification Usages
+
+Classification involves predicting which predefined category, class, or label best corresponds to a given input.
+
+## Summary
+
+- Model Usage: (sequence) classification
+- Pooling Task: `classify`
+- Offline APIs:
+    - `LLM.classify(...)`
+    - `LLM.encode(..., pooling_task="classify")`
+- Online APIs:
+    - [Classification API](classify.md#online-serving) (`/classify`)
+    - Pooling API (`/pooling`)
+
+The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.
+
+Many classification models support both (sequence) classification and token classification. For further details on token classification, please refer to [this page](token_classify.md).
+
+Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled, please refer to [this page](scoring.md).
+
+## Typical Use Cases
+
+### Classification
+
+The most fundamental application of classification models is to categorize input data into predefined classes.
+
+## Supported Models
+
+### Text-only Models
+
+| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
+| ------------ | ------ | ----------------- | ------------------------------ | ------------------------------------------ |
+| `ErnieForSequenceClassification` | BERT-like Chinese ERNIE | `Forrest20231206/ernie-3.0-base-zh-cls` | | |
+| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | |
+| `Qwen2ForSequenceClassification`<sup>C</sup> | Qwen2-based | `jason9693/Qwen2.5-1.5B-apeach` | | |
+| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |
+
+### Multimodal Models
+
+!!! note
+    For more information about multimodal models inputs, see [this page](../supported_models.md#list-of-multimodal-language-models).
+
+| Architecture | Models | Inputs | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
+| ------------ | ------ | ------ | ----------------- | ------------------------------ | ------------------------------------------ |
+| `Qwen2_5_VLForSequenceClassification`<sup>C</sup> | Qwen2_5_VL-based | T + I<sup>E+</sup> + V<sup>E+</sup> | `muziyongshixin/Qwen2.5-VL-7B-for-VideoCls` | | |
+| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* |
+
+<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))  
+\* Feature support is the same as that of the original model.
+
+If your model is not in the above list, we will try to automatically convert the model using
+[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
+
+### Cross-encoder Models
+
+Cross-encoder (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1. Most classification models can also be used as [cross-encoder models](scoring.md#cross-encoder-models). For more information on cross-encoder models, please refer to [this page](scoring.md).
+
+--8<-- "docs/models/pooling_models/scoring.md:supported-cross-encoder-models"
+
+### Reward Models
+
+Using (sequence) classification models as reward models. For more information, see [Reward Models](reward.md).
+
+--8<-- "docs/models/pooling_models/reward.md:supported-sequence-reward-models"
+
+## Offline Inference
+
+### Pooling Parameters
+
+The following [pooling parameters][vllm.PoolingParams] are supported.
+
+```python
+--8<-- "vllm/pooling_params.py:common-pooling-params"
+--8<-- "vllm/pooling_params.py:classify-pooling-params"
+```
+
+### `LLM.classify`
+
+The [classify][vllm.LLM.classify] method outputs a probability vector for each prompt.
+
+```python
+from vllm import LLM
+
+llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
+(output,) = llm.classify("Hello, my name is")
+
+probs = output.outputs.probs
+print(f"Class Probabilities: {probs!r} (size={len(probs)})")
+```
+
+A code example can be found here: [examples/offline_inference/basic/classify.py](../../../examples/basic/offline_inference/classify.py)
+
+### `LLM.encode`
+
+The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
+
+Set `pooling_task="classify"` when using `LLM.encode` for classification Models:
+
+```python
+from vllm import LLM
+
+llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
+(output,) = llm.encode("Hello, my name is", pooling_task="classify")
+
+data = output.outputs.data
+print(f"Data: {data!r}")
+```
+
+## Online Serving
+
+### Classification API
+
+Online `/classify` API is similar to `LLM.classify`.
+
+#### Completion Parameters
+
+The following Classification API parameters are supported:
+
+??? code
+
+    ```python
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:completion-params"
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-params"
+    ```
+
+The following extra parameters are supported:
+
+??? code
+
+    ```python
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:completion-extra-params"
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
+    ```
+
+#### Chat Parameters
+
+For chat-like input (i.e. if `messages` is passed), the following parameters are supported:
+
+??? code
+
+    ```python
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-params"
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:chat-params"
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-params"
+    ```
+
+these extra parameters are supported instead:
+
+??? code
+
+    ```python
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:pooling-common-extra-params"
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:chat-extra-params"
+    --8<-- "vllm/entrypoints/pooling/base/protocol.py:classify-extra-params"
+    ```
+
+#### Example Requests
+
+Code example: [examples/pooling/classify/classification_online.py](../../../examples/pooling/classify/classification_online.py)
+
+You can classify multiple texts by passing an array of strings:
+
+```bash
+curl -v "http://127.0.0.1:8000/classify" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "jason9693/Qwen2.5-1.5B-apeach",
+    "input": [
+      "Loved the new café—coffee was great.",
+      "This update broke everything. Frustrating."
+    ]
+  }'
+```
+
+??? console "Response"
+
+    ```json
+    {
+      "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
+      "object": "list",
+      "created": 1745383065,
+      "model": "jason9693/Qwen2.5-1.5B-apeach",
+      "data": [
+        {
+          "index": 0,
+          "label": "Default",
+          "probs": [
+            0.565970778465271,
+            0.4340292513370514
+          ],
+          "num_classes": 2
+        },
+        {
+          "index": 1,
+          "label": "Spoiled",
+          "probs": [
+            0.26448777318000793,
+            0.7355121970176697
+          ],
+          "num_classes": 2
+        }
+      ],
+      "usage": {
+        "prompt_tokens": 20,
+        "total_tokens": 20,
+        "completion_tokens": 0,
+        "prompt_tokens_details": null
+      }
+    }
+    ```
+
+You can also pass a string directly to the `input` field:
+
+```bash
+curl -v "http://127.0.0.1:8000/classify" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "jason9693/Qwen2.5-1.5B-apeach",
+    "input": "Loved the new café—coffee was great."
+  }'
+```
+
+??? console "Response"
+
+    ```json
+    {
+      "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
+      "object": "list",
+      "created": 1745383213,
+      "model": "jason9693/Qwen2.5-1.5B-apeach",
+      "data": [
+        {
+          "index": 0,
+          "label": "Default",
+          "probs": [
+            0.565970778465271,
+            0.4340292513370514
+          ],
+          "num_classes": 2
+        }
+      ],
+      "usage": {
+        "prompt_tokens": 10,
+        "total_tokens": 10,
+        "completion_tokens": 0,
+        "prompt_tokens_details": null
+      }
+    }
+    ```
+
+## More examples
+
+More examples can be found here: [examples/pooling/classify](../../../examples/pooling/classify)
+
+## Supported Features
+
+### Enable/disable activation
+
+You can enable or disable activation via `use_activation`.
+
+### Problem type (e.g. `multi_label_classification`)
+
+You can modify the `problem_type` via problem_type in the Hugging Face config. The supported problem types are: `single_label_classification`, `multi_label_classification`, and `regression`.
+
+Implement alignment with transformers [ForSequenceClassificationLoss](https://github.com/huggingface/transformers/blob/57bb6db6ee4cfaccc45b8d474dfad5a17811ca60/src/transformers/loss/loss_utils.py#L92).
+
+### Logit bias
+
+You can modify the `logit_bias` (aka `sigmoid_normalize`) through the logit_bias parameter in `vllm.config.PoolerConfig`.
+
+## Removed Features
+
+### Remove softmax from PoolingParams
+
+We have already removed `softmax` and `activation` from PoolingParams. Instead, use `use_activation`, since we allow `classify` and `token_classify` to use any activation function.
--- a/docs/models/pooling_models/embed.md
+++ b/docs/models/pooling_models/embed.md
--- a/docs/models/pooling_models/reward.md
+++ b/docs/models/pooling_models/reward.md
+# Reward Usages
+
+A reward model (RM) is designed to evaluate and score the quality of outputs generated by a language model, acting as a proxy for human preferences.
+
+## Summary
+
+- Model Usage: reward
+- Pooling Task:
+
+| Model Types                        | Pooling Tasks  |
+|------------------------------------|----------------|
+| (sequence) (outcome) reward models | classify       |
+| token (outcome) reward models      | token_classify |
+| process reward models              | token_classify |
+
+- Offline APIs:
+    - `LLM.encode(..., pooling_task="...")`
+- Online APIs:
+    - Pooling API (`/pooling`)
+
+## Supported Models
+
+### Reward Models
+
+Using sequence classification models as (sequence) (outcome) reward models, the usage and supported features are the same as for normal [classification models](classify.md).
+
+--8<-- [start:supported-sequence-reward-models]
+
+| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
+| ------------ | ------ | ----------------- | -------------------- | ------------------------- |
+| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ |
+| `Qwen3ForSequenceClassification`<sup>C</sup> | Qwen3-based | `Skywork/Skywork-Reward-V2-Qwen3-0.6B`, etc. | ✅︎ | ✅︎ |
+| `LlamaForSequenceClassification`<sup>C</sup> | Llama-based | `Skywork/Skywork-Reward-V2-Llama-3.2-1B`, etc. | ✅︎ | ✅︎ |
+| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |
+
+<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))  
+
+If your model is not in the above list, we will try to automatically convert the model using
+[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
+
+--8<-- [end:supported-sequence-reward-models]
+
+### Token Reward Models
+
+The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.
+
+Using token classification models as token (outcome) reward models, the usage and supported features are the same as for normal [token classification models](token_classify.md).
+
+--8<-- [start:supported-token-reward-models]
+
+| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
+| ------------ | ------ | ----------------- | -------------------- | ------------------------- |
+| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ |
+| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ |
+| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |
+
+<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))  
+
+If your model is not in the above list, we will try to automatically convert the model using
+[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model].
+
+--8<-- [end:supported-token-reward-models]
+
+### Process Reward Models
+
+The process reward models used for evaluating intermediate steps are crucial to achieving the desired outcome.
+
+| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
+| ------------ | ------ | ----------------- | -------------------- | ------------------------- |
+| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ |
+| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ |
+
+!!! important
+    For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
+    e.g.: `--pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
+
+## Offline Inference
+
+### Pooling Parameters
+
+The following [pooling parameters][vllm.PoolingParams] are supported.
+
+```python
+--8<-- "vllm/pooling_params.py:common-pooling-params"
+--8<-- "vllm/pooling_params.py:classify-pooling-params"
+```
+
+### `LLM.encode`
+
+The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
+
+- Reward Models
+
+Set `pooling_task="classify"` when using `LLM.encode` for (sequence) (outcome) reward models:
+
+```python
+from vllm import LLM
+
+llm = LLM(model="Skywork/Skywork-Reward-V2-Qwen3-0.6B", runner="pooling")
+(output,) = llm.encode("Hello, my name is", pooling_task="classify")
+
+data = output.outputs.data
+print(f"Data: {data!r}")
+```
+
+- Token Reward Models
+
+Set `pooling_task="token_classify"` when using `LLM.encode` for token (outcome) reward models:
+
+```python
+from vllm import LLM
+
+llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
+(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")
+
+data = output.outputs.data
+print(f"Data: {data!r}")
+```
+
+- Process Reward Models
+
+Set `pooling_task="token_classify"` when using `LLM.encode` for token (outcome) reward models:
+
+```python
+from vllm import LLM
+
+llm = LLM(model="Qwen/Qwen2.5-Math-PRM-7B", runner="pooling")
+(output,) = llm.encode("Hello, my name is<extra_0><extra_0><extra_0>", pooling_task="token_classify")
+
+data = output.outputs.data
+print(f"Data: {data!r}")
+```
+
+## Online Serving
+
+Please refer to the [pooling API](README.md#pooling-api). Pooling task corresponding to reward model types refer to the [table above](#summary).
--- a/docs/models/pooling_models/scoring.md
+++ b/docs/models/pooling_models/scoring.md
--- a/docs/models/pooling_models.md
+++ b/docs/models/pooling_models.md