Merge tag 'v0.18.1rc0' into v0.18.1rc0-ori

0da93439 · zhuwenwen · 25f2f756 · 298e5108 · 0da93439 · 0da93439
Commit 0da93439 authored Mar 26, 2026 by zhuwenwen
20 changed files
--- a/docs/models/pooling_models/token_classify.md
+++ b/docs/models/pooling_models/token_classify.md
+# Token Classification Usages
+
+## Summary
+
+- Model Usage: token classification
+- Pooling Tasks: `token_classify`
+- Offline APIs:
+    - `LLM.encode(..., pooling_task="token_classify")`
+- Online APIs:
+    - Pooling API (`/pooling`)
+
+The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.
+
+Many classification models support both (sequence) classification and token classification. For further details on (sequence) classification, please refer to [this page](classify.md).
+
+## Typical Use Cases
+
+### Named Entity Recognition (NER)
+
+For implementation examples, see:
+
+Offline: [examples/pooling/token_classify/ner_offline.py](../../../examples/pooling/token_classify/ner_offline.py)
+
+Online: [examples/pooling/token_classify/ner_online.py](../../../examples/pooling/token_classify/ner_online.py)
+
+### Sparse retrieval (lexical matching)
+
+The BAAI/bge-m3 model leverages token classification for sparse retrieval. For more information, see [this page](specific_models.md#baaibge-m3).
+
+## Supported Models
+
+| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
+| ------------ | ------ | ----------------- | --------------------------- | --------------------------------------- |
+| `BertForTokenClassification` | bert-based | `boltuix/NeuroBERT-NER` (see note), etc. | | |
+| `ErnieForTokenClassification` | BERT-like Chinese ERNIE | `gyr66/Ernie-3.0-base-chinese-finetuned-ner` | | |
+| `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | |
+| `Qwen3ForTokenClassification`<sup>C</sup> | Qwen3-based | `bd2lcco/Qwen3-0.6B-finetuned` | | |
+| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |
+
+<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))  
+\* Feature support is the same as that of the original model.
+
+If your model is not in the above list, we will try to automatically convert the model using
+[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
+
+### As Reward Models
+
+Using token classification models as reward models. For details on reward models, see [Reward Models](reward.md).
+
+--8<-- "docs/models/pooling_models/reward.md:supported-token-reward-models"
+
+## Offline Inference
+
+### Pooling Parameters
+
+The following [pooling parameters][vllm.PoolingParams] are supported.
+
+```python
+--8<-- "vllm/pooling_params.py:common-pooling-params"
+--8<-- "vllm/pooling_params.py:classify-pooling-params"
+```
+
+### `LLM.encode`
+
+The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
+
+Set `pooling_task="token_classify"` when using `LLM.encode` for token classification Models:
+
+```python
+from vllm import LLM
+
+llm = LLM(model="boltuix/NeuroBERT-NER", runner="pooling")
+(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")
+
+data = output.outputs.data
+print(f"Data: {data!r}")
+```
+
+## Online Serving
+
+Please refer to the [pooling API](README.md#pooling-api) and use `"task":"token_classify"`.
+
+## More examples
+
+More examples can be found here: [examples/pooling/token_classify](../../../examples/pooling/token_classify)
+
+## Supported Features
+
+Token classification features should be consistent with (sequence) classification. For more information, see [this page](classify.md#supported-features).
--- a/docs/models/pooling_models/token_embed.md
+++ b/docs/models/pooling_models/token_embed.md
+# Token Embedding Usages
+
+## Summary
+
+- Model Usage: Token classification models
+- Pooling Tasks: `token_embed`
+- Offline APIs:
+    - `LLM.encode(..., pooling_task="token_embed")`
+- Online APIs:
+    - Pooling API (`/pooling`)
+
+The difference between the (sequence) embedding task and the token embedding task is that (sequence) embedding outputs one embedding for each sequence, while token embedding outputs a embedding for each token.
+
+Many embedding models support both (sequence) embedding and token embedding. For further details on (sequence) embedding, please refer to [this page](embed.md).
+
+## Typical Use Cases
+
+### Multi-Vector Retrieval
+
+For implementation examples, see:
+
+Offline: [examples/pooling/token_embed/multi_vector_retrieval_offline.py](../../../examples/pooling/token_embed/multi_vector_retrieval_offline.py)
+
+Online: [examples/pooling/token_embed/multi_vector_retrieval_online.py](../../../examples/pooling/token_embed/multi_vector_retrieval_online.py)
+
+### Late interaction
+
+Similarity scores can be computed using late interaction between two input prompts via the score API. For more information, see [Score API](scoring.md).
+
+### Extract last hidden states
+
+Models of any architecture can be converted into embedding models using `--convert embed`. Token embedding can then be used to extract the last hidden states from these models.
+
+## Supported Models
+
+--8<-- [start:supported-token-embed-models]
+
+### Text-only Models
+
+| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
+| ------------ | ------ | ----------------- | -------------------- | ------------------------- |
+| `ColBERTLfm2Model` | LFM2 | `LiquidAI/LFM2-ColBERT-350M` | | |
+| `ColBERTModernBertModel` | ModernBERT | `lightonai/GTE-ModernColBERT-v1` | | |
+| `ColBERTJinaRobertaModel` | Jina XLM-RoBERTa | `jinaai/jina-colbert-v2` | | |
+| `HF_ColBERT` | BERT | `answerdotai/answerai-colbert-small-v1`, `colbert-ir/colbertv2.0` | | |
+| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |
+
+### Multimodal Models
+
+!!! note
+    For more information about multimodal models inputs, see [this page](../supported_models.md#list-of-multimodal-language-models).
+
+| Architecture | Models | Inputs | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
+| ------------ | ------ | ----- | ----------------- | ------------------------------ | ------------------------------------------ |
+| `ColModernVBertForRetrieval` | ColModernVBERT | T / I | `ModernVBERT/colmodernvbert-merged` | | |
+| `ColPaliForRetrieval` | ColPali | T / I | `vidore/colpali-v1.3-hf` | | |
+| `ColQwen3` | Qwen3-VL | T / I | `TomoroAI/tomoro-colqwen3-embed-4b`, `TomoroAI/tomoro-colqwen3-embed-8b` | | |
+| `ColQwen3_5` | ColQwen3.5 | T + I + V | `athrael-soju/colqwen3.5-4.5B-v3` | | |
+| `OpsColQwen3Model` | Qwen3-VL | T / I | `OpenSearch-AI/Ops-Colqwen3-4B`, `OpenSearch-AI/Ops-Colqwen3-8B` | | |
+| `Qwen3VLNemotronEmbedModel` | Qwen3-VL | T / I | `nvidia/nemotron-colembed-vl-4b-v2`, `nvidia/nemotron-colembed-vl-8b-v2` | ✅︎ | ✅︎ |
+| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* |
+
+<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./README.md#model-conversion))  
+\* Feature support is the same as that of the original model.
+
+If your model is not in the above list, we will try to automatically convert the model using [as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model].
+
+--8<-- [end:supported-token-embed-models]
+
+## Offline Inference
+
+### Pooling Parameters
+
+The following [pooling parameters][vllm.PoolingParams] are supported.
+
+```python
+--8<-- "vllm/pooling_params.py:common-pooling-params"
+--8<-- "vllm/pooling_params.py:embed-pooling-params"
+```
+
+### `LLM.encode`
+
+The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
+
+Set `pooling_task="token_embed"` when using `LLM.encode` for token embedding Models:
+
+```python
+from vllm import LLM
+
+llm = LLM(model="answerdotai/answerai-colbert-small-v1", runner="pooling")
+(output,) = llm.encode("Hello, my name is", pooling_task="token_embed")
+
+data = output.outputs.data
+print(f"Data: {data!r}")
+```
+
+### `LLM.score`
+
+The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
+
+All models that support token embedding task also support using the score API to compute similarity scores by calculating the late interaction of two input prompts.
+
+```python
+from vllm import LLM
+
+llm = LLM(model="answerdotai/answerai-colbert-small-v1", runner="pooling")
+(output,) = llm.score(
+    "What is the capital of France?",
+    "The capital of Brazil is Brasilia.",
+)
+
+score = output.outputs.score
+print(f"Score: {score}")
+```
+
+## Online Serving
+
+Please refer to the [pooling API](README.md#pooling-api) and use `"task":"token_embed"`.
+
+## More examples
+
+More examples can be found here: [examples/pooling/token_embed](../../../examples/pooling/token_embed)
+
+## Supported Features
+
+Token embedding features should be consistent with (sequence) embedding. For more information, see [this page](embed.md#supported-features).
--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
--- a/docs/serving/expert_parallel_deployment.md
+++ b/docs/serving/expert_parallel_deployment.md
@@ -23,7 +23,6 @@ vLLM provides multiple communication backends for EP. Use `--all2all-backend` to
 | `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout, optimized for decode | Decode-dominated workloads, low-latency scenarios |
 | `flashinfer_nvlink_one_sided` | MNNVL systems | FlashInfer's one-sided A2A strategy for multi-node NVLink | High-throughput workloads |
 | `flashinfer_nvlink_two_sided` | MNNVL systems | FlashInfer's two-sided A2A strategy for multi-node NVLink | Systems with NVLink across nodes |
-| `naive` | Testing/debugging | Simple broadcast-based implementation | Debugging, not recommended for production |

 ## Single Node Deployment


--- a/docs/serving/offline_inference.md
+++ b/docs/serving/offline_inference.md
@@ -16,7 +16,7 @@ After initializing the `LLM` instance, use the available APIs to perform model i
 The available APIs depend on the model type:

 - [Generative models](../models/generative_models.md) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](../models/pooling_models.md) output their hidden states directly.
+- [Pooling models](../models/pooling_models/README.md) output their hidden states directly.

 !!! info
    [API Reference](../api/README.md#offline-inference)

--- a/docs/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
--- a/docs/training/async_rl.md
+++ b/docs/training/async_rl.md
+# Async Reinforcement Learning
+
+## Overview
+
+In a standard RL training loop, generation and training happen sequentially: the policy generates rollouts, then training runs on those rollouts, and the cycle repeats. During generation the training accelerators sit idle, and vice versa.
+
+The **one-off pipelining** approach separates the generation and training phases into two parallel coroutines, allowing the model to generate new samples while simultaneously training on previously generated data. This can lead to better GPU utilization and greater training throughput.
+
+However, this overlap introduces a complication: weights must be updated in the inference engine mid-flight, while requests may still be in progress.
+
+## The Pause and Resume API
+
+To safely update weights while the inference engine is running, vLLM provides `pause_generation` and `resume_generation` methods. These let the trainer coordinate a clean window for weight synchronization without losing in-flight work.
+
+### pause_generation
+
+```python
+await engine.pause_generation(mode="keep", clear_cache=True)
+```
+
+The `mode` parameter controls how in-flight requests are handled:
+
+| Mode | Behavior |
+| ---- | -------- |
+| `"abort"` | Abort all in-flight requests immediately and return partial results (default) |
+| `"wait"` | Wait for all in-flight requests to finish before pausing |
+| `"keep"` | Freeze requests in the queue; they resume when `resume_generation` is called |
+
+The `clear_cache` parameter controls whether to clear the KV cache and prefix cache after pausing.
+
+### resume_generation
+
+```python
+await engine.resume_generation()
+```
+
+Resumes the scheduler after a pause. Any requests frozen with `mode="keep"` will continue generating.
+
+### HTTP Endpoints
+
+When using the vLLM HTTP server, the same functionality is available via:
+
+- `POST /pause?mode=keep` - Pause generation
+- `POST /resume` - Resume generation
+
+!!! note "Data Parallelism"
+    When using data parallelism with vLLM's **internal load balancer** (i.e. `data_parallel_backend="ray"`), pause and resume are handled automatically across all DP ranks -- a single call is sufficient. When using an **external load balancer** (i.e. multiple independent vLLM instances behind a proxy), you must send pause and resume requests to **every** engine instance individually before and after the weight update.
+
+## Typical Async RL Flow
+
+A typical async RL loop with weight syncing looks like this:
+
+1. Start generating rollouts from the current policy
+2. Once trainer has new weights to update to, pause generation with `mode="keep"`
+3. Sync the updated weights from the trainer to the inference engine (see [Weight Transfer](weight_transfer/README.md))
+4. Resume generation -- in-flight requests continue with the new weights
+5. Repeat
+
+The key insight is that requests paused with `mode="keep"` will produce tokens from the **old** weights before the pause and tokens from the **new** weights after resume. The `clear_cache` parameter controls whether the KV cache is invalidated during the pause. When `clear_cache=True`, previously cached key-value entries are discarded, so all tokens generated after resume will be computed entirely with the new weights. When `clear_cache=False`, existing KV cache entries are retained, meaning some tokens in context may still reflect the old weights (stale KV cache).
+
+## Example
+
+The [async RLHF example](../examples/rl/rlhf_async_new_apis.md) demonstrates this pattern with `vllm.AsyncLLMEngine`, NCCL weight transfer, and mid-flight pause/resume with validation.
--- a/docs/training/rlhf.md
+++ b/docs/training/rlhf.md
@@ -16,11 +16,9 @@ The following open-source RL libraries use vLLM for fast rollouts (sorted alphab
 - [Unsloth](https://github.com/unslothai/unsloth)
 - [verl](https://github.com/volcengine/verl)

-See the following basic examples to get started if you don't want to use an existing library:
+For weight synchronization between training and inference, see the [Weight Transfer](weight_transfer/README.md) documentation, which covers the pluggable backend system with [NCCL](weight_transfer/nccl.md) (multi-GPU) and [IPC](weight_transfer/ipc.md) (same-GPU) engines.

- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md)
- [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md)
- [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md)
+For pipelining generation and training to improve GPU utilization and throughput, see the [Async Reinforcement Learning](async_rl.md) guide, which covers the pause/resume API for safely updating weights mid-flight.

 See the following notebooks showing how to use vLLM for GRPO:


--- a/docs/training/weight_transfer/README.md
+++ b/docs/training/weight_transfer/README.md
+# Weight Transfer
+
+vLLM provides a pluggable weight transfer system for synchronizing model weights from a training process to the inference engine during reinforcement learning (RL) workflows. This is essential for RLHF, GRPO, and other online RL methods where the policy model is iteratively updated during training and the updated weights must be reflected in the inference engine for rollout generation.
+
+## Architecture
+
+The weight transfer system follows a **two-phase protocol** with a pluggable backend design:
+
+1. **Initialization** (`init_weight_transfer_engine`): Establishes the communication channel between the trainer and inference workers. Called once before the training loop begins.
+2. **Weight Update** (`update_weights`): Transfers updated weights from the trainer to the inference engine. Called after each training step (or batch of steps).
+
+## Available Backends
+
+| Backend | Transport | Use Case |
+| ------- | --------- | -------- |
+| [NCCL](nccl.md) | NCCL broadcast | Separate GPUs for training and inference |
+| [IPC](ipc.md) | CUDA IPC handles | Colocated training and inference on same GPU |
+
+## Configuration
+
+Specify the weight transfer backend through `WeightTransferConfig`. The backend determines which engine handles the weight synchronization.
+
+### Programmatic (Offline Inference)
+
+```python
+from vllm import LLM
+from vllm.config import WeightTransferConfig
+
+llm = LLM(
+    model="my-model",
+    weight_transfer_config=WeightTransferConfig(backend="nccl"),  # or "ipc"
+)
+```
+
+### CLI (Online Serving)
+
+```bash
+vllm serve my-model \
+    --weight-transfer-config '{"backend": "nccl"}'
+```
+
+The `backend` field accepts `"nccl"` (default) or `"ipc"`.
+
+## API Endpoints
+
+When running vLLM as an HTTP server, the following endpoints are available for weight transfer:
+
+| Endpoint | Method | Description |
+| -------- | ------ | ----------- |
+| `/init_weight_transfer_engine` | POST | Initialize the weight transfer engine with backend-specific info |
+| `/update_weights` | POST | Trigger a weight update with backend-specific metadata |
+| `/pause` | POST | Pause generation before weight sync to handle inflight requests |
+| `/resume` | POST | Resume generation after weight sync |
+| `/get_world_size` | GET | Get the number of inference workers (useful for NCCL world size calculation) |
+
+!!! note
+    The HTTP weight transfer endpoints require `VLLM_SERVER_DEV_MODE=1` to be set.
+
+## Trainer-Side API
+
+Both backends provide static methods that the trainer calls to send weights. The general pattern is:
+
+```python
+# 1. Initialize the transfer engine (backend-specific)
+EngineClass.trainer_init(init_info)
+
+# 2. Send weights to inference workers
+EngineClass.trainer_send_weights(
+    iterator=model.named_parameters(),
+    trainer_args=backend_specific_args,
+)
+```
+
+See the [NCCL](nccl.md) and [IPC](ipc.md) pages for backend-specific trainer APIs and full examples.
+
+## Extending the System
+
+The weight transfer system is designed to be extensible. You can implement custom backends by subclassing `WeightTransferEngine` and registering them with the factory. See the [Base Class](base.md) page for details.
--- a/docs/training/weight_transfer/base.md
+++ b/docs/training/weight_transfer/base.md
--- a/docs/training/weight_transfer/ipc.md
+++ b/docs/training/weight_transfer/ipc.md
--- a/docs/training/weight_transfer/nccl.md
+++ b/docs/training/weight_transfer/nccl.md
--- a/examples/offline_inference/audio_language.py
+++ b/examples/offline_inference/audio_language.py
--- a/examples/offline_inference/rlhf.py
+++ b/examples/offline_inference/rlhf.py
--- a/examples/offline_inference/rlhf_colocate.py
+++ b/examples/offline_inference/rlhf_colocate.py
--- a/examples/offline_inference/rlhf_online_quant.py
+++ b/examples/offline_inference/rlhf_online_quant.py
--- a/examples/offline_inference/rlhf_utils.py
+++ b/examples/offline_inference/rlhf_utils.py
--- a/examples/online_serving/openai_chat_completion_client_for_multimodal.py
+++ b/examples/online_serving/openai_chat_completion_client_for_multimodal.py
@@ -20,9 +20,9 @@ run the script with
 python openai_chat_completion_client_for_multimodal.py --chat-type audio
 """

-import base64
 import os

+import pybase64 as base64
 import requests
 from openai import OpenAI
 from utils import get_first_model

--- a/examples/online_serving/openai_realtime_client.py
+++ b/examples/online_serving/openai_realtime_client.py
--- a/examples/online_serving/openai_realtime_microphone_client.py
+++ b/examples/online_serving/openai_realtime_microphone_client.py