Batch invariance is currently in beta. Some features are still under active development.
Track progress and planned improvements at <https://github.com/vllm-project/vllm/issues/27433>
This document shows how to enable batch invariance in vLLM. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch.
## Motivation
Batch invariance is crucial for several use cases:
-**Framework debugging**: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching.
-**Model debugging**: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations.
-**Reinforcement Learning (RL)**: RL training often requires deterministic rollouts for reproducibility and stable training.
-**Large-scale inference systems**: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees.
## Hardware Requirements
Batch invariance currently requires NVIDIA GPUs with compute capability 9.0 or higher:
-**H-series**: H100, H200
-**B-series**: B100, B200
## Enabling Batch Invariance
Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` environment variable to `1`:
```bash
export VLLM_BATCH_INVARIANT=1
```
### Online Inference (Server Mode)
To start a vLLM server with batch invariance enabled:
Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm/issues/new/choose).
## Implementation Details
When batch invariance is enabled, vLLM:
1. Uses deterministic kernel implementations for attention and other operations
2. Ensures consistent numerical behavior across different batch sizes
3. Disables certain optimizations that may introduce non-determinism (such as custom all-reduce operations in tensor parallel mode)
!!! note
Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility.
## Future Improvements
The batch invariance feature is under active development. Planned improvements include:
- Support for additional GPU architectures
- Expanded model coverage
- Performance optimizations
- Additional testing and validation
For the latest status and to contribute ideas, see the [tracking issue](https://github.com/vllm-project/vllm/issues/27433).
@@ -4,6 +4,9 @@ You can use vLLM *custom arguments* to pass in arguments which are not part of t
...
@@ -4,6 +4,9 @@ You can use vLLM *custom arguments* to pass in arguments which are not part of t
Custom arguments can be useful if, for example, you want to use a [custom logits processor](./custom_logitsprocs.md) without modifying the vLLM source code.
Custom arguments can be useful if, for example, you want to use a [custom logits processor](./custom_logitsprocs.md) without modifying the vLLM source code.
!!! note
Make sure your custom logits processor have implemented `validate_params` for custom arguments. Otherwise, invalid custom arguments can cause unexpected behaviour.
## Offline Custom Arguments
## Offline Custom Arguments
Custom arguments passed to `SamplingParams.extra_args` as a `dict` will be visible to any code which has access to `SamplingParams`:
Custom arguments passed to `SamplingParams.extra_args` as a `dict` will be visible to any code which has access to `SamplingParams`:
* Raise `ValueError` if `SamplingParams` has invalid arguments (especially custom arguments) used by logits processor.
* When request is sent to entrypoint, `validate_params()` will validate `SamplingParams` and refuse request with invalid arguments.
***Note:** it's important to implement `validate_params()` to prevent invalid parameters for custom logits processor. Otherwise requests with invalid parameters can cause unexpected behaviour in custom logits processor.
*`vllm_config`: engine configuration data structure
*`vllm_config`: engine configuration data structure
*`device`: hardware accelerator device info
*`device`: hardware accelerator device info
...
@@ -66,7 +71,7 @@ Logits processor `update_state()` implementations should assume the following mo
...
@@ -66,7 +71,7 @@ Logits processor `update_state()` implementations should assume the following mo
* **"Condense" the batch to be contiguous:** starting with the lowest-index empty slot (which was caused by a Remove), apply a Unidirectional Move from the current highest non-empty slot in the batch to fill the empty slot. Proceed with additional Unidirectional Move operations in order of increasing empty slot destination index and decreasing non-empty slot source index until the batch is contiguous
* **"Condense" the batch to be contiguous:** starting with the lowest-index empty slot (which was caused by a Remove), apply a Unidirectional Move from the current highest non-empty slot in the batch to fill the empty slot. Proceed with additional Unidirectional Move operations in order of increasing empty slot destination index and decreasing non-empty slot source index until the batch is contiguous
* **Shrink the batch:** a side-effect of condensing the batch is that empty slots resulting from Remove operations are grouped in a contiguous block at the end of the batch array. Thus, after condensing, update `BatchUpdate.batch_size` to reflect the number of non-empty slots
* **Shrink the batch:** a sideeffect of condensing the batch is that empty slots resulting from Remove operations are grouped in a contiguous block at the end of the batch array. Thus, after condensing, update `BatchUpdate.batch_size` to reflect the number of non-empty slots
5. Reorder the batch for improved efficiency. Depending on the attention backend implementation and the current characteristics of the batch, zero or more Swap Move operations may be applied to reorder the batch
5. Reorder the batch for improved efficiency. Depending on the attention backend implementation and the current characteristics of the batch, zero or more Swap Move operations may be applied to reorder the batch
...
@@ -93,7 +98,6 @@ The contrived example below implements a custom logits processor which consumes
...
@@ -93,7 +98,6 @@ The contrived example below implements a custom logits processor which consumes
While request-level logits processors are explicitly *not* supported in the vLLM engine, vLLM *does* provide a convenient process to wrap an existing `Callable` request-level logits processor and create a batch-level logits processor that is compatible with vLLM. The `Callable` must conform to the type annotation above; if your request-level logits processor has a different interface, then in order to wrap it, you may need to modify it or implement an additional wrapper layer to comply with the interface specification above.
While request-level logits processors are explicitly *not* supported in the vLLM engine, vLLM *does* provide a convenient process to wrap an existing `Callable` request-level logits processor and create a batch-level logits processor that is compatible with vLLM. The `Callable` must conform to the type annotation above; if your request-level logits processor has a different interface, then in order to wrap it, you may need to modify it or implement an additional wrapper layer to comply with the interface specification above.
You can wrap the request-level logits processor by subclassing `AdapterLogitsProcessor` as shown in the example below (in this example, `DummyPerReqLogitsProcessor` is a stand-in for your request-level logits processor which needs to be wrapped.) Override `AdapterLogitsProcessor.is_argmax_invariant(self)` to accurately reflect whether your request-level logits processor may impact which token has the highest-value logit. Override `AdapterLogitsProcessor.new_req_logits_processor(self,params)` to create a new request-level logits processor instance from a `SamplingParams` instance:
You can wrap the request-level logits processor by subclassing `AdapterLogitsProcessor` as shown in the example below (in this example, `DummyPerReqLogitsProcessor` is a stand-in for your request-level logits processor which needs to be wrapped.):
* Override `AdapterLogitsProcessor.validate_params(cls,params)` to validate request's sampling parameters.
* Override `AdapterLogitsProcessor.is_argmax_invariant(self)` to accurately reflect whether your request-level logits processor may impact which token has the highest-value logit.
* Override `AdapterLogitsProcessor.new_req_logits_processor(self,params)` to create a new request-level logits processor instance from a `SamplingParams` instance:
??? code "Example of Wrapping a Request-Level Logits Processor"
??? code "Example of Wrapping a Request-Level Logits Processor"
...
@@ -221,6 +241,16 @@ You can wrap the request-level logits processor by subclassing `AdapterLogitsPro
...
@@ -221,6 +241,16 @@ You can wrap the request-level logits processor by subclassing `AdapterLogitsPro
"""Example of wrapping a fake request-level logit processor to create a
"""Example of wrapping a fake request-level logit processor to create a
batch-level logits processor"""
batch-level logits processor"""
@classmethod
def validate_params(cls, params: SamplingParams):
target_token: Any | None = params.extra_args and params.extra_args.get(
"target_token"
)
if target_token is not None and not isinstance(target_token, int):
raise ValueError(
f"target_token value {target_token} is not int"
)
def is_argmax_invariant(self) -> bool:
def is_argmax_invariant(self) -> bool:
return False
return False
...
@@ -241,18 +271,11 @@ You can wrap the request-level logits processor by subclassing `AdapterLogitsPro
...
@@ -241,18 +271,11 @@ You can wrap the request-level logits processor by subclassing `AdapterLogitsPro
Returns:
Returns:
`Callable` request logits processor, or None
`Callable` request logits processor, or None
"""
"""
target_token: Optional[Any] = params.extra_args and params.extra_args.get(
target_token: Any | None = params.extra_args and params.extra_args.get(
"target_token"
"target_token"
)
)
if target_token is None:
if target_token is None:
return None
return None
if not isinstance(target_token, int):
logger.warning(
"target_token value %s is not int; not applying logits"
" processor to request.",
target_token,
)
return None
return DummyPerReqLogitsProcessor(target_token)
return DummyPerReqLogitsProcessor(target_token)
```
```
...
@@ -263,7 +286,7 @@ Once you have created a custom subclass (like `WrappedPerReqLogitsProcessor`) wh
...
@@ -263,7 +286,7 @@ Once you have created a custom subclass (like `WrappedPerReqLogitsProcessor`) wh
## Ways to Load Your Custom Logits Processor in vLLM
## Ways to Load Your Custom Logits Processor in vLLM
Logits processors are loaded at initialization. Critically, the set of loaded logits processors cannot be modified after the vLLM engine finishes loading, and new logits logits processors cannot be loaded on-demand for individual requests.
Logits processors are loaded at initialization. Critically, the set of loaded logits processors cannot be modified after the vLLM engine finishes loading, and new logits processors cannot be loaded on-demand for individual requests.
This section details different ways of making your logits processor visible to vLLM and triggering vLLM to load your logits processor.
This section details different ways of making your logits processor visible to vLLM and triggering vLLM to load your logits processor.
...
@@ -415,7 +438,7 @@ The examples below show how a user would pass a custom argument (`target_token`)
...
@@ -415,7 +438,7 @@ The examples below show how a user would pass a custom argument (`target_token`)
## Best Practices for Writing Custom Logits Processors
## Best Practices for Writing Custom Logits Processors
Once vLLM loads a logits processor during initialization, then vLLM will invoke `update_state()` and `apply()` against that logits processor in every engine step. Both methods operate on all requests which currently reside in the vLLM persistent batch. Thus it is important to implement these methods efficiently.
Once vLLM loads a logits processor during initialization, then vLLM will invoke `update_state()` and `apply()` against that logits processor in every engine step. Both methods operate on all requests which currently reside in the vLLM persistent batch. Thus, it is important to implement these methods efficiently.
* Write efficient `apply()` and `update_state()` implementations in light of the fact that logits processors operate at batch granularity
* Write efficient `apply()` and `update_state()` implementations in light of the fact that logits processors operate at batch granularity
* For example, you may be able to use efficient vectorized operations to implement `apply()` or update internal state vectors in `update_state()`
* For example, you may be able to use efficient vectorized operations to implement `apply()` or update internal state vectors in `update_state()`
...
@@ -442,4 +465,4 @@ Once vLLM loads a logits processor during initialization, then vLLM will invoke
...
@@ -442,4 +465,4 @@ Once vLLM loads a logits processor during initialization, then vLLM will invoke
* **Note:** for wrapped per-request logits processors, the `AdapterLogitsProcessor` base-class handles this by default
* **Note:** for wrapped per-request logits processors, the `AdapterLogitsProcessor` base-class handles this by default
*`is_argmax_invariant()` can be hard-coded to `True` or `False` if the logits processor has consistent behavior. However the argmax invariance may also be determined programmatically (i.e. if your logits processor is user-customizable in some way that impacts whether the logits processor is argmax invariant). For this reason, `is_argmax_invariant()` is not a class method
*`is_argmax_invariant()` can be hard-coded to `True` or `False` if the logits processor has consistent behavior. However, the argmax invariance may also be determined programmatically (i.e. if your logits processor is user-customizable in some way that impacts whether the logits processor is argmax invariant). For this reason, `is_argmax_invariant()` is not a class method
A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:
1.**Independent, fine-grained scaling**
2.**Lower time-to-first-token (TTFT)**
3.**Cross-process reuse and caching of encoder outputs**
Please refer to the directories `tests/v1/ec_connector`
## 4 Development
Disaggregated encoding is implemented by running two parts:
***Encoder instance** – a vLLM instance to performs vision encoding.
***Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
* PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D)
A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
All related code is under `vllm/distributed/ec_transfer`.
### Key abstractions
***ECConnector** – interface for retrieving EC caches produced by the encoder.
**Scheduler role* – checks cache existence and schedules loads.
**Worker role* – loads the embeddings into memory.
Here is a figure illustrating disaggregate encoder flow:
For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)
We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D;
Please refer to <gh-file:examples/online_serving/disaggregated_prefill.sh> for the example usage of disaggregated prefilling.
Please refer to [examples/online_serving/disaggregated_prefill.sh](../../examples/online_serving/disaggregated_prefill.sh) for the example usage of disaggregated prefilling.
Now supports 5 types of connectors:
Now supports 5 types of connectors:
-**SharedStorageConnector**: refer to <gh-file:examples/offline_inference/disaggregated-prefill-v1/run.sh> for the example usage of SharedStorageConnector disaggregated prefilling.
-**SharedStorageConnector**: refer to [examples/offline_inference/disaggregated-prefill-v1/run.sh](../../examples/offline_inference/disaggregated-prefill-v1/run.sh) for the example usage of SharedStorageConnector disaggregated prefilling.
-**LMCacheConnectorV1**: refer to <gh-file:examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh> for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
-**LMCacheConnectorV1**: refer to [examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh](../../examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh) for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
-**NixlConnector**: refer to <gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh> for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md).
-**NixlConnector**: refer to [tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md).
-**P2pNcclConnector**: refer to <gh-file:examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh> for the example usage of P2pNcclConnector disaggregated prefilling.
-**P2pNcclConnector**: refer to [examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh](../../examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh) for the example usage of P2pNcclConnector disaggregated prefilling.
-**MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
-**MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
```bash
```bash
...
@@ -45,7 +45,7 @@ For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
...
@@ -45,7 +45,7 @@ For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
## Benchmarks
## Benchmarks
Please refer to <gh-file:benchmarks/disagg_benchmarks> for disaggregated prefilling benchmarks.
Please refer to [benchmarks/disagg_benchmarks](../../benchmarks/disagg_benchmarks) for disaggregated prefilling benchmarks.
## Development
## Development
...
@@ -91,6 +91,6 @@ Disaggregated prefilling is highly related to infrastructure, so vLLM relies on
...
@@ -91,6 +91,6 @@ Disaggregated prefilling is highly related to infrastructure, so vLLM relies on
We recommend three ways of implementations:
We recommend three ways of implementations:
-**Fully-customized connector**: Implement your own `Connector`, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.
-**Fully-customized connector**: Implement your own `Connector`, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc.). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.
-**Database-like connector**: Implement your own `LookupBuffer` and support the `insert` and `drop_select` APIs just like SQL.
-**Database-like connector**: Implement your own `LookupBuffer` and support the `insert` and `drop_select` APIs just like SQL.
-**Distributed P2P connector**: Implement your own `Pipe` and support the `send_tensor` and `recv_tensor` APIs, just like `torch.distributed`.
-**Distributed P2P connector**: Implement your own `Pipe` and support the `send_tensor` and `recv_tensor` APIs, just like `torch.distributed`.
Interleaved thinking allows models to reason between tool calls, enabling more sophisticated decision-making after receiving tool results. This feature helps models chain multiple tool calls with reasoning steps in between and make nuanced decisions based on intermediate results.
Important: Interleaved thinking increases token usage and response latency. Consider your budget and performance requirements when enabling this feature.
## How Interleaved Thinking Works
With interleaved thinking, the model can:
- Reason about the results of a tool call before deciding what to do next
- Chain multiple tool calls with reasoning steps in between
- Make more nuanced decisions based on intermediate results
- Provide transparent reasoning for its tool selection process
## Supported Models
vLLM currently supports the following interleaved thinking models:
| Model Series | Reasoning Parser Name |
|--------------|-----------------------|
| moonshotai/Kimi-K2-Thinking | kimi_k2 |
| MiniMaxAI/MiniMax-M2 | minimax_m2 |
## Example Usage
To use interleaved thinking with tool calls, specify a model that supports this feature and enable tool calls in your chat completion request. Here's an example:
This example demonstrates how to set up interleaved thinking with tool calls using a weather retrieval function. The model reasons about the tool results before generating the final response.
Check out <gh-file:examples/offline_inference/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
Check out [examples/offline_inference/multilora_inference.py](../../examples/offline_inference/multilora_inference.py) for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
## Serving LoRA Adapters
## Serving LoRA Adapters
...
@@ -197,7 +197,7 @@ Alternatively, follow these example steps to implement your own plugin:
...
@@ -197,7 +197,7 @@ Alternatively, follow these example steps to implement your own plugin:
lora_request = LoRARequest(
lora_request = LoRARequest(
lora_name=lora_name,
lora_name=lora_name,
lora_path=local_path,
lora_path=local_path,
lora_int_id=abs(hash(lora_name))
lora_int_id=abs(hash(lora_name)),
)
)
return lora_request
return lora_request
```
```
...
@@ -296,10 +296,7 @@ To this end, we allow registration of default multimodal LoRAs to handle this au
...
@@ -296,10 +296,7 @@ To this end, we allow registration of default multimodal LoRAs to handle this au
This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
This page teaches you how to pass multi-modal inputs to [multi-modal models](../models/supported_models.md#list-of-multimodal-language-models) in vLLM.
!!! note
!!! note
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
!!! tip
!!! tip
...
@@ -129,7 +129,7 @@ You can pass a single image to the `'image'` field of the multi-modal dictionary
...
@@ -129,7 +129,7 @@ You can pass a single image to the `'image'` field of the multi-modal dictionary
print(generated_text)
print(generated_text)
```
```
Full example: <gh-file:examples/offline_inference/vision_language.py>
Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py)
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
...
@@ -154,9 +154,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis
...
@@ -154,9 +154,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis
outputs = llm.generate({
outputs = llm.generate({
"prompt": prompt,
"prompt": prompt,
"multi_modal_data": {
"multi_modal_data": {"image": [image1, image2]},
"image": [image1, image2]
},
})
})
for o in outputs:
for o in outputs:
...
@@ -164,7 +162,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis
...
@@ -164,7 +162,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis
print(generated_text)
print(generated_text)
```
```
Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
Full example: [examples/offline_inference/vision_language_multi_image.py](../../examples/offline_inference/vision_language_multi_image.py)
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
...
@@ -183,21 +181,24 @@ conversation = [
...
@@ -183,21 +181,24 @@ conversation = [
{"role":"assistant","content":"Hello! How can I assist you today?"},
{"role":"assistant","content":"Hello! How can I assist you today?"},
{
{
"role":"user",
"role":"user",
"content":[{
"content":[
"type":"image_url",
{
"image_url":{
"type":"image_url",
"url":image_url
"image_url":{"url":image_url},
}
},
},{
{
"type":"image_pil",
"type":"image_pil",
"image_pil":image_pil
"image_pil":image_pil,
},{
},
"type":"image_embeds",
{
"image_embeds":image_embeds
"type":"image_embeds",
},{
"image_embeds":image_embeds,
"type":"text",
},
"text":"What's in these images?"
{
}],
"type":"text",
"text":"What's in these images?",
},
],
},
},
]
]
...
@@ -224,7 +225,10 @@ Multi-image input can be extended to perform video captioning. We show this with
...
@@ -224,7 +225,10 @@ Multi-image input can be extended to perform video captioning. We show this with
message = {
message = {
"role": "user",
"role": "user",
"content": [
"content": [
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
{
"type": "text",
"text": "Describe this set of frames. Consider the frames to be a part of the same video.",
},
],
],
}
}
for i in range(len(video_frames)):
for i in range(len(video_frames)):
...
@@ -255,13 +259,13 @@ When loading RGBA images (images with transparency), vLLM converts them to RGB f
...
@@ -255,13 +259,13 @@ When loading RGBA images (images with transparency), vLLM converts them to RGB f
@@ -427,11 +449,11 @@ Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions
...
@@ -427,11 +449,11 @@ Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions
A chat template is **required** to use Chat Completions API.
A chat template is **required** to use Chat Completions API.
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>.
If no default chat template is available, we will first look for a built-in fallback in [vllm/transformers_utils/chat_templates/registry.py](../../vllm/transformers_utils/chat_templates/registry.py).
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
For certain models, we provide alternative chat templates inside <gh-dir:examples>.
For certain models, we provide alternative chat templates inside [examples](../../examples).
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
For example, VLM2Vec uses [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja) which is different from the default one for Phi-3-Vision.
### Image Inputs
### Image Inputs
...
@@ -465,55 +487,59 @@ Then, you can use the OpenAI client as follows:
...
@@ -465,55 +487,59 @@ Then, you can use the OpenAI client as follows:
chat_response = client.chat.completions.create(
chat_response = client.chat.completions.create(
model="microsoft/Phi-3.5-vision-instruct",
model="microsoft/Phi-3.5-vision-instruct",
messages=[{
messages=[
"role": "user",
{
"content": [
"role": "user",
# NOTE: The prompt formatting with the image token `<image>` is not needed
"content": [
# since the prompt will be processed automatically by the API server.
# NOTE: The prompt formatting with the image token `<image>` is not needed
{"type": "text", "text": "What’s in this image?"},
# since the prompt will be processed automatically by the API server.
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! tip
!!! tip
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
...
@@ -560,23 +586,22 @@ Then, you can use the OpenAI client as follows:
...
@@ -560,23 +586,22 @@ Then, you can use the OpenAI client as follows:
@@ -585,7 +610,7 @@ Then, you can use the OpenAI client as follows:
...
@@ -585,7 +610,7 @@ Then, you can use the OpenAI client as follows:
print("Chat completion output from image url:", result)
print("Chat completion output from image url:", result)
```
```
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! note
!!! note
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
...
@@ -652,23 +677,25 @@ Then, you can use the OpenAI client as follows:
...
@@ -652,23 +677,25 @@ Then, you can use the OpenAI client as follows:
@@ -707,7 +734,7 @@ Alternatively, you can pass `audio_url`, which is the audio counterpart of `imag
...
@@ -707,7 +734,7 @@ Alternatively, you can pass `audio_url`, which is the audio counterpart of `imag
print("Chat completion output from audio url:", result)
print("Chat completion output from audio url:", result)
```
```
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! note
!!! note
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
...
@@ -720,7 +747,13 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
...
@@ -720,7 +747,13 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
### Embedding Inputs
### Embedding Inputs
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape to the corresponding field of the multi-modal dictionary.
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
You must enable this feature via the `--enable-mm-embeds` flag in `vllm serve`.
!!! warning
The vLLM engine may crash if incorrect shape of embeddings is passed.
Only enable this flag for trusted users!
#### Image Embedding Inputs
#### Image Embedding Inputs
...
@@ -747,43 +780,48 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
...
@@ -747,43 +780,48 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
# Basic usage - this is equivalent to the LLaVA example for offline inference
# Basic usage - this is equivalent to the LLaVA example for offline inference
model = "llava-hf/llava-1.5-7b-hf"
model = "llava-hf/llava-1.5-7b-hf"
embeds = {
embeds = {
"type": "image_embeds",
"type": "image_embeds",
"image_embeds": f"{base64_image_embedding}",
"image_embeds": f"{base64_image_embedding}",
"uuid": image_url # Optional
"uuid": image_url, # Optional
}
}
# Pass additional parameters (available to Qwen2-VL and MiniCPM-V)
# Pass additional parameters (available to Qwen2-VL and MiniCPM-V)
@@ -9,7 +9,13 @@ NixlConnector is a high-performance KV cache transfer connector for vLLM's disag
...
@@ -9,7 +9,13 @@ NixlConnector is a high-performance KV cache transfer connector for vLLM's disag
Install the NIXL library: `uv pip install nixl`, as a quick start.
Install the NIXL library: `uv pip install nixl`, as a quick start.
- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
- The specified required NIXL version can be found in [requirements/kv_connectors.txt](gh-file:requirements/kv_connectors.txt) and other relevant config files
- The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files
For non-cuda platform, please install nixl with ucx build from source, instructed as below.
-**Required for both prefiller and decoder instances**
-**Required for both prefiller and decoder instances**
- Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
- Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
- For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank (e.g., with `--tensor-parallel-size=4` and base_port=5600, tp_rank 0..3 use ports 5600, 5601, 5602, 5603 on that node).
- For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank (e.g., with `--data-parallel-size=2` and base_port=5600, dp_rank 0..1 use port 5600, 5601 on that node).
- Used for the initial NIXL handshake between the prefiller and the decoder
- Used for the initial NIXL handshake between the prefiller and the decoder
-`VLLM_NIXL_SIDE_CHANNEL_HOST`: Host for side channel communication
-`VLLM_NIXL_SIDE_CHANNEL_HOST`: Host for side channel communication
- Connection info is passed via KVTransferParams from prefiller to decoder for handshake
- Connection info is passed via KVTransferParams from prefiller to decoder for handshake
-`VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)
-`VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)
- Default: 120
- Default: 480
- If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.
- If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.
NixlConnector currently does not distinguish `kv_role`; the actual prefiller/decoder roles are determined by the upper-level proxy (e.g., `toy_proxy_server.py` using `--prefiller-hosts` and `--decoder-hosts`).
NixlConnector currently does not distinguish `kv_role`; the actual prefiller/decoder roles are determined by the upper-level proxy (e.g., `toy_proxy_server.py` using `--prefiller-hosts` and `--decoder-hosts`).
Therefore, `kv_role` in `--kv-transfer-config` is effectively a placeholder and does not affect NixlConnector's behavior.
Therefore, `kv_role` in `--kv-transfer-config` is effectively a placeholder and does not affect NixlConnector's behavior.
## Experimental Feature
### Heterogenuous KV Layout support
Support use case: Prefill with 'HND' and decode with 'NHD' with experimental configuration
@@ -16,16 +16,20 @@ To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
...
@@ -16,16 +16,20 @@ To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package.
Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package and are enabled by the `--enable-prompt-embeds` flag in `vllm serve`.
When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.
When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.
Prompt embeddings are passed in as base64 encoded torch tensors.
Prompt embeddings are passed in as base64 encoded torch tensors.
!!! warning
The vLLM engine may crash if incorrect shape of embeddings is passed.
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
For the most up-to-date information on hardware support and quantization methods, please refer to [vllm/model_executor/layers/quantization](../../../vllm/model_executor/layers/quantization) or consult with the vLLM development team.
The `AutoAWQ` library is deprecated. This functionality has been adopted by the vLLM project in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/tree/main/examples/awq).
For the recommended quantization workflow, please see the AWQ examples in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/tree/main/examples/awq). For more details on the deprecation, refer to the original [AutoAWQ repository](https://github.com/casper-hansen/AutoAWQ).
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
The main benefits are lower latency and memory usage.
The main benefits are lower latency and memory usage.
...
@@ -18,13 +22,15 @@ After installing AutoAWQ, you are ready to quantize a model. Please refer to the
...
@@ -18,13 +22,15 @@ After installing AutoAWQ, you are ready to quantize a model. Please refer to the