If it is `TransformersForCausalLM` then it means it's based on Transformers!
!!! tip
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][openai-compatible-server].
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][serving-openai-compatible-server].
!!! note
vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
...
...
@@ -53,8 +53,8 @@ For a model to be compatible with the Transformers backend for vLLM it must:
If the compatible model is:
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][openai-compatible-server].
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference][offline-inference] or `vllm serve <MODEL_DIR>` for the [openai-compatible-server][openai-compatible-server].
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][serving-openai-compatible-server].
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference][offline-inference] or `vllm serve <MODEL_DIR>` for the [openai-compatible-server][serving-openai-compatible-server].
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
...
...
@@ -70,7 +70,10 @@ To make your model compatible with the Transformers backend, it needs:
2.`MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
3.`MyModel` must contain `_supports_attention_backend = True`.
```python title="modeling_my_model.py"
<details>
<summary>modeling_my_model.py</summary>
```python
fromtransformersimportPreTrainedModel
fromtorchimportnn
...
...
@@ -93,6 +96,8 @@ class MyModel(PreTrainedModel):
_supports_attention_backend=True
```
</details>
Here is what happens in the background when this model is loaded:
1. The config is loaded.
...
...
@@ -103,7 +108,10 @@ That's it!
For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:
```python title="configuration_my_model.py"
<details>
<summary>configuration_my_model.py</summary>
```python
fromtransformersimportPretrainedConfig
...
...
@@ -123,6 +131,8 @@ class MyConfig(PretrainedConfig):
}
```
</details>
-`base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
-`base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
* You only need to do this for layers which are not present on all pipeline stages
...
...
@@ -168,7 +178,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
If you prefer, you can use the Hugging Face CLI to [download a model](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) or specific files from a model repository:
Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache:
<details>
<summary>Commands</summary>
```console
#The `delete-cache`command requires extra dependencies to work with the TUI.
#Please run `pip install huggingface_hub[cli]` to install them.
...
...
@@ -224,6 +237,8 @@ Start deletion.
Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M.
```
</details>
#### Using a proxy
Here are some tips for loading/downloading models from Hugging Face using a proxy:
...
...
@@ -299,87 +314,97 @@ See [this page][generative-models] for more information on how to use generative
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
!!! note
Only text inputs are currently supported for `Gemma3nForConditionalGeneration`. To use this model, please upgrade Hugging Face Transformers to version 4.53.0.
### Pooling Models
See [this page](./pooling_models.md) for more information on how to use pooling models.
!!! warning
!!! important
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
...
...
@@ -387,18 +412,19 @@ See [this page](./pooling_models.md) for more information on how to use pooling
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
...
...
@@ -409,7 +435,7 @@ Specified using `--task embed`.
See [relevant issue on HF Transformers](https://github.com/huggingface/transformers/issues/34882).
!!! note
`jinaai/jina-embeddings-v3` supports multiple tasks through lora, while vllm temporarily only supports text-matching tasks by merging lora weights.
`jinaai/jina-embeddings-v3` supports multiple tasks through LoRA, while vllm temporarily only supports text-matching tasks by merging LoRA weights.
!!! note
The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.
...
...
@@ -422,16 +448,17 @@ of the whole prompt are extracted from the normalized hidden state corresponding
If your model is not in the above list, we will try to automatically convert the model using
[as_classification_model][vllm.model_executor.models.adapters.as_classification_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: <gh-file:examples/offline_inference/qwen3_reranker.py>.
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
• For example, to use DeepSeek-VL2 series models:
...
...
@@ -579,9 +626,6 @@ Specified using `--task generate`.
!!! note
Only `InternVLChatModel` with Qwen2.5 text backbone (`OpenGVLab/InternVL3-2B`, `OpenGVLab/InternVL2.5-1B` etc) has video inputs support currently.
!!! note
`h2oai/h2ovl-mississippi-2b` will be available in V1 once we support head size 80.
!!! note
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
...
...
@@ -590,27 +634,29 @@ Specified using `--task generate`.
For the best results, we recommend using the following dependency versions (tested on A10 and L40):
```text
# Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
torch==2.5.1
torchvision==0.20.1
transformers==4.48.1
tokenizers==0.21.0
tiktoken==0.7.0
vllm==0.7.0
# Optional but recommended for improved performance and stability
triton==3.1.0
xformers==0.0.28.post3
uvloop==0.21.0
protobuf==5.29.3
openai==1.60.2
opencv-python-headless==4.11.0.86
pillow==10.4.0
# Installed FlashAttention (for float16 only)
flash-attn>=2.5.6 # Not used in float32, but should be documented
```
??? Dependency versions
```text
# Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
torch==2.5.1
torchvision==0.20.1
transformers==4.48.1
tokenizers==0.21.0
tiktoken==0.7.0
vllm==0.7.0
# Optional but recommended for improved performance and stability
triton==3.1.0
xformers==0.0.28.post3
uvloop==0.21.0
protobuf==5.29.3
openai==1.60.2
opencv-python-headless==4.11.0.86
pillow==10.4.0
# Installed FlashAttention (for float16 only)
flash-attn>=2.5.6 # Not used in float32, but should be documented
```
**Note:** Make sure you understand the security implications of using outdated packages.
...
...
@@ -622,17 +668,24 @@ Specified using `--task generate`.
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
!!! note
To use Qwen2.5-Omni, you have to install Hugging Face Transformers library from source via
@@ -34,15 +34,15 @@ output = llm.generate("San Francisco is a")
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
```console
vllm serve facebook/opt-13b \
```bash
vllm serve facebook/opt-13b \
--tensor-parallel-size 4
```
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
```console
vllm serve gpt2 \
```bash
vllm serve gpt2 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2
```
...
...
@@ -55,7 +55,7 @@ The first step, is to start containers and organize them into a cluster. We have
Pick a node as the head node, and run the following command:
```console
```bash
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
...
...
@@ -66,7 +66,7 @@ bash run_cluster.sh \
On the rest of the worker nodes, run the following command:
```console
```bash
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
...
...
@@ -87,7 +87,7 @@ Then, on any node, use `docker exec -it node /bin/bash` to enter the container,
After that, on any node, use `docker exec -it node /bin/bash` to enter the container again. **In the container**, you can use vLLM as usual, just as you have all the GPUs on one node: vLLM will be able to leverage GPU resources of all nodes in the Ray cluster, and therefore, only run the `vllm` command on this node but not other nodes. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
```console
```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
...
...
@@ -95,12 +95,55 @@ After that, on any node, use `docker exec -it node /bin/bash` to enter the conta
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
```console
```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 16
```
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like InfiniBand. To correctly set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the InfiniBand is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses InfiniBand with GPUDirect RDMA, which is efficient.
### GPUDirect RDMA
To enable GPUDirect RDMA with vLLM, specific configuration tweaks are needed. This setup ensures:
-`IPC_LOCK` Security Context: Add the `IPC_LOCK` capability to the container’s security context to lock memory pages and prevent swapping to disk.
- Shared Memory with `/dev/shm`: Mount `/dev/shm` in the pod spec to provide shared memory for IPC.
When using Docker, you can set up the container as follows:
```bash
docker run --gpus all \
--ipc=host \
--shm-size=16G \
-v /dev/shm:/dev/shm \
vllm/vllm-openai
```
When using Kubernetes, you can set up the pod spec as follows:
```yaml
...
spec:
containers:
-name:vllm
image:vllm/vllm-openai
securityContext:
capabilities:
add:["IPC_LOCK"]
volumeMounts:
-mountPath:/dev/shm
name:dshm
resources:
limits:
nvidia.com/gpu:8
requests:
nvidia.com/gpu:8
volumes:
-name:dshm
emptyDir:
medium:Memory
...
```
!!! warning
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script][troubleshooting-incorrect-hardware-driver] for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
```python
fromopenaiimportOpenAI
client=OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
??? Code
completion=client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role":"user","content":"Hello!"}
]
)
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
print(completion.choices[0].message)
```
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
```
!!! tip
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
!!! warning
!!! important
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
...
...
@@ -55,6 +57,8 @@ We currently support the following OpenAI APIs:
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
We automatically wrap any other transformer via `as_classification_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
@@ -60,79 +60,84 @@ To identify the particular CUDA operation that causes the error, you can add `--
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
If the script runs successfully, you should see the message `sanity check is successful!`.
...
...
@@ -165,25 +170,27 @@ WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
or an error from Python that looks like this:
```console
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
??? Logs
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
```console
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
if __name__ == '__main__':
freeze_support()
...
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
if __name__ == '__main__':
freeze_support()
...
To fix this issue, refer to the "Safe importing of main module"
section in https://docs.python.org/3/library/multiprocessing.html
```
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
To fix this issue, refer to the "Safe importing of main module"
section in https://docs.python.org/3/library/multiprocessing.html
```
then you must update your Python code to guard usage of `vllm` behind a `if
__name__ == '__main__':` block. For example, instead of this:
...
...
@@ -207,20 +214,22 @@ if __name__ == '__main__':
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
```python
importtorch
@torch.compile
deff(x):
# a simple function to test torch.compile
x=x+1
x=x*2
x=x.sin()
returnx
x=torch.randn(4,4).cuda()
print(f(x))
```
??? Code
```python
import torch
@torch.compile
def f(x):
# a simple function to test torch.compile
x = x + 1
x = x * 2
x = x.sin()
return x
x = torch.randn(4, 4).cuda()
print(f(x))
```
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
...
...
@@ -264,6 +273,27 @@ But you are sure that the model is in the [list of supported models][supported-m
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](gh-file:vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](gh-pr:14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
## NCCL error: unhandled system error during `ncclCommInitRank`
If your serving workload uses GPUDirect RDMA for distributed serving across multiple nodes and encounters an error during `ncclCommInitRank`, with no clear error message even with `NCCL_DEBUG=INFO` set, it might look like this:
```text
Error executing method 'init_device'. This might cause deadlock in distributed execution.
Traceback (most recent call last):
...
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
...
```
This indicates vLLM failed to initialize the NCCL communicator, possibly due to a missing `IPC_LOCK` linux capability or an unmounted `/dev/shm`. Refer to [Distributed Inference and Serving](../serving/distributed_serving.md#running-vllm-on-multiple-nodes) for guidance on properly configuring the environment for distributed serving.
## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
**We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.**
!!! announcement
We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
...
...
@@ -32,53 +34,110 @@ Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.
### Supports Overview
#### Hardware
## Current Status
For each item, our progress towards V1 support falls into one of the following states:
-**🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
-**🟢 Functional**: Fully operational, with ongoing optimizations.
-**🚧 WIP**: Under active development.
-**🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
-**🟠 Delayed**: Temporarily dropped in V1 but planned to be re-introduced later.
-**🔴 Deprecated**: Not planned for V1 unless there is strong demand.
!!! note
vLLM V1’s unified scheduler treats both prompt and output tokens the same
way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
allocate a fixed token budget per request, enabling features like chunked prefills,
prefix caching, and speculative decoding without a strict separation between prefill
and decode phases.
The V1 scheduler supports multiple scheduling policies, including First-Come,
First-Served (FCFS) and priority-based scheduling (where requests are processed
based on assigned priority, with FCFS as a tie-breaker), configurable via the
-**🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
-**🟢 Functional**: Fully operational, with ongoing optimizations.
-**🚧 WIP**: Under active development.
-**🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
-**🟠 Delayed**: Temporarily dropped in V1 but planned to be re-introduced later.
-**🔴 Deprecated**: Not planned for V1 unless there is strong demand.
**Note**: vLLM V1’s unified scheduler treats both prompt and output tokens the same
way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
allocate a fixed token budget per request, enabling features like chunked prefills,
prefix caching, and speculative decoding without a strict separation between prefill
and decode phases.
!!! note
### Semantic Changes and Deprecated Features
vLLM V1’s unified scheduler treats both prompt and output tokens the same
way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
allocate a fixed token budget per request, enabling features like chunked prefills,
prefix caching, and speculative decoding without a strict separation between prefill
and decode phases.
#### Logprobs
#### Semantic Changes to Logprobs
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
differences compared to V0:
...
...
@@ -115,46 +174,4 @@ to handle request preemptions.
**Structured Output features**
-**Request-level Structured Output Backend**: Deprecated, alternative backends
(outlines, guidance) with fallbacks is WIP.
### Feature & Model Support in Progress
Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported.
#### Features to Be Optimized
These features are already supported in vLLM V1, but their optimization is still
in progress.
-**Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.
-**Multimodal Models**: V1 is almost fully compatible with V0 except that interleaved modality input is not supported yet.
See [here](https://github.com/orgs/vllm-project/projects/8) for the status of upcoming features and optimizations.
#### Features to Be Supported
-**Structured Output Alternative Backends**: Structured output alternative backends (outlines, guidance) support is planned. V1 currently
supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar.
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol,
and the majority fall into the following categories. V1 support for these models will be added eventually.
**Embedding Models**
The initial support will be provided by [PR #16188](https://github.com/vllm-project/vllm/pull/16188).
Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249), which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360) to enable simultaneous generation and embedding using the same engine instance in V1.
**Mamba Models**
Models using selective state-space mechanisms (instead of standard transformer attention)
are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`).
**Encoder-Decoder Models**
vLLM V1 is currently optimized for decoder-only transformers. Models requiring
cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
-**Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl`
```console
$cat results.jsonl
```bash
cat results.jsonl
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
@@ -79,7 +79,7 @@ The batch runner supports remote input and output urls that are accessible via h
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl`, you can run
@@ -4,7 +4,7 @@ This script is used to profile the TPU performance of vLLM for specific prefill
Note: an actual running server is a mix of both prefill of many shapes and decode of many shapes.
We assume you are on a TPU already (this was tested on TPU v6e) and have installed vLLM according to the [installation guide](https://docs.vllm.ai/en/latest/getting_started/installation/ai_accelerator/index.html).
We assume you are on a TPU already (this was tested on TPU v6e) and have installed vLLM according to the [Google TPU installation guide](https://docs.vllm.ai/en/latest/getting_started/installation/google_tpu.html).
> In all examples below, we run several warmups before (so `--enforce-eager` is okay)
...
...
@@ -57,7 +57,10 @@ Once you have collected your profiles with this script, you can visualize them u
Here are most likely the dependencies you need to install:
# Why do we need hf_overrides for the official original version:
# vllm converts it to Qwen3ForSequenceClassification when loaded for
# better performance.
# - Firstly, we need using `"architectures": ["Qwen3ForSequenceClassification"],`
# to manually route to Qwen3ForSequenceClassification.
# - Then, we will extract the vector corresponding to classifier_from_token
# from lm_head using `"classifier_from_token": ["no", "yes"]`.
# - Third, we will convert these two vectors into one vector. The use of
# conversion logic is controlled by `using "is_original_qwen3_reranker": True`.
# Please use the query_template and document_template to format the query and
# document for better reranker results.
prefix='<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
"Given a web search query, retrieve relevant passages that answer the query"
)
queries=[
"What is the capital of China?",
"Explain gravity",
]
documents=[
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",