The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models.
We recommend installing the library with:
```console
pip install nvidia-modelopt
```
## Quantizing HuggingFace Models with PTQ
You can quantize HuggingFace models using the example scripts provided in the TensorRT Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.
Below is an example showing how to quantize a model using modelopt's PTQ API:
export_dir,# The directory where the exported files will be stored.
)
```
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
```python
fromvllmimportLLM,SamplingParams
defmain():
model_id="nvidia/Llama-3.1-8B-Instruct-FP8"
# Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
- IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
:::{note}
IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
:::
## Quickstart
To use reasoning models, you need to specify the `--enable-reasoning` and `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
To use reasoning models, you need to specify the `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
@@ -83,7 +88,7 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
}
```
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client support extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
```python
fromopenaiimportOpenAI
...
...
@@ -102,6 +107,8 @@ model = models.data[0].id
messages=[{"role":"user","content":"9.11 and 9.8, which is greater?"}]
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
@@ -139,11 +146,10 @@ Remember to check whether the `reasoning_content` exists in the response before
The reasoning content is also available in the structured output. The structured output engine like `xgrammar` will use the reasoning content to generate structured output. It is only supported in v0 engine now.
@@ -287,7 +292,7 @@ class ExampleParser(ReasoningParser):
"""
```
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in `vllm/model_executor/guided_decoding/reasoner/deepseek_reasoner.py`.
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
```python
@dataclass
...
...
@@ -313,11 +318,10 @@ class DeepSeekReasoner(Reasoner):
...
```
The structured output engine like `xgrammar` will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
Finally, you can enable reasoning for the model by using the `--enable-reasoning` and `--reasoning-parser` flags.
Finally, you can enable reasoning for the model by using the `--reasoning-parser` flags.
`examples/tool_chat_template_granite.jinja`: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.
<gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.
* `ibm-granite/granite-3.1-8b-instruct`
...
...
@@ -203,7 +203,7 @@ The chat template from Huggingface can be used directly. Parallel function calls
`examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
<gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
### InternLM Models (`internlm`)
...
...
@@ -236,6 +236,13 @@ For Qwen2.5, the chat template in tokenizer_config.json has already included sup
A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
...
...
@@ -253,12 +260,12 @@ Limitations:
Example supported models:
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with `examples/tool_chat_template_llama4_pythonic.jinja`)
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with `examples/tool_chat_template_llama4_pythonic.jinja`)
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
* `Team-ACE/ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
@@ -270,7 +277,7 @@ Llama's smaller models frequently fail to emit tool calls in the correct format.
## How to write a tool parser plugin
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in <gh-file:vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py>.
vLLM contains pre-compiled C++ and CUDA (12.1) binaries.
vLLM contains pre-compiled C++ and CUDA (12.6) binaries.
## Requirements
...
...
@@ -23,12 +23,12 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
You can install vLLM using either `pip` or `uv pip`:
```console
#Install vLLM with CUDA 12.4.
#Install vLLM with CUDA 12.6.
pip install vllm #If you are using pip.
uv pip install vllm #If you are using uv.
```
As of now, vLLM's binaries are compiled with CUDA 12.4 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.1, 11.8, and public PyTorch release versions:
As of now, vLLM's binaries are compiled with CUDA 12.6 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.8, 11.8, and public PyTorch release versions:
- Finally, due to a known issue of conflict dependency(oneapi related) in torch-xpu 2.6 and ipex-xpu 2.6, we install ipex here. This will be fixed in the ipex-xpu 2.7.
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
There are some new features coming with ipex-xpu 2.6, e.g. **chunked prefill**, **V1 engine support**, **lora**, **MoE**, etc.
@@ -19,8 +19,8 @@ If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
@@ -54,7 +54,7 @@ For a model to be compatible with the Transformers backend for vLLM it must:
If the compatible model is:
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for <project:#offline-inference> or `--trust-remode-code` for the <project:#openai-compatible-server>.
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for <project:#offline-inference> or `--trust-remote-code` for the <project:#openai-compatible-server>.
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for <project:#offline-inference> or `vllm serve <MODEL_DIR>` for the <project:#openai-compatible-server>.
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
...
...
@@ -168,6 +168,66 @@ If vLLM successfully returns text (for generative models) or hidden states (for
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
#### Download a model
If you prefer, you can use the Hugging Face CLI to [download a model](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) or specific files from a model repository:
Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache:
```console
#The `delete-cache`command requires extra dependencies to work with the TUI.
#Please run `pip install huggingface_hub[cli]` to install them.
#Launch the interactive TUI to select models to delete
$huggingface-cli delete-cache
? Select revisions to delete: 1 revisions selected counting for 438.9M.
○ None of the following (if selected, nothing will be deleted).
Model BAAI/bge-base-en-v1.5 (438.9M, used 1 week ago)
❯ ◉ a5beb1e3: main #modified 1 week ago
Model BAAI/bge-large-en-v1.5 (1.3G, used 1 week ago)
○ d4aa6901: main #modified 1 week ago
Model BAAI/bge-reranker-base (1.1G, used 4 weeks ago)
○ 2cfc18c9: main #modified 4 weeks ago
Press <space>to select, <enter> to validate and <ctrl+c> to quit without modification.
#Need to confirm after selected
? Select revisions to delete: 1 revision(s) selected.
@@ -472,7 +539,7 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
-*`OLMo2ForCausalLM`
* OLMo2
*`allenai/OLMo2-7B-1124`, etc.
*`allenai/OLMo-2-0425-1B`, etc.
*
* ✅︎
-*`OLMoEForCausalLM`
...
...
@@ -542,8 +609,8 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
-*`Qwen3MoeForCausalLM`
* Qwen3MoE
*`Qwen/Qwen3-MoE-15B-A2B`, etc.
* ✅︎
*`Qwen/Qwen3-30B-A3B`, etc.
*
* ✅︎
-*`StableLmForCausalLM`
* StableLM
...
...
@@ -585,6 +652,11 @@ See [this page](#generative-models) for more information on how to use generativ
*`Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc.
*
*
-*`MiMoForCausalLM`
* MiMo
*`XiaomiMiMo/MiMo-7B-RL`, etc.
*
*
:::
:::{note}
...
...
@@ -600,7 +672,9 @@ Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
:::
#### Text Embedding (`--task embed`)
#### Text Embedding
Specified using `--task embed`.
:::{list-table}
:widths: 25 25 50 5 5
...
...
@@ -613,7 +687,7 @@ you should explicitly specify the task type to ensure that the model is used in
*[PP](#distributed-serving)
-*`BertModel`
* BERT-based
*`BAAI/bge-base-en-v1.5`, etc.
*`BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc.
*
*
-*`Gemma2Model`
...
...
@@ -626,6 +700,26 @@ you should explicitly specify the task type to ensure that the model is used in
*`parasail-ai/GritLM-7B-vllm`.
* ✅︎
* ✅︎
-*`GteModel`
* Arctic-Embed-2.0-M
*`Snowflake/snowflake-arctic-embed-m-v2.0`.
*
* ︎
-*`GteNewModel`
* mGTE-TRM (see note)
*`Alibaba-NLP/gte-multilingual-base`, etc.
* ︎
* ︎
-*`ModernBertModel`
* ModernBERT-based
*`Alibaba-NLP/gte-modernbert-base`, etc.
* ︎
* ︎
-*`NomicBertModel`
* Nomic BERT
*`nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc.
* ︎
* ︎
-*`LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc.
* Llama-based
*`intfloat/e5-mistral-7b-instruct`, etc.
...
...
@@ -638,12 +732,12 @@ you should explicitly specify the task type to ensure that the model is used in
* ✅︎
-*`RobertaModel`, `RobertaForMaskedLM`
* RoBERTa-based
*`sentence-transformers/all-roberta-large-v1`, `sentence-transformers/all-roberta-large-v1`, etc.
*`sentence-transformers/all-roberta-large-v1`, etc.
*
*
-*`XLMRobertaModel`
* XLM-RoBERTa-based
*`intfloat/multilingual-e5-large`, `jinaai/jina-reranker-v2-base-multilingual`, etc.
*`intfloat/multilingual-e5-large`, `jinaai/jina-reranker-v2-base-multilingual`, `Snowflake/snowflake-arctic-embed-l-v2.0`, `jinaai/jina-embeddings-v3`(see note), etc.
*
*
:::
...
...
@@ -661,11 +755,21 @@ For both the 1.5B and 7B variants, you also need to enable `--trust-remote-code`
See [relevant issue on HF Transformers](https://github.com/huggingface/transformers/issues/34882).
:::
:::{note}
`jinaai/jina-embeddings-v3` supports multiple tasks through lora, while vllm temporarily only supports text-matching tasks by merging lora weights.
:::
:::{note}
The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.
:::
If your model is not in the above list, we will try to automatically convert the model using
{func}`~vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
#### Reward Modeling (`--task reward`)
#### Reward Modeling
Specified using `--task reward`.
:::{list-table}
:widths: 25 25 50 5 5
...
...
@@ -706,7 +810,9 @@ For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b
If your model is not in the above list, we will try to automatically convert the model using
{func}`~vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
#### Sentence Pair Scoring (`--task score`)
#### Sentence Pair Scoring
Specified using `--task score`.
:::{list-table}
:widths: 25 25 50 5 5
...
...
@@ -819,7 +927,9 @@ vLLM currently only supports adding LoRA to the language backbone of multimodal
See [this page](#generative-models) for more information on how to use generative models.
#### Text Generation (`--task generate`)
#### Text Generation
Specified using `--task generate`.
:::{list-table}
:widths: 25 25 15 20 5 5 5
...
...
@@ -986,11 +1096,18 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
* ✅︎
* ✅︎
-*`MiniMaxVL01ForConditionalGeneration`
* MiniMax-VL
* T + I<sup>E+</sup>
*`MiniMaxAI/MiniMax-VL-01`, etc.
*
* ✅︎
* ✅︎
-*`Mistral3ForConditionalGeneration`
* Mistral3
* T + I<sup>+</sup>
*`mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc.
*
* ✅︎
* ✅︎
* ✅︎
-*`MllamaForConditionalGeneration`
...
...
@@ -1014,6 +1131,13 @@ See [this page](#generative-models) for more information on how to use generativ
*
* ✅︎
* ✅︎
-*`Ovis`
* Ovis2, Ovis1.6
* T + I<sup>+</sup>
*`AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc.
*
*
* ✅︎
-*`PaliGemmaForConditionalGeneration`
* PaliGemma, PaliGemma 2
* T + I<sup>E</sup>
...
...
@@ -1106,11 +1230,6 @@ See [this page](#generative-models) for more information on how to use generativ
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
:::{important}
Pan-and-scan image pre-processing is currently supported on V0 (but not V1).
You can enable it by passing `--mm-processor-kwargs '{"do_pan_and_scan": true}'`.
:::
:::{warning}
Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
However, there are differences in how they handle text + image inputs:
...
...
@@ -1130,7 +1249,7 @@ This limitation exists because the model's mixed attention pattern (bidirectiona
:::
:::{note}
`h2oai/h2ovl-mississippi-2b` will be available in V1 once we support backends other than FlashAttention.
`h2oai/h2ovl-mississippi-2b` will be available in V1 once we support head size 80.
:::
:::{note}
...
...
@@ -1193,7 +1312,9 @@ Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
:::
#### Text Embedding (`--task embed`)
#### Text Embedding
Specified using `--task embed`.
Any text generation model can be converted into an embedding model by passing `--task embed`.
...
...
@@ -1233,7 +1354,9 @@ The following table lists those that are tested in vLLM.
* ✅︎
:::
#### Transcription (`--task transcription`)
#### Transcription
Specified using `--task transcription`.
Speech2Text models trained specifically for Automatic Speech Recognition.
This guide covers optimization strategies and performance tuning for vLLM V1.
## Preemption
Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, the following warning is printed:
In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, you may see the following warning:
```text
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
```
While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
If you frequently encounter preemptions from the vLLM engine, consider the following actions:
If you frequently encounter preemptions, consider the following actions:
- Increase `gpu_memory_utilization`. vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space.
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase `tensor_parallel_size`. This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache. However, increasing this value may cause excessive synchronization overhead.
- Increase `pipeline_parallel_size`. This distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, indirectly leaving more memory available for KV cache. However, increasing this value may cause latency penalties.
- Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache.
- Increase `pipeline_parallel_size`. This approach distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, which indirectly leaves more memory available for KV cache.
You can monitor the number of preemption requests through Prometheus metrics exposed by vLLM. Additionally, you can log the cumulative number of preemption requests by setting `disable_log_stats=False`.
You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
(chunked-prefill)=
## Chunked Prefill
vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.
Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics.
With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
This policy has two benefits:
- It improves ITL and generation decode because decode requests are prioritized.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.
### Performance Tuning with Chunked Prefill
You can tune the performance by adjusting `max_num_batched_tokens`:
- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
- Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8096` especially for smaller models on large GPUs.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
This policy optimizes the TTFT (time to the first token), but incurs slower ITL (inter token latency) and inefficient GPU utilization.
See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
Once chunked prefill is enabled, the policy is changed to prioritize decode requests.
It batches all pending decode requests to the batch before scheduling any prefill.
When there are available token_budget (`max_num_batched_tokens`), it schedules pending prefills.
If a last pending prefill request cannot fit into `max_num_batched_tokens`, it chunks it.
## Parallelism Strategies
This policy has two benefits:
vLLM supports multiple parallelism strategies that can be combined to optimize performance across different hardware configurations.
- It improves ITL and generation decode because decode requests are prioritized.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
### Tensor Parallelism (TP)
You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048.
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
Tensor parallelism shards model parameters across multiple GPUs within each model layer. This is the most common strategy for large model inference within a single node.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
- Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
**When to use:**
We recommend you set `max_num_batched_tokens > 2048` for throughput.
- When the model is too large to fit on a single GPU
- When you need to reduce memory pressure per GPU to allow more KV cache space for higher throughput
See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
For models that are too large to fit on a single GPU (like 70B parameter models), tensor parallelism is essential.
### Pipeline Parallelism (PP)
Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence.
**When to use:**
- When you've already maxed out efficient tensor parallelism but need to distribute the model further, or across nodes
- For very deep and narrow models where layer distribution is more efficient than tensor sharding
Pipeline parallelism can be combined with tensor parallelism for very large models:
```python
fromvllmimportLLM
# Combine pipeline and tensor parallelism
llm=LLM(
model="meta-llama/Llama-3.3-70B-Instruct,
tensor_parallel_size=4,
pipeline_parallel_size=2
)
```
### Expert Parallelism (EP)
Expert parallelism is a specialized form of parallelism for Mixture of Experts (MoE) models, where different expert networks are distributed across GPUs.
**When to use:**
Please try out this feature and let us know your feedback via GitHub issues!
- Specifically for MoE models (like DeepSeekV3, Qwen3MoE, Llama-4)
- When you want to balance the expert computation load across GPUs
Expert parallelism is enabled by setting `enable_expert_parallel=True`, which will use expert parallelism instead of tensor parallelism for MoE layers.
It will use the same degree of parallelism as what you have set for tensor parallelism.
### Data Parallelism (DP)
Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel.
**When to use:**
- When you have enough GPUs to replicate the entire model
- When you need to scale throughput rather than model size
- In multi-user environments where isolation between request batches is beneficial
Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
## Reducing Memory Usage
If you encounter out-of-memory issues, consider these strategies:
### Context Length and Batch Size
You can reduce memory usage by limiting the context length and batch size:
```python
fromvllmimportLLM
llm=LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
max_model_len=2048,# Limit context window
max_num_seqs=4# Limit batch size
)
```
### Adjust CUDA Graph Compilation
CUDA graph compilation in V1 uses more memory than in V0. You can reduce memory usage by adjusting the compilation level:
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`](#vllm-serve) command. (You can also use our [Docker](#deployment-docker) image.)
In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`](#serve-args) command. (You can also use our [Docker](#deployment-docker) image.)
```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
...
...
@@ -61,6 +61,8 @@ In addition, we have the following custom APIs:
- Applicable to any model with a tokenizer.
- [Pooling API](#pooling-api) (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
We automatically wrap any other transformer via `as_classification_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.