Commit 41199996 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.12.0' into v0.12.0-dev

parents 31021d81 4fd9d6a8
...@@ -5,4 +5,4 @@ nav: ...@@ -5,4 +5,4 @@ nav:
- complete.md - complete.md
- run-batch.md - run-batch.md
- vllm bench: - vllm bench:
- bench/*.md - bench/**/*.md
...@@ -4,6 +4,6 @@ ...@@ -4,6 +4,6 @@
--8<-- "docs/cli/json_tip.inc.md" --8<-- "docs/cli/json_tip.inc.md"
## Options ## Arguments
--8<-- "docs/argparse/bench_latency.md" --8<-- "docs/argparse/bench_latency.inc.md"
...@@ -4,6 +4,6 @@ ...@@ -4,6 +4,6 @@
--8<-- "docs/cli/json_tip.inc.md" --8<-- "docs/cli/json_tip.inc.md"
## Options ## Arguments
--8<-- "docs/argparse/bench_serve.md" --8<-- "docs/argparse/bench_serve.inc.md"
# vllm bench sweep plot
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Arguments
--8<-- "docs/argparse/bench_sweep_plot.inc.md"
# vllm bench sweep plot_pareto
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Arguments
--8<-- "docs/argparse/bench_sweep_plot_pareto.inc.md"
# vllm bench sweep serve
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Arguments
--8<-- "docs/argparse/bench_sweep_serve.inc.md"
# vllm bench sweep serve_sla
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Arguments
--8<-- "docs/argparse/bench_sweep_serve_sla.inc.md"
...@@ -4,6 +4,6 @@ ...@@ -4,6 +4,6 @@
--8<-- "docs/cli/json_tip.inc.md" --8<-- "docs/cli/json_tip.inc.md"
## Options ## Arguments
--8<-- "docs/argparse/bench_throughput.md" --8<-- "docs/argparse/bench_throughput.inc.md"
# vllm chat # vllm chat
## Options ## Arguments
--8<-- "docs/argparse/chat.md" --8<-- "docs/argparse/chat.inc.md"
# vllm complete # vllm complete
## Options ## Arguments
--8<-- "docs/argparse/complete.md" --8<-- "docs/argparse/complete.inc.md"
...@@ -4,6 +4,6 @@ ...@@ -4,6 +4,6 @@
--8<-- "docs/cli/json_tip.inc.md" --8<-- "docs/cli/json_tip.inc.md"
## Options ## Arguments
--8<-- "docs/argparse/run-batch.md" --8<-- "docs/argparse/run-batch.inc.md"
...@@ -4,6 +4,6 @@ ...@@ -4,6 +4,6 @@
--8<-- "docs/cli/json_tip.inc.md" --8<-- "docs/cli/json_tip.inc.md"
## Options ## Arguments
--8<-- "docs/argparse/serve.md" --8<-- "docs/argparse/serve.inc.md"
# Meetups # Meetups
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below: We host regular meetups around the world. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights.
## Upcoming Meetups
Stay tuned for upcoming meetups! Follow us on [Twitter/X](https://x.com/vllm_project), join our [Slack](https://slack.vllm.ai), and follow vLLM on [Luma](https://luma.com/vLLM-Meetups) to get notified about new events.
## Past Meetups
Below you'll find slides and recordings from our previous meetups:
- [vLLM Bangkok Meetup](https://luma.com/v0f647nv), November 21st 2025. [[Slides]](https://drive.google.com/drive/folders/1H0DS57F8HQ5q3kSOSoRmucPJWL3E0A_X?usp=sharing)
- [vLLM Zurich Meetup](https://luma.com/0gls27kb), November 6th 2025. [[Slides]](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) [[Recording]](https://www.youtube.com/watch?v=6m6ZE6yVEDI)
- [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w), November 1st 2025. [[Slides]](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link)
- [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg), October 25th 2025. [[Slides]](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6)
- [vLLM Toronto Meetup](https://luma.com/e80e0ymm), September 25th 2025. [[Slides]](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing)
- [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ), August 30th 2025. [[Slides]](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA) - [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ), August 30th 2025. [[Slides]](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA)
- [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet), August 27th 2025. [[Slides]](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing) - [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet), August 27th 2025. [[Slides]](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing)
- [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg), August 23rd 2025. [[Slides]](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH) - [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg), August 23rd 2025. [[Slides]](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH)
...@@ -22,4 +35,12 @@ We host regular meetups in San Francisco Bay Area every 2 months. We will share ...@@ -22,4 +35,12 @@ We host regular meetups in San Francisco Bay Area every 2 months. We will share
- [The second vLLM meetup](https://lu.ma/ygxbpzhl), with IBM Research, January 31st 2024. [[Slides]](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing) [[Video (vLLM Update)]](https://youtu.be/Y0C-DUvEnZQ) [[Video (IBM Research & torch.compile)]](https://youtu.be/m0dMtFLI-dg) - [The second vLLM meetup](https://lu.ma/ygxbpzhl), with IBM Research, January 31st 2024. [[Slides]](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing) [[Video (vLLM Update)]](https://youtu.be/Y0C-DUvEnZQ) [[Video (IBM Research & torch.compile)]](https://youtu.be/m0dMtFLI-dg)
- [The first vLLM meetup](https://lu.ma/first-vllm-meetup), with a16z, October 5th 2023. [[Slides]](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing) - [The first vLLM meetup](https://lu.ma/first-vllm-meetup), with a16z, October 5th 2023. [[Slides]](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing)
We are always looking for speakers and sponsors at San Francisco Bay Area and potentially other locations. If you are interested in speaking or sponsoring, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu). ## Get Involved
**Want to host or speak at a vLLM meetup?** We're always looking for speakers and sponsors for our meetups. Whether you want to:
- Share your vLLM feature, use case, project extension, or deployment experience
- Host a meetup in your city
- Sponsor an event
Please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu).
...@@ -34,6 +34,7 @@ Compute Resources: ...@@ -34,6 +34,7 @@ Compute Resources:
- Trainy - Trainy
- UC Berkeley - UC Berkeley
- UC San Diego - UC San Diego
- Volcengine
Slack Sponsor: Anyscale Slack Sponsor: Anyscale
......
...@@ -4,6 +4,6 @@ This section lists the most common options for running vLLM. ...@@ -4,6 +4,6 @@ This section lists the most common options for running vLLM.
There are three main levels of configuration, from highest priority to lowest priority: There are three main levels of configuration, from highest priority to lowest priority:
- [Request parameters][completions-api] and [input arguments][sampling-params] - [Request parameters](../serving/openai_compatible_server.md#completions-api) and [input arguments](../api/README.md#inference-parameters)
- [Engine arguments](./engine_args.md) - [Engine arguments](./engine_args.md)
- [Environment variables](./env_vars.md) - [Environment variables](./env_vars.md)
...@@ -11,8 +11,7 @@ The following code splits the model across 2 GPUs. ...@@ -11,8 +11,7 @@ The following code splits the model across 2 GPUs.
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
tensor_parallel_size=2)
``` ```
!!! warning !!! warning
...@@ -24,7 +23,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", ...@@ -24,7 +23,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
!!! note !!! note
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Quantization ## Quantization
...@@ -43,30 +42,25 @@ and the maximum batch size (`max_num_seqs` option). ...@@ -43,30 +42,25 @@ and the maximum batch size (`max_num_seqs` option).
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model="adept/fuyu-8b", llm = LLM(model="adept/fuyu-8b", max_model_len=2048, max_num_seqs=2)
max_model_len=2048,
max_num_seqs=2)
``` ```
## Reduce CUDA Graphs ## Reduce CUDA Graphs
By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU. By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
!!! warning
CUDA graph capture takes up more memory in V1 than in V0.
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
??? code ??? code
```python ```python
from vllm import LLM from vllm import LLM
from vllm.config import CompilationConfig, CompilationLevel from vllm.config import CompilationConfig, CompilationMode
llm = LLM( llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct", model="meta-llama/Llama-3.1-8B-Instruct",
compilation_config=CompilationConfig( compilation_config=CompilationConfig(
level=CompilationLevel.PIECEWISE, mode=CompilationMode.VLLM_COMPILE,
# By default, it goes up to max_num_seqs # By default, it goes up to max_num_seqs
cudagraph_capture_sizes=[1, 2, 4, 8, 16], cudagraph_capture_sizes=[1, 2, 4, 8, 16],
), ),
...@@ -78,8 +72,7 @@ You can disable graph capturing completely via the `enforce_eager` flag: ...@@ -78,8 +72,7 @@ You can disable graph capturing completely via the `enforce_eager` flag:
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enforce_eager=True)
enforce_eager=True)
``` ```
## Adjust cache size ## Adjust cache size
...@@ -97,8 +90,10 @@ You can allow a smaller number of multi-modal items per prompt to reduce the mem ...@@ -97,8 +90,10 @@ You can allow a smaller number of multi-modal items per prompt to reduce the mem
from vllm import LLM from vllm import LLM
# Accept up to 3 images and 1 video per prompt # Accept up to 3 images and 1 video per prompt
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", llm = LLM(
limit_mm_per_prompt={"image": 3, "video": 1}) model="Qwen/Qwen2.5-VL-3B-Instruct",
limit_mm_per_prompt={"image": 3, "video": 1},
)
``` ```
You can go a step further and disable unused modalities completely by setting its limit to zero. You can go a step further and disable unused modalities completely by setting its limit to zero.
...@@ -108,8 +103,10 @@ For example, if your application only accepts image input, there is no need to a ...@@ -108,8 +103,10 @@ For example, if your application only accepts image input, there is no need to a
from vllm import LLM from vllm import LLM
# Accept any number of images but no videos # Accept any number of images but no videos
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", llm = LLM(
limit_mm_per_prompt={"video": 0}) model="Qwen/Qwen2.5-VL-3B-Instruct",
limit_mm_per_prompt={"video": 0},
)
``` ```
You can even run a multi-modal model for text-only inference: You can even run a multi-modal model for text-only inference:
...@@ -118,10 +115,52 @@ You can even run a multi-modal model for text-only inference: ...@@ -118,10 +115,52 @@ You can even run a multi-modal model for text-only inference:
from vllm import LLM from vllm import LLM
# Don't accept images. Just text. # Don't accept images. Just text.
llm = LLM(model="google/gemma-3-27b-it", llm = LLM(
limit_mm_per_prompt={"image": 0}) model="google/gemma-3-27b-it",
limit_mm_per_prompt={"image": 0},
)
``` ```
### Configurable options
`limit_mm_per_prompt` also accepts configurable options per modality. In the configurable form, you still specify `count`, and you may optionally provide size hints that control how vLLM profiles and reserves memory for your multi‑modal inputs. This helps you tune memory for the actual media you expect, instead of the model’s absolute maxima.
Configurable options by modality:
- `image`: `{"count": int, "width": int, "height": int}`
- `video`: `{"count": int, "num_frames": int, "width": int, "height": int}`
- `audio`: `{"count": int, "length": int}`
Details could be found in [`ImageDummyOptions`][vllm.config.multimodal.ImageDummyOptions], [`VideoDummyOptions`][vllm.config.multimodal.VideoDummyOptions], and [`AudioDummyOptions`][vllm.config.multimodal.AudioDummyOptions].
Examples:
```python
from vllm import LLM
# Up to 5 images per prompt, profile with 512x512.
# Up to 1 video per prompt, profile with 32 frames at 640x640.
llm = LLM(
model="Qwen/Qwen2.5-VL-3B-Instruct",
limit_mm_per_prompt={
"image": {"count": 5, "width": 512, "height": 512},
"video": {"count": 1, "num_frames": 32, "width": 640, "height": 640},
},
)
```
For backward compatibility, passing an integer works as before and is interpreted as `{"count": <int>}`. For example:
- `limit_mm_per_prompt={"image": 5}` is equivalent to `limit_mm_per_prompt={"image": {"count": 5}}`
- You can mix formats: `limit_mm_per_prompt={"image": 5, "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640}}`
!!! note
- The size hints affect memory profiling only. They shape the dummy inputs used to compute reserved activation sizes. They do not change how inputs are actually processed at inference time.
- If a hint exceeds what the model can accept, vLLM clamps it to the model's effective maximum and may log a warning.
!!! warning
These size hints currently only affect activation memory profiling. Encoder cache size is determined by the actual inputs at runtime and is not limited by these hints.
## Multi-modal processor arguments ## Multi-modal processor arguments
For certain models, you can adjust the multi-modal processor arguments to For certain models, you can adjust the multi-modal processor arguments to
...@@ -133,14 +172,14 @@ Here are some examples: ...@@ -133,14 +172,14 @@ Here are some examples:
from vllm import LLM from vllm import LLM
# Available for Qwen2-VL series models # Available for Qwen2-VL series models
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", llm = LLM(
mm_processor_kwargs={ model="Qwen/Qwen2.5-VL-3B-Instruct",
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28 mm_processor_kwargs={"max_pixels": 768 * 768}, # Default is 1280 * 28 * 28
}) )
# Available for InternVL series models # Available for InternVL series models
llm = LLM(model="OpenGVLab/InternVL2-2B", llm = LLM(
mm_processor_kwargs={ model="OpenGVLab/InternVL2-2B",
"max_dynamic_patch": 4, # Default is 12 mm_processor_kwargs={"max_dynamic_patch": 4}, # Default is 12
}) )
``` ```
...@@ -7,8 +7,6 @@ vLLM uses the following environment variables to configure the system: ...@@ -7,8 +7,6 @@ vLLM uses the following environment variables to configure the system:
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
??? code ```python
--8<-- "vllm/envs.py:env-vars-definition"
```python ```
--8<-- "vllm/envs.py:env-vars-definition"
```
...@@ -27,15 +27,11 @@ You can monitor the number of preemption requests through Prometheus metrics exp ...@@ -27,15 +27,11 @@ You can monitor the number of preemption requests through Prometheus metrics exp
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture. In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
[](){ #chunked-prefill }
## Chunked Prefill ## Chunked Prefill
Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations. Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics. In V1, **chunked prefill is enabled by default whenever possible**. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
This policy has two benefits: This policy has two benefits:
...@@ -100,7 +96,7 @@ from vllm import LLM ...@@ -100,7 +96,7 @@ from vllm import LLM
llm = LLM( llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct, model="meta-llama/Llama-3.3-70B-Instruct,
tensor_parallel_size=4, tensor_parallel_size=4,
pipeline_parallel_size=2 pipeline_parallel_size=2,
) )
``` ```
...@@ -174,14 +170,14 @@ Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to u ...@@ -174,14 +170,14 @@ Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to u
Known supported models (with corresponding benchmarks): Known supported models (with corresponding benchmarks):
- dots_ocr (<gh-pr:25466>) - dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
- GLM-4.1V or above (<gh-pr:23168>) - GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
- InternVL (<gh-pr:23909>) - InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
- Kimi-VL (<gh-pr:23817>) - Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
- Llama4 (<gh-pr:18368>) - Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
- MiniCPM-V-2.5 or above (<gh-pr:23327>, <gh-pr:23948>) - MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
- Qwen2-VL or above (<gh-pr:22742>, <gh-pr:24955>, <gh-pr:25445>) - Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
- Step3 (<gh-pr:22697>) - Step3 (<https://github.com/vllm-project/vllm/pull/22697>)
## Input Processing ## Input Processing
...@@ -257,18 +253,24 @@ Examples: ...@@ -257,18 +253,24 @@ Examples:
```python ```python
# Use a larger cache # Use a larger cache
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", llm = LLM(
mm_processor_cache_gb=8) model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_cache_gb=8,
)
# Use a shared-memory based IPC cache # Use a shared-memory based IPC cache
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", llm = LLM(
tensor_parallel_size=2, model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_cache_type="shm", tensor_parallel_size=2,
mm_processor_cache_gb=8) mm_processor_cache_type="shm",
mm_processor_cache_gb=8,
)
# Disable the cache # Disable the cache
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", llm = LLM(
mm_processor_cache_gb=0) model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_cache_gb=0,
)
``` ```
### Cache Placement ### Cache Placement
......
...@@ -5,7 +5,7 @@ The `vllm serve` command is used to launch the OpenAI-compatible server. ...@@ -5,7 +5,7 @@ The `vllm serve` command is used to launch the OpenAI-compatible server.
## CLI Arguments ## CLI Arguments
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.
To see the available options, take a look at the [CLI Reference](../cli/README.md#options)! To see the available options, take a look at the [CLI Reference](../cli/README.md)!
## Configuration file ## Configuration file
......
# TPU Optimization Tips
This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload.
## Get started
Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md).
### TPU workload sizing
When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed.
The following colab [calculator](https://colab.research.google.com/github/ericehanley/rightsize-vllm/blob/main/HBM_Calculator.ipynb) will tell you:
- KV cache size requirement per token and per request
- TPU/GPU memory consumed by the model weights
- TPU/GPU memory allocated for the KV cache
- Maximum \# of requests you can approximately set (--max-num-seqs)
This approach serves as a general rule of thumb.
#### Latency-throughput tradeoff
As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` and/or increasing the number of chips can help reduce latency.
`--max-num-seqs` defines the number of concurrent decode slots, effectively limiting the number of requests the server can process tokens for simultaneously. Increasing this value allows the server to pre-allocate more HBM to handle a higher number of concurrent requests, which can maximize overall throughput. However, this often increases the end-to-end (e2e) latency per request.
Therefore, carefully tuning `--max-num-seqs` is crucial to achieving the desired balance between latency and throughput for your specific workload.
In a similar way, `--max-num-batch-tokens` can be adjusted down to improve latency, or adjusted up to improve throughput.
#### Compilation and Caching
Coming from a GPU background, one of the key differences you'll notice with TPUs is an initial compilation step. TPUs are specialized accelerators (ASICs) that achieve maximum performance by executing pre-compiled, static computation graphs via the XLA compiler. Unlike GPUs, which can handle dynamic input shapes more flexibly, TPUs require a specific compiled graph for each tensor shape (e.g., batch size and sequence length) they process.
To manage this, vLLM performs a one-time "warmup" process when you first launch the server. During this phase, it pre-compiles the model for various common input shapes and saves these compiled graphs to a cache on disk or remote storage (located at `~/.cache/vllm/xla_cache` by default). This process can range significantly, anywhere from a few minutes to an hour depending on the size of the model and context length used.
Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs.
Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling).
#### Reducing compilation time
This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`.
### Optimize based on your data
#### max-model-len vs. most-model-len
![most_model_len](../assets/design/tpu/most_model_len.png)
If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most-model-len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable.
For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32768` and use `VLLM_TPU_MOST_MODEL_LEN=2048`.
The requests get subdivided into max-model-len and most-model-len categories, for the latter category, you can gain better performance since the server can process more requests at a time.
#### Padding
For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128 (e.g., 128, 256, etc.)
The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. To read more about TPU padding, see [here](https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies). Currently, there are 2 ways to pad the requests:
1. the default exponential padding (pad to the nearest power of 2)
2. bucket padding (pad to the nearest linearly increasing bucket).
When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`.
For example, max_model_len=512, padding_gap=64, the buckets will be [16, 32, 64, 128, 192, 256, 320, 384, 448, 512].
The fewer tokens you pad, the less unnecessary computation TPU does, the better performance you can get. For example, if num_tokens=300, with exponential padding, you pad to 512, with the bucket_padding above, you pad to 320.
However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compiled graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding.
#### Quantization
If possible, use the precision that matches the chip’s hardware acceleration:
- v5e has int4/int8 hardware acceleration in the MXU
- v6e has int4/int8 hardware acceleration in the MXU
Supported quantized formats and features in vLLM on TPU [Jul '25]:
- INT8 W8A8
- INT8 W8A16
- FP8 KV cache
- [WIP] FP8 W8A8
- [WIP] AWQ
- [WIP] FP4 W4A8
#### Parallelization
Don't set TP to be less than the number of chips on a single-host deployment.
Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types).
### Tune your workloads
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
### Future Topics We'll Cover
#### Profiling
The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance.
#### SPMD
More details to come.
**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.**
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment