Merge tag 'v0.12.0' into v0.12.0-dev

41199996 · zhuwenwen · 31021d81 · 4fd9d6a8 · 41199996 · 41199996
Commit 41199996 authored Dec 13, 2025 by zhuwenwen
20 changed files
--- a/docs/cli/.nav.yml
+++ b/docs/cli/.nav.yml
@@ -5,4 +5,4 @@ nav:
  - complete.md
  - run-batch.md
  - vllm bench:
-    - bench/*.md
+    - bench/**/*.md
--- a/docs/cli/bench/latency.md
+++ b/docs/cli/bench/latency.md
@@ -4,6 +4,6 @@
 --8<-- "docs/cli/json_tip.inc.md"
-## Options
+## Arguments
--8<-- "docs/argparse/bench_latency.md"
+--8<-- "docs/argparse/bench_latency.inc.md"
--- a/docs/cli/bench/serve.md
+++ b/docs/cli/bench/serve.md
@@ -4,6 +4,6 @@
 --8<-- "docs/cli/json_tip.inc.md"
-## Options
+## Arguments
--8<-- "docs/argparse/bench_serve.md"
+--8<-- "docs/argparse/bench_serve.inc.md"
--- a/docs/cli/bench/sweep/plot.md
+++ b/docs/cli/bench/sweep/plot.md
+# vllm bench sweep plot
+## JSON CLI Arguments
+--8<-- "docs/cli/json_tip.inc.md"
+## Arguments
+--8<-- "docs/argparse/bench_sweep_plot.inc.md"
--- a/docs/cli/bench/sweep/plot_pareto.md
+++ b/docs/cli/bench/sweep/plot_pareto.md
+# vllm bench sweep plot_pareto
+## JSON CLI Arguments
+--8<-- "docs/cli/json_tip.inc.md"
+## Arguments
+--8<-- "docs/argparse/bench_sweep_plot_pareto.inc.md"
--- a/docs/cli/bench/sweep/serve.md
+++ b/docs/cli/bench/sweep/serve.md
+# vllm bench sweep serve
+## JSON CLI Arguments
+--8<-- "docs/cli/json_tip.inc.md"
+## Arguments
+--8<-- "docs/argparse/bench_sweep_serve.inc.md"
--- a/docs/cli/bench/sweep/serve_sla.md
+++ b/docs/cli/bench/sweep/serve_sla.md
+# vllm bench sweep serve_sla
+## JSON CLI Arguments
+--8<-- "docs/cli/json_tip.inc.md"
+## Arguments
+--8<-- "docs/argparse/bench_sweep_serve_sla.inc.md"
--- a/docs/cli/bench/throughput.md
+++ b/docs/cli/bench/throughput.md
@@ -4,6 +4,6 @@
 --8<-- "docs/cli/json_tip.inc.md"
-## Options
+## Arguments
--8<-- "docs/argparse/bench_throughput.md"
+--8<-- "docs/argparse/bench_throughput.inc.md"
--- a/docs/cli/chat.md
+++ b/docs/cli/chat.md
 # vllm chat
-## Options
+## Arguments
--8<-- "docs/argparse/chat.md"
+--8<-- "docs/argparse/chat.inc.md"
--- a/docs/cli/complete.md
+++ b/docs/cli/complete.md
 # vllm complete
-## Options
+## Arguments
--8<-- "docs/argparse/complete.md"
+--8<-- "docs/argparse/complete.inc.md"
--- a/docs/cli/run-batch.md
+++ b/docs/cli/run-batch.md
@@ -4,6 +4,6 @@
 --8<-- "docs/cli/json_tip.inc.md"
-## Options
+## Arguments
--8<-- "docs/argparse/run-batch.md"
+--8<-- "docs/argparse/run-batch.inc.md"
--- a/docs/cli/serve.md
+++ b/docs/cli/serve.md
@@ -4,6 +4,6 @@
 --8<-- "docs/cli/json_tip.inc.md"
-## Options
+## Arguments
--8<-- "docs/argparse/serve.md"
+--8<-- "docs/argparse/serve.inc.md"
--- a/docs/community/meetups.md
+++ b/docs/community/meetups.md
 # Meetups
-We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
+We host regular meetups around the world. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights.
+## Upcoming Meetups
+Stay tuned for upcoming meetups! Follow us on [Twitter/X](https://x.com/vllm_project), join our [Slack](https://slack.vllm.ai), and follow vLLM on [Luma](https://luma.com/vLLM-Meetups) to get notified about new events.
+## Past Meetups
+Below you'll find slides and recordings from our previous meetups:
+- [vLLM Bangkok Meetup](https://luma.com/v0f647nv), November 21st 2025. [[Slides]](https://drive.google.com/drive/folders/1H0DS57F8HQ5q3kSOSoRmucPJWL3E0A_X?usp=sharing)
+- [vLLM Zurich Meetup](https://luma.com/0gls27kb), November 6th 2025. [[Slides]](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) [[Recording]](https://www.youtube.com/watch?v=6m6ZE6yVEDI)
+- [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w), November 1st 2025. [[Slides]](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link)
+- [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg), October 25th 2025. [[Slides]](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6)
+- [vLLM Toronto Meetup](https://luma.com/e80e0ymm), September 25th 2025. [[Slides]](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing)
 - [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ), August 30th 2025. [[Slides]](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA)
 - [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet), August 27th 2025. [[Slides]](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing)
 - [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg), August 23rd 2025. [[Slides]](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH)
@@ -22,4 +35,12 @@ We host regular meetups in San Francisco Bay Area every 2 months. We will share
 - [The second vLLM meetup](https://lu.ma/ygxbpzhl), with IBM Research, January 31st 2024. [[Slides]](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing) [[Video (vLLM Update)]](https://youtu.be/Y0C-DUvEnZQ) [[Video (IBM Research & torch.compile)]](https://youtu.be/m0dMtFLI-dg)
 - [The first vLLM meetup](https://lu.ma/first-vllm-meetup), with a16z, October 5th 2023. [[Slides]](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing)
-We are always looking for speakers and sponsors at San Francisco Bay Area and potentially other locations. If you are interested in speaking or sponsoring, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu).
+## Get Involved
+**Want to host or speak at a vLLM meetup?** We're always looking for speakers and sponsors for our meetups. Whether you want to:
+- Share your vLLM feature, use case, project extension, or deployment experience
+- Host a meetup in your city
+- Sponsor an event
+Please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu).
--- a/docs/community/sponsors.md
+++ b/docs/community/sponsors.md
@@ -34,6 +34,7 @@ Compute Resources:
 - Trainy
 - UC Berkeley
 - UC San Diego
+- Volcengine
 Slack Sponsor: Anyscale

--- a/docs/configuration/README.md
+++ b/docs/configuration/README.md
@@ -4,6 +4,6 @@ This section lists the most common options for running vLLM.
 There are three main levels of configuration, from highest priority to lowest priority:
- [Request parameters][completions-api] and [input arguments][sampling-params]
+- [Request parameters](../serving/openai_compatible_server.md#completions-api) and [input arguments](../api/README.md#inference-parameters)
 - [Engine arguments](./engine_args.md)
 - [Environment variables](./env_vars.md)
--- a/docs/configuration/conserving_memory.md
+++ b/docs/configuration/conserving_memory.md
@@ -11,8 +11,7 @@ The following code splits the model across 2 GPUs.
 ```python
 from vllm import LLM
-llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
+llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
-          tensor_parallel_size=2)
 ```
 !!! warning
@@ -24,7 +23,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
 !!! note
    With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
-    You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
+    You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
 ## Quantization
@@ -43,30 +42,25 @@ and the maximum batch size (`max_num_seqs` option).
 ```python
 from vllm import LLM
-llm = LLM(model="adept/fuyu-8b",
+llm = LLM(model="adept/fuyu-8b", max_model_len=2048, max_num_seqs=2)
-          max_model_len=2048,
-          max_num_seqs=2)
 ```
 ## Reduce CUDA Graphs
 By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
-!!! warning
-    CUDA graph capture takes up more memory in V1 than in V0.
 You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
 ??? code
    ```python
    from vllm import LLM
-    from vllm.config import CompilationConfig, CompilationLevel
+    from vllm.config import CompilationConfig, CompilationMode
    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        compilation_config=CompilationConfig(
-            level=CompilationLevel.PIECEWISE,
+            mode=CompilationMode.VLLM_COMPILE,
            # By default, it goes up to max_num_seqs
            cudagraph_capture_sizes=[1, 2, 4, 8, 16],
        ),
@@ -78,8 +72,7 @@ You can disable graph capturing completely via the `enforce_eager` flag:
 ```python
 from vllm import LLM
-llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
+llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enforce_eager=True)
-          enforce_eager=True)
 ```
 ## Adjust cache size
@@ -97,8 +90,10 @@ You can allow a smaller number of multi-modal items per prompt to reduce the mem
 from vllm import LLM
 # Accept up to 3 images and 1 video per prompt
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+llm = LLM(
-          limit_mm_per_prompt={"image": 3, "video": 1})
+    model="Qwen/Qwen2.5-VL-3B-Instruct",
+    limit_mm_per_prompt={"image": 3, "video": 1},
+)
 ```
 You can go a step further and disable unused modalities completely by setting its limit to zero.
@@ -108,8 +103,10 @@ For example, if your application only accepts image input, there is no need to a
 from vllm import LLM
 # Accept any number of images but no videos
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+llm = LLM(
-          limit_mm_per_prompt={"video": 0})
+    model="Qwen/Qwen2.5-VL-3B-Instruct",
+    limit_mm_per_prompt={"video": 0},
+)
 ```
 You can even run a multi-modal model for text-only inference:
@@ -118,10 +115,52 @@ You can even run a multi-modal model for text-only inference:
 from vllm import LLM
 # Don't accept images. Just text.
-llm = LLM(model="google/gemma-3-27b-it",
+llm = LLM(
-          limit_mm_per_prompt={"image": 0})
+    model="google/gemma-3-27b-it",
+    limit_mm_per_prompt={"image": 0},
+)
 ```
+### Configurable options
+`limit_mm_per_prompt` also accepts configurable options per modality. In the configurable form, you still specify `count`, and you may optionally provide size hints that control how vLLM profiles and reserves memory for your multi‑modal inputs. This helps you tune memory for the actual media you expect, instead of the model’s absolute maxima.
+Configurable options by modality:
+- `image`: `{"count": int, "width": int, "height": int}`
+- `video`: `{"count": int, "num_frames": int, "width": int, "height": int}`
+- `audio`: `{"count": int, "length": int}`
+Details could be found in [`ImageDummyOptions`][vllm.config.multimodal.ImageDummyOptions], [`VideoDummyOptions`][vllm.config.multimodal.VideoDummyOptions], and [`AudioDummyOptions`][vllm.config.multimodal.AudioDummyOptions].
+Examples:
+```python
+from vllm import LLM
+# Up to 5 images per prompt, profile with 512x512.
+# Up to 1 video per prompt, profile with 32 frames at 640x640.
+llm = LLM(
+    model="Qwen/Qwen2.5-VL-3B-Instruct",
+    limit_mm_per_prompt={
+        "image": {"count": 5, "width": 512, "height": 512},
+        "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640},
+    },
+)
+```
+For backward compatibility, passing an integer works as before and is interpreted as `{"count": <int>}`. For example:
+- `limit_mm_per_prompt={"image": 5}` is equivalent to `limit_mm_per_prompt={"image": {"count": 5}}`
+- You can mix formats: `limit_mm_per_prompt={"image": 5, "video": {"count": 1, "num_frames": 32, "width": 640, "height": 640}}`
+!!! note
+    - The size hints affect memory profiling only. They shape the dummy inputs used to compute reserved activation sizes. They do not change how inputs are actually processed at inference time.
+    - If a hint exceeds what the model can accept, vLLM clamps it to the model's effective maximum and may log a warning.
+!!! warning
+    These size hints currently only affect activation memory profiling. Encoder cache size is determined by the actual inputs at runtime and is not limited by these hints.
 ## Multi-modal processor arguments
 For certain models, you can adjust the multi-modal processor arguments to
@@ -133,14 +172,14 @@ Here are some examples:
 from vllm import LLM
 # Available for Qwen2-VL series models
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+llm = LLM(
-          mm_processor_kwargs={
+    model="Qwen/Qwen2.5-VL-3B-Instruct",
-              "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
+    mm_processor_kwargs={"max_pixels": 768 * 768},  # Default is 1280 * 28 * 28
-          })
+)
 # Available for InternVL series models
-llm = LLM(model="OpenGVLab/InternVL2-2B",
+llm = LLM(
-          mm_processor_kwargs={
+    model="OpenGVLab/InternVL2-2B",
-              "max_dynamic_patch": 4,  # Default is 12
+    mm_processor_kwargs={"max_dynamic_patch": 4},  # Default is 12
-          })
+)
 ```
--- a/docs/configuration/env_vars.md
+++ b/docs/configuration/env_vars.md
@@ -7,8 +7,6 @@ vLLM uses the following environment variables to configure the system:
    All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
-??? code
+```python
+--8<-- "vllm/envs.py:env-vars-definition"
-    ```python
+```
-    --8<-- "vllm/envs.py:env-vars-definition"
-    ```
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@@ -27,15 +27,11 @@ You can monitor the number of preemption requests through Prometheus metrics exp
 In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
-[](){ #chunked-prefill }
 ## Chunked Prefill
 Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
-In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics.
+In V1, **chunked prefill is enabled by default whenever possible**. With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
-With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
 This policy has two benefits:
@@ -100,7 +96,7 @@ from vllm import LLM
 llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct,
    tensor_parallel_size=4,
-    pipeline_parallel_size=2
+    pipeline_parallel_size=2,
 )
 ```
@@ -174,14 +170,14 @@ Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to u
 Known supported models (with corresponding benchmarks):
- dots_ocr (<gh-pr:25466>)
+- dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
- GLM-4.1V or above (<gh-pr:23168>)
+- GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
- InternVL (<gh-pr:23909>)
+- InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
- Kimi-VL (<gh-pr:23817>)
+- Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
- Llama4 (<gh-pr:18368>)
+- Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
- MiniCPM-V-2.5 or above (<gh-pr:23327>, <gh-pr:23948>)
+- MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
- Qwen2-VL or above (<gh-pr:22742>, <gh-pr:24955>, <gh-pr:25445>)
+- Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
- Step3 (<gh-pr:22697>)
+- Step3 (<https://github.com/vllm-project/vllm/pull/22697>)
 ## Input Processing
@@ -257,18 +253,24 @@ Examples:
 ```python
 # Use a larger cache
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+llm = LLM(
-          mm_processor_cache_gb=8)
+    model="Qwen/Qwen2.5-VL-3B-Instruct",
+    mm_processor_cache_gb=8,
+)
 # Use a shared-memory based IPC cache
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+llm = LLM(
-          tensor_parallel_size=2,
+    model="Qwen/Qwen2.5-VL-3B-Instruct",
-          mm_processor_cache_type="shm",
+    tensor_parallel_size=2,
-          mm_processor_cache_gb=8)
+    mm_processor_cache_type="shm",
+    mm_processor_cache_gb=8,
+)
 # Disable the cache
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+llm = LLM(
-          mm_processor_cache_gb=0)
+    model="Qwen/Qwen2.5-VL-3B-Instruct",
+    mm_processor_cache_gb=0,
+)
 ```
 ### Cache Placement

--- a/docs/configuration/serve_args.md
+++ b/docs/configuration/serve_args.md
@@ -5,7 +5,7 @@ The `vllm serve` command is used to launch the OpenAI-compatible server.
 ## CLI Arguments
 The `vllm serve` command is used to launch the OpenAI-compatible server.
-To see the available options, take a look at the [CLI Reference](../cli/README.md#options)!
+To see the available options, take a look at the [CLI Reference](../cli/README.md)!
 ## Configuration file

--- a/docs/configuration/tpu.md
+++ b/docs/configuration/tpu.md
-# TPU Optimization Tips
-This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload.
-## Get started
-Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md).
-### TPU workload sizing
-When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed.
-The following colab [calculator](https://colab.research.google.com/github/ericehanley/rightsize-vllm/blob/main/HBM_Calculator.ipynb) will tell you:
- KV cache size requirement per token and per request
- TPU/GPU memory consumed by the model weights
- TPU/GPU memory allocated for the KV cache
- Maximum \# of requests you can approximately set (--max-num-seqs)
-This approach serves as a general rule of thumb.
-#### Latency-throughput tradeoff
-As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` and/or increasing the number of chips can help reduce latency.
-`--max-num-seqs` defines the number of concurrent decode slots, effectively limiting the number of requests the server can process tokens for simultaneously. Increasing this value allows the server to pre-allocate more HBM to handle a higher number of concurrent requests, which can maximize overall throughput. However, this often increases the end-to-end (e2e) latency per request.
-Therefore, carefully tuning `--max-num-seqs` is crucial to achieving the desired balance between latency and throughput for your specific workload.
-In a similar way, `--max-num-batch-tokens` can be adjusted down to improve latency, or adjusted up to improve throughput.
-#### Compilation and Caching
-Coming from a GPU background, one of the key differences you'll notice with TPUs is an initial compilation step. TPUs are specialized accelerators (ASICs) that achieve maximum performance by executing pre-compiled, static computation graphs via the XLA compiler. Unlike GPUs, which can handle dynamic input shapes more flexibly, TPUs require a specific compiled graph for each tensor shape (e.g., batch size and sequence length) they process.
-To manage this, vLLM performs a one-time "warmup" process when you first launch the server. During this phase, it pre-compiles the model for various common input shapes and saves these compiled graphs to a cache on disk or remote storage (located at `~/.cache/vllm/xla_cache` by default). This process can range significantly, anywhere from a few minutes to an hour depending on the size of the model and context length used.
-Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs.
-Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling).
-#### Reducing compilation time
-This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`.
-### Optimize based on your data
-#### max-model-len vs. most-model-len
-![most_model_len](../assets/design/tpu/most_model_len.png)
-If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most-model-len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable.
-For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32768` and use `VLLM_TPU_MOST_MODEL_LEN=2048`.
-The requests get subdivided into max-model-len and most-model-len categories, for the latter category, you can gain better performance since the server can process more requests at a time.
-#### Padding
-For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128 (e.g., 128, 256, etc.)
-The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. To read more about TPU padding, see [here](https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies). Currently, there are 2 ways to pad the requests:
-1. the default exponential padding (pad to the nearest power of 2)
-2. bucket padding (pad to the nearest linearly increasing bucket).
-When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`.
-For example, max_model_len=512, padding_gap=64, the buckets will be [16, 32, 64, 128, 192, 256, 320, 384, 448, 512].
-The fewer tokens you pad, the less unnecessary computation TPU does, the better performance you can get. For example, if num_tokens=300, with exponential padding, you pad to 512, with the bucket_padding above, you pad to 320.
-However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compiled graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding.
-#### Quantization
-If possible, use the precision that matches the chip’s hardware acceleration:
- v5e has int4/int8 hardware acceleration in the MXU
- v6e has int4/int8 hardware acceleration in the MXU
-Supported quantized formats and features in vLLM on TPU [Jul '25]:
- INT8 W8A8
- INT8 W8A16
- FP8 KV cache
- [WIP] FP8 W8A8
- [WIP] AWQ
- [WIP] FP4 W4A8
-#### Parallelization
-Don't set TP to be less than the number of chips on a single-host deployment.
-Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types).
-### Tune your workloads
-Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
-### Future Topics We'll Cover
-#### Profiling
-The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance.
-#### SPMD
-More details to come.
-**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.**