Merge tag 'v0.10.0' into v0.10.0-dev

711aa9d5 · zhuwenwen · 751c492c · 6d8d0a24 · 711aa9d5 · 711aa9d5
Commit 711aa9d5 authored Jul 30, 2025 by zhuwenwen
20 changed files
--- a/docs/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
---
+# Structured Outputs
-title: Structured Outputs
---
-[](){ #structured-outputs }
 vLLM supports the generation of structured outputs using
 [xgrammar](https://github.com/mlc-ai/xgrammar) or
@@ -21,7 +18,7 @@ The following parameters are supported, which must be added as extra parameters:
 - `guided_grammar`: the output will follow the context free grammar.
 - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
-You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page.
+You can see the complete list of supported parameters on the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) page.
 Structured outputs are supported by default in the OpenAI-Compatible Server. You
 may choose to specify the backend to use by setting the
@@ -33,7 +30,7 @@ text.
 Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
-??? Code
+??? code
    ```python
    from openai import OpenAI
@@ -55,7 +52,7 @@ Now let´s see an example for each of the cases, starting with the `guided_choic
 The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
-??? Code
+??? code
    ```python
    completion = client.chat.completions.create(
@@ -79,7 +76,7 @@ For this we can use the `guided_json` parameter in two different ways:
 The next example shows how to use the `guided_json` parameter with a Pydantic model:
-??? Code
+??? code
    ```python
    from pydantic import BaseModel
@@ -127,7 +124,7 @@ difficult to use, but it´s really powerful. It allows us to define complete
 languages like SQL queries. It works by using a context free EBNF grammar.
 As an example, we can use to define a specific format of simplified SQL queries:
-??? Code
+??? code
    ```python
    simplified_sql_grammar = """
@@ -157,7 +154,7 @@ As an example, we can use to define a specific format of simplified SQL queries:
    print(completion.choices[0].message.content)
    ```
-See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
+See also: [full example](../examples/online_serving/structured_outputs.md)
 ## Reasoning Outputs
@@ -169,7 +166,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r
 Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:
-??? Code
+??? code
    ```python
    from pydantic import BaseModel
@@ -200,7 +197,7 @@ Note that you can use reasoning with any provided structured outputs feature. Th
    print("content: ", completion.choices[0].message.content)
    ```
-See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
+See also: [full example](../examples/online_serving/structured_outputs.md)
 ## Experimental Automatic Parsing (OpenAI API)
@@ -212,7 +209,7 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.
 Here is a simple example demonstrating how to get structured output using Pydantic models:
-??? Code
+??? code
    ```python
    from pydantic import BaseModel
@@ -248,7 +245,7 @@ Age: 28
 Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
-??? Code
+??? code
    ```python
    from typing import List
@@ -308,7 +305,7 @@ These parameters can be used in the same way as the parameters from the Online
 Serving examples above. One example for the usage of the `choice` parameter is
 shown below:
-??? Code
+??? code
    ```python
    from vllm import LLM, SamplingParams
@@ -325,4 +322,4 @@ shown below:
    print(outputs[0].outputs[0].text)
    ```
-See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
+See also: [full example](../examples/online_serving/structured_outputs.md)
--- a/docs/features/tool_calling.md
+++ b/docs/features/tool_calling.md
 # Tool Calling
-vLLM currently supports named function calling, as well as the `auto`, `required` (as of `vllm>=0.8.3`) and `none` options for the `tool_choice` field in the chat completion API.
+vLLM currently supports named function calling, as well as the `auto`, `required` (as of `vllm>=0.8.3`), and `none` options for the `tool_choice` field in the chat completion API.
 ## Quickstart
-Start the server with tool calling enabled. This example uses Meta's Llama 3.1 8B model, so we need to use the llama3 tool calling chat template from the vLLM examples directory:
+Start the server with tool calling enabled. This example uses Meta's Llama 3.1 8B model, so we need to use the `llama3_json` tool calling chat template from the vLLM examples directory:
 ```bash
 vllm serve meta-llama/Llama-3.1-8B-Instruct \
@@ -13,9 +13,9 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --chat-template examples/tool_chat_template_llama3.1_json.jinja
 ```
-Next, make a request to the model that should result in it using the available tools:
+Next, make a request that triggers the model to use the available tools:
-??? Code
+??? code
    ```python
    from openai import OpenAI
@@ -73,7 +73,7 @@ This example demonstrates:
 You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests.
-Remember that it's the callers responsibility to:
+Remember that it's the caller's responsibility to:
 1. Define appropriate tools in the request
 2. Include relevant context in the chat messages
@@ -84,7 +84,7 @@ For more advanced usage, including parallel tool calls and different model-speci
 ## Named Function Calling
 vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is
-enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
+enabled by default and will work with any supported model. You are guaranteed a validly-parsable function call - not a
 high-quality one.
 vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.
@@ -95,7 +95,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha
 ## Required Function Calling
-vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/usage/v1_guide.html#feature-model) for the V1 engine.
+vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The guided decoding features for `tool_choice='required'` (such as JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](../usage/v1_guide.md#features) for the V1 engine.
 When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
@@ -103,24 +103,22 @@ When tool_choice='required' is set, the model is guaranteed to generate one or m
 vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request.
-By default, when `tool_choice='none'` is specified, vLLM excludes tool definitions from the prompt to optimize context usage. To include tool definitions even with `tool_choice='none'`, use the `--expand-tools-even-if-tool-choice-none` option.
+However, when `tool_choice='none'` is specified, vLLM includes tool definitions from the prompt.
-Note: This behavior will change in v0.10.0, where tool definitions will be included by default even with `tool_choice='none'`.
 ## Automatic Function Calling
 To enable this feature, you should set the following flags:
-* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it
+* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it
 deems appropriate.
 * `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
-will continue to be added in the future, and also can register your own tool parsers in the `--tool-parser-plugin`.
+will continue to be added in the future. You can also register your own tool parsers in the `--tool-parser-plugin`.
 * `--tool-parser-plugin` -- **optional** tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in `--tool-call-parser`.
-* `--chat-template` -- **optional** for auto tool choice. the path to the chat template which handles `tool`-role messages and `assistant`-role messages
+* `--chat-template` -- **optional** for auto tool choice. It's the path to the chat template which handles `tool`-role messages and `assistant`-role messages
 that contain previously generated tool calls. Hermes, Mistral and Llama models have tool-compatible chat templates in their
 `tokenizer_config.json` files, but you can specify a custom template. This argument can be set to `tool_use` if your model has a tool use-specific chat
 template configured in the `tokenizer_config.json`. In this case, it will be used per the `transformers` specification. More on this [here](https://huggingface.co/docs/transformers/en/chat_templating#why-do-some-models-have-multiple-templates)
-from HuggingFace; and you can find an example of this in a `tokenizer_config.json` [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json)
+from HuggingFace; and you can find an example of this in a `tokenizer_config.json` [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json).
 If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
@@ -132,7 +130,7 @@ All Nous Research Hermes-series models newer than Hermes 2 Pro should be support
 * `NousResearch/Hermes-2-Theta-*`
 * `NousResearch/Hermes-3-*`
-_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge
+_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality and capabilities due to the merge
 step in their creation_.
 Flags: `--tool-call-parser hermes`
@@ -148,13 +146,13 @@ Known issues:
 1. Mistral 7B struggles to generate parallel tool calls correctly.
 2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is
-much shorter than what vLLM generates. Since an exception is thrown when this condition
+   much shorter than what vLLM generates. Since an exception is thrown when this condition
-is not met, the following additional chat templates are provided:
+   is not met, the following additional chat templates are provided:
-* <gh-file:examples/tool_chat_template_mistral.jinja> - this is the "official" Mistral chat template, but tweaked so that
+    * <gh-file:examples/tool_chat_template_mistral.jinja> - this is the "official" Mistral chat template, but tweaked so that
-it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits)
+      it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits)
-* <gh-file:examples/tool_chat_template_mistral_parallel.jinja> - this is a "better" version that adds a tool-use system prompt
+    * <gh-file:examples/tool_chat_template_mistral_parallel.jinja> - this is a "better" version that adds a tool-use system prompt
-when tools are provided, that results in much better reliability when working with parallel tool calling.
+      when tools are provided, that results in much better reliability when working with parallel tool calling.
 Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja`
@@ -168,17 +166,17 @@ All Llama 3.1, 3.2 and 4 models should be supported.
 * `meta-llama/Llama-3.2-*`
 * `meta-llama/Llama-4-*`
-The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for llama 4 models, it is recommended to use the `llama4_pythonic` tool parser.
+The tool calling that is supported is the [JSON-based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for Llama 4 models, it is recommended to use the `llama4_pythonic` tool parser.
 Other tool calling formats like the built in python tool calling or custom tool calling are not supported.
 Known issues:
-1. Parallel tool calls are not supported for llama 3, but it is supported in llama 4 models.
+1. Parallel tool calls are not supported for Llama 3, but it is supported in Llama 4 models.
-2. The model can generate parameters with a wrong format, such as generating
+2. The model can generate parameters in an incorrect format, such as generating
   an array serialized as string instead of an array.
-VLLM provides two JSON based chat templates for Llama 3.1 and 3.2:
+VLLM provides two JSON-based chat templates for Llama 3.1 and 3.2:
 * <gh-file:examples/tool_chat_template_llama3.1_json.jinja> - this is the "official" chat template for the Llama 3.1
 models, but tweaked so that it works better with vLLM.
@@ -187,7 +185,8 @@ images.
 Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}`
-VLLM also provides a pythonic and JSON based chat template for Llama 4, but pythonic tool calling is recommended:
+VLLM also provides a pythonic and JSON-based chat template for Llama 4, but pythonic tool calling is recommended:
 * <gh-file:examples/tool_chat_template_llama4_pythonic.jinja> - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models.
 For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`.
@@ -198,21 +197,21 @@ Supported models:
 * `ibm-granite/granite-3.0-8b-instruct`
-Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
+    Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
-<gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Huggingface. Parallel function calls are supported.
+    <gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Hugging Face. Parallel function calls are supported.
 * `ibm-granite/granite-3.1-8b-instruct`
-Recommended flags: `--tool-call-parser granite`
+    Recommended flags: `--tool-call-parser granite`
-The chat template from Huggingface can be used directly. Parallel function calls are supported.
+    The chat template from Huggingface can be used directly. Parallel function calls are supported.
 * `ibm-granite/granite-20b-functioncalling`
-Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`
+    Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`
-<gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
+    <gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Hugging Face, which is not vLLM-compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
 ### InternLM Models (`internlm`)
@@ -248,10 +247,12 @@ The xLAM tool parser is designed to support models that generate tool calls in v
 Parallel function calls are supported, and the parser can effectively separate text content from tool calls.
 Supported models:
 * Salesforce Llama-xLAM models: `Salesforce/Llama-xLAM-2-8B-fc-r`, `Salesforce/Llama-xLAM-2-70B-fc-r`
 * Qwen-xLAM models: `Salesforce/xLAM-1B-fc-r`, `Salesforce/xLAM-3B-fc-r`, `Salesforce/Qwen-xLAM-32B-fc-r`
 Flags:
 * For Llama-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_llama.jinja`
 * For Qwen-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_qwen.jinja`
@@ -268,10 +269,10 @@ Flags: `--tool-call-parser hermes`
 Supported models:
-* `MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>)
+* `MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax_m1.jinja>)
-* `MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>)
+* `MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax_m1.jinja>)
-Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax.jinja`
+Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax_m1.jinja`
 ### DeepSeek-V3 Models (`deepseek_v3`)
@@ -282,6 +283,25 @@ Supported models:
 Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}`
+### Kimi-K2 Models (`kimi_k2`)
+Supported models:
+* `moonshotai/Kimi-K2-Instruct`
+Flags: `--tool-call-parser kimi_k2`
+### Hunyuan Models (`hunyuan_a13b`)
+Supported models:
+* `tencent/Hunyuan-A13B-Instruct` (The chat template is already included in the Hugging Face model files.)
+Flags:
+* For non-reasoning: `--tool-call-parser hunyuan_a13b`
+* For reasoning: `--tool-call-parser hunyuan_a13b --reasoning-parser hunyuan_a13b --enable_reasoning`
 ### Models with Pythonic Tool Calls (`pythonic`)
 A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
@@ -299,28 +319,25 @@ Limitations:
 Example supported models:
-* `meta-llama/Llama-3.2-1B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
+* `meta-llama/Llama-3.2-1B-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
-* `meta-llama/Llama-3.2-3B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
+* `meta-llama/Llama-3.2-3B-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
 * `Team-ACE/ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
 * `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
-* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
+* `meta-llama/Llama-4-Scout-17B-16E-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
-* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
+* `meta-llama/Llama-4-Maverick-17B-128E-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
 Flags: `--tool-call-parser pythonic --chat-template {see_above}`
---
+!!! warning
-**WARNING**
+    Llama's smaller models frequently fail to emit tool calls in the correct format. Results may vary depending on the model.
-Llama's smaller models frequently fail to emit tool calls in the correct format. Your mileage may vary.
---
-## How to write a tool parser plugin
+## How to Write a Tool Parser Plugin
 A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in <gh-file:vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py>.
 Here is a summary of a plugin file:
-??? Code
+??? code
    ```python

--- a/docs/getting_started/installation/README.md
+++ b/docs/getting_started/installation/README.md
---
+# Installation
-title: Installation
---
-[](){ #installation-index }
 vLLM supports the following hardware platforms:

--- a/docs/getting_started/installation/cpu.md
+++ b/docs/getting_started/installation/cpu.md
@@ -76,80 +76,62 @@ Currently, there are no pre-built CPU wheels.
 ### Build image from source
-??? Commands
+=== "Intel/AMD x86"
-    ```bash
-    docker build -f docker/Dockerfile.cpu \
-            --tag vllm-cpu-env \
-            --target vllm-openai .
-    # Launching OpenAI server
-    docker run --rm \
-                --privileged=true \
-                --shm-size=4g \
-                -p 8000:8000 \
-                -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-                -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
-                vllm-cpu-env \
-                --model=meta-llama/Llama-3.2-1B-Instruct \
-                --dtype=bfloat16 \
-                other vLLM OpenAI server arguments
-    ```
-!!! tip
+    --8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-image-from-source"
-    For ARM or Apple silicon, use `docker/Dockerfile.arm`
+=== "ARM AArch64"
-!!! tip
+    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
-    For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
-## Supported features
+=== "Apple silicon"
-vLLM CPU backend supports the following vLLM features:
+    --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
- Tensor Parallel
+=== "IBM Z (S390X)"
- Model Quantization (`INT8 W8A8, AWQ, GPTQ`)
+    --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-image-from-source"
- Chunked-prefill
- Prefix-caching
- FP8-E5M2 KV cache
 ## Related runtime environment variables
 - `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`.
+- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads, can be set as CPU id lists or `auto` (by default). For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node respectively.
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`.
+- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `None`. If the value is not set and use `auto` thread binding, no CPU will be reserved for `world_size == 1`, 1 CPU per rank will be reserved for `world_size > 1`.
- `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
+- `VLLM_CPU_MOE_PREPACK` (x86 only): whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
- `VLLM_CPU_SGL_KERNEL` (Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
+- `VLLM_CPU_SGL_KERNEL` (x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
-## Performance tips
+## FAQ
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
+### Which `dtype` should be used?
-```bash
+- Currently vLLM CPU uses model default settings as `dtype`. However, due to unstable float16 support in torch CPU, it is recommended to explicitly set `dtype=bfloat16` if there are any performance or accuracy problem.  
-sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
-find / -name *libtcmalloc* # find the dynamic link library path
-export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
-python examples/offline_inference/basic/basic.py # run vLLM
-```
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
+### How to launch a vLLM service on CPU?
+- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 31 for the framework and using CPU 0-30 for inference threads:
 ```bash
 export VLLM_CPU_KVCACHE_SPACE=40
-export VLLM_CPU_OMP_THREADS_BIND=0-29
+export VLLM_CPU_OMP_THREADS_BIND=0-30
-vllm serve facebook/opt-125m
+vllm serve facebook/opt-125m --dtype=bfloat16
 ```
 or using default auto thread binding:
 ```bash
 export VLLM_CPU_KVCACHE_SPACE=40
-export VLLM_CPU_NUM_OF_RESERVED_CPU=2
+export VLLM_CPU_NUM_OF_RESERVED_CPU=1
-vllm serve facebook/opt-125m
+vllm serve facebook/opt-125m --dtype=bfloat16
 ```
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
+Note, it is recommended to manually reserve 1 CPU for vLLM front-end process when `world_size == 1`.
+### How to decide `VLLM_CPU_OMP_THREADS_BIND`?
+- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to a same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If have any performance problems or unexpected binding behaviours, please try to bind threads as following.
+- On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
-??? Commands
+??? console "Commands"
    ```console
    $ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
@@ -178,34 +160,36 @@ vllm serve facebook/opt-125m
    $ python examples/offline_inference/basic/basic.py
    ```
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.
+- When deploy vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on a same NUMA node to avoid cross NUMA node memory access.
-## Other considerations
+### How to decide `VLLM_CPU_KVCACHE_SPACE`?
- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.
+  - This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.
+### How to do performance tuning for vLLM CPU?
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, Tensor Parallel is a option for better performance.
+First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`.
-  - Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
+Inference batch size is a important parameter for the performance. Larger batch usually provides higher throughput, smaller batch provides lower latency. Tuning max batch size starts from default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM:
-    ```bash
+- `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as:
-    VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
+    - Offline Inference: `4096 * world_size`
-        vllm serve meta-llama/Llama-2-7b-chat-hf \
+    - Online Serving: `2048 * world_size`
-        -tp=2 \
+- `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance.
-        --distributed-executor-backend mp
+    - Offline Inference: `256 * world_size`
-    ```
+    - Online Serving: `128 * world_size`
-    or using default auto thread binding:
+vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more detials of tuning TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommend to use TP and PP togther if there are enough CPU sockets and memory nodes.
-    ```bash
+### Which quantization configs does vLLM CPU support?
-    VLLM_CPU_KVCACHE_SPACE=40 \
-        vllm serve meta-llama/Llama-2-7b-chat-hf \
+  - vLLM CPU supports quantizations:
-        -tp=2 \
+    - AWQ (x86 only)
-        --distributed-executor-backend mp
+    - GPTQ (x86 only)
-    ```
+    - compressed-tensor INT8 W8A8 (x86, s390x)
-  - For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.
+### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`?
-  - Meanwhile, users should also take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, TP worker will be killed due to out-of-memory.
+  - Both of them requires `amx` CPU flag.
+    - `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models
+    - `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios.
--- a/docs/getting_started/installation/cpu/apple.inc.md
+++ b/docs/getting_started/installation/cpu/apple.inc.md
@@ -35,28 +35,24 @@ pip install -e .
 !!! note
    On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
-#### Troubleshooting
+!!! example "Troubleshooting"
+    If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your
-If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your
+    [Command Line Tools for Xcode](https://developer.apple.com/download/all/).
-[Command Line Tools for Xcode](https://developer.apple.com/download/all/).
+    ```text
-```text
+    [...] fatal error: 'map' file not found
-[...] fatal error: 'map' file not found
+            1 | #include <map>
-          1 | #include <map>
+                |          ^~~~~
-            |          ^~~~~
+        1 error generated.
-      1 error generated.
+        [2/8] Building CXX object CMakeFiles/_C.dir/csrc/cpu/pos_encoding.cpp.o
-      [2/8] Building CXX object CMakeFiles/_C.dir/csrc/cpu/pos_encoding.cpp.o
+    [...] fatal error: 'cstddef' file not found
-[...] fatal error: 'cstddef' file not found
+            10 | #include <cstddef>
-         10 | #include <cstddef>
+                |          ^~~~~~~~~
-            |          ^~~~~~~~~
+        1 error generated.
-      1 error generated.
+    ```
-```
 # --8<-- [end:build-wheel-from-source]
-# --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
 # --8<-- [start:pre-built-images]
 # --8<-- [end:pre-built-images]

--- a/docs/getting_started/installation/cpu/arm.inc.md
+++ b/docs/getting_started/installation/cpu/arm.inc.md
@@ -28,14 +28,26 @@ ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
 Testing has been conducted on AWS Graviton3 instances for compatibility.
 # --8<-- [end:build-wheel-from-source]
-# --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
 # --8<-- [start:pre-built-images]
 # --8<-- [end:pre-built-images]
 # --8<-- [start:build-image-from-source]
+```bash
+docker build -f docker/Dockerfile.arm \
+        --tag vllm-cpu-env .
+# Launching OpenAI server
+docker run --rm \
+            --privileged=true \
+            --shm-size=4g \
+            -p 8000:8000 \
+            -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
+            -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
+            vllm-cpu-env \
+            --model=meta-llama/Llama-3.2-1B-Instruct \
+            --dtype=bfloat16 \
+            other vLLM OpenAI server arguments
+```
 # --8<-- [end:build-image-from-source]
 # --8<-- [start:extra-information]
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/cpu/build.inc.md
+++ b/docs/getting_started/installation/cpu/build.inc.md
@@ -2,7 +2,7 @@ First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as
 ```bash
 sudo apt-get update  -y
-sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
+sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
 ```
@@ -17,7 +17,7 @@ Third, install Python packages for vLLM CPU backend building:
 ```bash
 pip install --upgrade pip
-pip install "cmake>=3.26.1" wheel packaging ninja "setuptools-scm>=8" numpy
+pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
 pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
 ```
@@ -33,4 +33,7 @@ If you want to develop vllm, install it in editable mode instead.
 VLLM_TARGET_DEVICE=cpu python setup.py develop
 ```
+!!! note
+    If you are building vLLM from source and not using the pre-built images, remember to set `LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD"` on x86 machines before running vLLM.
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/cpu/s390x.inc.md
+++ b/docs/getting_started/installation/cpu/s390x.inc.md
@@ -56,14 +56,28 @@ Execute the following commands to build and install vLLM from the source.
 ```
 # --8<-- [end:build-wheel-from-source]
-# --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
 # --8<-- [start:pre-built-images]
 # --8<-- [end:pre-built-images]
 # --8<-- [start:build-image-from-source]
+```bash
+docker build -f docker/Dockerfile.s390x \
+        --tag vllm-cpu-env .
+# Launching OpenAI server
+docker run --rm \
+            --privileged=true \
+            --shm-size=4g \
+            -p 8000:8000 \
+            -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
+            -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
+            vllm-cpu-env \
+            --model=meta-llama/Llama-3.2-1B-Instruct \
+            --dtype=float \
+            other vLLM OpenAI server arguments
+```
 # --8<-- [end:build-image-from-source]
 # --8<-- [start:extra-information]
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/cpu/x86.inc.md
+++ b/docs/getting_started/installation/cpu/x86.inc.md
 # --8<-- [start:installation]
-vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
+vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
-!!! warning
-    There are no pre-built wheels or images for this device, so you must build vLLM from source.
 # --8<-- [end:installation]
 # --8<-- [start:requirements]
 - OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
+- CPU flags: `avx512f`, `avx512_bf16` (Optional), `avx512_vnni` (Optional)
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
 !!! tip
-    [Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
+    Use `lscpu` to check the CPU flags.
 # --8<-- [end:requirements]
 # --8<-- [start:set-up-using-python]
@@ -26,21 +22,37 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
 --8<-- "docs/getting_started/installation/cpu/build.inc.md"
-!!! note
-    - AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
-    - If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
 # --8<-- [end:build-wheel-from-source]
-# --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
 # --8<-- [start:pre-built-images]
-See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
+[https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
+!!! warning
+    If deploying the pre-built images on machines only contain `avx512f`, `Illegal instruction` error may be raised. It is recommended to build images for these machines with `--build-arg VLLM_CPU_AVX512BF16=false` and `--build-arg VLLM_CPU_AVX512VNNI=false`.
 # --8<-- [end:pre-built-images]
 # --8<-- [start:build-image-from-source]
+```bash
+docker build -f docker/Dockerfile.cpu \
+        --build-arg VLLM_CPU_AVX512BF16=false (default)|true \
+        --build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
+        --tag vllm-cpu-env \
+        --target vllm-openai .
+# Launching OpenAI server
+docker run --rm \
+            --privileged=true \
+            --shm-size=4g \
+            -p 8000:8000 \
+            -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
+            -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
+            vllm-cpu-env \
+            --model=meta-llama/Llama-3.2-1B-Instruct \
+            --dtype=bfloat16 \
+            other vLLM OpenAI server arguments
+```
 # --8<-- [end:build-image-from-source]
 # --8<-- [start:extra-information]
 # --8<-- [end:extra-information]
--- a/docs/getting_started/installation/google_tpu.md
+++ b/docs/getting_started/installation/google_tpu.md
@@ -37,7 +37,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
 - Google Cloud TPU VM
 - TPU versions: v6e, v5e, v5p, v4
- Python: 3.10 or newer
+- Python: 3.11 or newer
 ### Provision Cloud TPUs
@@ -117,7 +117,7 @@ source ~/.bashrc
 Create and activate a Conda environment for vLLM:
 ```bash
-conda create -n vllm python=3.10 -y
+conda create -n vllm python=3.12 -y
 conda activate vllm
 ```

--- a/docs/getting_started/installation/gpu.md
+++ b/docs/getting_started/installation/gpu.md
@@ -46,11 +46,11 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 === "AMD ROCm"
-    There is no extra information on creating a new Python environment for this device.
+    --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:set-up-using-python"
 === "Intel XPU"
-    There is no extra information on creating a new Python environment for this device.
+    --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:set-up-using-python"
 ### Pre-built wheels

--- a/docs/getting_started/installation/gpu/cuda.inc.md
+++ b/docs/getting_started/installation/gpu/cuda.inc.md
@@ -232,9 +232,6 @@ pip install -e .
 ```
 # --8<-- [end:build-wheel-from-source]
-# --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
 # --8<-- [start:pre-built-images]
 See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image.
@@ -261,4 +258,3 @@ See [deployment-docker-build-image-from-source][deployment-docker-build-image-fr
 See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
 # --8<-- [end:supported-features]
-# --8<-- [end:extra-information]
--- a/docs/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/getting_started/installation/gpu/rocm.inc.md
@@ -2,6 +2,9 @@
 vLLM supports AMD GPUs with ROCm 6.3.
+!!! tip
+    [Docker](#set-up-using-docker) is the recommended way to use vLLM on ROCm.
 !!! warning
    There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
@@ -14,6 +17,8 @@ vLLM supports AMD GPUs with ROCm 6.3.
 # --8<-- [end:requirements]
 # --8<-- [start:set-up-using-python]
+There is no extra information on creating a new Python environment for this device.
 # --8<-- [end:set-up-using-python]
 # --8<-- [start:pre-built-wheels]
@@ -90,7 +95,7 @@ Currently, there are no pre-built ROCm wheels.
 4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
-    ??? Commands
+    ??? console "Commands"
        ```bash
        pip install --upgrade pip
@@ -123,9 +128,7 @@ Currently, there are no pre-built ROCm wheels.
    - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
      For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
-## Set up using Docker (Recommended)
+# --8<-- [end:build-wheel-from-source]
-# --8<-- [end:set-up-using-docker]
 # --8<-- [start:pre-built-images]
 The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
@@ -203,7 +206,7 @@ DOCKER_BUILDKIT=1 docker build \
 To run the above docker image `vllm-rocm`, use the below command:
-??? Command
+??? console "Command"
    ```bash
    docker run -it \
@@ -227,4 +230,3 @@ Where the `<path/to/model>` is the location where the model is stored, for examp
 See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
 # --8<-- [end:supported-features]
-# --8<-- [end:extra-information]
--- a/docs/getting_started/installation/gpu/xpu.inc.md
+++ b/docs/getting_started/installation/gpu/xpu.inc.md
@@ -14,6 +14,8 @@ vLLM initially supports basic model inference and serving on Intel GPU platform.
 # --8<-- [end:requirements]
 # --8<-- [start:set-up-using-python]
+There is no extra information on creating a new Python environment for this device.
 # --8<-- [end:set-up-using-python]
 # --8<-- [start:pre-built-wheels]
@@ -43,9 +45,6 @@ VLLM_TARGET_DEVICE=xpu python setup.py install
      type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
 # --8<-- [end:build-wheel-from-source]
-# --8<-- [start:set-up-using-docker]
-# --8<-- [end:set-up-using-docker]
 # --8<-- [start:pre-built-images]
 Currently, there are no pre-built XPU images.
@@ -81,4 +80,8 @@ python -m vllm.entrypoints.openai.api_server \
 By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
 # --8<-- [end:supported-features]
-# --8<-- [end:extra-information]
+# --8<-- [start:distributed-backend]
+XPU platform uses **torch-ccl** for torch<2.8 and **xccl** for torch>=2.8 as distributed backend, since torch 2.8 supports **xccl** as built-in backend for XPU.
+# --8<-- [end:distributed-backend]
--- a/docs/getting_started/installation/intel_gaudi.md
+++ b/docs/getting_started/installation/intel_gaudi.md
@@ -28,7 +28,7 @@ To verify that the Intel Gaudi software was correctly installed, run:
 hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
 apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
 pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
-pip list | grep neural # verify that neural_compressor is installed
+pip list | grep neural # verify that neural_compressor_pt is installed
 ```
 Refer to [Intel Gaudi Software Stack Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade)
@@ -109,8 +109,8 @@ docker run \
 ### Supported features
- [Offline inference][offline-inference]
+- [Offline inference](../../serving/offline_inference.md)
- Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server]
+- Online serving via [OpenAI-Compatible Server](../../serving/openai_compatible_server.md)
 - HPU autodetection - no need to manually select device within vLLM
 - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
 - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
@@ -120,12 +120,13 @@ docker run \
 - Inference with [HPU Graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html)
  for accelerating low-batch latency and throughput
 - Attention with Linear Biases (ALiBi)
+- INC quantization
 ### Unsupported features
 - Beam search
 - LoRA adapters
- Quantization
+- AWQ quantization
 - Prefill chunking (mixed-batch inferencing)
 ### Supported configurations
@@ -133,36 +134,20 @@ docker run \
 The following configurations have been validated to function with
 Gaudi2 devices. Configurations that are not listed may or may not work.
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
+| Model | TP Size| dtype | Sampling |
-  on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
+|-------|--------|--------|----------|
-  datatype with random or greedy sampling
+| [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) | 1, 2, 8 | BF16 | Random / Greedy |
- [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
+| [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | 1, 2, 8 | BF16 | Random / Greedy |
-  on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
+| [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | 1, 2, 8 | BF16 | Random / Greedy |
-  datatype with random or greedy sampling
+| [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 1, 2, 8 | BF16 | Random / Greedy |
- [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
+| [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) | 1, 2, 8 | BF16 | Random / Greedy |
-  on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
+| [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) | 1, 2, 8 | BF16 | Random / Greedy |
-  datatype with random or greedy sampling
+| [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) | 8 | BF16 | Random / Greedy |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
+| [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | 8 | BF16 | Random / Greedy |
-  on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
+| [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) | 8 | BF16 | Random / Greedy |
-  datatype with random or greedy sampling
+| [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | 8 | BF16 | Random / Greedy |
- [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B)
+| [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) | 8 | BF16 | Random / Greedy |
-  on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
+| [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | 8 | BF16 | Random / Greedy |
-  datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
-  on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
-  datatype with random or greedy sampling
- [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b)
-  with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
-  with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
-  with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
-  with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B)
-  with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
-  with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
 ## Performance tuning
@@ -237,7 +222,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
 Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
-??? Logs
+??? console "Logs"
    ```text
    INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
@@ -286,7 +271,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
 Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
-??? Logs
+??? console "Logs"
    ```text
    INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]

--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
---
+# Quickstart
-title: Quickstart
---
-[](){ #quickstart }
 This guide will help you quickly get started with vLLM to perform:
@@ -43,7 +40,7 @@ uv pip install vllm --torch-backend=auto
 ```
 !!! note
-    For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM.
+    For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
 [](){ #quickstart-offline }
@@ -77,7 +74,7 @@ prompts = [
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 ```
-The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models].
+The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](../models/supported_models.md).
 ```python
 llm = LLM(model="facebook/opt-125m")
@@ -147,7 +144,7 @@ curl http://localhost:8000/v1/completions \
 Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
-??? Code
+??? code
    ```python
    from openai import OpenAI
@@ -186,7 +183,7 @@ curl http://localhost:8000/v1/chat/completions \
 Alternatively, you can use the `openai` Python package:
-??? Code
+??? code
    ```python
    from openai import OpenAI

--- a/docs/mkdocs/hooks/generate_argparse.py
+++ b/docs/mkdocs/hooks/generate_argparse.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+import logging
+import sys
+from argparse import SUPPRESS, HelpFormatter
+from pathlib import Path
+from typing import Literal
+from unittest.mock import MagicMock, patch
+ROOT_DIR = Path(__file__).parent.parent.parent.parent
+ARGPARSE_DOC_DIR = ROOT_DIR / "docs/argparse"
+sys.path.insert(0, str(ROOT_DIR))
+sys.modules["aiohttp"] = MagicMock()
+sys.modules["blake3"] = MagicMock()
+sys.modules["vllm._C"] = MagicMock()
+from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs  # noqa: E402
+from vllm.entrypoints.openai.cli_args import make_arg_parser  # noqa: E402
+from vllm.utils import FlexibleArgumentParser  # noqa: E402
+logger = logging.getLogger("mkdocs")
+class MarkdownFormatter(HelpFormatter):
+    """Custom formatter that generates markdown for argument groups."""
+    def __init__(self, prog, starting_heading_level=3):
+        super().__init__(prog,
+                         max_help_position=float('inf'),
+                         width=float('inf'))
+        self._section_heading_prefix = "#" * starting_heading_level
+        self._argument_heading_prefix = "#" * (starting_heading_level + 1)
+        self._markdown_output = []
+    def start_section(self, heading):
+        if heading not in {"positional arguments", "options"}:
+            heading_md = f"\n{self._section_heading_prefix} {heading}\n\n"
+            self._markdown_output.append(heading_md)
+    def end_section(self):
+        pass
+    def add_text(self, text):
+        if text:
+            self._markdown_output.append(f"{text.strip()}\n\n")
+    def add_usage(self, usage, actions, groups, prefix=None):
+        pass
+    def add_arguments(self, actions):
+        for action in actions:
+            if (len(action.option_strings) == 0
+                    or "--help" in action.option_strings):
+                continue
+            option_strings = f'`{"`, `".join(action.option_strings)}`'
+            heading_md = f"{self._argument_heading_prefix} {option_strings}\n\n"
+            self._markdown_output.append(heading_md)
+            if choices := action.choices:
+                choices = f'`{"`, `".join(str(c) for c in choices)}`'
+                self._markdown_output.append(
+                    f"Possible choices: {choices}\n\n")
+            self._markdown_output.append(f"{action.help}\n\n")
+            if (default := action.default) != SUPPRESS:
+                self._markdown_output.append(f"Default: `{default}`\n\n")
+    def format_help(self):
+        """Return the formatted help as markdown."""
+        return "".join(self._markdown_output)
+def create_parser(cls, **kwargs) -> FlexibleArgumentParser:
+    """Create a parser for the given class with markdown formatting.
+    Args:
+        cls: The class to create a parser for
+        **kwargs: Additional keyword arguments to pass to `cls.add_cli_args`.
+    Returns:
+        FlexibleArgumentParser: A parser with markdown formatting for the class.
+    """
+    parser = FlexibleArgumentParser()
+    parser.formatter_class = MarkdownFormatter
+    with patch("vllm.config.DeviceConfig.__post_init__"):
+        return cls.add_cli_args(parser, **kwargs)
+def create_serve_parser() -> FlexibleArgumentParser:
+    """Create a parser for the serve command with markdown formatting."""
+    parser = FlexibleArgumentParser()
+    parser.formatter_class = lambda prog: MarkdownFormatter(
+        prog, starting_heading_level=4)
+    return make_arg_parser(parser)
+def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+    logger.info("Generating argparse documentation")
+    logger.debug("Root directory: %s", ROOT_DIR.resolve())
+    logger.debug("Output directory: %s", ARGPARSE_DOC_DIR.resolve())
+    # Create the ARGPARSE_DOC_DIR if it doesn't exist
+    if not ARGPARSE_DOC_DIR.exists():
+        ARGPARSE_DOC_DIR.mkdir(parents=True)
+    # Create parsers to document
+    parsers = {
+        "engine_args": create_parser(EngineArgs),
+        "async_engine_args": create_parser(AsyncEngineArgs,
+                                           async_args_only=True),
+        "serve": create_serve_parser(),
+    }
+    # Generate documentation for each parser
+    for stem, parser in parsers.items():
+        doc_path = ARGPARSE_DOC_DIR / f"{stem}.md"
+        with open(doc_path, "w") as f:
+            f.write(parser.format_help())
+        logger.info("Argparse generated: %s", doc_path.relative_to(ROOT_DIR))
--- a/docs/mkdocs/hooks/generate_examples.py
+++ b/docs/mkdocs/hooks/generate_examples.py
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import itertools
+import logging
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Literal
 import regex as re
+logger = logging.getLogger("mkdocs")
 ROOT_DIR = Path(__file__).parent.parent.parent.parent
 ROOT_DIR_RELATIVE = '../../../../..'
 EXAMPLE_DIR = ROOT_DIR / "examples"
 EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples"
-print(ROOT_DIR.resolve())
-print(EXAMPLE_DIR.resolve())
-print(EXAMPLE_DOC_DIR.resolve())
 def fix_case(text: str) -> str:
@@ -135,6 +135,11 @@ class Example:
 def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+    logger.info("Generating example documentation")
+    logger.debug("Root directory: %s", ROOT_DIR.resolve())
+    logger.debug("Example directory: %s", EXAMPLE_DIR.resolve())
+    logger.debug("Example document directory: %s", EXAMPLE_DOC_DIR.resolve())
    # Create the EXAMPLE_DOC_DIR if it doesn't exist
    if not EXAMPLE_DOC_DIR.exists():
        EXAMPLE_DOC_DIR.mkdir(parents=True)
@@ -156,8 +161,8 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
    for example in sorted(examples, key=lambda e: e.path.stem):
        example_name = f"{example.path.stem}.md"
        doc_path = EXAMPLE_DOC_DIR / example.category / example_name
-        print(doc_path)
        if not doc_path.parent.exists():
            doc_path.parent.mkdir(parents=True)
        with open(doc_path, "w+") as f:
            f.write(example.generate())
+        logger.debug("Example generated: %s", doc_path.relative_to(ROOT_DIR))
--- a/docs/mkdocs/hooks/url_schemes.py
+++ b/docs/mkdocs/hooks/url_schemes.py
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+This is basically a port of MyST parser’s external URL resolution mechanism
+(https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#customising-external-url-resolution)
+to work with MkDocs.
+It allows Markdown authors to use GitHub shorthand links like:
+  - [Text](gh-issue:123)
+  - <gh-pr:456>
+  - [File](gh-file:path/to/file.py#L10)
+These are automatically rewritten into fully qualified GitHub URLs pointing to
+issues, pull requests, files, directories, or projects in the
+`vllm-project/vllm` repository.
+The goal is to simplify cross-referencing common GitHub resources
+in project docs.
+"""
 import regex as re
 from mkdocs.config.defaults import MkDocsConfig
 from mkdocs.structure.files import Files
@@ -7,11 +26,42 @@ from mkdocs.structure.pages import Page
 def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
-                     files: Files):
+                     files: Files) -> str:
+    """
+    Custom MkDocs plugin hook to rewrite special GitHub reference links
+    in Markdown.
+    This function scans the given Markdown content for specially formatted
+    GitHub shorthand links, such as:
+      - `[Link text](gh-issue:123)`
+      - `<gh-pr:456>`
+    And rewrites them into fully-qualified GitHub URLs with GitHub icons:
+      - `[:octicons-mark-github-16: Link text](https://github.com/vllm-project/vllm/issues/123)`
+      - `[:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)`
+    Supported shorthand types:
+      - `gh-issue`
+      - `gh-pr`
+      - `gh-project`
+      - `gh-dir`
+      - `gh-file`
+    Args:
+        markdown (str): The raw Markdown content of the page.
+        page (Page): The MkDocs page object being processed.
+        config (MkDocsConfig): The MkDocs site configuration.
+        files (Files): The collection of files in the MkDocs build.
+    Returns:
+        str: The updated Markdown content with GitHub shorthand links replaced.
+    """
    gh_icon = ":octicons-mark-github-16:"
    gh_url = "https://github.com"
    repo_url = f"{gh_url}/vllm-project/vllm"
    org_url = f"{gh_url}/orgs/vllm-project"
+    # Mapping of shorthand types to their corresponding GitHub base URLs
    urls = {
        "issue": f"{repo_url}/issues",
        "pr": f"{repo_url}/pull",
@@ -19,6 +69,8 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
        "dir": f"{repo_url}/tree/main",
        "file": f"{repo_url}/blob/main",
    }
+    # Default title prefixes for auto links
    titles = {
        "issue": "Issue #",
        "pr": "Pull Request #",
@@ -27,11 +79,19 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
        "file": "",
    }
+    # Regular expression to match GitHub shorthand links
    scheme = r"gh-(?P<type>.+?):(?P<path>.+?)(#(?P<fragment>.+?))?"
    inline_link = re.compile(r"\[(?P<title>[^\[]+?)\]\(" + scheme + r"\)")
    auto_link = re.compile(f"<{scheme}>")
    def replace_inline_link(match: re.Match) -> str:
+        """
+        Replaces a matched inline-style GitHub shorthand link
+        with a full Markdown link.
+        Example:
+            [My issue](gh-issue:123) → [:octicons-mark-github-16: My issue](https://github.com/vllm-project/vllm/issues/123)
+        """
        url = f'{urls[match.group("type")]}/{match.group("path")}'
        if fragment := match.group("fragment"):
            url += f"#{fragment}"
@@ -39,6 +99,13 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
        return f'[{gh_icon} {match.group("title")}]({url})'
    def replace_auto_link(match: re.Match) -> str:
+        """
+        Replaces a matched autolink-style GitHub shorthand
+        with a full Markdown link.
+        Example:
+            <gh-pr:456> → [:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)
+        """
        type = match.group("type")
        path = match.group("path")
        title = f"{titles[type]}{path}"
@@ -48,6 +115,7 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
        return f"[{gh_icon} {title}]({url})"
+    # Replace both inline and autolinks
    markdown = inline_link.sub(replace_inline_link, markdown)
    markdown = auto_link.sub(replace_auto_link, markdown)

--- a/docs/mkdocs/overrides/partials/toc-item.html
+++ b/docs/mkdocs/overrides/partials/toc-item.html
+<!-- Enables the use of toc_depth in document frontmatter https://github.com/squidfunk/mkdocs-material/issues/4827#issuecomment-1869812019 -->
+<li class="md-nav__item">
+    <a href="{{ toc_item.url }}" class="md-nav__link">
+      <span class="md-ellipsis">
+        {{ toc_item.title }}
+      </span>
+    </a>
+    <!-- Table of contents list -->
+    {% if toc_item.children %}
+      <nav class="md-nav" aria-label="{{ toc_item.title | striptags }}">
+        <ul class="md-nav__list">
+          {% for toc_item in toc_item.children %}
+          {% if not page.meta.toc_depth or toc_item.level <= page.meta.toc_depth %}
+            {% include "partials/toc-item.html" %}
+          {% endif %}
+          {% endfor %}
+        </ul>
+      </nav>
+    {% endif %}
+  </li>
\ No newline at end of file