Commit 711aa9d5 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.10.0' into v0.10.0-dev

parents 751c492c 6d8d0a24
--- # Structured Outputs
title: Structured Outputs
---
[](){ #structured-outputs }
vLLM supports the generation of structured outputs using vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or [xgrammar](https://github.com/mlc-ai/xgrammar) or
...@@ -21,7 +18,7 @@ The following parameters are supported, which must be added as extra parameters: ...@@ -21,7 +18,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar. - `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text. - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page. You can see the complete list of supported parameters on the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the may choose to specify the backend to use by setting the
...@@ -33,7 +30,7 @@ text. ...@@ -33,7 +30,7 @@ text.
Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one: Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
??? Code ??? code
```python ```python
from openai import OpenAI from openai import OpenAI
...@@ -55,7 +52,7 @@ Now let´s see an example for each of the cases, starting with the `guided_choic ...@@ -55,7 +52,7 @@ Now let´s see an example for each of the cases, starting with the `guided_choic
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template: The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
??? Code ??? code
```python ```python
completion = client.chat.completions.create( completion = client.chat.completions.create(
...@@ -79,7 +76,7 @@ For this we can use the `guided_json` parameter in two different ways: ...@@ -79,7 +76,7 @@ For this we can use the `guided_json` parameter in two different ways:
The next example shows how to use the `guided_json` parameter with a Pydantic model: The next example shows how to use the `guided_json` parameter with a Pydantic model:
??? Code ??? code
```python ```python
from pydantic import BaseModel from pydantic import BaseModel
...@@ -127,7 +124,7 @@ difficult to use, but it´s really powerful. It allows us to define complete ...@@ -127,7 +124,7 @@ difficult to use, but it´s really powerful. It allows us to define complete
languages like SQL queries. It works by using a context free EBNF grammar. languages like SQL queries. It works by using a context free EBNF grammar.
As an example, we can use to define a specific format of simplified SQL queries: As an example, we can use to define a specific format of simplified SQL queries:
??? Code ??? code
```python ```python
simplified_sql_grammar = """ simplified_sql_grammar = """
...@@ -157,7 +154,7 @@ As an example, we can use to define a specific format of simplified SQL queries: ...@@ -157,7 +154,7 @@ As an example, we can use to define a specific format of simplified SQL queries:
print(completion.choices[0].message.content) print(completion.choices[0].message.content)
``` ```
See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html) See also: [full example](../examples/online_serving/structured_outputs.md)
## Reasoning Outputs ## Reasoning Outputs
...@@ -169,7 +166,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r ...@@ -169,7 +166,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r
Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema: Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:
??? Code ??? code
```python ```python
from pydantic import BaseModel from pydantic import BaseModel
...@@ -200,7 +197,7 @@ Note that you can use reasoning with any provided structured outputs feature. Th ...@@ -200,7 +197,7 @@ Note that you can use reasoning with any provided structured outputs feature. Th
print("content: ", completion.choices[0].message.content) print("content: ", completion.choices[0].message.content)
``` ```
See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html) See also: [full example](../examples/online_serving/structured_outputs.md)
## Experimental Automatic Parsing (OpenAI API) ## Experimental Automatic Parsing (OpenAI API)
...@@ -212,7 +209,7 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3. ...@@ -212,7 +209,7 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.
Here is a simple example demonstrating how to get structured output using Pydantic models: Here is a simple example demonstrating how to get structured output using Pydantic models:
??? Code ??? code
```python ```python
from pydantic import BaseModel from pydantic import BaseModel
...@@ -248,7 +245,7 @@ Age: 28 ...@@ -248,7 +245,7 @@ Age: 28
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution: Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
??? Code ??? code
```python ```python
from typing import List from typing import List
...@@ -308,7 +305,7 @@ These parameters can be used in the same way as the parameters from the Online ...@@ -308,7 +305,7 @@ These parameters can be used in the same way as the parameters from the Online
Serving examples above. One example for the usage of the `choice` parameter is Serving examples above. One example for the usage of the `choice` parameter is
shown below: shown below:
??? Code ??? code
```python ```python
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
...@@ -325,4 +322,4 @@ shown below: ...@@ -325,4 +322,4 @@ shown below:
print(outputs[0].outputs[0].text) print(outputs[0].outputs[0].text)
``` ```
See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html) See also: [full example](../examples/online_serving/structured_outputs.md)
# Tool Calling # Tool Calling
vLLM currently supports named function calling, as well as the `auto`, `required` (as of `vllm>=0.8.3`) and `none` options for the `tool_choice` field in the chat completion API. vLLM currently supports named function calling, as well as the `auto`, `required` (as of `vllm>=0.8.3`), and `none` options for the `tool_choice` field in the chat completion API.
## Quickstart ## Quickstart
Start the server with tool calling enabled. This example uses Meta's Llama 3.1 8B model, so we need to use the llama3 tool calling chat template from the vLLM examples directory: Start the server with tool calling enabled. This example uses Meta's Llama 3.1 8B model, so we need to use the `llama3_json` tool calling chat template from the vLLM examples directory:
```bash ```bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \ vllm serve meta-llama/Llama-3.1-8B-Instruct \
...@@ -13,9 +13,9 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \ ...@@ -13,9 +13,9 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \
--chat-template examples/tool_chat_template_llama3.1_json.jinja --chat-template examples/tool_chat_template_llama3.1_json.jinja
``` ```
Next, make a request to the model that should result in it using the available tools: Next, make a request that triggers the model to use the available tools:
??? Code ??? code
```python ```python
from openai import OpenAI from openai import OpenAI
...@@ -73,7 +73,7 @@ This example demonstrates: ...@@ -73,7 +73,7 @@ This example demonstrates:
You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests. You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests.
Remember that it's the callers responsibility to: Remember that it's the caller's responsibility to:
1. Define appropriate tools in the request 1. Define appropriate tools in the request
2. Include relevant context in the chat messages 2. Include relevant context in the chat messages
...@@ -84,7 +84,7 @@ For more advanced usage, including parallel tool calls and different model-speci ...@@ -84,7 +84,7 @@ For more advanced usage, including parallel tool calls and different model-speci
## Named Function Calling ## Named Function Calling
vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is
enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a enabled by default and will work with any supported model. You are guaranteed a validly-parsable function call - not a
high-quality one. high-quality one.
vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter. vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.
...@@ -95,7 +95,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha ...@@ -95,7 +95,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha
## Required Function Calling ## Required Function Calling
vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/usage/v1_guide.html#feature-model) for the V1 engine. vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The guided decoding features for `tool_choice='required'` (such as JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](../usage/v1_guide.md#features) for the V1 engine.
When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter. When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
...@@ -103,24 +103,22 @@ When tool_choice='required' is set, the model is guaranteed to generate one or m ...@@ -103,24 +103,22 @@ When tool_choice='required' is set, the model is guaranteed to generate one or m
vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request. vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request.
By default, when `tool_choice='none'` is specified, vLLM excludes tool definitions from the prompt to optimize context usage. To include tool definitions even with `tool_choice='none'`, use the `--expand-tools-even-if-tool-choice-none` option. However, when `tool_choice='none'` is specified, vLLM includes tool definitions from the prompt.
Note: This behavior will change in v0.10.0, where tool definitions will be included by default even with `tool_choice='none'`.
## Automatic Function Calling ## Automatic Function Calling
To enable this feature, you should set the following flags: To enable this feature, you should set the following flags:
* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it * `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it
deems appropriate. deems appropriate.
* `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers * `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
will continue to be added in the future, and also can register your own tool parsers in the `--tool-parser-plugin`. will continue to be added in the future. You can also register your own tool parsers in the `--tool-parser-plugin`.
* `--tool-parser-plugin` -- **optional** tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in `--tool-call-parser`. * `--tool-parser-plugin` -- **optional** tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in `--tool-call-parser`.
* `--chat-template` -- **optional** for auto tool choice. the path to the chat template which handles `tool`-role messages and `assistant`-role messages * `--chat-template` -- **optional** for auto tool choice. It's the path to the chat template which handles `tool`-role messages and `assistant`-role messages
that contain previously generated tool calls. Hermes, Mistral and Llama models have tool-compatible chat templates in their that contain previously generated tool calls. Hermes, Mistral and Llama models have tool-compatible chat templates in their
`tokenizer_config.json` files, but you can specify a custom template. This argument can be set to `tool_use` if your model has a tool use-specific chat `tokenizer_config.json` files, but you can specify a custom template. This argument can be set to `tool_use` if your model has a tool use-specific chat
template configured in the `tokenizer_config.json`. In this case, it will be used per the `transformers` specification. More on this [here](https://huggingface.co/docs/transformers/en/chat_templating#why-do-some-models-have-multiple-templates) template configured in the `tokenizer_config.json`. In this case, it will be used per the `transformers` specification. More on this [here](https://huggingface.co/docs/transformers/en/chat_templating#why-do-some-models-have-multiple-templates)
from HuggingFace; and you can find an example of this in a `tokenizer_config.json` [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json) from HuggingFace; and you can find an example of this in a `tokenizer_config.json` [here](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B/blob/main/tokenizer_config.json).
If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template! If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
...@@ -132,7 +130,7 @@ All Nous Research Hermes-series models newer than Hermes 2 Pro should be support ...@@ -132,7 +130,7 @@ All Nous Research Hermes-series models newer than Hermes 2 Pro should be support
* `NousResearch/Hermes-2-Theta-*` * `NousResearch/Hermes-2-Theta-*`
* `NousResearch/Hermes-3-*` * `NousResearch/Hermes-3-*`
_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge _Note that the Hermes 2 **Theta** models are known to have degraded tool call quality and capabilities due to the merge
step in their creation_. step in their creation_.
Flags: `--tool-call-parser hermes` Flags: `--tool-call-parser hermes`
...@@ -148,13 +146,13 @@ Known issues: ...@@ -148,13 +146,13 @@ Known issues:
1. Mistral 7B struggles to generate parallel tool calls correctly. 1. Mistral 7B struggles to generate parallel tool calls correctly.
2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is 2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is
much shorter than what vLLM generates. Since an exception is thrown when this condition much shorter than what vLLM generates. Since an exception is thrown when this condition
is not met, the following additional chat templates are provided: is not met, the following additional chat templates are provided:
* <gh-file:examples/tool_chat_template_mistral.jinja> - this is the "official" Mistral chat template, but tweaked so that * <gh-file:examples/tool_chat_template_mistral.jinja> - this is the "official" Mistral chat template, but tweaked so that
it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits) it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits)
* <gh-file:examples/tool_chat_template_mistral_parallel.jinja> - this is a "better" version that adds a tool-use system prompt * <gh-file:examples/tool_chat_template_mistral_parallel.jinja> - this is a "better" version that adds a tool-use system prompt
when tools are provided, that results in much better reliability when working with parallel tool calling. when tools are provided, that results in much better reliability when working with parallel tool calling.
Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja` Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja`
...@@ -168,17 +166,17 @@ All Llama 3.1, 3.2 and 4 models should be supported. ...@@ -168,17 +166,17 @@ All Llama 3.1, 3.2 and 4 models should be supported.
* `meta-llama/Llama-3.2-*` * `meta-llama/Llama-3.2-*`
* `meta-llama/Llama-4-*` * `meta-llama/Llama-4-*`
The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for llama 4 models, it is recommended to use the `llama4_pythonic` tool parser. The tool calling that is supported is the [JSON-based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for Llama 4 models, it is recommended to use the `llama4_pythonic` tool parser.
Other tool calling formats like the built in python tool calling or custom tool calling are not supported. Other tool calling formats like the built in python tool calling or custom tool calling are not supported.
Known issues: Known issues:
1. Parallel tool calls are not supported for llama 3, but it is supported in llama 4 models. 1. Parallel tool calls are not supported for Llama 3, but it is supported in Llama 4 models.
2. The model can generate parameters with a wrong format, such as generating 2. The model can generate parameters in an incorrect format, such as generating
an array serialized as string instead of an array. an array serialized as string instead of an array.
VLLM provides two JSON based chat templates for Llama 3.1 and 3.2: VLLM provides two JSON-based chat templates for Llama 3.1 and 3.2:
* <gh-file:examples/tool_chat_template_llama3.1_json.jinja> - this is the "official" chat template for the Llama 3.1 * <gh-file:examples/tool_chat_template_llama3.1_json.jinja> - this is the "official" chat template for the Llama 3.1
models, but tweaked so that it works better with vLLM. models, but tweaked so that it works better with vLLM.
...@@ -187,7 +185,8 @@ images. ...@@ -187,7 +185,8 @@ images.
Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}` Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}`
VLLM also provides a pythonic and JSON based chat template for Llama 4, but pythonic tool calling is recommended: VLLM also provides a pythonic and JSON-based chat template for Llama 4, but pythonic tool calling is recommended:
* <gh-file:examples/tool_chat_template_llama4_pythonic.jinja> - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models. * <gh-file:examples/tool_chat_template_llama4_pythonic.jinja> - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models.
For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`. For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`.
...@@ -198,21 +197,21 @@ Supported models: ...@@ -198,21 +197,21 @@ Supported models:
* `ibm-granite/granite-3.0-8b-instruct` * `ibm-granite/granite-3.0-8b-instruct`
Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja` Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
<gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Huggingface. Parallel function calls are supported. <gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Hugging Face. Parallel function calls are supported.
* `ibm-granite/granite-3.1-8b-instruct` * `ibm-granite/granite-3.1-8b-instruct`
Recommended flags: `--tool-call-parser granite` Recommended flags: `--tool-call-parser granite`
The chat template from Huggingface can be used directly. Parallel function calls are supported. The chat template from Huggingface can be used directly. Parallel function calls are supported.
* `ibm-granite/granite-20b-functioncalling` * `ibm-granite/granite-20b-functioncalling`
Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja` Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`
<gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported. <gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Hugging Face, which is not vLLM-compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
### InternLM Models (`internlm`) ### InternLM Models (`internlm`)
...@@ -248,10 +247,12 @@ The xLAM tool parser is designed to support models that generate tool calls in v ...@@ -248,10 +247,12 @@ The xLAM tool parser is designed to support models that generate tool calls in v
Parallel function calls are supported, and the parser can effectively separate text content from tool calls. Parallel function calls are supported, and the parser can effectively separate text content from tool calls.
Supported models: Supported models:
* Salesforce Llama-xLAM models: `Salesforce/Llama-xLAM-2-8B-fc-r`, `Salesforce/Llama-xLAM-2-70B-fc-r` * Salesforce Llama-xLAM models: `Salesforce/Llama-xLAM-2-8B-fc-r`, `Salesforce/Llama-xLAM-2-70B-fc-r`
* Qwen-xLAM models: `Salesforce/xLAM-1B-fc-r`, `Salesforce/xLAM-3B-fc-r`, `Salesforce/Qwen-xLAM-32B-fc-r` * Qwen-xLAM models: `Salesforce/xLAM-1B-fc-r`, `Salesforce/xLAM-3B-fc-r`, `Salesforce/Qwen-xLAM-32B-fc-r`
Flags: Flags:
* For Llama-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_llama.jinja` * For Llama-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_llama.jinja`
* For Qwen-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_qwen.jinja` * For Qwen-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_qwen.jinja`
...@@ -268,10 +269,10 @@ Flags: `--tool-call-parser hermes` ...@@ -268,10 +269,10 @@ Flags: `--tool-call-parser hermes`
Supported models: Supported models:
* `MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>) * `MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax_m1.jinja>)
* `MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>) * `MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax_m1.jinja>)
Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax.jinja` Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax_m1.jinja`
### DeepSeek-V3 Models (`deepseek_v3`) ### DeepSeek-V3 Models (`deepseek_v3`)
...@@ -282,6 +283,25 @@ Supported models: ...@@ -282,6 +283,25 @@ Supported models:
Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}` Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}`
### Kimi-K2 Models (`kimi_k2`)
Supported models:
* `moonshotai/Kimi-K2-Instruct`
Flags: `--tool-call-parser kimi_k2`
### Hunyuan Models (`hunyuan_a13b`)
Supported models:
* `tencent/Hunyuan-A13B-Instruct` (The chat template is already included in the Hugging Face model files.)
Flags:
* For non-reasoning: `--tool-call-parser hunyuan_a13b`
* For reasoning: `--tool-call-parser hunyuan_a13b --reasoning-parser hunyuan_a13b --enable_reasoning`
### Models with Pythonic Tool Calls (`pythonic`) ### Models with Pythonic Tool Calls (`pythonic`)
A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models. A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
...@@ -299,28 +319,25 @@ Limitations: ...@@ -299,28 +319,25 @@ Limitations:
Example supported models: Example supported models:
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>) * `meta-llama/Llama-3.2-1B-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>) * `meta-llama/Llama-3.2-3B-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
* `Team-ACE/ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>) * `Team-ACE/ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>) * `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
* `meta-llama/Llama-4-Scout-17B-16E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>) * `meta-llama/Llama-4-Scout-17B-16E-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct`\* (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>) * `meta-llama/Llama-4-Maverick-17B-128E-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
Flags: `--tool-call-parser pythonic --chat-template {see_above}` Flags: `--tool-call-parser pythonic --chat-template {see_above}`
--- !!! warning
**WARNING** Llama's smaller models frequently fail to emit tool calls in the correct format. Results may vary depending on the model.
Llama's smaller models frequently fail to emit tool calls in the correct format. Your mileage may vary.
---
## How to write a tool parser plugin ## How to Write a Tool Parser Plugin
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in <gh-file:vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py>. A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in <gh-file:vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py>.
Here is a summary of a plugin file: Here is a summary of a plugin file:
??? Code ??? code
```python ```python
......
--- # Installation
title: Installation
---
[](){ #installation-index }
vLLM supports the following hardware platforms: vLLM supports the following hardware platforms:
......
...@@ -76,80 +76,62 @@ Currently, there are no pre-built CPU wheels. ...@@ -76,80 +76,62 @@ Currently, there are no pre-built CPU wheels.
### Build image from source ### Build image from source
??? Commands === "Intel/AMD x86"
```bash
docker build -f docker/Dockerfile.cpu \
--tag vllm-cpu-env \
--target vllm-openai .
# Launching OpenAI server
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments
```
!!! tip --8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-image-from-source"
For ARM or Apple silicon, use `docker/Dockerfile.arm`
=== "ARM AArch64"
!!! tip --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
## Supported features === "Apple silicon"
vLLM CPU backend supports the following vLLM features: --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
- Tensor Parallel === "IBM Z (S390X)"
- Model Quantization (`INT8 W8A8, AWQ, GPTQ`) --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-image-from-source"
- Chunked-prefill
- Prefix-caching
- FP8-E5M2 KV cache
## Related runtime environment variables ## Related runtime environment variables
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`. - `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`. - `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads, can be set as CPU id lists or `auto` (by default). For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node respectively.
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`. - `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `None`. If the value is not set and use `auto` thread binding, no CPU will be reserved for `world_size == 1`, 1 CPU per rank will be reserved for `world_size > 1`.
- `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False). - `VLLM_CPU_MOE_PREPACK` (x86 only): whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
- `VLLM_CPU_SGL_KERNEL` (Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False). - `VLLM_CPU_SGL_KERNEL` (x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
## Performance tips ## FAQ
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run: ### Which `dtype` should be used?
```bash - Currently vLLM CPU uses model default settings as `dtype`. However, due to unstable float16 support in torch CPU, it is recommended to explicitly set `dtype=bfloat16` if there are any performance or accuracy problem.
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
find / -name *libtcmalloc* # find the dynamic link library path
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
python examples/offline_inference/basic/basic.py # run vLLM
```
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP: ### How to launch a vLLM service on CPU?
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 31 for the framework and using CPU 0-30 for inference threads:
```bash ```bash
export VLLM_CPU_KVCACHE_SPACE=40 export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=0-29 export VLLM_CPU_OMP_THREADS_BIND=0-30
vllm serve facebook/opt-125m vllm serve facebook/opt-125m --dtype=bfloat16
``` ```
or using default auto thread binding: or using default auto thread binding:
```bash ```bash
export VLLM_CPU_KVCACHE_SPACE=40 export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_NUM_OF_RESERVED_CPU=2 export VLLM_CPU_NUM_OF_RESERVED_CPU=1
vllm serve facebook/opt-125m vllm serve facebook/opt-125m --dtype=bfloat16
``` ```
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores: Note, it is recommended to manually reserve 1 CPU for vLLM front-end process when `world_size == 1`.
### How to decide `VLLM_CPU_OMP_THREADS_BIND`?
- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to a same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If have any performance problems or unexpected binding behaviours, please try to bind threads as following.
- On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
??? Commands ??? console "Commands"
```console ```console
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores $ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
...@@ -178,34 +160,36 @@ vllm serve facebook/opt-125m ...@@ -178,34 +160,36 @@ vllm serve facebook/opt-125m
$ python examples/offline_inference/basic/basic.py $ python examples/offline_inference/basic/basic.py
``` ```
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access. - When deploy vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on a same NUMA node to avoid cross NUMA node memory access.
## Other considerations ### How to decide `VLLM_CPU_KVCACHE_SPACE`?
- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance. - This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance. ### How to do performance tuning for vLLM CPU?
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, Tensor Parallel is a option for better performance. First of all, please make sure the thread-binding and KV cache space are properly set and take effect. You can check the thread-binding by running a vLLM benchmark and observing CPU cores usage via `htop`.
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving: Inference batch size is a important parameter for the performance. Larger batch usually provides higher throughput, smaller batch provides lower latency. Tuning max batch size starts from default value to balance throughput and latency is an effective way to improve vLLM CPU performance on specific platforms. There are two important related parameters in vLLM:
```bash - `--max-num-batched-tokens`, defines the limit of token numbers in a single batch, has more impacts on the first token performance. The default value is set as:
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \ - Offline Inference: `4096 * world_size`
vllm serve meta-llama/Llama-2-7b-chat-hf \ - Online Serving: `2048 * world_size`
-tp=2 \ - `--max-num-seqs`, defines the limit of sequence numbers in a single batch, has more impacts on the output token performance.
--distributed-executor-backend mp - Offline Inference: `256 * world_size`
``` - Online Serving: `128 * world_size`
or using default auto thread binding: vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage multiple CPU sockets and memory nodes. For more detials of tuning TP and PP, please refer to [Optimization and Tuning](../../configuration/optimization.md). For vLLM CPU, it is recommend to use TP and PP togther if there are enough CPU sockets and memory nodes.
```bash ### Which quantization configs does vLLM CPU support?
VLLM_CPU_KVCACHE_SPACE=40 \
vllm serve meta-llama/Llama-2-7b-chat-hf \ - vLLM CPU supports quantizations:
-tp=2 \ - AWQ (x86 only)
--distributed-executor-backend mp - GPTQ (x86 only)
``` - compressed-tensor INT8 W8A8 (x86, s390x)
- For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node. ### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`?
- Meanwhile, users should also take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, TP worker will be killed due to out-of-memory. - Both of them requires `amx` CPU flag.
- `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models
- `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios.
...@@ -35,28 +35,24 @@ pip install -e . ...@@ -35,28 +35,24 @@ pip install -e .
!!! note !!! note
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device. On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
#### Troubleshooting !!! example "Troubleshooting"
If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your
If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your [Command Line Tools for Xcode](https://developer.apple.com/download/all/).
[Command Line Tools for Xcode](https://developer.apple.com/download/all/).
```text
```text [...] fatal error: 'map' file not found
[...] fatal error: 'map' file not found 1 | #include <map>
1 | #include <map> | ^~~~~
| ^~~~~ 1 error generated.
1 error generated. [2/8] Building CXX object CMakeFiles/_C.dir/csrc/cpu/pos_encoding.cpp.o
[2/8] Building CXX object CMakeFiles/_C.dir/csrc/cpu/pos_encoding.cpp.o
[...] fatal error: 'cstddef' file not found
[...] fatal error: 'cstddef' file not found 10 | #include <cstddef>
10 | #include <cstddef> | ^~~~~~~~~
| ^~~~~~~~~ 1 error generated.
1 error generated. ```
```
# --8<-- [end:build-wheel-from-source] # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
# --8<-- [end:pre-built-images] # --8<-- [end:pre-built-images]
......
...@@ -28,14 +28,26 @@ ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes. ...@@ -28,14 +28,26 @@ ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
Testing has been conducted on AWS Graviton3 instances for compatibility. Testing has been conducted on AWS Graviton3 instances for compatibility.
# --8<-- [end:build-wheel-from-source] # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
# --8<-- [end:pre-built-images] # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] # --8<-- [start:build-image-from-source]
```bash
docker build -f docker/Dockerfile.arm \
--tag vllm-cpu-env .
# Launching OpenAI server
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments
```
# --8<-- [end:build-image-from-source] # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] # --8<-- [start:extra-information]
# --8<-- [end:extra-information] # --8<-- [end:extra-information]
...@@ -2,7 +2,7 @@ First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as ...@@ -2,7 +2,7 @@ First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as
```bash ```bash
sudo apt-get update -y sudo apt-get update -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
``` ```
...@@ -17,7 +17,7 @@ Third, install Python packages for vLLM CPU backend building: ...@@ -17,7 +17,7 @@ Third, install Python packages for vLLM CPU backend building:
```bash ```bash
pip install --upgrade pip pip install --upgrade pip
pip install "cmake>=3.26.1" wheel packaging ninja "setuptools-scm>=8" numpy pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
``` ```
...@@ -33,4 +33,7 @@ If you want to develop vllm, install it in editable mode instead. ...@@ -33,4 +33,7 @@ If you want to develop vllm, install it in editable mode instead.
VLLM_TARGET_DEVICE=cpu python setup.py develop VLLM_TARGET_DEVICE=cpu python setup.py develop
``` ```
!!! note
If you are building vLLM from source and not using the pre-built images, remember to set `LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD"` on x86 machines before running vLLM.
# --8<-- [end:extra-information] # --8<-- [end:extra-information]
...@@ -56,14 +56,28 @@ Execute the following commands to build and install vLLM from the source. ...@@ -56,14 +56,28 @@ Execute the following commands to build and install vLLM from the source.
``` ```
# --8<-- [end:build-wheel-from-source] # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
# --8<-- [end:pre-built-images] # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] # --8<-- [start:build-image-from-source]
```bash
docker build -f docker/Dockerfile.s390x \
--tag vllm-cpu-env .
# Launching OpenAI server
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=float \
other vLLM OpenAI server arguments
```
# --8<-- [end:build-image-from-source] # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] # --8<-- [start:extra-information]
# --8<-- [end:extra-information] # --8<-- [end:extra-information]
# --8<-- [start:installation] # --8<-- [start:installation]
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation] # --8<-- [end:installation]
# --8<-- [start:requirements] # --8<-- [start:requirements]
- OS: Linux - OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended) - CPU flags: `avx512f`, `avx512_bf16` (Optional), `avx512_vnni` (Optional)
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
!!! tip !!! tip
[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware. Use `lscpu` to check the CPU flags.
# --8<-- [end:requirements] # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] # --8<-- [start:set-up-using-python]
...@@ -26,21 +22,37 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform, ...@@ -26,21 +22,37 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
--8<-- "docs/getting_started/installation/cpu/build.inc.md" --8<-- "docs/getting_started/installation/cpu/build.inc.md"
!!! note
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
# --8<-- [end:build-wheel-from-source] # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo) [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
!!! warning
If deploying the pre-built images on machines only contain `avx512f`, `Illegal instruction` error may be raised. It is recommended to build images for these machines with `--build-arg VLLM_CPU_AVX512BF16=false` and `--build-arg VLLM_CPU_AVX512VNNI=false`.
# --8<-- [end:pre-built-images] # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] # --8<-- [start:build-image-from-source]
```bash
docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512BF16=false (default)|true \
--build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
--tag vllm-cpu-env \
--target vllm-openai .
# Launching OpenAI server
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments
```
# --8<-- [end:build-image-from-source] # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] # --8<-- [start:extra-information]
# --8<-- [end:extra-information] # --8<-- [end:extra-information]
...@@ -37,7 +37,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp ...@@ -37,7 +37,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
- Google Cloud TPU VM - Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4 - TPU versions: v6e, v5e, v5p, v4
- Python: 3.10 or newer - Python: 3.11 or newer
### Provision Cloud TPUs ### Provision Cloud TPUs
...@@ -117,7 +117,7 @@ source ~/.bashrc ...@@ -117,7 +117,7 @@ source ~/.bashrc
Create and activate a Conda environment for vLLM: Create and activate a Conda environment for vLLM:
```bash ```bash
conda create -n vllm python=3.10 -y conda create -n vllm python=3.12 -y
conda activate vllm conda activate vllm
``` ```
......
...@@ -46,11 +46,11 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -46,11 +46,11 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "AMD ROCm" === "AMD ROCm"
There is no extra information on creating a new Python environment for this device. --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:set-up-using-python"
=== "Intel XPU" === "Intel XPU"
There is no extra information on creating a new Python environment for this device. --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:set-up-using-python"
### Pre-built wheels ### Pre-built wheels
......
...@@ -232,9 +232,6 @@ pip install -e . ...@@ -232,9 +232,6 @@ pip install -e .
``` ```
# --8<-- [end:build-wheel-from-source] # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image. See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image.
...@@ -261,4 +258,3 @@ See [deployment-docker-build-image-from-source][deployment-docker-build-image-fr ...@@ -261,4 +258,3 @@ See [deployment-docker-build-image-from-source][deployment-docker-build-image-fr
See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information. See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
# --8<-- [end:supported-features] # --8<-- [end:supported-features]
# --8<-- [end:extra-information]
...@@ -2,6 +2,9 @@ ...@@ -2,6 +2,9 @@
vLLM supports AMD GPUs with ROCm 6.3. vLLM supports AMD GPUs with ROCm 6.3.
!!! tip
[Docker](#set-up-using-docker) is the recommended way to use vLLM on ROCm.
!!! warning !!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source. There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
...@@ -14,6 +17,8 @@ vLLM supports AMD GPUs with ROCm 6.3. ...@@ -14,6 +17,8 @@ vLLM supports AMD GPUs with ROCm 6.3.
# --8<-- [end:requirements] # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] # --8<-- [start:set-up-using-python]
There is no extra information on creating a new Python environment for this device.
# --8<-- [end:set-up-using-python] # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] # --8<-- [start:pre-built-wheels]
...@@ -90,7 +95,7 @@ Currently, there are no pre-built ROCm wheels. ...@@ -90,7 +95,7 @@ Currently, there are no pre-built ROCm wheels.
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps: 4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
??? Commands ??? console "Commands"
```bash ```bash
pip install --upgrade pip pip install --upgrade pip
...@@ -123,9 +128,7 @@ Currently, there are no pre-built ROCm wheels. ...@@ -123,9 +128,7 @@ Currently, there are no pre-built ROCm wheels.
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level. - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization). For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
## Set up using Docker (Recommended) # --8<-- [end:build-wheel-from-source]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
...@@ -203,7 +206,7 @@ DOCKER_BUILDKIT=1 docker build \ ...@@ -203,7 +206,7 @@ DOCKER_BUILDKIT=1 docker build \
To run the above docker image `vllm-rocm`, use the below command: To run the above docker image `vllm-rocm`, use the below command:
??? Command ??? console "Command"
```bash ```bash
docker run -it \ docker run -it \
...@@ -227,4 +230,3 @@ Where the `<path/to/model>` is the location where the model is stored, for examp ...@@ -227,4 +230,3 @@ Where the `<path/to/model>` is the location where the model is stored, for examp
See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information. See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
# --8<-- [end:supported-features] # --8<-- [end:supported-features]
# --8<-- [end:extra-information]
...@@ -14,6 +14,8 @@ vLLM initially supports basic model inference and serving on Intel GPU platform. ...@@ -14,6 +14,8 @@ vLLM initially supports basic model inference and serving on Intel GPU platform.
# --8<-- [end:requirements] # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] # --8<-- [start:set-up-using-python]
There is no extra information on creating a new Python environment for this device.
# --8<-- [end:set-up-using-python] # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] # --8<-- [start:pre-built-wheels]
...@@ -43,9 +45,6 @@ VLLM_TARGET_DEVICE=xpu python setup.py install ...@@ -43,9 +45,6 @@ VLLM_TARGET_DEVICE=xpu python setup.py install
type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet. type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
# --8<-- [end:build-wheel-from-source] # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
Currently, there are no pre-built XPU images. Currently, there are no pre-built XPU images.
...@@ -81,4 +80,8 @@ python -m vllm.entrypoints.openai.api_server \ ...@@ -81,4 +80,8 @@ python -m vllm.entrypoints.openai.api_server \
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script. By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
# --8<-- [end:supported-features] # --8<-- [end:supported-features]
# --8<-- [end:extra-information] # --8<-- [start:distributed-backend]
XPU platform uses **torch-ccl** for torch<2.8 and **xccl** for torch>=2.8 as distributed backend, since torch 2.8 supports **xccl** as built-in backend for XPU.
# --8<-- [end:distributed-backend]
...@@ -28,7 +28,7 @@ To verify that the Intel Gaudi software was correctly installed, run: ...@@ -28,7 +28,7 @@ To verify that the Intel Gaudi software was correctly installed, run:
hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
pip list | grep neural # verify that neural_compressor is installed pip list | grep neural # verify that neural_compressor_pt is installed
``` ```
Refer to [Intel Gaudi Software Stack Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade) Refer to [Intel Gaudi Software Stack Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade)
...@@ -109,8 +109,8 @@ docker run \ ...@@ -109,8 +109,8 @@ docker run \
### Supported features ### Supported features
- [Offline inference][offline-inference] - [Offline inference](../../serving/offline_inference.md)
- Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server] - Online serving via [OpenAI-Compatible Server](../../serving/openai_compatible_server.md)
- HPU autodetection - no need to manually select device within vLLM - HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops, - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
...@@ -120,12 +120,13 @@ docker run \ ...@@ -120,12 +120,13 @@ docker run \
- Inference with [HPU Graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html) - Inference with [HPU Graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html)
for accelerating low-batch latency and throughput for accelerating low-batch latency and throughput
- Attention with Linear Biases (ALiBi) - Attention with Linear Biases (ALiBi)
- INC quantization
### Unsupported features ### Unsupported features
- Beam search - Beam search
- LoRA adapters - LoRA adapters
- Quantization - AWQ quantization
- Prefill chunking (mixed-batch inferencing) - Prefill chunking (mixed-batch inferencing)
### Supported configurations ### Supported configurations
...@@ -133,36 +134,20 @@ docker run \ ...@@ -133,36 +134,20 @@ docker run \
The following configurations have been validated to function with The following configurations have been validated to function with
Gaudi2 devices. Configurations that are not listed may or may not work. Gaudi2 devices. Configurations that are not listed may or may not work.
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) | Model | TP Size| dtype | Sampling |
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 |-------|--------|--------|----------|
datatype with random or greedy sampling | [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) | 1, 2, 8 | BF16 | Random / Greedy |
- [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | 1, 2, 8 | BF16 | Random / Greedy |
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 | [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | 1, 2, 8 | BF16 | Random / Greedy |
datatype with random or greedy sampling | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 1, 2, 8 | BF16 | Random / Greedy |
- [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) | [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) | 1, 2, 8 | BF16 | Random / Greedy |
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 | [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) | 1, 2, 8 | BF16 | Random / Greedy |
datatype with random or greedy sampling | [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) | 8 | BF16 | Random / Greedy |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | 8 | BF16 | Random / Greedy |
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 | [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) | 8 | BF16 | Random / Greedy |
datatype with random or greedy sampling | [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | 8 | BF16 | Random / Greedy |
- [meta-llama/Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) | [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) | 8 | BF16 | Random / Greedy |
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 | [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | 8 | BF16 | Random / Greedy |
datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16
datatype with random or greedy sampling
- [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-70B](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
- [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
## Performance tuning ## Performance tuning
...@@ -237,7 +222,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come ...@@ -237,7 +222,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup: Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
??? Logs ??? console "Logs"
```text ```text
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
...@@ -286,7 +271,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi ...@@ -286,7 +271,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released): Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
??? Logs ??? console "Logs"
```text ```text
INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024] INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
......
--- # Quickstart
title: Quickstart
---
[](){ #quickstart }
This guide will help you quickly get started with vLLM to perform: This guide will help you quickly get started with vLLM to perform:
...@@ -43,7 +40,7 @@ uv pip install vllm --torch-backend=auto ...@@ -43,7 +40,7 @@ uv pip install vllm --torch-backend=auto
``` ```
!!! note !!! note
For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM. For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
[](){ #quickstart-offline } [](){ #quickstart-offline }
...@@ -77,7 +74,7 @@ prompts = [ ...@@ -77,7 +74,7 @@ prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
``` ```
The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models]. The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](../models/supported_models.md).
```python ```python
llm = LLM(model="facebook/opt-125m") llm = LLM(model="facebook/opt-125m")
...@@ -147,7 +144,7 @@ curl http://localhost:8000/v1/completions \ ...@@ -147,7 +144,7 @@ curl http://localhost:8000/v1/completions \
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package: Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
??? Code ??? code
```python ```python
from openai import OpenAI from openai import OpenAI
...@@ -186,7 +183,7 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -186,7 +183,7 @@ curl http://localhost:8000/v1/chat/completions \
Alternatively, you can use the `openai` Python package: Alternatively, you can use the `openai` Python package:
??? Code ??? code
```python ```python
from openai import OpenAI from openai import OpenAI
......
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import logging
import sys
from argparse import SUPPRESS, HelpFormatter
from pathlib import Path
from typing import Literal
from unittest.mock import MagicMock, patch
ROOT_DIR = Path(__file__).parent.parent.parent.parent
ARGPARSE_DOC_DIR = ROOT_DIR / "docs/argparse"
sys.path.insert(0, str(ROOT_DIR))
sys.modules["aiohttp"] = MagicMock()
sys.modules["blake3"] = MagicMock()
sys.modules["vllm._C"] = MagicMock()
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs # noqa: E402
from vllm.entrypoints.openai.cli_args import make_arg_parser # noqa: E402
from vllm.utils import FlexibleArgumentParser # noqa: E402
logger = logging.getLogger("mkdocs")
class MarkdownFormatter(HelpFormatter):
"""Custom formatter that generates markdown for argument groups."""
def __init__(self, prog, starting_heading_level=3):
super().__init__(prog,
max_help_position=float('inf'),
width=float('inf'))
self._section_heading_prefix = "#" * starting_heading_level
self._argument_heading_prefix = "#" * (starting_heading_level + 1)
self._markdown_output = []
def start_section(self, heading):
if heading not in {"positional arguments", "options"}:
heading_md = f"\n{self._section_heading_prefix} {heading}\n\n"
self._markdown_output.append(heading_md)
def end_section(self):
pass
def add_text(self, text):
if text:
self._markdown_output.append(f"{text.strip()}\n\n")
def add_usage(self, usage, actions, groups, prefix=None):
pass
def add_arguments(self, actions):
for action in actions:
if (len(action.option_strings) == 0
or "--help" in action.option_strings):
continue
option_strings = f'`{"`, `".join(action.option_strings)}`'
heading_md = f"{self._argument_heading_prefix} {option_strings}\n\n"
self._markdown_output.append(heading_md)
if choices := action.choices:
choices = f'`{"`, `".join(str(c) for c in choices)}`'
self._markdown_output.append(
f"Possible choices: {choices}\n\n")
self._markdown_output.append(f"{action.help}\n\n")
if (default := action.default) != SUPPRESS:
self._markdown_output.append(f"Default: `{default}`\n\n")
def format_help(self):
"""Return the formatted help as markdown."""
return "".join(self._markdown_output)
def create_parser(cls, **kwargs) -> FlexibleArgumentParser:
"""Create a parser for the given class with markdown formatting.
Args:
cls: The class to create a parser for
**kwargs: Additional keyword arguments to pass to `cls.add_cli_args`.
Returns:
FlexibleArgumentParser: A parser with markdown formatting for the class.
"""
parser = FlexibleArgumentParser()
parser.formatter_class = MarkdownFormatter
with patch("vllm.config.DeviceConfig.__post_init__"):
return cls.add_cli_args(parser, **kwargs)
def create_serve_parser() -> FlexibleArgumentParser:
"""Create a parser for the serve command with markdown formatting."""
parser = FlexibleArgumentParser()
parser.formatter_class = lambda prog: MarkdownFormatter(
prog, starting_heading_level=4)
return make_arg_parser(parser)
def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
logger.info("Generating argparse documentation")
logger.debug("Root directory: %s", ROOT_DIR.resolve())
logger.debug("Output directory: %s", ARGPARSE_DOC_DIR.resolve())
# Create the ARGPARSE_DOC_DIR if it doesn't exist
if not ARGPARSE_DOC_DIR.exists():
ARGPARSE_DOC_DIR.mkdir(parents=True)
# Create parsers to document
parsers = {
"engine_args": create_parser(EngineArgs),
"async_engine_args": create_parser(AsyncEngineArgs,
async_args_only=True),
"serve": create_serve_parser(),
}
# Generate documentation for each parser
for stem, parser in parsers.items():
doc_path = ARGPARSE_DOC_DIR / f"{stem}.md"
with open(doc_path, "w") as f:
f.write(parser.format_help())
logger.info("Argparse generated: %s", doc_path.relative_to(ROOT_DIR))
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import itertools import itertools
import logging
from dataclasses import dataclass, field from dataclasses import dataclass, field
from pathlib import Path from pathlib import Path
from typing import Literal from typing import Literal
import regex as re import regex as re
logger = logging.getLogger("mkdocs")
ROOT_DIR = Path(__file__).parent.parent.parent.parent ROOT_DIR = Path(__file__).parent.parent.parent.parent
ROOT_DIR_RELATIVE = '../../../../..' ROOT_DIR_RELATIVE = '../../../../..'
EXAMPLE_DIR = ROOT_DIR / "examples" EXAMPLE_DIR = ROOT_DIR / "examples"
EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples" EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples"
print(ROOT_DIR.resolve())
print(EXAMPLE_DIR.resolve())
print(EXAMPLE_DOC_DIR.resolve())
def fix_case(text: str) -> str: def fix_case(text: str) -> str:
...@@ -135,6 +135,11 @@ class Example: ...@@ -135,6 +135,11 @@ class Example:
def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool): def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
logger.info("Generating example documentation")
logger.debug("Root directory: %s", ROOT_DIR.resolve())
logger.debug("Example directory: %s", EXAMPLE_DIR.resolve())
logger.debug("Example document directory: %s", EXAMPLE_DOC_DIR.resolve())
# Create the EXAMPLE_DOC_DIR if it doesn't exist # Create the EXAMPLE_DOC_DIR if it doesn't exist
if not EXAMPLE_DOC_DIR.exists(): if not EXAMPLE_DOC_DIR.exists():
EXAMPLE_DOC_DIR.mkdir(parents=True) EXAMPLE_DOC_DIR.mkdir(parents=True)
...@@ -156,8 +161,8 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool): ...@@ -156,8 +161,8 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
for example in sorted(examples, key=lambda e: e.path.stem): for example in sorted(examples, key=lambda e: e.path.stem):
example_name = f"{example.path.stem}.md" example_name = f"{example.path.stem}.md"
doc_path = EXAMPLE_DOC_DIR / example.category / example_name doc_path = EXAMPLE_DOC_DIR / example.category / example_name
print(doc_path)
if not doc_path.parent.exists(): if not doc_path.parent.exists():
doc_path.parent.mkdir(parents=True) doc_path.parent.mkdir(parents=True)
with open(doc_path, "w+") as f: with open(doc_path, "w+") as f:
f.write(example.generate()) f.write(example.generate())
logger.debug("Example generated: %s", doc_path.relative_to(ROOT_DIR))
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
This is basically a port of MyST parser’s external URL resolution mechanism
(https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#customising-external-url-resolution)
to work with MkDocs.
It allows Markdown authors to use GitHub shorthand links like:
- [Text](gh-issue:123)
- <gh-pr:456>
- [File](gh-file:path/to/file.py#L10)
These are automatically rewritten into fully qualified GitHub URLs pointing to
issues, pull requests, files, directories, or projects in the
`vllm-project/vllm` repository.
The goal is to simplify cross-referencing common GitHub resources
in project docs.
"""
import regex as re import regex as re
from mkdocs.config.defaults import MkDocsConfig from mkdocs.config.defaults import MkDocsConfig
from mkdocs.structure.files import Files from mkdocs.structure.files import Files
...@@ -7,11 +26,42 @@ from mkdocs.structure.pages import Page ...@@ -7,11 +26,42 @@ from mkdocs.structure.pages import Page
def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig, def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
files: Files): files: Files) -> str:
"""
Custom MkDocs plugin hook to rewrite special GitHub reference links
in Markdown.
This function scans the given Markdown content for specially formatted
GitHub shorthand links, such as:
- `[Link text](gh-issue:123)`
- `<gh-pr:456>`
And rewrites them into fully-qualified GitHub URLs with GitHub icons:
- `[:octicons-mark-github-16: Link text](https://github.com/vllm-project/vllm/issues/123)`
- `[:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)`
Supported shorthand types:
- `gh-issue`
- `gh-pr`
- `gh-project`
- `gh-dir`
- `gh-file`
Args:
markdown (str): The raw Markdown content of the page.
page (Page): The MkDocs page object being processed.
config (MkDocsConfig): The MkDocs site configuration.
files (Files): The collection of files in the MkDocs build.
Returns:
str: The updated Markdown content with GitHub shorthand links replaced.
"""
gh_icon = ":octicons-mark-github-16:" gh_icon = ":octicons-mark-github-16:"
gh_url = "https://github.com" gh_url = "https://github.com"
repo_url = f"{gh_url}/vllm-project/vllm" repo_url = f"{gh_url}/vllm-project/vllm"
org_url = f"{gh_url}/orgs/vllm-project" org_url = f"{gh_url}/orgs/vllm-project"
# Mapping of shorthand types to their corresponding GitHub base URLs
urls = { urls = {
"issue": f"{repo_url}/issues", "issue": f"{repo_url}/issues",
"pr": f"{repo_url}/pull", "pr": f"{repo_url}/pull",
...@@ -19,6 +69,8 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig, ...@@ -19,6 +69,8 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
"dir": f"{repo_url}/tree/main", "dir": f"{repo_url}/tree/main",
"file": f"{repo_url}/blob/main", "file": f"{repo_url}/blob/main",
} }
# Default title prefixes for auto links
titles = { titles = {
"issue": "Issue #", "issue": "Issue #",
"pr": "Pull Request #", "pr": "Pull Request #",
...@@ -27,11 +79,19 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig, ...@@ -27,11 +79,19 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
"file": "", "file": "",
} }
# Regular expression to match GitHub shorthand links
scheme = r"gh-(?P<type>.+?):(?P<path>.+?)(#(?P<fragment>.+?))?" scheme = r"gh-(?P<type>.+?):(?P<path>.+?)(#(?P<fragment>.+?))?"
inline_link = re.compile(r"\[(?P<title>[^\[]+?)\]\(" + scheme + r"\)") inline_link = re.compile(r"\[(?P<title>[^\[]+?)\]\(" + scheme + r"\)")
auto_link = re.compile(f"<{scheme}>") auto_link = re.compile(f"<{scheme}>")
def replace_inline_link(match: re.Match) -> str: def replace_inline_link(match: re.Match) -> str:
"""
Replaces a matched inline-style GitHub shorthand link
with a full Markdown link.
Example:
[My issue](gh-issue:123) → [:octicons-mark-github-16: My issue](https://github.com/vllm-project/vllm/issues/123)
"""
url = f'{urls[match.group("type")]}/{match.group("path")}' url = f'{urls[match.group("type")]}/{match.group("path")}'
if fragment := match.group("fragment"): if fragment := match.group("fragment"):
url += f"#{fragment}" url += f"#{fragment}"
...@@ -39,6 +99,13 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig, ...@@ -39,6 +99,13 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
return f'[{gh_icon} {match.group("title")}]({url})' return f'[{gh_icon} {match.group("title")}]({url})'
def replace_auto_link(match: re.Match) -> str: def replace_auto_link(match: re.Match) -> str:
"""
Replaces a matched autolink-style GitHub shorthand
with a full Markdown link.
Example:
<gh-pr:456> → [:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)
"""
type = match.group("type") type = match.group("type")
path = match.group("path") path = match.group("path")
title = f"{titles[type]}{path}" title = f"{titles[type]}{path}"
...@@ -48,6 +115,7 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig, ...@@ -48,6 +115,7 @@ def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
return f"[{gh_icon} {title}]({url})" return f"[{gh_icon} {title}]({url})"
# Replace both inline and autolinks
markdown = inline_link.sub(replace_inline_link, markdown) markdown = inline_link.sub(replace_inline_link, markdown)
markdown = auto_link.sub(replace_auto_link, markdown) markdown = auto_link.sub(replace_auto_link, markdown)
......
<!-- Enables the use of toc_depth in document frontmatter https://github.com/squidfunk/mkdocs-material/issues/4827#issuecomment-1869812019 -->
<li class="md-nav__item">
<a href="{{ toc_item.url }}" class="md-nav__link">
<span class="md-ellipsis">
{{ toc_item.title }}
</span>
</a>
<!-- Table of contents list -->
{% if toc_item.children %}
<nav class="md-nav" aria-label="{{ toc_item.title | striptags }}">
<ul class="md-nav__list">
{% for toc_item in toc_item.children %}
{% if not page.meta.toc_depth or toc_item.level <= page.meta.toc_depth %}
{% include "partials/toc-item.html" %}
{% endif %}
{% endfor %}
</ul>
</nav>
{% endif %}
</li>
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment