Commit 4eabe123 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge remote-tracking branch 'mirror/releases/v0.9.0' into v0.9.0-ori

parents 45840cd2 58738772
(quantized-kvcache)= ---
title: Quantized KV Cache
# Quantized KV Cache ---
[](){ #quantized-kvcache }
## FP8 KV Cache ## FP8 KV Cache
......
(quark)= ---
title: AMD QUARK
# AMD QUARK ---
[](){ #quark }
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/), throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
...@@ -86,13 +87,12 @@ We need to set the quantization configuration, you can check ...@@ -86,13 +87,12 @@ We need to set the quantization configuration, you can check
for further details. Here we use FP8 per-tensor quantization on weight, activation, for further details. Here we use FP8 per-tensor quantization on weight, activation,
kv-cache and the quantization algorithm is AutoSmoothQuant. kv-cache and the quantization algorithm is AutoSmoothQuant.
:::{note} !!! note
Note the quantization algorithm needs a JSON config file and the config file is located in Note the quantization algorithm needs a JSON config file and the config file is located in
[Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html), [Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
under the directory `examples/torch/language_modeling/llm_ptq/models`. For example, under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
AutoSmoothQuant config file for Llama is AutoSmoothQuant config file for Llama is
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`. `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
:::
```python ```python
from quark.torch.quantization import (Config, QuantizationConfig, from quark.torch.quantization import (Config, QuantizationConfig,
......
---
title: Supported Hardware
---
[](){ #quantization-supported-hardware }
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Inferentia | Google TPU |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ❌ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- ❌ indicates that the quantization method is not supported on the specified hardware.
!!! note
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
...@@ -7,7 +7,9 @@ We recommend installing the latest torchao nightly with ...@@ -7,7 +7,9 @@ We recommend installing the latest torchao nightly with
```console ```console
# Install the latest TorchAO nightly build # Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.) # Choose the CUDA version that matches your system (cu126, cu128, etc.)
pip install --pre torchao>=10.0.0 --index-url https://download.pytorch.org/whl/nightly/cu126 pip install \
--pre torchao>=10.0.0 \
--index-url https://download.pytorch.org/whl/nightly/cu126
``` ```
## Quantizing HuggingFace Models ## Quantizing HuggingFace Models
...@@ -20,7 +22,12 @@ from torchao.quantization import Int8WeightOnlyConfig ...@@ -20,7 +22,12 @@ from torchao.quantization import Int8WeightOnlyConfig
model_name = "meta-llama/Meta-Llama-3-8B" model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = TorchAoConfig(Int8WeightOnlyConfig()) quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config) quantized_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?" input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
......
(reasoning-outputs)= ---
title: Reasoning Outputs
# Reasoning Outputs ---
[](){ #reasoning-outputs }
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions. vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
...@@ -17,17 +18,17 @@ vLLM currently supports the following reasoning models: ...@@ -17,17 +18,17 @@ vLLM currently supports the following reasoning models:
| [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ | | [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ |
| [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ | | [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ |
:::{note} !!! note
IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`. IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`. The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
:::
## Quickstart ## Quickstart
To use reasoning models, you need to specify the `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output. To use reasoning models, you need to specify the `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
```bash ```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1 vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--reasoning-parser deepseek_r1
``` ```
Next, make a request to the model that should return the reasoning content in the response. Next, make a request to the model that should return the reasoning content in the response.
...@@ -167,12 +168,10 @@ client = OpenAI( ...@@ -167,12 +168,10 @@ client = OpenAI(
models = client.models.list() models = client.models.list()
model = models.data[0].id model = models.data[0].id
class People(BaseModel): class People(BaseModel):
name: str name: str
age: int age: int
json_schema = People.model_json_schema() json_schema = People.model_json_schema()
prompt = ("Generate a JSON with the name and age of one random person.") prompt = ("Generate a JSON with the name and age of one random person.")
......
(spec-decode)= ---
title: Speculative Decoding
---
[](){ #spec-decode }
# Speculative Decoding !!! warning
Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
:::{warning} !!! warning
Please note that speculative decoding in vLLM is not yet optimized and does Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
:::
:::{warning}
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
:::
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM. This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
...@@ -46,14 +45,18 @@ for output in outputs: ...@@ -46,14 +45,18 @@ for output in outputs:
To perform the same with an online mode launch the server: To perform the same with an online mode launch the server:
```bash ```bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \ python -m vllm.entrypoints.openai.api_server \
--seed 42 -tp 1 --gpu_memory_utilization 0.8 \ --host 0.0.0.0 \
--port 8000 \
--model facebook/opt-6.7b \
--seed 42 \
-tp 1 \
--gpu_memory_utilization 0.8 \
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}' --speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
``` ```
:::{warning} !!! warning
Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now. Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
:::
Then use a client: Then use a client:
...@@ -172,7 +175,7 @@ A variety of speculative models of this type are available on HF hub: ...@@ -172,7 +175,7 @@ A variety of speculative models of this type are available on HF hub:
## Speculating using EAGLE based draft models ## Speculating using EAGLE based draft models
The following code configures vLLM to use speculative decoding where proposals are generated by The following code configures vLLM to use speculative decoding where proposals are generated by
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](<gh-file:examples/offline_inference/eagle.py>). an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
```python ```python
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
...@@ -255,7 +258,7 @@ speculative decoding, breaking down the guarantees into three key areas: ...@@ -255,7 +258,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability** 3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq). titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors: can occur due to following factors:
...@@ -264,7 +267,7 @@ can occur due to following factors: ...@@ -264,7 +267,7 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability. due to non-deterministic behavior in batched operations or numerical instability.
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq). For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
## Resources for vLLM contributors ## Resources for vLLM contributors
......
(structured-outputs)= ---
title: Structured Outputs
# Structured Outputs ---
[](){ #structured-outputs }
vLLM supports the generation of structured outputs using vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or [xgrammar](https://github.com/mlc-ai/xgrammar) or
...@@ -20,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters: ...@@ -20,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar. - `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text. - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server](#openai-compatible-server) page. You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the may choose to specify the backend to use by setting the
...@@ -83,13 +84,11 @@ class CarType(str, Enum): ...@@ -83,13 +84,11 @@ class CarType(str, Enum):
truck = "Truck" truck = "Truck"
coupe = "Coupe" coupe = "Coupe"
class CarDescription(BaseModel): class CarDescription(BaseModel):
brand: str brand: str
model: str model: str
car_type: CarType car_type: CarType
json_schema = CarDescription.model_json_schema() json_schema = CarDescription.model_json_schema()
completion = client.chat.completions.create( completion = client.chat.completions.create(
...@@ -105,11 +104,10 @@ completion = client.chat.completions.create( ...@@ -105,11 +104,10 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content) print(completion.choices[0].message.content)
``` ```
:::{tip} !!! tip
While not strictly necessary, normally it´s better to indicate in the prompt the While not strictly necessary, normally it´s better to indicate in the prompt the
JSON schema and how the fields should be populated. This can improve the JSON schema and how the fields should be populated. This can improve the
results notably in most cases. results notably in most cases.
:::
Finally we have the `guided_grammar` option, which is probably the most Finally we have the `guided_grammar` option, which is probably the most
difficult to use, but it´s really powerful. It allows us to define complete difficult to use, but it´s really powerful. It allows us to define complete
...@@ -160,12 +158,10 @@ Here is a simple example demonstrating how to get structured output using Pydant ...@@ -160,12 +158,10 @@ Here is a simple example demonstrating how to get structured output using Pydant
from pydantic import BaseModel from pydantic import BaseModel
from openai import OpenAI from openai import OpenAI
class Info(BaseModel): class Info(BaseModel):
name: str name: str
age: int age: int
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy") client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse( completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct", model="meta-llama/Llama-3.1-8B-Instruct",
...@@ -199,17 +195,14 @@ from typing import List ...@@ -199,17 +195,14 @@ from typing import List
from pydantic import BaseModel from pydantic import BaseModel
from openai import OpenAI from openai import OpenAI
class Step(BaseModel): class Step(BaseModel):
explanation: str explanation: str
output: str output: str
class MathResponse(BaseModel): class MathResponse(BaseModel):
steps: list[Step] steps: list[Step]
final_answer: str final_answer: str
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy") client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse( completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct", model="meta-llama/Llama-3.1-8B-Instruct",
......
...@@ -93,7 +93,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha ...@@ -93,7 +93,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha
## Required Function Calling ## Required Function Calling
vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html#feature-model) for the V1 engine. vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/usage/v1_guide.html#feature-model) for the V1 engine.
When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter. When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
...@@ -158,13 +158,13 @@ All Llama 3.1, 3.2 and 4 models should be supported. ...@@ -158,13 +158,13 @@ All Llama 3.1, 3.2 and 4 models should be supported.
* `meta-llama/Llama-3.2-*` * `meta-llama/Llama-3.2-*`
* `meta-llama/Llama-4-*` * `meta-llama/Llama-4-*`
The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. The tool calling that is supported is the [JSON based tool calling](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#json-based-tool-calling). For [pythonic tool calling](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#zero-shot-function-calling) introduced by the Llama-3.2 models, see the `pythonic` tool parser below. As for llama 4 models, it is recommended to use the `llama4_pythonic` tool parser.
Other tool calling formats like the built in python tool calling or custom tool calling are not supported. Other tool calling formats like the built in python tool calling or custom tool calling are not supported.
Known issues: Known issues:
1. Parallel tool calls are not supported. 1. Parallel tool calls are not supported for llama 3, but it is supported in llama 4 models.
2. The model can generate parameters with a wrong format, such as generating 2. The model can generate parameters with a wrong format, such as generating
an array serialized as string instead of an array. an array serialized as string instead of an array.
...@@ -177,11 +177,10 @@ images. ...@@ -177,11 +177,10 @@ images.
Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}` Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}`
VLLM also provides a JSON based chat template for Llama 4: VLLM also provides a pythonic and JSON based chat template for Llama 4, but pythonic tool calling is recommended:
* <gh-file:examples/tool_chat_template_llama4_json.jinja> - this is based on the "official" chat template for the Llama 4 * <gh-file:examples/tool_chat_template_llama4_pythonic.jinja> - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models.
models, but tweaked so that it works better with vLLM.
For Llama 4 use `--tool-call-parser llama4_json examples/tool_chat_template_llama4_json.jinja`. For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`.
#### IBM Granite #### IBM Granite
...@@ -323,7 +322,6 @@ class ExampleToolParser(ToolParser): ...@@ -323,7 +322,6 @@ class ExampleToolParser(ToolParser):
tool_calls=[], tool_calls=[],
content=text) content=text)
``` ```
Then you can use this plugin in the command line like this. Then you can use this plugin in the command line like this.
......
nav:
- README.md
- gpu.md
- cpu.md
- ai_accelerator.md
\ No newline at end of file
---
title: Installation
---
[](){ #installation-index }
vLLM supports the following hardware platforms:
- [GPU](gpu.md)
- [NVIDIA CUDA](gpu.md#nvidia-cuda)
- [AMD ROCm](gpu.md#amd-rocm)
- [Intel XPU](gpu.md#intel-xpu)
- [CPU](cpu.md)
- [Intel/AMD x86](cpu.md#intelamd-x86)
- [ARM AArch64](cpu.md#arm-aarch64)
- [Apple silicon](cpu.md#apple-silicon)
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
- [Other AI accelerators](ai_accelerator.md)
- [Google TPU](ai_accelerator.md#google-tpu)
- [Intel Gaudi](ai_accelerator.md#intel-gaudi)
- [AWS Neuron](ai_accelerator.md#aws-neuron)
# Other AI accelerators
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
## Requirements
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
## Configure a new environment
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
## Set up using Python
### Pre-built wheels
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
### Build wheel from source
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
## Set up using Docker
### Pre-built images
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
### Build image from source
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
## Extra information
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"
# Installation # --8<-- [start:installation]
This tab provides instructions on running vLLM with Intel Gaudi devices. This tab provides instructions on running vLLM with Intel Gaudi devices.
:::{attention} !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
:::
## Requirements # --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: Ubuntu 22.04 LTS - OS: Ubuntu 22.04 LTS
- Python: 3.10 - Python: 3.10
...@@ -45,16 +45,27 @@ Use the following commands to run a Docker image: ...@@ -45,16 +45,27 @@ Use the following commands to run a Docker image:
```console ```console
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest docker run \
-it \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--net=host \
--ipc=host \
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
``` ```
## Set up using Python # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
### Pre-built wheels # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Intel Gaudi wheels. Currently, there are no pre-built Intel Gaudi wheels.
### Build wheel from source # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
To build and install vLLM from source, run: To build and install vLLM from source, run:
...@@ -75,29 +86,39 @@ pip install -r requirements/hpu.txt ...@@ -75,29 +86,39 @@ pip install -r requirements/hpu.txt
python setup.py develop python setup.py develop
``` ```
## Set up using Docker # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
### Pre-built images # --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Intel Gaudi images. Currently, there are no pre-built Intel Gaudi images.
### Build image from source # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
```console ```console
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env . docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env docker run \
-it \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--net=host \
--rm vllm-hpu-env
``` ```
:::{tip} !!! tip
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered. If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
:::
## Extra information # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
## Supported features ## Supported features
- [Offline inference](#offline-inference) - [Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server](#openai-compatible-server) - Online serving via [OpenAI-Compatible Server][openai-compatible-server]
- HPU autodetection - no need to manually select device within vLLM - HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops, - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
...@@ -157,41 +178,25 @@ Gaudi2 devices. Configurations that are not listed may or may not work. ...@@ -157,41 +178,25 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag. Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
:::{list-table} vLLM execution modes | `PT_HPU_LAZY_MODE` | `enforce_eager` | execution mode |
:widths: 25 25 50 |----------------------|-------------------|--------------------|
:header-rows: 1 | 0 | 0 | torch.compile |
| 0 | 1 | PyTorch eager mode |
- * `PT_HPU_LAZY_MODE` | 1 | 0 | HPU Graphs |
* `enforce_eager` <figcaption>vLLM execution modes</figcaption>
* execution mode
- * 0 !!! warning
* 0 In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
* torch.compile
- * 0 [](){ #gaudi-bucketing-mechanism }
* 1
* PyTorch eager mode
- * 1
* 0
* HPU Graphs
- * 1
* 1
* PyTorch lazy mode
:::
:::{warning}
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
:::
(gaudi-bucketing-mechanism)=
### Bucketing mechanism ### Bucketing mechanism
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution. Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`. In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
:::{note} !!! note
Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase. Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
:::
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup: Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
...@@ -224,15 +229,13 @@ min = 128, step = 128, max = 512 ...@@ -224,15 +229,13 @@ min = 128, step = 128, max = 512
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket. In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
:::{warning} !!! warning
If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario. If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
:::
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket. As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
:::{note} !!! note
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests. Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
:::
### Warmup ### Warmup
...@@ -252,11 +255,10 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size ...@@ -252,11 +255,10 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
``` ```
This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations. This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
:::{tip} !!! tip
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment. Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
:::
### HPU Graph capture ### HPU Graph capture
...@@ -271,9 +273,8 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil ...@@ -271,9 +273,8 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil
Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints. Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs. Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
:::{note} !!! note
`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory. `gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
:::
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented: User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
...@@ -282,9 +283,8 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec ...@@ -282,9 +283,8 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy. When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
:::{note} !!! note
`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below. `VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
:::
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released): Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
...@@ -401,3 +401,4 @@ the below: ...@@ -401,3 +401,4 @@ the below:
higher batches. You can do that by adding `--enforce-eager` flag to higher batches. You can do that by adding `--enforce-eager` flag to
server (for online serving), or by passing `enforce_eager=True` server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference). argument to LLM constructor (for offline inference).
# --8<-- [end:extra-information]
# Installation # --8<-- [start:installation]
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching. vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
Paged Attention and Chunked Prefill are currently in development and will be available soon. Paged Attention and Chunked Prefill are currently in development and will be available soon.
Data types currently supported in Neuron SDK are FP16 and BF16. Data types currently supported in Neuron SDK are FP16 and BF16.
:::{attention} !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
:::
## Requirements # --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: Linux - OS: Linux
- Python: 3.9 -- 3.11 - Python: 3.9 -- 3.11
...@@ -38,7 +38,8 @@ The installation of drivers and tools wouldn't be necessary, if [Deep Learning A ...@@ -38,7 +38,8 @@ The installation of drivers and tools wouldn't be necessary, if [Deep Learning A
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
EOF EOF
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add - wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB \
| sudo apt-key add -
# Update OS packages # Update OS packages
sudo apt-get update -y sudo apt-get update -y
...@@ -63,17 +64,19 @@ sudo apt-get install aws-neuronx-tools=2.* -y ...@@ -63,17 +64,19 @@ sudo apt-get install aws-neuronx-tools=2.* -y
export PATH=/opt/aws/neuron/bin:$PATH export PATH=/opt/aws/neuron/bin:$PATH
``` ```
## Set up using Python # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
### Pre-built wheels # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Neuron wheels. Currently, there are no pre-built Neuron wheels.
### Build wheel from source # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
:::{note} !!! note
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel. The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
:::
Following instructions are applicable to Neuron SDK 2.16 and beyond. Following instructions are applicable to Neuron SDK 2.16 and beyond.
...@@ -94,12 +97,17 @@ source aws_neuron_venv_pytorch/bin/activate ...@@ -94,12 +97,17 @@ source aws_neuron_venv_pytorch/bin/activate
# Install Jupyter notebook kernel # Install Jupyter notebook kernel
pip install ipykernel pip install ipykernel
python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" python3.10 -m ipykernel install \
--user \
--name aws_neuron_venv_pytorch \
--display-name "Python (torch-neuronx)"
pip install jupyter notebook pip install jupyter notebook
pip install environment_kernels pip install environment_kernels
# Set pip repository pointing to the Neuron repository # Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com python -m pip config set \
global.extra-index-url \
https://pip.repos.neuron.amazonaws.com
# Install wget, awscli # Install wget, awscli
python -m pip install wget python -m pip install wget
...@@ -122,18 +130,23 @@ VLLM_TARGET_DEVICE="neuron" pip install . ...@@ -122,18 +130,23 @@ VLLM_TARGET_DEVICE="neuron" pip install .
If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed. If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed.
## Set up using Docker # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
### Pre-built images # --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Neuron images. Currently, there are no pre-built Neuron images.
### Build image from source # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
See <project:#deployment-docker-build-image-from-source> for instructions on building the Docker image. See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile. Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
## Extra information # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
There is no extra information for this device. There is no extra information for this device.
# --8<-- [end:extra-information]
# Installation # --8<-- [start:installation]
Tensor Processing Units (TPUs) are Google's custom-developed application-specific Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
...@@ -30,11 +30,11 @@ For TPU pricing information, see [Cloud TPU pricing](https://cloud.google.com/tp ...@@ -30,11 +30,11 @@ For TPU pricing information, see [Cloud TPU pricing](https://cloud.google.com/tp
You may need additional persistent storage for your TPU VMs. For more You may need additional persistent storage for your TPU VMs. For more
information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp.google.com/tpu/docs/storage-options). information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp.google.com/tpu/docs/storage-options).
:::{attention} !!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source. There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
:::
## Requirements # --8<-- [end:installation]
# --8<-- [start:requirements]
- Google Cloud TPU VM - Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4 - TPU versions: v6e, v5e, v5p, v4
...@@ -51,10 +51,9 @@ When you request queued resources, the request is added to a queue maintained by ...@@ -51,10 +51,9 @@ When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use. assigned to your Google Cloud project for your immediate exclusive use.
:::{note} !!! note
In all of the following commands, replace the ALL CAPS parameter names with In all of the following commands, replace the ALL CAPS parameter names with
appropriate values. See the parameter descriptions table for more information. appropriate values. See the parameter descriptions table for more information.
:::
### Provision Cloud TPUs with GKE ### Provision Cloud TPUs with GKE
...@@ -79,33 +78,15 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \ ...@@ -79,33 +78,15 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--service-account SERVICE_ACCOUNT --service-account SERVICE_ACCOUNT
``` ```
:::{list-table} Parameter descriptions | Parameter name | Description |
:header-rows: 1 |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
- * Parameter name | TPU_NAME | The user-assigned name of the TPU which is created when the queued |
* Description | PROJECT_ID | Your Google Cloud project |
- * QUEUED_RESOURCE_ID | ZONE | The GCP zone where you want to create your Cloud TPU. The value you use |
* The user-assigned ID of the queued resource request. | ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example |
- * TPU_NAME | RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes). |
* The user-assigned name of the TPU which is created when the queued <figcaption>Parameter descriptions</figcaption>
resource request is allocated.
- * PROJECT_ID
* Your Google Cloud project
- * ZONE
* The GCP zone where you want to create your Cloud TPU. The value you use
depends on the version of TPUs you are using. For more information, see
`TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
- * ACCELERATOR_TYPE
* The TPU version you want to use. Specify the TPU version, for example
`v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information,
see [TPU versions](https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions).
- * RUNTIME_VERSION
* The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes).
- * SERVICE_ACCOUNT
* The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
:::
Connect to your TPU using SSH: Connect to your TPU using SSH:
...@@ -113,13 +94,16 @@ Connect to your TPU using SSH: ...@@ -113,13 +94,16 @@ Connect to your TPU using SSH:
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
``` ```
## Set up using Python # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
### Pre-built wheels # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built TPU wheels. Currently, there are no pre-built TPU wheels.
### Build wheel from source # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
Install Miniconda: Install Miniconda:
...@@ -161,13 +145,16 @@ Run the setup script: ...@@ -161,13 +145,16 @@ Run the setup script:
VLLM_TARGET_DEVICE="tpu" python -m pip install -e . VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
``` ```
## Set up using Docker # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
### Pre-built images # --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
See <project:#deployment-docker-pre-built-image> for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`. See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
### Build image from source # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support. You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
...@@ -182,31 +169,30 @@ Run the Docker image with the following command: ...@@ -182,31 +169,30 @@ Run the Docker image with the following command:
docker run --privileged --net host --shm-size=16G -it vllm-tpu docker run --privileged --net host --shm-size=16G -it vllm-tpu
``` ```
:::{note} !!! note
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
possible input shapes and compiles an XLA graph for each shape. The possible input shapes and compiles an XLA graph for each shape. The
compilation time may take 20~30 minutes in the first run. However, the compilation time may take 20~30 minutes in the first run. However, the
compilation time reduces to ~5 minutes afterwards because the XLA graphs are compilation time reduces to ~5 minutes afterwards because the XLA graphs are
cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default). cached in the disk (in `VLLM_XLA_CACHE_PATH` or `~/.cache/vllm/xla_cache` by default).
:::
:::{tip} !!! tip
If you encounter the following error: If you encounter the following error:
```console ```console
from torch._C import * # noqa: F403 from torch._C import * # noqa: F403
ImportError: libopenblas.so.0: cannot open shared object file: No such ImportError: libopenblas.so.0: cannot open shared object file: No such
file or directory file or directory
``` ```
Install OpenBLAS with the following command:
```console Install OpenBLAS with the following command:
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```
::: ```console
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```
## Extra information # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
There is no extra information for this device. There is no extra information for this device.
# --8<-- [end:extra-information]
...@@ -2,107 +2,47 @@ ...@@ -2,107 +2,47 @@
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions: vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
:::::{tab-set} === "Intel/AMD x86"
:sync-group: device
::::{tab-item} Intel/AMD x86 --8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation"
:selected:
:sync: x86
:::{include} cpu/x86.inc.md === "ARM AArch64"
:start-after: "# Installation"
:end-before: "## Requirements"
:::
:::: --8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation"
::::{tab-item} ARM AArch64 === "Apple silicon"
:sync: arm
:::{include} cpu/arm.inc.md --8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation"
:start-after: "# Installation"
:end-before: "## Requirements"
:::
:::: === "IBM Z (S390X)"
::::{tab-item} Apple silicon --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation"
:sync: apple
:::{include} cpu/apple.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} IBM Z (S390X)
:sync: s390x
:::{include} cpu/s390x.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
:::::
## Requirements ## Requirements
- Python: 3.9 -- 3.12 - Python: 3.9 -- 3.12
:::::{tab-set} === "Intel/AMD x86"
:sync-group: device
::::{tab-item} Intel/AMD x86
:sync: x86
:::{include} cpu/x86.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} ARM AArch64
:sync: arm
:::{include} cpu/arm.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
:::: --8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements"
::::{tab-item} Apple silicon === "ARM AArch64"
:sync: apple
:::{include} cpu/apple.inc.md --8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements"
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
:::: === "Apple silicon"
::::{tab-item} IBM Z (S390X) --8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements"
:sync: s390x
:::{include} cpu/s390x.inc.md === "IBM Z (S390X)"
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
:::: --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements"
:::::
## Set up using Python ## Set up using Python
### Create a new Python environment ### Create a new Python environment
:::{include} python_env_setup.inc.md --8<-- "docs/getting_started/installation/python_env_setup.inc.md"
:::
### Pre-built wheels ### Pre-built wheels
...@@ -110,69 +50,29 @@ Currently, there are no pre-built CPU wheels. ...@@ -110,69 +50,29 @@ Currently, there are no pre-built CPU wheels.
### Build wheel from source ### Build wheel from source
:::::{tab-set} === "Intel/AMD x86"
:sync-group: device
::::{tab-item} Intel/AMD x86
:sync: x86
:::{include} cpu/x86.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} ARM AArch64
:sync: arm
:::{include} cpu/arm.inc.md --8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source"
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
:::: === "ARM AArch64"
::::{tab-item} Apple silicon --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source"
:sync: apple
:::{include} cpu/apple.inc.md === "Apple silicon"
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
:::: --8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source"
::::{tab-item} IBM Z (s390x) === "IBM Z (s390x)"
:sync: s390x
:::{include} cpu/s390x.inc.md --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source"
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
:::::
## Set up using Docker ## Set up using Docker
### Pre-built images ### Pre-built images
:::::{tab-set} === "Intel/AMD x86"
:sync-group: device
::::{tab-item} Intel/AMD x86
:sync: x86
:::{include} cpu/x86.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::: --8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images"
### Build image from source ### Build image from source
...@@ -192,13 +92,11 @@ $ docker run --rm \ ...@@ -192,13 +92,11 @@ $ docker run --rm \
other vLLM OpenAI server arguments other vLLM OpenAI server arguments
``` ```
::::{tip} !!! tip
For ARM or Apple silicon, use `docker/Dockerfile.arm` For ARM or Apple silicon, use `docker/Dockerfile.arm`
::::
::::{tip} !!! tip
For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float` For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
::::
## Supported features ## Supported features
......
# Installation # --8<-- [start:installation]
vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS. vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS.
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes. Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
:::{attention} !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
:::
## Requirements # --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: `macOS Sonoma` or later - OS: `macOS Sonoma` or later
- SDK: `XCode 15.4` or later with Command Line Tools - SDK: `XCode 15.4` or later with Command Line Tools
- Compiler: `Apple Clang >= 15.0.0` - Compiler: `Apple Clang >= 15.0.0`
## Set up using Python # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
### Pre-built wheels # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
### Build wheel from source # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source. After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
...@@ -29,9 +32,8 @@ pip install -r requirements/cpu.txt ...@@ -29,9 +32,8 @@ pip install -r requirements/cpu.txt
pip install -e . pip install -e .
``` ```
:::{note} !!! note
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device. On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
:::
#### Troubleshooting #### Troubleshooting
...@@ -51,10 +53,15 @@ If the build has error like the following snippet where standard C++ headers can ...@@ -51,10 +53,15 @@ If the build has error like the following snippet where standard C++ headers can
1 error generated. 1 error generated.
``` ```
## Set up using Docker # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
### Pre-built images # --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
### Build image from source # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
## Extra information # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
# --8<-- [end:extra-information]
# Installation # --8<-- [start:installation]
vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.
ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes. ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
:::{attention} !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
:::
## Requirements # --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: Linux - OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended) - Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): NEON support is required - Instruction Set Architecture (ISA): NEON support is required
## Set up using Python # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
### Pre-built wheels # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
### Build wheel from source # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
:::{include} cpu/build.inc.md --8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md"
:::
Testing has been conducted on AWS Graviton3 instances for compatibility. Testing has been conducted on AWS Graviton3 instances for compatibility.
## Set up using Docker # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
### Pre-built images # --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
### Build image from source # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
## Extra information # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
# --8<-- [end:extra-information]
...@@ -32,3 +32,5 @@ If you want to develop vllm, install it in editable mode instead. ...@@ -32,3 +32,5 @@ If you want to develop vllm, install it in editable mode instead.
```console ```console
VLLM_TARGET_DEVICE=cpu python setup.py develop VLLM_TARGET_DEVICE=cpu python setup.py develop
``` ```
# --8<-- [end:extra-information]
# Installation # --8<-- [start:installation]
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform. vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform.
Currently the CPU implementation for s390x architecture supports FP32 datatype only. Currently the CPU implementation for s390x architecture supports FP32 datatype only.
:::{attention} !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
:::
## Requirements # --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: `Linux` - OS: `Linux`
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools - SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above. - Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
- Build install python packages: `pyarrow`, `torch` and `torchvision` - Build install python packages: `pyarrow`, `torch` and `torchvision`
## Set up using Python # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
### Pre-built wheels # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
### Build wheel from source # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4: Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
...@@ -39,9 +42,8 @@ curl https://sh.rustup.rs -sSf | sh -s -- -y && \ ...@@ -39,9 +42,8 @@ curl https://sh.rustup.rs -sSf | sh -s -- -y && \
Execute the following commands to build and install vLLM from the source. Execute the following commands to build and install vLLM from the source.
::::{tip} !!! tip
Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM. Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
::::
```console ```console
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
...@@ -53,10 +55,15 @@ Please build the following dependencies, `torchvision`, `pyarrow` from the sourc ...@@ -53,10 +55,15 @@ Please build the following dependencies, `torchvision`, `pyarrow` from the sourc
pip install dist/*.whl pip install dist/*.whl
``` ```
## Set up using Docker # --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
### Pre-built images # --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
### Build image from source # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
## Extra information # --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
# --8<-- [end:extra-information]
# --8<-- [start:installation]
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
- OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
!!! tip
[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md"
!!! note
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker]
# --8<-- [start:pre-built-images]
See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
# --8<-- [end:extra-information]
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment