Result: Getting the weather for San Francisco, CA in fahrenheit...
Result: Getting the weather for San Francisco, CA in fahrenheit...
```
```
This example demonstrates:
This example demonstrates:
- Setting up the server with tool calling enabled
- Defining an actual function to handle tool calls
* Setting up the server with tool calling enabled
- Making a request with `tool_choice="auto"`
* Defining an actual function to handle tool calls
- Handling the structured response and executing the corresponding function
* Making a request with `tool_choice="auto"`
* Handling the structured response and executing the corresponding function
You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests.
You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests.
Remember that it's the callers responsibility to:
Remember that it's the callers responsibility to:
1. Define appropriate tools in the request
1. Define appropriate tools in the request
2. Include relevant context in the chat messages
2. Include relevant context in the chat messages
3. Handle the tool calls in your application logic
3. Handle the tool calls in your application logic
...
@@ -77,6 +80,7 @@ Remember that it's the callers responsibility to:
...
@@ -77,6 +80,7 @@ Remember that it's the callers responsibility to:
For more advanced usage, including parallel tool calls and different model-specific parsers, see the sections below.
For more advanced usage, including parallel tool calls and different model-specific parsers, see the sections below.
## Named Function Calling
## Named Function Calling
vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is
vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is
enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
high-quality one.
high-quality one.
...
@@ -87,10 +91,10 @@ For best results, we recommend ensuring that the expected output format / schema
...
@@ -87,10 +91,10 @@ For best results, we recommend ensuring that the expected output format / schema
To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and
To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and
specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request.
specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request.
## Automatic Function Calling
## Automatic Function Calling
To enable this feature, you should set the following flags:
To enable this feature, you should set the following flags:
* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it
* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it
deems appropriate.
deems appropriate.
* `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
* `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
...
@@ -104,28 +108,28 @@ from HuggingFace; and you can find an example of this in a `tokenizer_config.jso
...
@@ -104,28 +108,28 @@ from HuggingFace; and you can find an example of this in a `tokenizer_config.jso
If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
### Hermes Models (`hermes`)
### Hermes Models (`hermes`)
All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.
All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.
* `NousResearch/Hermes-2-Pro-*`
* `NousResearch/Hermes-2-Pro-*`
* `NousResearch/Hermes-2-Theta-*`
* `NousResearch/Hermes-2-Theta-*`
* `NousResearch/Hermes-3-*`
* `NousResearch/Hermes-3-*`
_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge
_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge
`examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
`examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
### InternLM Models (`internlm`)
### InternLM Models (`internlm`)
Supported models:
Supported models:
* `internlm/internlm2_5-7b-chat` (confirmed)
* `internlm/internlm2_5-7b-chat` (confirmed)
* Additional internlm2.5 function-calling models are compatible as well
* Additional internlm2.5 function-calling models are compatible as well
Known issues:
Known issues:
* Although this implementation also supports InternLM2, the tool call results are not stable when testing with the `internlm/internlm2-chat-7b` model.
* Although this implementation also supports InternLM2, the tool call results are not stable when testing with the `internlm/internlm2-chat-7b` model.
A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
As a concrete example, these models may look up the weather in San Francisco and Seattle by generating:
As a concrete example, these models may look up the weather in San Francisco and Seattle by generating:
* The model must not generate both text and tool calls in the same generation. This may not be hard to change for a specific model, but the community currently lacks consensus on which tokens to emit when starting and ending tool calls. (In particular, the Llama 3.2 models emit no such tokens.)
* The model must not generate both text and tool calls in the same generation. This may not be hard to change for a specific model, but the community currently lacks consensus on which tokens to emit when starting and ending tool calls. (In particular, the Llama 3.2 models emit no such tokens.)
* Llama's smaller models struggle to use tools effectively.
* Llama's smaller models struggle to use tools effectively.
Example supported models:
Example supported models:
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
* `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
...
@@ -231,7 +239,6 @@ Llama's smaller models frequently fail to emit tool calls in the correct format.
...
@@ -231,7 +239,6 @@ Llama's smaller models frequently fail to emit tool calls in the correct format.
---
---
## How to write a tool parser plugin
## How to write a tool parser plugin
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
...
@@ -284,7 +291,8 @@ class ExampleToolParser(ToolParser):
...
@@ -284,7 +291,8 @@ class ExampleToolParser(ToolParser):
```
```
Then you can use this plugin in the command line like this.
Then you can use this plugin in the command line like this.
```
```console
--enable-auto-tool-choice \
--enable-auto-tool-choice \
--tool-parser-plugin <absolute path of the plugin file>
--tool-parser-plugin <absolute path of the plugin file>
@@ -18,25 +18,23 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
...
@@ -18,25 +18,23 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
- First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
- First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
```console
```console
$export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_KVCACHE_SPACE=40
$export VLLM_CPU_OMP_THREADS_BIND=0-29
export VLLM_CPU_OMP_THREADS_BIND=0-29
$vllm serve facebook/opt-125m
vllm serve facebook/opt-125m
```
```
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND`. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND`. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
- Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](gh-pr:6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
- Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](gh-pr:6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
@@ -17,9 +17,9 @@ vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) bin
...
@@ -17,9 +17,9 @@ vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) bin
You can create a new Python environment using `conda`:
You can create a new Python environment using `conda`:
```console
```console
$# (Recommended) Create a new conda environment.
#(Recommended) Create a new conda environment.
$conda create -n myenv python=3.12 -y
conda create -n myenv python=3.12 -y
$conda activate myenv
conda activate myenv
```
```
```{note}
```{note}
...
@@ -29,9 +29,9 @@ $ conda activate myenv
...
@@ -29,9 +29,9 @@ $ conda activate myenv
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
```console
```console
$# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
#(Recommended) Create a new uv environment. Use `--seed` to install`pip` and `setuptools`in the environment.
$uv venv myenv --python 3.12 --seed
uv venv myenv --python 3.12 --seed
$source myenv/bin/activate
source myenv/bin/activate
```
```
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
...
@@ -43,18 +43,18 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
...
@@ -43,18 +43,18 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
You can install vLLM using either `pip` or `uv pip`:
You can install vLLM using either `pip` or `uv pip`:
```console
```console
$# Install vLLM with CUDA 12.1.
#Install vLLM with CUDA 12.1.
$pip install vllm # If you are using pip.
pip install vllm #If you are using pip.
$uv pip install vllm # If you are using uv.
uv pip install vllm #If you are using uv.
```
```
As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 11.8 and public PyTorch release versions:
As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 11.8 and public PyTorch release versions:
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
```console
```console
$export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd #use full commit hash from the main branch
Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
...
@@ -85,14 +85,14 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p
...
@@ -85,14 +85,14 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p
Another way to install the latest code is to use `uv`:
Another way to install the latest code is to use `uv`:
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
```console
```console
$export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 #use full commit hash from the main branch
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
...
@@ -102,8 +102,8 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb
...
@@ -102,8 +102,8 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb
Another way to access the latest code is to use the docker images:
Another way to access the latest code is to use the docker images:
```console
```console
$export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd #use full commit hash from the main branch
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
...
@@ -121,18 +121,18 @@ The latest code can contain bugs and may not be stable. Please use it with cauti
...
@@ -121,18 +121,18 @@ The latest code can contain bugs and may not be stable. Please use it with cauti
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
This will download the latest nightly wheel from https://wheels.vllm.ai/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl and use the compiled libraries from there in the installation.
This will download the [latest nightly wheel](https://wheels.vllm.ai/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl) and use the compiled libraries from there in the installation.
The `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable can be used instead of `VLLM_USE_PRECOMPILED` to specify a custom path or URL to the wheel file. For example, to use the [0.6.1.post1 PyPi wheel](https://pypi.org/project/vllm/#files):
The `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable can be used instead of `VLLM_USE_PRECOMPILED` to specify a custom path or URL to the wheel file. For example, to use the [0.6.1.post1 PyPi wheel](https://pypi.org/project/vllm/#files):
@@ -196,8 +196,8 @@ To avoid your system being overloaded, you can limit the number of compilation j
...
@@ -196,8 +196,8 @@ To avoid your system being overloaded, you can limit the number of compilation j
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
```console
```console
$export MAX_JOBS=6
export MAX_JOBS=6
$pip install-e .
pip install -e .
```
```
This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory.
This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory.
...
@@ -206,22 +206,22 @@ A side effect is a much slower build process.
...
@@ -206,22 +206,22 @@ A side effect is a much slower build process.
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
```console
```console
$# Use `--ipc=host` to make sure the shared memory is large enough.
#Use `--ipc=host` to make sure the shared memory is large enough.
$docker run --gpus all -it--rm--ipc=host nvcr.io/nvidia/pytorch:23.10-py3
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
```
```
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
```console
```console
$export CUDA_HOME=/usr/local/cuda
export CUDA_HOME=/usr/local/cuda
$export PATH="${CUDA_HOME}/bin:$PATH"
export PATH="${CUDA_HOME}/bin:$PATH"
```
```
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
```console
```console
$nvcc --version# verify that nvcc is in your PATH
nvcc --version #verify that nvcc is in your PATH
$${CUDA_HOME}/bin/nvcc --version# verify that nvcc is in your CUDA_HOME
${CUDA_HOME}/bin/nvcc --version# verify that nvcc is in your CUDA_HOME
```
```
### Unsupported OS build
### Unsupported OS build
...
@@ -231,6 +231,6 @@ vLLM can fully run only on Linux but for development purposes, you can still bui
...
@@ -231,6 +231,6 @@ vLLM can fully run only on Linux but for development purposes, you can still bui
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0`, `rocm/pytorch-nightly`.
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0`, `rocm/pytorch-nightly`.
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/)
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/)
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
```
```
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support)
Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support)
Alternatively, wheels intended for vLLM use can be accessed under the releases.
Alternatively, wheels intended for vLLM use can be accessed under the releases.
For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
```{tip}
```{tip}
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
```
```
```{tip}
```{tip}
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
@@ -37,10 +37,10 @@ If you're observing the following error: `docker: Error response from daemon: Un
...
@@ -37,10 +37,10 @@ If you're observing the following error: `docker: Error response from daemon: Un
To verify that the Intel Gaudi software was correctly installed, run:
To verify that the Intel Gaudi software was correctly installed, run:
```console
```console
$hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
hl-smi #verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
apt list --installed | grep habana #verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
pip list | grep habana #verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$pip list | grep neural # verify that neural_compressor is installed
pip list | grep neural #verify that neural_compressor is installed
Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following:
Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following:
@@ -181,7 +181,7 @@ Bucketing allows us to reduce the number of required graphs significantly, but i
...
@@ -181,7 +181,7 @@ Bucketing allows us to reduce the number of required graphs significantly, but i
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
@@ -224,7 +224,7 @@ Bucketing is transparent to a client -- padding in sequence length dimension is
...
@@ -224,7 +224,7 @@ Bucketing is transparent to a client -- padding in sequence length dimension is
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
```
```text
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
...
@@ -273,7 +273,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
...
@@ -273,7 +273,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
-[Optional] To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: [https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html).
-[Optional] To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: [https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html).
XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:
XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:
```console
```console
$python -m vllm.entrypoints.openai.api_server \
python -m vllm.entrypoints.openai.api_server \
$ --model=facebook/opt-13b \
--model=facebook/opt-13b \
$ --dtype=bfloat16 \
--dtype=bfloat16 \
$ --device=xpu \
--device=xpu \
$ --max_model_len=1024 \
--max_model_len=1024 \
$ --distributed-executor-backend=ray \
--distributed-executor-backend=ray \
$ --pipeline-parallel-size=2 \
--pipeline-parallel-size=2 \
$ -tp=8
-tp=8
```
```
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
@@ -19,17 +19,17 @@ If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/
...
@@ -19,17 +19,17 @@ If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
```console
```console
$uv venv myenv --python 3.12 --seed
uv venv myenv --python 3.12 --seed
$source myenv/bin/activate
source myenv/bin/activate
$uv pip install vllm
uv pip install vllm
```
```
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
```console
```console
$conda create -n myenv python=3.12 -y
conda create -n myenv python=3.12 -y
$conda activate myenv
conda activate myenv
$pip install vllm
pip install vllm
```
```
```{note}
```{note}
...
@@ -94,7 +94,7 @@ By default, it starts the server at `http://localhost:8000`. You can specify the
...
@@ -94,7 +94,7 @@ By default, it starts the server at `http://localhost:8000`. You can specify the
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
```console
```console
$vllm serve Qwen/Qwen2.5-1.5B-Instruct
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```
```
```{note}
```{note}
...
@@ -105,7 +105,7 @@ You can learn about overriding it [here](#chat-template).
...
@@ -105,7 +105,7 @@ You can learn about overriding it [here](#chat-template).
This server can be queried in the same format as OpenAI API. For example, to list the models:
This server can be queried in the same format as OpenAI API. For example, to list the models:
```console
```console
$curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/models
```
```
You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header.
You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header.
...
@@ -115,14 +115,14 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY`
...
@@ -115,14 +115,14 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY`
Once your server is started, you can query the model with input prompts:
Once your server is started, you can query the model with input prompts:
```console
```console
$curl http://localhost:8000/v1/completions \
curl http://localhost:8000/v1/completions \
$ -H"Content-Type: application/json"\
-H "Content-Type: application/json" \
$ -d'{
-d '{
$"model": "Qwen/Qwen2.5-1.5B-Instruct",
"model": "Qwen/Qwen2.5-1.5B-Instruct",
$"prompt": "San Francisco is a",
"prompt": "San Francisco is a",
$"max_tokens": 7,
"max_tokens": 7,
$"temperature": 0
"temperature": 0
$}'
}'
```
```
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
...
@@ -151,15 +151,15 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter
...
@@ -151,15 +151,15 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
```console
```console
$curl http://localhost:8000/v1/chat/completions \
curl http://localhost:8000/v1/chat/completions \
$ -H"Content-Type: application/json"\
-H "Content-Type: application/json" \
$ -d'{
-d '{
$"model": "Qwen/Qwen2.5-1.5B-Instruct",
"model": "Qwen/Qwen2.5-1.5B-Instruct",
$"messages": [
"messages": [
${"role": "system", "content": "You are a helpful assistant."},
{"role": "system", "content": "You are a helpful assistant."},
${"role": "user", "content": "Who won the world series in 2020?"}
{"role": "user", "content": "Who won the world series in 2020?"}
$]
]
$}'
}'
```
```
Alternatively, you can use the `openai` Python package:
Alternatively, you can use the `openai` Python package:
@@ -48,6 +48,7 @@ If vLLM crashes and the error trace captures it somewhere around `self.graph.rep
...
@@ -48,6 +48,7 @@ If vLLM crashes and the error trace captures it somewhere around `self.graph.rep
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
(troubleshooting-incorrect-hardware-driver)=
(troubleshooting-incorrect-hardware-driver)=
## Incorrect hardware/driver
## Incorrect hardware/driver
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
@@ -45,7 +45,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
...
@@ -45,7 +45,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
```shell
```shell
$ export VLLM_USE_MODELSCOPE=True
export VLLM_USE_MODELSCOPE=True
```
```
And use with `trust_remote_code=True`.
And use with `trust_remote_code=True`.
...
@@ -820,19 +820,22 @@ The following table lists those that are tested in vLLM.
...
@@ -820,19 +820,22 @@ The following table lists those that are tested in vLLM.
_________________
_________________
# Model Support Policy
## Model Support Policy
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
1.**Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
1.**Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
2.**Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
2.**Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
```{tip}
```{tip}
When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
```
```
3.**Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
3.**Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
4.**Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
4.**Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
5.**Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.
5.**Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.
Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
@@ -8,7 +8,7 @@ Due to the auto-regressive nature of transformer architecture, there are times w
...
@@ -8,7 +8,7 @@ Due to the auto-regressive nature of transformer architecture, there are times w
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, the following warning is printed:
available again. When this occurs, the following warning is printed:
```
```text
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
@@ -35,16 +35,16 @@ output = llm.generate("San Franciso is a")
...
@@ -35,16 +35,16 @@ output = llm.generate("San Franciso is a")
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
```console
```console
$vllm serve facebook/opt-13b \
vllm serve facebook/opt-13b \
$--tensor-parallel-size 4
--tensor-parallel-size 4
```
```
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
```console
```console
$vllm serve gpt2 \
vllm serve gpt2 \
$--tensor-parallel-size 4 \
--tensor-parallel-size 4 \
$--pipeline-parallel-size 2
--pipeline-parallel-size 2
```
```
## Running vLLM on multiple nodes
## Running vLLM on multiple nodes
...
@@ -56,21 +56,21 @@ The first step, is to start containers and organize them into a cluster. We have
...
@@ -56,21 +56,21 @@ The first step, is to start containers and organize them into a cluster. We have
Pick a node as the head node, and run the following command:
Pick a node as the head node, and run the following command:
```console
```console
$bash run_cluster.sh \
bash run_cluster.sh \
$ vllm/vllm-openai \
vllm/vllm-openai \
$ ip_of_head_node \
ip_of_head_node \
$ --head\
--head \
$ /path/to/the/huggingface/home/in/this/node
/path/to/the/huggingface/home/in/this/node
```
```
On the rest of the worker nodes, run the following command:
On the rest of the worker nodes, run the following command:
```console
```console
$bash run_cluster.sh \
bash run_cluster.sh \
$ vllm/vllm-openai \
vllm/vllm-openai \
$ ip_of_head_node \
ip_of_head_node \
$ --worker\
--worker \
$ /path/to/the/huggingface/home/in/this/node
/path/to/the/huggingface/home/in/this/node
```
```
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
...
@@ -80,16 +80,16 @@ Then, on any node, use `docker exec -it node /bin/bash` to enter the container,
...
@@ -80,16 +80,16 @@ Then, on any node, use `docker exec -it node /bin/bash` to enter the container,
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
```console
```console
$vllm serve /path/to/the/model/in/the/container \
vllm serve /path/to/the/model/in/the/container \
$--tensor-parallel-size 8 \
--tensor-parallel-size 8 \
$--pipeline-parallel-size 2
--pipeline-parallel-size 2
```
```
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
```console
```console
$vllm serve /path/to/the/model/in/the/container \
vllm serve /path/to/the/model/in/the/container \
$--tensor-parallel-size 16
--tensor-parallel-size 16
```
```
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.