Unverified Commit 43f3d9e6 authored by Rafael Vasquez's avatar Rafael Vasquez Committed by GitHub
Browse files

[CI/Build] Add markdown linter (#11857)


Signed-off-by: default avatarRafael Vasquez <rafvasq21@gmail.com>
parent b25cfab9
...@@ -55,21 +55,24 @@ print(f"Result: {get_weather(**json.loads(tool_call.arguments))}") ...@@ -55,21 +55,24 @@ print(f"Result: {get_weather(**json.loads(tool_call.arguments))}")
``` ```
Example output: Example output:
```
```text
Function called: get_weather Function called: get_weather
Arguments: {"location": "San Francisco, CA", "unit": "fahrenheit"} Arguments: {"location": "San Francisco, CA", "unit": "fahrenheit"}
Result: Getting the weather for San Francisco, CA in fahrenheit... Result: Getting the weather for San Francisco, CA in fahrenheit...
``` ```
This example demonstrates: This example demonstrates:
- Setting up the server with tool calling enabled
- Defining an actual function to handle tool calls * Setting up the server with tool calling enabled
- Making a request with `tool_choice="auto"` * Defining an actual function to handle tool calls
- Handling the structured response and executing the corresponding function * Making a request with `tool_choice="auto"`
* Handling the structured response and executing the corresponding function
You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests. You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests.
Remember that it's the callers responsibility to: Remember that it's the callers responsibility to:
1. Define appropriate tools in the request 1. Define appropriate tools in the request
2. Include relevant context in the chat messages 2. Include relevant context in the chat messages
3. Handle the tool calls in your application logic 3. Handle the tool calls in your application logic
...@@ -77,20 +80,21 @@ Remember that it's the callers responsibility to: ...@@ -77,20 +80,21 @@ Remember that it's the callers responsibility to:
For more advanced usage, including parallel tool calls and different model-specific parsers, see the sections below. For more advanced usage, including parallel tool calls and different model-specific parsers, see the sections below.
## Named Function Calling ## Named Function Calling
vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is
enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
high-quality one. high-quality one.
vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter. vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.
For best results, we recommend ensuring that the expected output format / schema is specified in the prompt to ensure that the model's intended generation is aligned with the schema that it's being forced to generate by the guided decoding backend. For best results, we recommend ensuring that the expected output format / schema is specified in the prompt to ensure that the model's intended generation is aligned with the schema that it's being forced to generate by the guided decoding backend.
To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and
specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request. specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request.
## Automatic Function Calling ## Automatic Function Calling
To enable this feature, you should set the following flags: To enable this feature, you should set the following flags:
* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it * `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it
deems appropriate. deems appropriate.
* `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers * `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
...@@ -104,28 +108,28 @@ from HuggingFace; and you can find an example of this in a `tokenizer_config.jso ...@@ -104,28 +108,28 @@ from HuggingFace; and you can find an example of this in a `tokenizer_config.jso
If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template! If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
### Hermes Models (`hermes`) ### Hermes Models (`hermes`)
All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported. All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.
* `NousResearch/Hermes-2-Pro-*` * `NousResearch/Hermes-2-Pro-*`
* `NousResearch/Hermes-2-Theta-*` * `NousResearch/Hermes-2-Theta-*`
* `NousResearch/Hermes-3-*` * `NousResearch/Hermes-3-*`
_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge _Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge
step in their creation_. step in their creation_.
Flags: `--tool-call-parser hermes` Flags: `--tool-call-parser hermes`
### Mistral Models (`mistral`) ### Mistral Models (`mistral`)
Supported models: Supported models:
* `mistralai/Mistral-7B-Instruct-v0.3` (confirmed) * `mistralai/Mistral-7B-Instruct-v0.3` (confirmed)
* Additional mistral function-calling models are compatible as well. * Additional mistral function-calling models are compatible as well.
Known issues: Known issues:
1. Mistral 7B struggles to generate parallel tool calls correctly. 1. Mistral 7B struggles to generate parallel tool calls correctly.
2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is 2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is
much shorter than what vLLM generates. Since an exception is thrown when this condition much shorter than what vLLM generates. Since an exception is thrown when this condition
...@@ -136,13 +140,12 @@ it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated ...@@ -136,13 +140,12 @@ it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated
* `examples/tool_chat_template_mistral_parallel.jinja` - this is a "better" version that adds a tool-use system prompt * `examples/tool_chat_template_mistral_parallel.jinja` - this is a "better" version that adds a tool-use system prompt
when tools are provided, that results in much better reliability when working with parallel tool calling. when tools are provided, that results in much better reliability when working with parallel tool calling.
Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja` Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja`
### Llama Models (`llama3_json`) ### Llama Models (`llama3_json`)
Supported models: Supported models:
* `meta-llama/Meta-Llama-3.1-8B-Instruct` * `meta-llama/Meta-Llama-3.1-8B-Instruct`
* `meta-llama/Meta-Llama-3.1-70B-Instruct` * `meta-llama/Meta-Llama-3.1-70B-Instruct`
* `meta-llama/Meta-Llama-3.1-405B-Instruct` * `meta-llama/Meta-Llama-3.1-405B-Instruct`
...@@ -152,6 +155,7 @@ The tool calling that is supported is the [JSON based tool calling](https://llam ...@@ -152,6 +155,7 @@ The tool calling that is supported is the [JSON based tool calling](https://llam
Other tool calling formats like the built in python tool calling or custom tool calling are not supported. Other tool calling formats like the built in python tool calling or custom tool calling are not supported.
Known issues: Known issues:
1. Parallel tool calls are not supported. 1. Parallel tool calls are not supported.
2. The model can generate parameters with a wrong format, such as generating 2. The model can generate parameters with a wrong format, such as generating
an array serialized as string instead of an array. an array serialized as string instead of an array.
...@@ -164,6 +168,7 @@ Recommended flags: `--tool-call-parser llama3_json --chat-template examples/tool ...@@ -164,6 +168,7 @@ Recommended flags: `--tool-call-parser llama3_json --chat-template examples/tool
#### IBM Granite #### IBM Granite
Supported models: Supported models:
* `ibm-granite/granite-3.0-8b-instruct` * `ibm-granite/granite-3.0-8b-instruct`
Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja` Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
...@@ -182,42 +187,45 @@ Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/t ...@@ -182,42 +187,45 @@ Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/t
`examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported. `examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
### InternLM Models (`internlm`) ### InternLM Models (`internlm`)
Supported models: Supported models:
* `internlm/internlm2_5-7b-chat` (confirmed) * `internlm/internlm2_5-7b-chat` (confirmed)
* Additional internlm2.5 function-calling models are compatible as well * Additional internlm2.5 function-calling models are compatible as well
Known issues: Known issues:
* Although this implementation also supports InternLM2, the tool call results are not stable when testing with the `internlm/internlm2-chat-7b` model. * Although this implementation also supports InternLM2, the tool call results are not stable when testing with the `internlm/internlm2-chat-7b` model.
Recommended flags: `--tool-call-parser internlm --chat-template examples/tool_chat_template_internlm2_tool.jinja` Recommended flags: `--tool-call-parser internlm --chat-template examples/tool_chat_template_internlm2_tool.jinja`
### Jamba Models (`jamba`) ### Jamba Models (`jamba`)
AI21's Jamba-1.5 models are supported. AI21's Jamba-1.5 models are supported.
* `ai21labs/AI21-Jamba-1.5-Mini` * `ai21labs/AI21-Jamba-1.5-Mini`
* `ai21labs/AI21-Jamba-1.5-Large` * `ai21labs/AI21-Jamba-1.5-Large`
Flags: `--tool-call-parser jamba` Flags: `--tool-call-parser jamba`
### Models with Pythonic Tool Calls (`pythonic`) ### Models with Pythonic Tool Calls (`pythonic`)
A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models. A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
As a concrete example, these models may look up the weather in San Francisco and Seattle by generating: As a concrete example, these models may look up the weather in San Francisco and Seattle by generating:
```python ```python
[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')] [get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]
``` ```
Limitations: Limitations:
* The model must not generate both text and tool calls in the same generation. This may not be hard to change for a specific model, but the community currently lacks consensus on which tokens to emit when starting and ending tool calls. (In particular, the Llama 3.2 models emit no such tokens.) * The model must not generate both text and tool calls in the same generation. This may not be hard to change for a specific model, but the community currently lacks consensus on which tokens to emit when starting and ending tool calls. (In particular, the Llama 3.2 models emit no such tokens.)
* Llama's smaller models struggle to use tools effectively. * Llama's smaller models struggle to use tools effectively.
Example supported models: Example supported models:
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`) * `meta-llama/Llama-3.2-1B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`) * `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
* `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`) * `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
...@@ -231,7 +239,6 @@ Llama's smaller models frequently fail to emit tool calls in the correct format. ...@@ -231,7 +239,6 @@ Llama's smaller models frequently fail to emit tool calls in the correct format.
--- ---
## How to write a tool parser plugin ## How to write a tool parser plugin
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py. A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
...@@ -284,7 +291,8 @@ class ExampleToolParser(ToolParser): ...@@ -284,7 +291,8 @@ class ExampleToolParser(ToolParser):
``` ```
Then you can use this plugin in the command line like this. Then you can use this plugin in the command line like this.
```
```console
--enable-auto-tool-choice \ --enable-auto-tool-choice \
--tool-parser-plugin <absolute path of the plugin file> --tool-parser-plugin <absolute path of the plugin file>
--tool-call-parser example \ --tool-call-parser example \
......
...@@ -30,7 +30,7 @@ changes in batch size, or batch expansion in speculative decoding. These batchin ...@@ -30,7 +30,7 @@ changes in batch size, or batch expansion in speculative decoding. These batchin
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
different tokens being sampled. Once a different token is sampled, further divergence is likely. different tokens being sampled. Once a different token is sampled, further divergence is likely.
**Mitigation Strategies** ## Mitigation Strategies
- For improved stability and reduced variance, use `float32`. Note that this will require more memory. - For improved stability and reduced variance, use `float32`. Note that this will require more memory.
- If using `bfloat16`, switching to `float16` can also help. - If using `bfloat16`, switching to `float16` can also help.
......
...@@ -18,25 +18,23 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes. ...@@ -18,25 +18,23 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source. After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
``` ```console
$ git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
$ cd vllm cd vllm
$ pip install -r requirements-cpu.txt pip install -r requirements-cpu.txt
$ pip install -e . pip install -e .
``` ```
```{note} ```{note}
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device. On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
``` ```
## Troubleshooting ## Troubleshooting
If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your
[Command Line Tools for Xcode](https://developer.apple.com/download/all/). [Command Line Tools for Xcode](https://developer.apple.com/download/all/).
``` ```text
[...] fatal error: 'map' file not found [...] fatal error: 'map' file not found
1 | #include <map> 1 | #include <map>
| ^~~~~ | ^~~~~
...@@ -48,4 +46,3 @@ If the build has error like the following snippet where standard C++ headers can ...@@ -48,4 +46,3 @@ If the build has error like the following snippet where standard C++ headers can
| ^~~~~~~~~ | ^~~~~~~~~
1 error generated. 1 error generated.
``` ```
...@@ -32,13 +32,13 @@ Table of contents: ...@@ -32,13 +32,13 @@ Table of contents:
## Quick start using Dockerfile ## Quick start using Dockerfile
```console ```console
$ docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g . docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
$ docker run -it \ docker run -it \
--rm \ --rm \
--network=host \ --network=host \
--cpuset-cpus=<cpu-id-list, optional> \ --cpuset-cpus=<cpu-id-list, optional> \
--cpuset-mems=<memory-node, optional> \ --cpuset-mems=<memory-node, optional> \
vllm-cpu-env vllm-cpu-env
``` ```
(build-cpu-backend-from-source)= (build-cpu-backend-from-source)=
...@@ -48,23 +48,23 @@ $ docker run -it \ ...@@ -48,23 +48,23 @@ $ docker run -it \
- First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: - First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
```console ```console
$ sudo apt-get update -y sudo apt-get update -y
$ sudo apt-get install -y gcc-12 g++-12 libnuma-dev sudo apt-get install -y gcc-12 g++-12 libnuma-dev
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
``` ```
- Second, install Python packages for vLLM CPU backend building: - Second, install Python packages for vLLM CPU backend building:
```console ```console
$ pip install --upgrade pip pip install --upgrade pip
$ pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
$ pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
``` ```
- Finally, build and install vLLM CPU backend: - Finally, build and install vLLM CPU backend:
```console ```console
$ VLLM_TARGET_DEVICE=cpu python setup.py install VLLM_TARGET_DEVICE=cpu python setup.py install
``` ```
```{note} ```{note}
...@@ -92,18 +92,18 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install ...@@ -92,18 +92,18 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run: - We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
```console ```console
$ sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
$ find / -name *libtcmalloc* # find the dynamic link library path find / -name *libtcmalloc* # find the dynamic link library path
$ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
$ python examples/offline_inference/basic.py # run vLLM python examples/offline_inference/basic.py # run vLLM
``` ```
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP: - When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
```console ```console
$ export VLLM_CPU_KVCACHE_SPACE=40 export VLLM_CPU_KVCACHE_SPACE=40
$ export VLLM_CPU_OMP_THREADS_BIND=0-29 export VLLM_CPU_OMP_THREADS_BIND=0-29
$ vllm serve facebook/opt-125m vllm serve facebook/opt-125m
``` ```
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND`. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores: - If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND`. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
...@@ -148,7 +148,7 @@ $ python examples/offline_inference/basic.py ...@@ -148,7 +148,7 @@ $ python examples/offline_inference/basic.py
- Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](gh-pr:6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving: - Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](gh-pr:6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
```console ```console
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
``` ```
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md). - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
...@@ -17,9 +17,9 @@ vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) bin ...@@ -17,9 +17,9 @@ vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) bin
You can create a new Python environment using `conda`: You can create a new Python environment using `conda`:
```console ```console
$ # (Recommended) Create a new conda environment. # (Recommended) Create a new conda environment.
$ conda create -n myenv python=3.12 -y conda create -n myenv python=3.12 -y
$ conda activate myenv conda activate myenv
``` ```
```{note} ```{note}
...@@ -29,9 +29,9 @@ $ conda activate myenv ...@@ -29,9 +29,9 @@ $ conda activate myenv
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command: Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
```console ```console
$ # (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment. # (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
$ uv venv myenv --python 3.12 --seed uv venv myenv --python 3.12 --seed
$ source myenv/bin/activate source myenv/bin/activate
``` ```
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations. In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
...@@ -43,18 +43,18 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I ...@@ -43,18 +43,18 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
You can install vLLM using either `pip` or `uv pip`: You can install vLLM using either `pip` or `uv pip`:
```console ```console
$ # Install vLLM with CUDA 12.1. # Install vLLM with CUDA 12.1.
$ pip install vllm # If you are using pip. pip install vllm # If you are using pip.
$ uv pip install vllm # If you are using uv. uv pip install vllm # If you are using uv.
``` ```
As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 11.8 and public PyTorch release versions: As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 11.8 and public PyTorch release versions:
```console ```console
$ # Install vLLM with CUDA 11.8. # Install vLLM with CUDA 11.8.
$ export VLLM_VERSION=0.6.1.post1 export VLLM_VERSION=0.6.1.post1
$ export PYTHON_VERSION=310 export PYTHON_VERSION=310
$ pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118 pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
``` ```
(install-the-latest-code)= (install-the-latest-code)=
...@@ -66,7 +66,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe ...@@ -66,7 +66,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
### Install the latest code using `pip` ### Install the latest code using `pip`
```console ```console
$ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
``` ```
`--pre` is required for `pip` to consider pre-released versions. `--pre` is required for `pip` to consider pre-released versions.
...@@ -74,8 +74,8 @@ $ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly ...@@ -74,8 +74,8 @@ $ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL: If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
```console ```console
$ export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
$ pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
``` ```
Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before. Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
...@@ -85,14 +85,14 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p ...@@ -85,14 +85,14 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p
Another way to install the latest code is to use `uv`: Another way to install the latest code is to use `uv`:
```console ```console
$ uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
``` ```
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL: If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
```console ```console
$ export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
$ uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
``` ```
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
...@@ -102,8 +102,8 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb ...@@ -102,8 +102,8 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb
Another way to access the latest code is to use the docker images: Another way to access the latest code is to use the docker images:
```console ```console
$ export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
$ docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT} docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
``` ```
These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days. These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
...@@ -121,18 +121,18 @@ The latest code can contain bugs and may not be stable. Please use it with cauti ...@@ -121,18 +121,18 @@ The latest code can contain bugs and may not be stable. Please use it with cauti
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM: If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
```console ```console
$ git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
$ cd vllm cd vllm
$ VLLM_USE_PRECOMPILED=1 pip install --editable . VLLM_USE_PRECOMPILED=1 pip install --editable .
``` ```
This will download the latest nightly wheel from https://wheels.vllm.ai/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl and use the compiled libraries from there in the installation. This will download the [latest nightly wheel](https://wheels.vllm.ai/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl) and use the compiled libraries from there in the installation.
The `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable can be used instead of `VLLM_USE_PRECOMPILED` to specify a custom path or URL to the wheel file. For example, to use the [0.6.1.post1 PyPi wheel](https://pypi.org/project/vllm/#files): The `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable can be used instead of `VLLM_USE_PRECOMPILED` to specify a custom path or URL to the wheel file. For example, to use the [0.6.1.post1 PyPi wheel](https://pypi.org/project/vllm/#files):
```console ```console
$ export VLLM_PRECOMPILED_WHEEL_LOCATION=https://files.pythonhosted.org/packages/4a/4c/ee65ba33467a4c0de350ce29fbae39b9d0e7fcd887cc756fa993654d1228/vllm-0.6.3.post1-cp38-abi3-manylinux1_x86_64.whl export VLLM_PRECOMPILED_WHEEL_LOCATION=https://files.pythonhosted.org/packages/4a/4c/ee65ba33467a4c0de350ce29fbae39b9d0e7fcd887cc756fa993654d1228/vllm-0.6.3.post1-cp38-abi3-manylinux1_x86_64.whl
$ pip install --editable . pip install --editable .
``` ```
You can find more information about vLLM's wheels [above](#install-the-latest-code). You can find more information about vLLM's wheels [above](#install-the-latest-code).
...@@ -147,9 +147,9 @@ It is recommended to use the same commit ID for the source code as the vLLM whee ...@@ -147,9 +147,9 @@ It is recommended to use the same commit ID for the source code as the vLLM whee
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes: If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
```console ```console
$ git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
$ cd vllm cd vllm
$ pip install -e . pip install -e .
``` ```
```{tip} ```{tip}
...@@ -172,11 +172,11 @@ There are scenarios where the PyTorch dependency cannot be easily installed via ...@@ -172,11 +172,11 @@ There are scenarios where the PyTorch dependency cannot be easily installed via
To build vLLM using an existing PyTorch installation: To build vLLM using an existing PyTorch installation:
```console ```console
$ git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
$ cd vllm cd vllm
$ python use_existing_torch.py python use_existing_torch.py
$ pip install -r requirements-build.txt pip install -r requirements-build.txt
$ pip install -e . --no-build-isolation pip install -e . --no-build-isolation
``` ```
#### Use the local cutlass for compilation #### Use the local cutlass for compilation
...@@ -185,9 +185,9 @@ Currently, before starting the build process, vLLM fetches cutlass code from Git ...@@ -185,9 +185,9 @@ Currently, before starting the build process, vLLM fetches cutlass code from Git
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory. To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
```console ```console
$ git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
$ cd vllm cd vllm
$ VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e . VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
``` ```
#### Troubleshooting #### Troubleshooting
...@@ -196,8 +196,8 @@ To avoid your system being overloaded, you can limit the number of compilation j ...@@ -196,8 +196,8 @@ To avoid your system being overloaded, you can limit the number of compilation j
to be run simultaneously, via the environment variable `MAX_JOBS`. For example: to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
```console ```console
$ export MAX_JOBS=6 export MAX_JOBS=6
$ pip install -e . pip install -e .
``` ```
This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory. This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory.
...@@ -206,22 +206,22 @@ A side effect is a much slower build process. ...@@ -206,22 +206,22 @@ A side effect is a much slower build process.
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image. Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
```console ```console
$ # Use `--ipc=host` to make sure the shared memory is large enough. # Use `--ipc=host` to make sure the shared memory is large enough.
$ docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3 docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
``` ```
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.: If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
```console ```console
$ export CUDA_HOME=/usr/local/cuda export CUDA_HOME=/usr/local/cuda
$ export PATH="${CUDA_HOME}/bin:$PATH" export PATH="${CUDA_HOME}/bin:$PATH"
``` ```
Here is a sanity check to verify that the CUDA Toolkit is correctly installed: Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
```console ```console
$ nvcc --version # verify that nvcc is in your PATH nvcc --version # verify that nvcc is in your PATH
$ ${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME ${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
``` ```
### Unsupported OS build ### Unsupported OS build
...@@ -231,6 +231,6 @@ vLLM can fully run only on Linux but for development purposes, you can still bui ...@@ -231,6 +231,6 @@ vLLM can fully run only on Linux but for development purposes, you can still bui
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing: Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
```console ```console
$ export VLLM_TARGET_DEVICE=empty export VLLM_TARGET_DEVICE=empty
$ pip install -e . pip install -e .
``` ```
...@@ -47,13 +47,13 @@ Their values can be passed in when running `docker build` with `--build-arg` opt ...@@ -47,13 +47,13 @@ Their values can be passed in when running `docker build` with `--build-arg` opt
To build vllm on ROCm 6.2 for MI200 and MI300 series, you can use the default: To build vllm on ROCm 6.2 for MI200 and MI300 series, you can use the default:
```console ```console
$ DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm . DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .
``` ```
To build vllm on ROCm 6.2 for Radeon RX7900 series (gfx1100), you should specify `BUILD_FA` as below: To build vllm on ROCm 6.2 for Radeon RX7900 series (gfx1100), you should specify `BUILD_FA` as below:
```console ```console
$ DOCKER_BUILDKIT=1 docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm . DOCKER_BUILDKIT=1 docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm .
``` ```
To run the above docker image `vllm-rocm`, use the below command: To run the above docker image `vllm-rocm`, use the below command:
...@@ -83,81 +83,81 @@ Where the `<path/to/model>` is the location where the model is stored, for examp ...@@ -83,81 +83,81 @@ Where the `<path/to/model>` is the location where the model is stored, for examp
- [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html) - [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html)
- [PyTorch](https://pytorch.org/) - [PyTorch](https://pytorch.org/)
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0`, `rocm/pytorch-nightly`. For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0`, `rocm/pytorch-nightly`.
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/) Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/)
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton) 1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md) Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
```console ```console
$ python3 -m pip install ninja cmake wheel pybind11 python3 -m pip install ninja cmake wheel pybind11
$ pip uninstall -y triton pip uninstall -y triton
$ git clone https://github.com/OpenAI/triton.git git clone https://github.com/OpenAI/triton.git
$ cd triton cd triton
$ git checkout e192dba git checkout e192dba
$ cd python cd python
$ pip3 install . pip3 install .
$ cd ../.. cd ../..
``` ```
```{note} ```{note}
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent. - If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
``` ```
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile) 2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support) Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support)
Alternatively, wheels intended for vLLM use can be accessed under the releases. Alternatively, wheels intended for vLLM use can be accessed under the releases.
For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`. For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
```console ```console
$ git clone https://github.com/ROCm/flash-attention.git git clone https://github.com/ROCm/flash-attention.git
$ cd flash-attention cd flash-attention
$ git checkout 3cea2fb git checkout 3cea2fb
$ git submodule update --init git submodule update --init
$ GPU_ARCHS="gfx90a" python3 setup.py install GPU_ARCHS="gfx90a" python3 setup.py install
$ cd .. cd ..
``` ```
```{note} ```{note}
- You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`) - You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
``` ```
3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps: 3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:
```bash ```bash
$ pip install --upgrade pip $ pip install --upgrade pip
# Install PyTorch # Install PyTorch
$ pip uninstall torch -y $ pip uninstall torch -y
$ pip install --no-cache-dir --pre torch==2.6.0.dev20241024 --index-url https://download.pytorch.org/whl/nightly/rocm6.2 $ pip install --no-cache-dir --pre torch==2.6.0.dev20241024 --index-url https://download.pytorch.org/whl/nightly/rocm6.2
# Build & install AMD SMI # Build & install AMD SMI
$ pip install /opt/rocm/share/amd_smi $ pip install /opt/rocm/share/amd_smi
# Install dependencies # Install dependencies
$ pip install --upgrade numba scipy huggingface-hub[cli] $ pip install --upgrade numba scipy huggingface-hub[cli]
$ pip install "numpy<2" $ pip install "numpy<2"
$ pip install -r requirements-rocm.txt $ pip install -r requirements-rocm.txt
# Build vLLM for MI210/MI250/MI300. # Build vLLM for MI210/MI250/MI300.
$ export PYTORCH_ROCM_ARCH="gfx90a;gfx942" $ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
$ python3 setup.py develop $ python3 setup.py develop
``` ```
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation. This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
```{tip} ```{tip}
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers. - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support. - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention. - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version. - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
``` ```
```{tip} ```{tip}
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level. - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization). For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
``` ```
...@@ -22,8 +22,8 @@ Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optim ...@@ -22,8 +22,8 @@ Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optim
### Quick start using Dockerfile ### Quick start using Dockerfile
```console ```console
$ docker build -f Dockerfile.hpu -t vllm-hpu-env . docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
``` ```
```{tip} ```{tip}
...@@ -37,10 +37,10 @@ If you're observing the following error: `docker: Error response from daemon: Un ...@@ -37,10 +37,10 @@ If you're observing the following error: `docker: Error response from daemon: Un
To verify that the Intel Gaudi software was correctly installed, run: To verify that the Intel Gaudi software was correctly installed, run:
```console ```console
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural_compressor is installed pip list | grep neural # verify that neural_compressor is installed
``` ```
Refer to [Intel Gaudi Software Stack Refer to [Intel Gaudi Software Stack
...@@ -57,8 +57,8 @@ for more details. ...@@ -57,8 +57,8 @@ for more details.
Use the following commands to run a Docker image: Use the following commands to run a Docker image:
```console ```console
$ docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
``` ```
#### Build and Install vLLM #### Build and Install vLLM
...@@ -66,18 +66,18 @@ $ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_ ...@@ -66,18 +66,18 @@ $ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_
To build and install vLLM from source, run: To build and install vLLM from source, run:
```console ```console
$ git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
$ cd vllm cd vllm
$ python setup.py develop python setup.py develop
``` ```
Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following: Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following:
```console ```console
$ git clone https://github.com/HabanaAI/vllm-fork.git git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork cd vllm-fork
$ git checkout habana_main git checkout habana_main
$ python setup.py develop python setup.py develop
``` ```
## Supported Features ## Supported Features
...@@ -181,7 +181,7 @@ Bucketing allows us to reduce the number of required graphs significantly, but i ...@@ -181,7 +181,7 @@ Bucketing allows us to reduce the number of required graphs significantly, but i
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup: Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
``` ```text
INFO 08-01 21:37:59 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024] INFO 08-01 21:37:59 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-01 21:37:59 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)] INFO 08-01 21:37:59 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048] INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
...@@ -192,7 +192,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 1 ...@@ -192,7 +192,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 1
Example (with ramp-up) Example (with ramp-up)
``` ```text
min = 2, step = 32, max = 64 min = 2, step = 32, max = 64
=> ramp_up = (2, 4, 8, 16) => ramp_up = (2, 4, 8, 16)
=> stable = (32, 64) => stable = (32, 64)
...@@ -201,7 +201,7 @@ min = 2, step = 32, max = 64 ...@@ -201,7 +201,7 @@ min = 2, step = 32, max = 64
Example (without ramp-up) Example (without ramp-up)
``` ```text
min = 128, step = 128, max = 512 min = 128, step = 128, max = 512
=> ramp_up = () => ramp_up = ()
=> stable = (128, 256, 384, 512) => stable = (128, 256, 384, 512)
...@@ -224,7 +224,7 @@ Bucketing is transparent to a client -- padding in sequence length dimension is ...@@ -224,7 +224,7 @@ Bucketing is transparent to a client -- padding in sequence length dimension is
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup: Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
``` ```text
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
...@@ -273,7 +273,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi ...@@ -273,7 +273,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released): Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
``` ```text
INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024] INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)] INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048] INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
...@@ -349,19 +349,19 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi ...@@ -349,19 +349,19 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
- Default values: - Default values:
- Prompt: - Prompt:
: - batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1` - batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`
- batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`): `min(max_num_seqs, 32)` - batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`): `min(max_num_seqs, 32)`
- batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`): `min(max_num_seqs, 64)` - batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`): `min(max_num_seqs, 64)`
- sequence length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`): `block_size` - sequence length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`): `block_size`
- sequence length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`): `block_size` - sequence length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`): `block_size`
- sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len` - sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len`
- Decode: - Decode:
: - batch size min (`VLLM_DECODE_BS_BUCKET_MIN`): `1` - batch size min (`VLLM_DECODE_BS_BUCKET_MIN`): `1`
- batch size step (`VLLM_DECODE_BS_BUCKET_STEP`): `min(max_num_seqs, 32)` - batch size step (`VLLM_DECODE_BS_BUCKET_STEP`): `min(max_num_seqs, 32)`
- batch size max (`VLLM_DECODE_BS_BUCKET_MAX`): `max_num_seqs` - batch size max (`VLLM_DECODE_BS_BUCKET_MAX`): `max_num_seqs`
- sequence length min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size` - sequence length min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size`
- sequence length step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size` - sequence length step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size`
- sequence length max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len)/block_size)` - sequence length max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len)/block_size)`
Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution: Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:
......
...@@ -123,10 +123,10 @@ python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torch ...@@ -123,10 +123,10 @@ python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torch
Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows: Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows:
```console ```console
$ git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
$ cd vllm cd vllm
$ pip install -U -r requirements-neuron.txt pip install -U -r requirements-neuron.txt
$ VLLM_TARGET_DEVICE="neuron" pip install . VLLM_TARGET_DEVICE="neuron" pip install .
``` ```
If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed. If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed.
...@@ -27,8 +27,8 @@ vLLM powered by OpenVINO supports all LLM models from [vLLM supported models lis ...@@ -27,8 +27,8 @@ vLLM powered by OpenVINO supports all LLM models from [vLLM supported models lis
## Quick start using Dockerfile ## Quick start using Dockerfile
```console ```console
$ docker build -f Dockerfile.openvino -t vllm-openvino-env . docker build -f Dockerfile.openvino -t vllm-openvino-env .
$ docker run -it --rm vllm-openvino-env docker run -it --rm vllm-openvino-env
``` ```
(install-openvino-backend-from-source)= (install-openvino-backend-from-source)=
...@@ -38,21 +38,21 @@ $ docker run -it --rm vllm-openvino-env ...@@ -38,21 +38,21 @@ $ docker run -it --rm vllm-openvino-env
- First, install Python. For example, on Ubuntu 22.04, you can run: - First, install Python. For example, on Ubuntu 22.04, you can run:
```console ```console
$ sudo apt-get update -y sudo apt-get update -y
$ sudo apt-get install python3 sudo apt-get install python3
``` ```
- Second, install prerequisites vLLM OpenVINO backend installation: - Second, install prerequisites vLLM OpenVINO backend installation:
```console ```console
$ pip install --upgrade pip pip install --upgrade pip
$ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
``` ```
- Finally, install vLLM with OpenVINO backend: - Finally, install vLLM with OpenVINO backend:
```console ```console
$ PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v . PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v .
``` ```
- [Optional] To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: [https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html). - [Optional] To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: [https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html).
......
...@@ -156,14 +156,14 @@ For more information about using TPUs with GKE, see ...@@ -156,14 +156,14 @@ For more information about using TPUs with GKE, see
You can use <gh-file:Dockerfile.tpu> to build a Docker image with TPU support. You can use <gh-file:Dockerfile.tpu> to build a Docker image with TPU support.
```console ```console
$ docker build -f Dockerfile.tpu -t vllm-tpu . docker build -f Dockerfile.tpu -t vllm-tpu .
``` ```
Run the Docker image with the following command: Run the Docker image with the following command:
```console ```console
$ # Make sure to add `--privileged --net host --shm-size=16G`. # Make sure to add `--privileged --net host --shm-size=16G`.
$ docker run --privileged --net host --shm-size=16G -it vllm-tpu docker run --privileged --net host --shm-size=16G -it vllm-tpu
``` ```
```{note} ```{note}
......
...@@ -40,15 +40,15 @@ $ docker run -it \ ...@@ -40,15 +40,15 @@ $ docker run -it \
- Second, install Python packages for vLLM XPU backend building: - Second, install Python packages for vLLM XPU backend building:
```console ```console
$ source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
$ pip install --upgrade pip pip install --upgrade pip
$ pip install -v -r requirements-xpu.txt pip install -v -r requirements-xpu.txt
``` ```
- Finally, build and install vLLM XPU backend: - Finally, build and install vLLM XPU backend:
```console ```console
$ VLLM_TARGET_DEVICE=xpu python setup.py install VLLM_TARGET_DEVICE=xpu python setup.py install
``` ```
```{note} ```{note}
...@@ -61,14 +61,14 @@ $ VLLM_TARGET_DEVICE=xpu python setup.py install ...@@ -61,14 +61,14 @@ $ VLLM_TARGET_DEVICE=xpu python setup.py install
XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following: XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:
```console ```console
$ python -m vllm.entrypoints.openai.api_server \ python -m vllm.entrypoints.openai.api_server \
$ --model=facebook/opt-13b \ --model=facebook/opt-13b \
$ --dtype=bfloat16 \ --dtype=bfloat16 \
$ --device=xpu \ --device=xpu \
$ --max_model_len=1024 \ --max_model_len=1024 \
$ --distributed-executor-backend=ray \ --distributed-executor-backend=ray \
$ --pipeline-parallel-size=2 \ --pipeline-parallel-size=2 \
$ -tp=8 -tp=8
``` ```
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script. By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
...@@ -19,17 +19,17 @@ If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/ ...@@ -19,17 +19,17 @@ If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands: It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
```console ```console
$ uv venv myenv --python 3.12 --seed uv venv myenv --python 3.12 --seed
$ source myenv/bin/activate source myenv/bin/activate
$ uv pip install vllm uv pip install vllm
``` ```
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
```console ```console
$ conda create -n myenv python=3.12 -y conda create -n myenv python=3.12 -y
$ conda activate myenv conda activate myenv
$ pip install vllm pip install vllm
``` ```
```{note} ```{note}
...@@ -94,7 +94,7 @@ By default, it starts the server at `http://localhost:8000`. You can specify the ...@@ -94,7 +94,7 @@ By default, it starts the server at `http://localhost:8000`. You can specify the
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model: Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
```console ```console
$ vllm serve Qwen/Qwen2.5-1.5B-Instruct vllm serve Qwen/Qwen2.5-1.5B-Instruct
``` ```
```{note} ```{note}
...@@ -105,7 +105,7 @@ You can learn about overriding it [here](#chat-template). ...@@ -105,7 +105,7 @@ You can learn about overriding it [here](#chat-template).
This server can be queried in the same format as OpenAI API. For example, to list the models: This server can be queried in the same format as OpenAI API. For example, to list the models:
```console ```console
$ curl http://localhost:8000/v1/models curl http://localhost:8000/v1/models
``` ```
You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header. You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header.
...@@ -115,14 +115,14 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` ...@@ -115,14 +115,14 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY`
Once your server is started, you can query the model with input prompts: Once your server is started, you can query the model with input prompts:
```console ```console
$ curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
$ -H "Content-Type: application/json" \ -H "Content-Type: application/json" \
$ -d '{ -d '{
$ "model": "Qwen/Qwen2.5-1.5B-Instruct", "model": "Qwen/Qwen2.5-1.5B-Instruct",
$ "prompt": "San Francisco is a", "prompt": "San Francisco is a",
$ "max_tokens": 7, "max_tokens": 7,
$ "temperature": 0 "temperature": 0
$ }' }'
``` ```
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package: Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
...@@ -151,15 +151,15 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter ...@@ -151,15 +151,15 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model: You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
```console ```console
$ curl http://localhost:8000/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
$ -H "Content-Type: application/json" \ -H "Content-Type: application/json" \
$ -d '{ -d '{
$ "model": "Qwen/Qwen2.5-1.5B-Instruct", "model": "Qwen/Qwen2.5-1.5B-Instruct",
$ "messages": [ "messages": [
$ {"role": "system", "content": "You are a helpful assistant."}, {"role": "system", "content": "You are a helpful assistant."},
$ {"role": "user", "content": "Who won the world series in 2020?"} {"role": "user", "content": "Who won the world series in 2020?"}
$ ] ]
$ }' }'
``` ```
Alternatively, you can use the `openai` Python package: Alternatively, you can use the `openai` Python package:
......
...@@ -48,6 +48,7 @@ If vLLM crashes and the error trace captures it somewhere around `self.graph.rep ...@@ -48,6 +48,7 @@ If vLLM crashes and the error trace captures it somewhere around `self.graph.rep
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error. To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
(troubleshooting-incorrect-hardware-driver)= (troubleshooting-incorrect-hardware-driver)=
## Incorrect hardware/driver ## Incorrect hardware/driver
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly. If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
...@@ -118,13 +119,13 @@ dist.destroy_process_group() ...@@ -118,13 +119,13 @@ dist.destroy_process_group()
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use: If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
```console ```console
$ NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
``` ```
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run: If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
```console ```console
$ NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
``` ```
If the script runs successfully, you should see the message `sanity check is successful!`. If the script runs successfully, you should see the message `sanity check is successful!`.
...@@ -141,6 +142,7 @@ Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup ...@@ -141,6 +142,7 @@ Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup
``` ```
(troubleshooting-python-multiprocessing)= (troubleshooting-python-multiprocessing)=
## Python multiprocessing ## Python multiprocessing
### `RuntimeError` Exception ### `RuntimeError` Exception
......
# Welcome to vLLM! # Welcome to vLLM
```{figure} ./assets/logos/vllm-logo-text-light.png ```{figure} ./assets/logos/vllm-logo-text-light.png
:align: center :align: center
...@@ -186,7 +186,7 @@ community/meetups ...@@ -186,7 +186,7 @@ community/meetups
community/sponsors community/sponsors
``` ```
# Indices and tables ## Indices and tables
- {ref}`genindex` - {ref}`genindex`
- {ref}`modindex` - {ref}`modindex`
...@@ -9,25 +9,25 @@ vLLM supports loading weights in Safetensors format using the Run:ai Model Strea ...@@ -9,25 +9,25 @@ vLLM supports loading weights in Safetensors format using the Run:ai Model Strea
You first need to install vLLM RunAI optional dependency: You first need to install vLLM RunAI optional dependency:
```console ```console
$ pip3 install vllm[runai] pip3 install vllm[runai]
``` ```
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag: To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
```console ```console
$ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer
``` ```
To run model from AWS S3 object store run: To run model from AWS S3 object store run:
```console ```console
$ vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
``` ```
To run model from a S3 compatible object store run: To run model from a S3 compatible object store run:
```console ```console
$ RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
``` ```
## Tunable parameters ## Tunable parameters
...@@ -38,14 +38,14 @@ You can tune `concurrency` that controls the level of concurrency and number of ...@@ -38,14 +38,14 @@ You can tune `concurrency` that controls the level of concurrency and number of
For reading from S3, it will be the number of client instances the host is opening to the S3 server. For reading from S3, it will be the number of client instances the host is opening to the S3 server.
```console ```console
$ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}' vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'
``` ```
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size. You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit). You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
```console ```console
$ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}' vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
``` ```
```{note} ```{note}
......
...@@ -45,7 +45,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project ...@@ -45,7 +45,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable: To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
```shell ```shell
$ export VLLM_USE_MODELSCOPE=True export VLLM_USE_MODELSCOPE=True
``` ```
And use with `trust_remote_code=True`. And use with `trust_remote_code=True`.
...@@ -820,19 +820,22 @@ The following table lists those that are tested in vLLM. ...@@ -820,19 +820,22 @@ The following table lists those that are tested in vLLM.
_________________ _________________
# Model Support Policy ## Model Support Policy
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support: At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
1. **Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated! 1. **Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results. 2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
```{tip} ```{tip}
When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs. When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
``` ```
3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback. 3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use. 4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
5. **Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement. 5. **Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.
Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem. Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
......
...@@ -8,7 +8,7 @@ Due to the auto-regressive nature of transformer architecture, there are times w ...@@ -8,7 +8,7 @@ Due to the auto-regressive nature of transformer architecture, there are times w
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, the following warning is printed: available again. When this occurs, the following warning is printed:
``` ```text
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1 WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
``` ```
......
...@@ -35,16 +35,16 @@ output = llm.generate("San Franciso is a") ...@@ -35,16 +35,16 @@ output = llm.generate("San Franciso is a")
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs: To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
```console ```console
$ vllm serve facebook/opt-13b \ vllm serve facebook/opt-13b \
$ --tensor-parallel-size 4 --tensor-parallel-size 4
``` ```
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism: You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
```console ```console
$ vllm serve gpt2 \ vllm serve gpt2 \
$ --tensor-parallel-size 4 \ --tensor-parallel-size 4 \
$ --pipeline-parallel-size 2 --pipeline-parallel-size 2
``` ```
## Running vLLM on multiple nodes ## Running vLLM on multiple nodes
...@@ -56,21 +56,21 @@ The first step, is to start containers and organize them into a cluster. We have ...@@ -56,21 +56,21 @@ The first step, is to start containers and organize them into a cluster. We have
Pick a node as the head node, and run the following command: Pick a node as the head node, and run the following command:
```console ```console
$ bash run_cluster.sh \ bash run_cluster.sh \
$ vllm/vllm-openai \ vllm/vllm-openai \
$ ip_of_head_node \ ip_of_head_node \
$ --head \ --head \
$ /path/to/the/huggingface/home/in/this/node /path/to/the/huggingface/home/in/this/node
``` ```
On the rest of the worker nodes, run the following command: On the rest of the worker nodes, run the following command:
```console ```console
$ bash run_cluster.sh \ bash run_cluster.sh \
$ vllm/vllm-openai \ vllm/vllm-openai \
$ ip_of_head_node \ ip_of_head_node \
$ --worker \ --worker \
$ /path/to/the/huggingface/home/in/this/node /path/to/the/huggingface/home/in/this/node
``` ```
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct. Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
...@@ -80,16 +80,16 @@ Then, on any node, use `docker exec -it node /bin/bash` to enter the container, ...@@ -80,16 +80,16 @@ Then, on any node, use `docker exec -it node /bin/bash` to enter the container,
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2: After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
```console ```console
$ vllm serve /path/to/the/model/in/the/container \ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 8 \ --tensor-parallel-size 8 \
$ --pipeline-parallel-size 2 --pipeline-parallel-size 2
``` ```
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16: You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
```console ```console
$ vllm serve /path/to/the/model/in/the/container \ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 16 --tensor-parallel-size 16
``` ```
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient. To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
......
...@@ -7,7 +7,7 @@ vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain ...@@ -7,7 +7,7 @@ vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain
To install LangChain, run To install LangChain, run
```console ```console
$ pip install langchain langchain_community -q pip install langchain langchain_community -q
``` ```
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`. To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
......
...@@ -7,7 +7,7 @@ vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index ...@@ -7,7 +7,7 @@ vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index
To install LlamaIndex, run To install LlamaIndex, run
```console ```console
$ pip install llama-index-llms-vllm -q pip install llama-index-llms-vllm -q
``` ```
To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`. To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment