@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
...
@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
```bash
pip install llmcompressor
pip install llmcompressor
```
```
...
@@ -81,7 +81,7 @@ Since simple RTN does not require data for weight quantization and the activatio
...
@@ -81,7 +81,7 @@ Since simple RTN does not require data for weight quantization and the activatio
Install `vllm` and `lm-evaluation-harness` for evaluation:
Install `vllm` and `lm-evaluation-harness` for evaluation:
```console
```bash
pip install vllm lm-eval==0.4.4
pip install vllm lm-eval==0.4.4
```
```
...
@@ -99,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
...
@@ -99,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
!!! note
!!! note
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
```console
```bash
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
@@ -21,7 +21,7 @@ for more details on this and other advanced features.
...
@@ -21,7 +21,7 @@ for more details on this and other advanced features.
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
```console
```bash
pip install-U gptqmodel --no-build-isolation-v
pip install-U gptqmodel --no-build-isolation-v
```
```
...
@@ -60,7 +60,7 @@ Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
...
@@ -60,7 +60,7 @@ Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
@@ -26,7 +26,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
...
@@ -26,7 +26,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
- Once inside your instance, activate the pre-installed virtual environment for inference by running
- Once inside your instance, activate the pre-installed virtual environment for inference by running
find / -name*libtcmalloc*# find the dynamic link library path
find / -name*libtcmalloc*# find the dynamic link library path
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD# prepend the library to LD_PRELOAD
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD# prepend the library to LD_PRELOAD
...
@@ -132,7 +132,7 @@ python examples/offline_inference/basic/basic.py # run vLLM
...
@@ -132,7 +132,7 @@ python examples/offline_inference/basic/basic.py # run vLLM
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
```console
```bash
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=0-29
export VLLM_CPU_OMP_THREADS_BIND=0-29
vllm serve facebook/opt-125m
vllm serve facebook/opt-125m
...
@@ -140,7 +140,7 @@ vllm serve facebook/opt-125m
...
@@ -140,7 +140,7 @@ vllm serve facebook/opt-125m
or using default auto thread binding:
or using default auto thread binding:
```console
```bash
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_NUM_OF_RESERVED_CPU=2
export VLLM_CPU_NUM_OF_RESERVED_CPU=2
vllm serve facebook/opt-125m
vllm serve facebook/opt-125m
...
@@ -189,7 +189,7 @@ vllm serve facebook/opt-125m
...
@@ -189,7 +189,7 @@ vllm serve facebook/opt-125m
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
@@ -25,7 +25,7 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
...
@@ -25,7 +25,7 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
@@ -37,7 +37,7 @@ We recommend leveraging `uv` to [automatically select the appropriate PyTorch in
...
@@ -37,7 +37,7 @@ We recommend leveraging `uv` to [automatically select the appropriate PyTorch in
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
```console
```bash
# Install vLLM with CUDA 11.8.
# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.6.1.post1
export VLLM_VERSION=0.6.1.post1
export PYTHON_VERSION=312
export PYTHON_VERSION=312
...
@@ -52,7 +52,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
...
@@ -52,7 +52,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
##### Install the latest code using `pip`
##### Install the latest code using `pip`
```console
```bash
pip install-U vllm \
pip install-U vllm \
--pre\
--pre\
--extra-index-url https://wheels.vllm.ai/nightly
--extra-index-url https://wheels.vllm.ai/nightly
...
@@ -62,7 +62,7 @@ pip install -U vllm \
...
@@ -62,7 +62,7 @@ pip install -U vllm \
Another way to install the latest code is to use `uv`:
Another way to install the latest code is to use `uv`:
```console
```bash
uv pip install-U vllm \
uv pip install-U vllm \
--torch-backend=auto \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
--extra-index-url https://wheels.vllm.ai/nightly
...
@@ -72,7 +72,7 @@ uv pip install -U vllm \
...
@@ -72,7 +72,7 @@ uv pip install -U vllm \
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
```console
```bash
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
@@ -83,7 +83,7 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p
...
@@ -83,7 +83,7 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
```console
```bash
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
uv pip install vllm \
uv pip install vllm \
--torch-backend=auto \
--torch-backend=auto \
...
@@ -99,7 +99,7 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb
...
@@ -99,7 +99,7 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
@@ -118,7 +118,7 @@ This command will do the following:
...
@@ -118,7 +118,7 @@ This command will do the following:
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
```console
```bash
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead.
Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead.
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
To avoid your system being overloaded, you can limit the number of compilation jobs
To avoid your system being overloaded, you can limit the number of compilation jobs
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
```console
```bash
export MAX_JOBS=6
export MAX_JOBS=6
pip install-e .
pip install-e .
```
```
...
@@ -194,7 +194,7 @@ A side effect is a much slower build process.
...
@@ -194,7 +194,7 @@ A side effect is a much slower build process.
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
```console
```bash
# Use `--ipc=host` to make sure the shared memory is large enough.
# Use `--ipc=host` to make sure the shared memory is large enough.
docker run \
docker run \
--gpus all \
--gpus all \
...
@@ -205,14 +205,14 @@ docker run \
...
@@ -205,14 +205,14 @@ docker run \
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
```console
```bash
export CUDA_HOME=/usr/local/cuda
export CUDA_HOME=/usr/local/cuda
export PATH="${CUDA_HOME}/bin:$PATH"
export PATH="${CUDA_HOME}/bin:$PATH"
```
```
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
```console
```bash
nvcc --version# verify that nvcc is in your PATH
nvcc --version# verify that nvcc is in your PATH
${CUDA_HOME}/bin/nvcc --version# verify that nvcc is in your CUDA_HOME
${CUDA_HOME}/bin/nvcc --version# verify that nvcc is in your CUDA_HOME
```
```
...
@@ -223,7 +223,7 @@ vLLM can fully run only on Linux but for development purposes, you can still bui
...
@@ -223,7 +223,7 @@ vLLM can fully run only on Linux but for development purposes, you can still bui
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
```console
```bash
export VLLM_TARGET_DEVICE=empty
export VLLM_TARGET_DEVICE=empty
pip install-e .
pip install-e .
```
```
...
@@ -238,7 +238,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
...
@@ -238,7 +238,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
Another way to access the latest code is to use the docker images:
Another way to access the latest code is to use the docker images:
```console
```bash
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
@@ -31,17 +31,17 @@ Currently, there are no pre-built ROCm wheels.
...
@@ -31,17 +31,17 @@ Currently, there are no pre-built ROCm wheels.
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
```console
```bash
python3 -m pip install ninja cmake wheel pybind11
python3 -m pip install ninja cmake wheel pybind11
pip uninstall -y triton
pip uninstall -y triton
git clone https://github.com/OpenAI/triton.git
git clone https://github.com/OpenAI/triton.git
...
@@ -62,7 +62,7 @@ Currently, there are no pre-built ROCm wheels.
...
@@ -62,7 +62,7 @@ Currently, there are no pre-built ROCm wheels.
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
@@ -148,7 +148,7 @@ If you choose to build this rocm_base image yourself, the steps are as follows.
...
@@ -148,7 +148,7 @@ If you choose to build this rocm_base image yourself, the steps are as follows.
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```console
```json
{
{
"features":{
"features":{
"buildkit":true
"buildkit":true
...
@@ -158,7 +158,7 @@ It is important that the user kicks off the docker build using buildkit. Either
...
@@ -158,7 +158,7 @@ It is important that the user kicks off the docker build using buildkit. Either
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```console
```bash
{
{
"features": {
"features": {
"buildkit": true
"buildkit": true
...
@@ -187,13 +187,13 @@ Their values can be passed in when running `docker build` with `--build-arg` opt
...
@@ -187,13 +187,13 @@ Their values can be passed in when running `docker build` with `--build-arg` opt
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default: