Commit 99324e25 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.9.2' into v0.9.2-ori

parents cc7f22a8 a5dd03c1
# --8<-- [start:installation] # AWS Neuron
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2, generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores. and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
This tab describes how to set up your environment to run vLLM on Neuron. This describes how to set up your environment to run vLLM on Neuron.
!!! warning !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation] ## Requirements
# --8<-- [start:requirements]
- OS: Linux - OS: Linux
- Python: 3.9 or newer - Python: 3.9 or newer
...@@ -21,57 +20,53 @@ ...@@ -21,57 +20,53 @@
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies ### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
The easiest way to launch a Trainium or Inferentia instance with pre-installed Neuron dependencies is to follow this The easiest way to launch a Trainium or Inferentia instance with pre-installed Neuron dependencies is to follow this
[quick start guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/multiframework/multi-framework-ubuntu22-neuron-dlami.html#setup-ubuntu22-multi-framework-dlami) using the Neuron Deep Learning AMI (Amazon machine image). [quick start guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/multiframework/multi-framework-ubuntu22-neuron-dlami.html#setup-ubuntu22-multi-framework-dlami) using the Neuron Deep Learning AMI (Amazon machine image).
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance - After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
- Once inside your instance, activate the pre-installed virtual environment for inference by running - Once inside your instance, activate the pre-installed virtual environment for inference by running
```console
```bash
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
``` ```
Refer to the [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html) Refer to the [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html)
for alternative setup instructions including using Docker and manually installing dependencies. for alternative setup instructions including using Docker and manually installing dependencies.
!!! note !!! note
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html). library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
# --8<-- [end:requirements] ## Set up using Python
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] ### Pre-built wheels
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Neuron wheels. Currently, there are no pre-built Neuron wheels.
# --8<-- [end:pre-built-wheels] ### Build wheel from source
# --8<-- [start:build-wheel-from-source]
#### Install vLLM from source
Install vllm as follows: To build and install vLLM from source, run:
```console ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
cd vllm cd vllm
pip install -U -r requirements/neuron.txt pip install -U -r requirements/neuron.txt
VLLM_TARGET_DEVICE="neuron" pip install -e . VLLM_TARGET_DEVICE="neuron" pip install -e .
``` ```
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's <https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
available on vLLM V0. Please utilize the AWS Fork for the following features: available on vLLM V0. Please utilize the AWS Fork for the following features:
- Llama-3.2 multi-modal support - Llama-3.2 multi-modal support
- Multi-node distributed inference - Multi-node distributed inference
Refer to [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html) Refer to [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)
for more details and usage examples. for more details and usage examples.
To install the AWS Neuron fork, run the following: To install the AWS Neuron fork, run the following:
```console ```bash
git clone -b neuron-2.23-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git git clone -b neuron-2.23-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git
cd upstreaming-to-vllm cd upstreaming-to-vllm
pip install -r requirements/neuron.txt pip install -r requirements/neuron.txt
...@@ -80,75 +75,73 @@ VLLM_TARGET_DEVICE="neuron" pip install -e . ...@@ -80,75 +75,73 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested. Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
# --8<-- [end:build-wheel-from-source] ## Set up using Docker
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker] ### Pre-built images
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Neuron images. Currently, there are no pre-built Neuron images.
# --8<-- [end:pre-built-images] ### Build image from source
# --8<-- [start:build-image-from-source]
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image. See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile. Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
# --8<-- [end:build-image-from-source] ## Extra information
# --8<-- [start:extra-information]
[](){ #feature-support-through-nxd-inference-backend } [](){ #feature-support-through-nxd-inference-backend }
### Feature support through NxD Inference backend ### Feature support through NxD Inference backend
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
to perform most of the heavy lifting which includes PyTorch model initialization, compilation, and runtime execution. Therefore, most to perform most of the heavy lifting which includes PyTorch model initialization, compilation, and runtime execution. Therefore, most
[features supported on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html) are also available via the vLLM integration. [features supported on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html) are also available via the vLLM integration.
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
```console
```python
override_neuron_config={ override_neuron_config={
"enable_bucketing":False, "enable_bucketing":False,
} }
``` ```
or when launching vLLM from the CLI, pass or when launching vLLM from the CLI, pass
```console
```bash
--override-neuron-config "{\"enable_bucketing\":false}" --override-neuron-config "{\"enable_bucketing\":false}"
``` ```
Alternatively, users can directly call the NxDI library to trace and compile your model, then load the pre-compiled artifacts Alternatively, users can directly call the NxDI library to trace and compile your model, then load the pre-compiled artifacts
(via `NEURON_COMPILED_ARTIFACTS` environment variable) in vLLM to run inference workloads. (via `NEURON_COMPILED_ARTIFACTS` environment variable) in vLLM to run inference workloads.
### Known limitations ### Known limitations
- EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this - EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility) [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI. for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
- Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this - Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html) [Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM. to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at - Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py) runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
- Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed - Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature. to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
- Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer - Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node) to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main. to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches - Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic. implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
### Environment variables ### Environment variables
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the - `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set, compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
under this specified path. but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
under this specified path.
- `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend). - `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
- `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend). - `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
# --8<-- [end:extra-information]
...@@ -76,21 +76,25 @@ Currently, there are no pre-built CPU wheels. ...@@ -76,21 +76,25 @@ Currently, there are no pre-built CPU wheels.
### Build image from source ### Build image from source
```console ??? Commands
$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
```bash
# Launching OpenAI server docker build -f docker/Dockerfile.cpu \
$ docker run --rm \ --tag vllm-cpu-env \
--privileged=true \ --target vllm-openai .
--shm-size=4g \
-p 8000:8000 \ # Launching OpenAI server
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \ docker run --rm \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \ --privileged=true \
vllm-cpu-env \ --shm-size=4g \
--model=meta-llama/Llama-3.2-1B-Instruct \ -p 8000:8000 \
--dtype=bfloat16 \ -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
other vLLM OpenAI server arguments -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
``` vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments
```
!!! tip !!! tip
For ARM or Apple silicon, use `docker/Dockerfile.arm` For ARM or Apple silicon, use `docker/Dockerfile.arm`
...@@ -114,12 +118,13 @@ vLLM CPU backend supports the following vLLM features: ...@@ -114,12 +118,13 @@ vLLM CPU backend supports the following vLLM features:
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`. - `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`.
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`. - `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`.
- `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False). - `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
- `VLLM_CPU_SGL_KERNEL` (Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
## Performance tips ## Performance tips
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run: - We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
```console ```bash
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
find / -name *libtcmalloc* # find the dynamic link library path find / -name *libtcmalloc* # find the dynamic link library path
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
...@@ -128,7 +133,7 @@ python examples/offline_inference/basic/basic.py # run vLLM ...@@ -128,7 +133,7 @@ python examples/offline_inference/basic/basic.py # run vLLM
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP: - When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
```console ```bash
export VLLM_CPU_KVCACHE_SPACE=40 export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND=0-29 export VLLM_CPU_OMP_THREADS_BIND=0-29
vllm serve facebook/opt-125m vllm serve facebook/opt-125m
...@@ -136,7 +141,7 @@ vllm serve facebook/opt-125m ...@@ -136,7 +141,7 @@ vllm serve facebook/opt-125m
or using default auto thread binding: or using default auto thread binding:
```console ```bash
export VLLM_CPU_KVCACHE_SPACE=40 export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_NUM_OF_RESERVED_CPU=2 export VLLM_CPU_NUM_OF_RESERVED_CPU=2
vllm serve facebook/opt-125m vllm serve facebook/opt-125m
...@@ -144,32 +149,34 @@ vllm serve facebook/opt-125m ...@@ -144,32 +149,34 @@ vllm serve facebook/opt-125m
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores: - If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
```console ??? Commands
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
```console
# The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core. $ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000 # The "CPU" column means the logical CPU core IDs, and the "CORE" column means the physical core IDs. On this platform, two logical cores are sharing one physical core.
1 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000 CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
2 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000 0 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
3 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000 1 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
4 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000 2 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
5 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000 3 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
6 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000 4 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
7 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000 5 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
8 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000 6 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
9 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000 7 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
10 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000 8 0 0 0 0:0:0:0 yes 2401.0000 800.0000 800.000
11 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000 9 0 0 1 1:1:1:0 yes 2401.0000 800.0000 800.000
12 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000 10 0 0 2 2:2:2:0 yes 2401.0000 800.0000 800.000
13 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000 11 0 0 3 3:3:3:0 yes 2401.0000 800.0000 800.000
14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000 12 0 0 4 4:4:4:0 yes 2401.0000 800.0000 800.000
15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000 13 0 0 5 5:5:5:0 yes 2401.0000 800.0000 800.000
14 0 0 6 6:6:6:0 yes 2401.0000 800.0000 800.000
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15 15 0 0 7 7:7:7:0 yes 2401.0000 800.0000 800.000
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/basic/basic.py # On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
``` $ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/basic/basic.py
```
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access. - If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.
...@@ -183,14 +190,20 @@ $ python examples/offline_inference/basic/basic.py ...@@ -183,14 +190,20 @@ $ python examples/offline_inference/basic/basic.py
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving: - Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
```console ```bash
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
vllm serve meta-llama/Llama-2-7b-chat-hf \
-tp=2 \
--distributed-executor-backend mp
``` ```
or using default auto thread binding: or using default auto thread binding:
```console ```bash
VLLM_CPU_KVCACHE_SPACE=40 vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp VLLM_CPU_KVCACHE_SPACE=40 \
vllm serve meta-llama/Llama-2-7b-chat-hf \
-tp=2 \
--distributed-executor-backend mp
``` ```
- For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node. - For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.
......
...@@ -25,11 +25,11 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes. ...@@ -25,11 +25,11 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source. After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
```console ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
cd vllm cd vllm
pip install -r requirements/cpu.txt pip install -r requirements/cpu.txt
pip install -e . pip install -e .
``` ```
!!! note !!! note
......
...@@ -23,7 +23,7 @@ ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes. ...@@ -23,7 +23,7 @@ ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
# --8<-- [end:pre-built-wheels] # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] # --8<-- [start:build-wheel-from-source]
--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md" --8<-- "docs/getting_started/installation/cpu/build.inc.md"
Testing has been conducted on AWS Graviton3 instances for compatibility. Testing has been conducted on AWS Graviton3 instances for compatibility.
......
First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
```console ```bash
sudo apt-get update -y sudo apt-get update -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
...@@ -8,14 +8,14 @@ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave / ...@@ -8,14 +8,14 @@ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /
Second, clone vLLM project: Second, clone vLLM project:
```console ```bash
git clone https://github.com/vllm-project/vllm.git vllm_source git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source cd vllm_source
``` ```
Third, install Python packages for vLLM CPU backend building: Third, install Python packages for vLLM CPU backend building:
```console ```bash
pip install --upgrade pip pip install --upgrade pip
pip install "cmake>=3.26.1" wheel packaging ninja "setuptools-scm>=8" numpy pip install "cmake>=3.26.1" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
...@@ -23,13 +23,13 @@ pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorc ...@@ -23,13 +23,13 @@ pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorc
Finally, build and install vLLM CPU backend: Finally, build and install vLLM CPU backend:
```console ```bash
VLLM_TARGET_DEVICE=cpu python setup.py install VLLM_TARGET_DEVICE=cpu python setup.py install
``` ```
If you want to develop vllm, install it in editable mode instead. If you want to develop vllm, install it in editable mode instead.
```console ```bash
VLLM_TARGET_DEVICE=cpu python setup.py develop VLLM_TARGET_DEVICE=cpu python setup.py develop
``` ```
......
...@@ -26,7 +26,7 @@ Currently the CPU implementation for s390x architecture supports FP32 datatype o ...@@ -26,7 +26,7 @@ Currently the CPU implementation for s390x architecture supports FP32 datatype o
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4: Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
```console ```bash
dnf install -y \ dnf install -y \
which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \ which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \ libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
...@@ -35,7 +35,7 @@ dnf install -y \ ...@@ -35,7 +35,7 @@ dnf install -y \
Install rust>=1.80 which is needed for `outlines-core` and `uvloop` python packages installation. Install rust>=1.80 which is needed for `outlines-core` and `uvloop` python packages installation.
```console ```bash
curl https://sh.rustup.rs -sSf | sh -s -- -y && \ curl https://sh.rustup.rs -sSf | sh -s -- -y && \
. "$HOME/.cargo/env" . "$HOME/.cargo/env"
``` ```
...@@ -45,7 +45,7 @@ Execute the following commands to build and install vLLM from the source. ...@@ -45,7 +45,7 @@ Execute the following commands to build and install vLLM from the source.
!!! tip !!! tip
Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM. Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
```console ```bash
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
pip install -v \ pip install -v \
--extra-index-url https://download.pytorch.org/whl/nightly/cpu \ --extra-index-url https://download.pytorch.org/whl/nightly/cpu \
......
...@@ -24,7 +24,7 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform, ...@@ -24,7 +24,7 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
# --8<-- [end:pre-built-wheels] # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] # --8<-- [start:build-wheel-from-source]
--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md" --8<-- "docs/getting_started/installation/cpu/build.inc.md"
!!! note !!! note
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16. - AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
......
# --8<-- [start:installation] # Google TPU
Tensor Processing Units (TPUs) are Google's custom-developed application-specific Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
...@@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp ...@@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
!!! warning !!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source. There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
# --8<-- [end:installation] ## Requirements
# --8<-- [start:requirements]
- Google Cloud TPU VM - Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4 - TPU versions: v6e, v5e, v5p, v4
...@@ -58,9 +57,10 @@ assigned to your Google Cloud project for your immediate exclusive use. ...@@ -58,9 +57,10 @@ assigned to your Google Cloud project for your immediate exclusive use.
### Provision Cloud TPUs with GKE ### Provision Cloud TPUs with GKE
For more information about using TPUs with GKE, see: For more information about using TPUs with GKE, see:
- <https://cloud.google.com/kubernetes-engine/docs/how-to/tpus>
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus> - [About TPUs in GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/tpus)
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus> - [Deploy TPU workloads in GKE Standard](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus)
- [Plan for TPUs in GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus)
## Configure a new environment ## Configure a new environment
...@@ -68,42 +68,43 @@ For more information about using TPUs with GKE, see: ...@@ -68,42 +68,43 @@ For more information about using TPUs with GKE, see:
Create a TPU v5e with 4 TPU chips: Create a TPU v5e with 4 TPU chips:
```console ```bash
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--node-id TPU_NAME \ --node-id TPU_NAME \
--project PROJECT_ID \ --project PROJECT_ID \
--zone ZONE \ --zone ZONE \
--accelerator-type ACCELERATOR_TYPE \ --accelerator-type ACCELERATOR_TYPE \
--runtime-version RUNTIME_VERSION \ --runtime-version RUNTIME_VERSION \
--service-account SERVICE_ACCOUNT --service-account SERVICE_ACCOUNT
``` ```
| Parameter name | Description | | Parameter name | Description |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. | | QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
| TPU_NAME | The user-assigned name of the TPU which is created when the queued | | TPU_NAME | The user-assigned name of the TPU which is created when the queued resource request is allocated. |
| PROJECT_ID | Your Google Cloud project | | PROJECT_ID | Your Google Cloud project |
| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use | | ZONE | The GCP zone where you want to create your Cloud TPU. The value you use depends on the version of TPUs you are using. For more information, see [TPU regions and zones] |
| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example | | ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information, see [TPU versions]. |
| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes). | | RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images]. |
<figcaption>Parameter descriptions</figcaption> | SERVICE_ACCOUNT | The email address for your service account. You can find it in the IAM Cloud Console under *Service Accounts*. For example: `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com` |
Connect to your TPU using SSH: Connect to your TPU VM using SSH:
```bash ```bash
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
``` ```
# --8<-- [end:requirements] [TPU versions]: https://cloud.google.com/tpu/docs/runtimes
# --8<-- [start:set-up-using-python] [TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
# --8<-- [end:set-up-using-python] ## Set up using Python
# --8<-- [start:pre-built-wheels]
### Pre-built wheels
Currently, there are no pre-built TPU wheels. Currently, there are no pre-built TPU wheels.
# --8<-- [end:pre-built-wheels] ### Build wheel from source
# --8<-- [start:build-wheel-from-source]
Install Miniconda: Install Miniconda:
...@@ -136,7 +137,7 @@ Install build dependencies: ...@@ -136,7 +137,7 @@ Install build dependencies:
```bash ```bash
pip install -r requirements/tpu.txt pip install -r requirements/tpu.txt
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
``` ```
Run the setup script: Run the setup script:
...@@ -145,26 +146,23 @@ Run the setup script: ...@@ -145,26 +146,23 @@ Run the setup script:
VLLM_TARGET_DEVICE="tpu" python -m pip install -e . VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
``` ```
# --8<-- [end:build-wheel-from-source] ## Set up using Docker
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker] ### Pre-built images
# --8<-- [start:pre-built-images]
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`. See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
# --8<-- [end:pre-built-images] ### Build image from source
# --8<-- [start:build-image-from-source]
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support. You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
```console ```bash
docker build -f docker/Dockerfile.tpu -t vllm-tpu . docker build -f docker/Dockerfile.tpu -t vllm-tpu .
``` ```
Run the Docker image with the following command: Run the Docker image with the following command:
```console ```bash
# Make sure to add `--privileged --net host --shm-size=16G`. # Make sure to add `--privileged --net host --shm-size=16G`.
docker run --privileged --net host --shm-size=16G -it vllm-tpu docker run --privileged --net host --shm-size=16G -it vllm-tpu
``` ```
...@@ -187,12 +185,6 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu ...@@ -187,12 +185,6 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
Install OpenBLAS with the following command: Install OpenBLAS with the following command:
```console ```bash
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
``` ```
# --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information]
There is no extra information for this device.
# --8<-- [end:extra-information]
...@@ -42,7 +42,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -42,7 +42,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:create-a-new-python-environment" --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:set-up-using-python"
=== "AMD ROCm" === "AMD ROCm"
......
...@@ -10,8 +10,6 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries. ...@@ -10,8 +10,6 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
# --8<-- [end:requirements] # --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] # --8<-- [start:set-up-using-python]
### Create a new Python environment
!!! note !!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details. PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
...@@ -24,7 +22,7 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I ...@@ -24,7 +22,7 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
You can install vLLM using either `pip` or `uv pip`: You can install vLLM using either `pip` or `uv pip`:
```console ```bash
# Install vLLM with CUDA 12.8. # Install vLLM with CUDA 12.8.
# If you are using pip. # If you are using pip.
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128 pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
...@@ -39,7 +37,7 @@ We recommend leveraging `uv` to [automatically select the appropriate PyTorch in ...@@ -39,7 +37,7 @@ We recommend leveraging `uv` to [automatically select the appropriate PyTorch in
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions: As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
```console ```bash
# Install vLLM with CUDA 11.8. # Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.6.1.post1 export VLLM_VERSION=0.6.1.post1
export PYTHON_VERSION=312 export PYTHON_VERSION=312
...@@ -54,7 +52,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe ...@@ -54,7 +52,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
##### Install the latest code using `pip` ##### Install the latest code using `pip`
```console ```bash
pip install -U vllm \ pip install -U vllm \
--pre \ --pre \
--extra-index-url https://wheels.vllm.ai/nightly --extra-index-url https://wheels.vllm.ai/nightly
...@@ -64,7 +62,7 @@ pip install -U vllm \ ...@@ -64,7 +62,7 @@ pip install -U vllm \
Another way to install the latest code is to use `uv`: Another way to install the latest code is to use `uv`:
```console ```bash
uv pip install -U vllm \ uv pip install -U vllm \
--torch-backend=auto \ --torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly --extra-index-url https://wheels.vllm.ai/nightly
...@@ -74,7 +72,7 @@ uv pip install -U vllm \ ...@@ -74,7 +72,7 @@ uv pip install -U vllm \
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL: If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
```console ```bash
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
``` ```
...@@ -85,7 +83,7 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p ...@@ -85,7 +83,7 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL: If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
```console ```bash
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
uv pip install vllm \ uv pip install vllm \
--torch-backend=auto \ --torch-backend=auto \
...@@ -101,7 +99,7 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb ...@@ -101,7 +99,7 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM: If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
```console ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
cd vllm cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable . VLLM_USE_PRECOMPILED=1 pip install --editable .
...@@ -120,7 +118,7 @@ This command will do the following: ...@@ -120,7 +118,7 @@ This command will do the following:
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable. In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
```console ```bash
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
pip install --editable . pip install --editable .
...@@ -136,7 +134,7 @@ You can find more information about vLLM's wheels in [install-the-latest-code][i ...@@ -136,7 +134,7 @@ You can find more information about vLLM's wheels in [install-the-latest-code][i
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes: If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
```console ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
cd vllm cd vllm
pip install -e . pip install -e .
...@@ -153,6 +151,9 @@ pip install -e . ...@@ -153,6 +151,9 @@ pip install -e .
[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments. [sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`. The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
!!! note "Faster Kernel Development"
For frequent C++/CUDA kernel changes, after the initial `pip install -e .` setup, consider using the [Incremental Compilation Workflow](../../contributing/incremental_build.md) for significantly faster rebuilds of only the modified kernel code.
##### Use an existing PyTorch installation ##### Use an existing PyTorch installation
There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.: There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.:
...@@ -162,7 +163,7 @@ There are scenarios where the PyTorch dependency cannot be easily installed via ...@@ -162,7 +163,7 @@ There are scenarios where the PyTorch dependency cannot be easily installed via
To build vLLM using an existing PyTorch installation: To build vLLM using an existing PyTorch installation:
```console ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
cd vllm cd vllm
python use_existing_torch.py python use_existing_torch.py
...@@ -175,7 +176,7 @@ pip install --no-build-isolation -e . ...@@ -175,7 +176,7 @@ pip install --no-build-isolation -e .
Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead. Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead.
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory. To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
```console ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
cd vllm cd vllm
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e . VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
...@@ -186,7 +187,7 @@ VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e . ...@@ -186,7 +187,7 @@ VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
To avoid your system being overloaded, you can limit the number of compilation jobs To avoid your system being overloaded, you can limit the number of compilation jobs
to be run simultaneously, via the environment variable `MAX_JOBS`. For example: to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
```console ```bash
export MAX_JOBS=6 export MAX_JOBS=6
pip install -e . pip install -e .
``` ```
...@@ -196,7 +197,7 @@ A side effect is a much slower build process. ...@@ -196,7 +197,7 @@ A side effect is a much slower build process.
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image. Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
```console ```bash
# Use `--ipc=host` to make sure the shared memory is large enough. # Use `--ipc=host` to make sure the shared memory is large enough.
docker run \ docker run \
--gpus all \ --gpus all \
...@@ -207,14 +208,14 @@ docker run \ ...@@ -207,14 +208,14 @@ docker run \
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.: If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
```console ```bash
export CUDA_HOME=/usr/local/cuda export CUDA_HOME=/usr/local/cuda
export PATH="${CUDA_HOME}/bin:$PATH" export PATH="${CUDA_HOME}/bin:$PATH"
``` ```
Here is a sanity check to verify that the CUDA Toolkit is correctly installed: Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
```console ```bash
nvcc --version # verify that nvcc is in your PATH nvcc --version # verify that nvcc is in your PATH
${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME ${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
``` ```
...@@ -225,7 +226,7 @@ vLLM can fully run only on Linux but for development purposes, you can still bui ...@@ -225,7 +226,7 @@ vLLM can fully run only on Linux but for development purposes, you can still bui
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing: Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
```console ```bash
export VLLM_TARGET_DEVICE=empty export VLLM_TARGET_DEVICE=empty
pip install -e . pip install -e .
``` ```
...@@ -240,7 +241,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i ...@@ -240,7 +241,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
Another way to access the latest code is to use the docker images: Another way to access the latest code is to use the docker images:
```console ```bash
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT} docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
``` ```
...@@ -254,7 +255,10 @@ The latest code can contain bugs and may not be stable. Please use it with cauti ...@@ -254,7 +255,10 @@ The latest code can contain bugs and may not be stable. Please use it with cauti
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image. See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
## Supported features # --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features]
See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information. See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
# --8<-- [end:supported-features]
# --8<-- [end:extra-information] # --8<-- [end:extra-information]
...@@ -31,17 +31,17 @@ Currently, there are no pre-built ROCm wheels. ...@@ -31,17 +31,17 @@ Currently, there are no pre-built ROCm wheels.
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example: Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
```console ```bash
# Install PyTorch # Install PyTorch
$ pip uninstall torch -y pip uninstall torch -y
$ pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3 pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3
``` ```
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton) 1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md) Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
```console ```bash
python3 -m pip install ninja cmake wheel pybind11 python3 -m pip install ninja cmake wheel pybind11
pip uninstall -y triton pip uninstall -y triton
git clone https://github.com/OpenAI/triton.git git clone https://github.com/OpenAI/triton.git
...@@ -62,7 +62,7 @@ Currently, there are no pre-built ROCm wheels. ...@@ -62,7 +62,7 @@ Currently, there are no pre-built ROCm wheels.
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`. For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
```console ```bash
git clone https://github.com/ROCm/flash-attention.git git clone https://github.com/ROCm/flash-attention.git
cd flash-attention cd flash-attention
git checkout b7d29fb git checkout b7d29fb
...@@ -76,7 +76,7 @@ Currently, there are no pre-built ROCm wheels. ...@@ -76,7 +76,7 @@ Currently, there are no pre-built ROCm wheels.
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps: 3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
```console ```bash
python3 -m pip uninstall -y aiter python3 -m pip uninstall -y aiter
git clone --recursive https://github.com/ROCm/aiter.git git clone --recursive https://github.com/ROCm/aiter.git
cd aiter cd aiter
...@@ -90,24 +90,26 @@ Currently, there are no pre-built ROCm wheels. ...@@ -90,24 +90,26 @@ Currently, there are no pre-built ROCm wheels.
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps: 4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
```bash ??? Commands
pip install --upgrade pip
# Build & install AMD SMI ```bash
pip install /opt/rocm/share/amd_smi pip install --upgrade pip
# Install dependencies # Build & install AMD SMI
pip install --upgrade numba \ pip install /opt/rocm/share/amd_smi
scipy \
huggingface-hub[cli,hf_transfer] \
setuptools_scm
pip install "numpy<2"
pip install -r requirements/rocm.txt
# Build vLLM for MI210/MI250/MI300. # Install dependencies
export PYTORCH_ROCM_ARCH="gfx90a;gfx942" pip install --upgrade numba \
python3 setup.py develop scipy \
``` huggingface-hub[cli,hf_transfer] \
setuptools_scm
pip install "numpy<2"
pip install -r requirements/rocm.txt
# Build vLLM for MI210/MI250/MI300.
export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
python3 setup.py develop
```
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation. This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
...@@ -146,7 +148,7 @@ If you choose to build this rocm_base image yourself, the steps are as follows. ...@@ -146,7 +148,7 @@ If you choose to build this rocm_base image yourself, the steps are as follows.
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```console ```json
{ {
"features": { "features": {
"buildkit": true "buildkit": true
...@@ -156,7 +158,7 @@ It is important that the user kicks off the docker build using buildkit. Either ...@@ -156,7 +158,7 @@ It is important that the user kicks off the docker build using buildkit. Either
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default: To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
```console ```bash
DOCKER_BUILDKIT=1 docker build \ DOCKER_BUILDKIT=1 docker build \
-f docker/Dockerfile.rocm_base \ -f docker/Dockerfile.rocm_base \
-t rocm/vllm-dev:base . -t rocm/vllm-dev:base .
...@@ -167,7 +169,7 @@ DOCKER_BUILDKIT=1 docker build \ ...@@ -167,7 +169,7 @@ DOCKER_BUILDKIT=1 docker build \
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image. First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```console ```bash
{ {
"features": { "features": {
"buildkit": true "buildkit": true
...@@ -185,13 +187,13 @@ Their values can be passed in when running `docker build` with `--build-arg` opt ...@@ -185,13 +187,13 @@ Their values can be passed in when running `docker build` with `--build-arg` opt
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default: To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
```console ```bash
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm . DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
``` ```
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image: To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
```console ```bash
DOCKER_BUILDKIT=1 docker build \ DOCKER_BUILDKIT=1 docker build \
--build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" \ --build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" \
-f docker/Dockerfile.rocm \ -f docker/Dockerfile.rocm \
...@@ -201,23 +203,28 @@ DOCKER_BUILDKIT=1 docker build \ ...@@ -201,23 +203,28 @@ DOCKER_BUILDKIT=1 docker build \
To run the above docker image `vllm-rocm`, use the below command: To run the above docker image `vllm-rocm`, use the below command:
```console ??? Command
docker run -it \
--network=host \ ```bash
--group-add=video \ docker run -it \
--ipc=host \ --network=host \
--cap-add=SYS_PTRACE \ --group-add=video \
--security-opt seccomp=unconfined \ --ipc=host \
--device /dev/kfd \ --cap-add=SYS_PTRACE \
--device /dev/dri \ --security-opt seccomp=unconfined \
-v <path/to/model>:/app/model \ --device /dev/kfd \
vllm-rocm \ --device /dev/dri \
bash -v <path/to/model>:/app/model \
``` vllm-rocm \
bash
```
Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models. Where the `<path/to/model>` is the location where the model is stored, for example, the weights for llama2 or llama3 models.
## Supported features # --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features]
See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information. See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
# --8<-- [end:supported-features]
# --8<-- [end:extra-information] # --8<-- [end:extra-information]
...@@ -22,10 +22,10 @@ Currently, there are no pre-built XPU wheels. ...@@ -22,10 +22,10 @@ Currently, there are no pre-built XPU wheels.
# --8<-- [end:pre-built-wheels] # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] # --8<-- [start:build-wheel-from-source]
- First, install required driver and Intel OneAPI 2025.0 or later. - First, install required [driver](https://dgpu-docs.intel.com/driver/installation.html#installing-gpu-drivers) and [Intel OneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) 2025.0 or later.
- Second, install Python packages for vLLM XPU backend building: - Second, install Python packages for vLLM XPU backend building:
```console ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
cd vllm cd vllm
pip install --upgrade pip pip install --upgrade pip
...@@ -34,7 +34,7 @@ pip install -v -r requirements/xpu.txt ...@@ -34,7 +34,7 @@ pip install -v -r requirements/xpu.txt
- Then, build and install vLLM XPU backend: - Then, build and install vLLM XPU backend:
```console ```bash
VLLM_TARGET_DEVICE=xpu python setup.py install VLLM_TARGET_DEVICE=xpu python setup.py install
``` ```
...@@ -53,9 +53,9 @@ Currently, there are no pre-built XPU images. ...@@ -53,9 +53,9 @@ Currently, there are no pre-built XPU images.
# --8<-- [end:pre-built-images] # --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] # --8<-- [start:build-image-from-source]
```console ```bash
$ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g . docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
$ docker run -it \ docker run -it \
--rm \ --rm \
--network=host \ --network=host \
--device /dev/dri \ --device /dev/dri \
...@@ -63,11 +63,12 @@ $ docker run -it \ ...@@ -63,11 +63,12 @@ $ docker run -it \
vllm-xpu-env vllm-xpu-env
``` ```
## Supported features # --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features]
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following: XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following:
```console ```bash
python -m vllm.entrypoints.openai.api_server \ python -m vllm.entrypoints.openai.api_server \
--model=facebook/opt-13b \ --model=facebook/opt-13b \
--dtype=bfloat16 \ --dtype=bfloat16 \
...@@ -78,4 +79,6 @@ python -m vllm.entrypoints.openai.api_server \ ...@@ -78,4 +79,6 @@ python -m vllm.entrypoints.openai.api_server \
``` ```
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script. By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
# --8<-- [end:supported-features]
# --8<-- [end:extra-information] # --8<-- [end:extra-information]
# --8<-- [start:installation] # Intel Gaudi
This tab provides instructions on running vLLM with Intel Gaudi devices. This page provides instructions on running vLLM with Intel Gaudi devices.
!!! warning !!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source. There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation] ## Requirements
# --8<-- [start:requirements]
- OS: Ubuntu 22.04 LTS - OS: Ubuntu 22.04 LTS
- Python: 3.10 - Python: 3.10
...@@ -25,7 +24,7 @@ please follow the methods outlined in the ...@@ -25,7 +24,7 @@ please follow the methods outlined in the
To verify that the Intel Gaudi software was correctly installed, run: To verify that the Intel Gaudi software was correctly installed, run:
```console ```bash
hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
...@@ -43,7 +42,7 @@ for more details. ...@@ -43,7 +42,7 @@ for more details.
Use the following commands to run a Docker image: Use the following commands to run a Docker image:
```console ```bash
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
docker run \ docker run \
-it \ -it \
...@@ -56,20 +55,17 @@ docker run \ ...@@ -56,20 +55,17 @@ docker run \
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
``` ```
# --8<-- [end:requirements] ## Set up using Python
# --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] ### Pre-built wheels
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Intel Gaudi wheels. Currently, there are no pre-built Intel Gaudi wheels.
# --8<-- [end:pre-built-wheels] ### Build wheel from source
# --8<-- [start:build-wheel-from-source]
To build and install vLLM from source, run: To build and install vLLM from source, run:
```console ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
cd vllm cd vllm
pip install -r requirements/hpu.txt pip install -r requirements/hpu.txt
...@@ -78,7 +74,7 @@ python setup.py develop ...@@ -78,7 +74,7 @@ python setup.py develop
Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following: Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following:
```console ```bash
git clone https://github.com/HabanaAI/vllm-fork.git git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork cd vllm-fork
git checkout habana_main git checkout habana_main
...@@ -86,18 +82,15 @@ pip install -r requirements/hpu.txt ...@@ -86,18 +82,15 @@ pip install -r requirements/hpu.txt
python setup.py develop python setup.py develop
``` ```
# --8<-- [end:build-wheel-from-source] ## Set up using Docker
# --8<-- [start:set-up-using-docker]
# --8<-- [end:set-up-using-docker] ### Pre-built images
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Intel Gaudi images. Currently, there are no pre-built Intel Gaudi images.
# --8<-- [end:pre-built-images] ### Build image from source
# --8<-- [start:build-image-from-source]
```console ```bash
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env . docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
docker run \ docker run \
-it \ -it \
...@@ -112,13 +105,12 @@ docker run \ ...@@ -112,13 +105,12 @@ docker run \
!!! tip !!! tip
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered. If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
# --8<-- [end:build-image-from-source] ## Extra information
# --8<-- [start:extra-information]
## Supported features ### Supported features
- [Offline inference][offline-inference] - [Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server][openai-compatible-server] - Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server]
- HPU autodetection - no need to manually select device within vLLM - HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops, - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
...@@ -129,14 +121,14 @@ docker run \ ...@@ -129,14 +121,14 @@ docker run \
for accelerating low-batch latency and throughput for accelerating low-batch latency and throughput
- Attention with Linear Biases (ALiBi) - Attention with Linear Biases (ALiBi)
## Unsupported features ### Unsupported features
- Beam search - Beam search
- LoRA adapters - LoRA adapters
- Quantization - Quantization
- Prefill chunking (mixed-batch inferencing) - Prefill chunking (mixed-batch inferencing)
## Supported configurations ### Supported configurations
The following configurations have been validated to function with The following configurations have been validated to function with
Gaudi2 devices. Configurations that are not listed may or may not work. Gaudi2 devices. Configurations that are not listed may or may not work.
...@@ -183,7 +175,6 @@ Currently in vLLM for HPU we support four execution modes, depending on selected ...@@ -183,7 +175,6 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
| 0 | 0 | torch.compile | | 0 | 0 | torch.compile |
| 0 | 1 | PyTorch eager mode | | 0 | 1 | PyTorch eager mode |
| 1 | 0 | HPU Graphs | | 1 | 0 | HPU Graphs |
<figcaption>vLLM execution modes</figcaption>
!!! warning !!! warning
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode. In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
...@@ -207,9 +198,14 @@ INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, ma ...@@ -207,9 +198,14 @@ INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, ma
INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
``` ```
`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes. | Parameter | Description |
|----------------|-----------------------------------------------------------------------------|
| `min` | Determines the lowest value of the bucket. |
| `step` | Determines the interval between buckets. |
| `max` | Determines the upper bound of the bucket. |
| Ramp-up phase | A special handling phase applied between `min` and `step`:<br/>- `min` is multiplied by consecutive powers of two until `step` is reached.<br/>- Minimizes resource wastage for small batch sizes.<br/>- Allows larger padding for larger batches. |
Example (with ramp-up) Example (with ramp-up):
```text ```text
min = 2, step = 32, max = 64 min = 2, step = 32, max = 64
...@@ -218,7 +214,7 @@ min = 2, step = 32, max = 64 ...@@ -218,7 +214,7 @@ min = 2, step = 32, max = 64
=> buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64) => buckets = ramp_up + stable => (2, 4, 8, 16, 32, 64)
``` ```
Example (without ramp-up) Example (without ramp-up):
```text ```text
min = 128, step = 128, max = 512 min = 128, step = 128, max = 512
...@@ -241,19 +237,21 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come ...@@ -241,19 +237,21 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup: Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
```text ??? Logs
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB ```text
INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
... INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][2/24] batch_size:4 seq_len:896 free_mem:55.43 GiB
INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB INFO 08-01 22:26:48 hpu_model_runner.py:1066] [Warmup][Prompt][3/24] batch_size:4 seq_len:768 free_mem:55.43 GiB
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB ...
INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB INFO 08-01 22:26:59 hpu_model_runner.py:1066] [Warmup][Prompt][24/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][1/48] batch_size:4 seq_len:2048 free_mem:55.43 GiB
... INFO 08-01 22:27:00 hpu_model_runner.py:1066] [Warmup][Decode][2/48] batch_size:4 seq_len:1920 free_mem:55.43 GiB
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB INFO 08-01 22:27:01 hpu_model_runner.py:1066] [Warmup][Decode][3/48] batch_size:4 seq_len:1792 free_mem:55.43 GiB
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB ...
``` INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size:2 seq_len:128 free_mem:55.43 GiB
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
```
This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations. This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
...@@ -288,37 +286,39 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi ...@@ -288,37 +286,39 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released): Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
```text ??? Logs
INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)] ```text
INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048] INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] INFO 08-02 17:37:44 hpu_model_runner.py:499] Generated 24 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used) INFO 08-02 17:37:44 hpu_model_runner.py:504] Decode bucket config (min, step, max_warmup) bs:[1, 128, 4], seq:[128, 128, 2048]
INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used) INFO 08-02 17:37:44 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used) INFO 08-02 17:37:52 hpu_model_runner.py:430] Pre-loading model weights on hpu:0 took 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used) INFO 08-02 17:37:52 hpu_model_runner.py:438] Wrapping in HPU Graph took 0 B of device memory (14.97 GiB/94.62 GiB used) and -252 KiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache INFO 08-02 17:37:52 hpu_model_runner.py:442] Loading model weights took in total 14.97 GiB of device memory (14.97 GiB/94.62 GiB used) and 2.95 GiB of host memory (475.2 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0 INFO 08-02 17:37:54 hpu_worker.py:134] Model profiling run took 504 MiB of device memory (15.46 GiB/94.62 GiB used) and 180.9 MiB of host memory (475.4 GiB/1007 GiB used)
INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used) INFO 08-02 17:37:54 hpu_worker.py:158] Free device memory: 79.16 GiB, 39.58 GiB usable (gpu_memory_utilization=0.5), 15.83 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 23.75 GiB reserved for KV cache
INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB INFO 08-02 17:37:54 hpu_executor.py:85] # HPU blocks: 1519, # CPU blocks: 0
... INFO 08-02 17:37:54 hpu_worker.py:190] Initializing cache engine took 23.73 GiB of device memory (39.2 GiB/94.62 GiB used) and -1.238 MiB of host memory (475.4 GiB/1007 GiB used)
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB INFO 08-02 17:37:54 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:55.43 GiB
INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3) ...
INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
... INFO 08-02 17:38:22 hpu_model_runner.py:1159] Using 15.85 GiB/55.43 GiB of free device memory for HPUGraphs, 7.923 GiB for prompt and 7.923 GiB for decode (VLLM_GRAPH_PROMPT_RATIO=0.3)
INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB INFO 08-02 17:38:22 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][1/24] batch_size:1 seq_len:128 free_mem:55.43 GiB
INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB ...
... INFO 08-02 17:38:26 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][11/24] batch_size:1 seq_len:896 free_mem:48.77 GiB
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB INFO 08-02 17:38:27 hpu_model_runner.py:1066] [Warmup][Graph/Decode][1/48] batch_size:4 seq_len:128 free_mem:47.51 GiB
INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB ...
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Decode][48/48] batch_size:1 seq_len:2048 free_mem:47.35 GiB
INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB INFO 08-02 17:38:41 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][12/24] batch_size:4 seq_len:256 free_mem:47.35 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][13/24] batch_size:2 seq_len:512 free_mem:45.91 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)] INFO 08-02 17:38:42 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][14/24] batch_size:1 seq_len:1024 free_mem:44.48 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)] INFO 08-02 17:38:43 hpu_model_runner.py:1066] [Warmup][Graph/Prompt][15/24] batch_size:2 seq_len:640 free_mem:43.03 GiB
INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Prompt captured:15 (62.5%) used_mem:14.03 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (4, 128), (4, 256)]
INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used) INFO 08-02 17:38:43 hpu_model_runner.py:1128] Graph/Decode captured:48 (100.0%) used_mem:161.9 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
``` INFO 08-02 17:38:43 hpu_model_runner.py:1206] Warmup finished in 49 secs, allocated 14.19 GiB of device memory
INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of device memory (53.39 GiB/94.62 GiB used) and 57.86 MiB of host memory (475.4 GiB/1007 GiB used)
```
### Recommended vLLM Parameters ### Recommended vLLM Parameters
...@@ -354,28 +354,28 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi ...@@ -354,28 +354,28 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism - `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
* `{phase}` is either `PROMPT` or `DECODE` * `{phase}` is either `PROMPT` or `DECODE`
* `{dim}` is either `BS`, `SEQ` or `BLOCK` * `{dim}` is either `BS`, `SEQ` or `BLOCK`
* `{param}` is either `MIN`, `STEP` or `MAX` * `{param}` is either `MIN`, `STEP` or `MAX`
* Default values: * Default values:
- Prompt: | `{phase}` | Parameter | Env Variable | Value Expression |
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1` |-----------|-----------|--------------|------------------|
- batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`): `min(max_num_seqs, 32)` | Prompt | Batch size min | `VLLM_PROMPT_BS_BUCKET_MIN` | `1` |
- batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`): `min(max_num_seqs, 64)` | Prompt | Batch size step | `VLLM_PROMPT_BS_BUCKET_STEP` | `min(max_num_seqs, 32)` |
- sequence length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`): `block_size` | Prompt | Batch size max | `VLLM_PROMPT_BS_BUCKET_MAX` | `min(max_num_seqs, 64)` |
- sequence length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`): `block_size` | Prompt | Sequence length min | `VLLM_PROMPT_SEQ_BUCKET_MIN` | `block_size` |
- sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len` | Prompt | Sequence length step | `VLLM_PROMPT_SEQ_BUCKET_STEP` | `block_size` |
- Decode: | Prompt | Sequence length max | `VLLM_PROMPT_SEQ_BUCKET_MAX` | `max_model_len` |
- batch size min (`VLLM_DECODE_BS_BUCKET_MIN`): `1` | Decode | Batch size min | `VLLM_DECODE_BS_BUCKET_MIN` | `1` |
- batch size step (`VLLM_DECODE_BS_BUCKET_STEP`): `min(max_num_seqs, 32)` | Decode | Batch size step | `VLLM_DECODE_BS_BUCKET_STEP` | `min(max_num_seqs, 32)` |
- batch size max (`VLLM_DECODE_BS_BUCKET_MAX`): `max_num_seqs` | Decode | Batch size max | `VLLM_DECODE_BS_BUCKET_MAX` | `max_num_seqs` |
- sequence length min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size` | Decode | Sequence length min | `VLLM_DECODE_BLOCK_BUCKET_MIN` | `block_size` |
- sequence length step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size` | Decode | Sequence length step | `VLLM_DECODE_BLOCK_BUCKET_STEP` | `block_size` |
- sequence length max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len)/block_size)` | Decode | Sequence length max | `VLLM_DECODE_BLOCK_BUCKET_MAX` | `max(128, (max_num_seqs*max_model_len)/block_size)` |
Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution: Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:
...@@ -401,4 +401,3 @@ the below: ...@@ -401,4 +401,3 @@ the below:
higher batches. You can do that by adding `--enforce-eager` flag to higher batches. You can do that by adding `--enforce-eager` flag to
server (for online serving), or by passing `enforce_eager=True` server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference). argument to LLM constructor (for offline inference).
# --8<-- [end:extra-information]
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands: It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
```console ```bash
uv venv --python 3.12 --seed uv venv --python 3.12 --seed
source .venv/bin/activate source .venv/bin/activate
``` ```
...@@ -19,7 +19,7 @@ If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/ ...@@ -19,7 +19,7 @@ If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands: It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
```console ```bash
uv venv --python 3.12 --seed uv venv --python 3.12 --seed
source .venv/bin/activate source .venv/bin/activate
uv pip install vllm --torch-backend=auto uv pip install vllm --torch-backend=auto
...@@ -29,13 +29,13 @@ uv pip install vllm --torch-backend=auto ...@@ -29,13 +29,13 @@ uv pip install vllm --torch-backend=auto
Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment: Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment:
```console ```bash
uv run --with vllm vllm --help uv run --with vllm vllm --help
``` ```
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment. You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment.
```console ```bash
conda create -n myenv python=3.12 -y conda create -n myenv python=3.12 -y
conda activate myenv conda activate myenv
pip install --upgrade uv pip install --upgrade uv
...@@ -61,7 +61,8 @@ from vllm import LLM, SamplingParams ...@@ -61,7 +61,8 @@ from vllm import LLM, SamplingParams
``` ```
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params]. The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
!!! warning
!!! important
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified. By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance. However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
...@@ -109,21 +110,21 @@ By default, it starts the server at `http://localhost:8000`. You can specify the ...@@ -109,21 +110,21 @@ By default, it starts the server at `http://localhost:8000`. You can specify the
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model: Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
```console ```bash
vllm serve Qwen/Qwen2.5-1.5B-Instruct vllm serve Qwen/Qwen2.5-1.5B-Instruct
``` ```
!!! note !!! note
By default, the server uses a predefined chat template stored in the tokenizer. By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here][chat-template]. You can learn about overriding it [here][chat-template].
!!! warning !!! important
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator. By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server. To disable this behavior, please pass `--generation-config vllm` when launching the server.
This server can be queried in the same format as OpenAI API. For example, to list the models: This server can be queried in the same format as OpenAI API. For example, to list the models:
```console ```bash
curl http://localhost:8000/v1/models curl http://localhost:8000/v1/models
``` ```
...@@ -133,7 +134,7 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` ...@@ -133,7 +134,7 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY`
Once your server is started, you can query the model with input prompts: Once your server is started, you can query the model with input prompts:
```console ```bash
curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
...@@ -146,20 +147,22 @@ curl http://localhost:8000/v1/completions \ ...@@ -146,20 +147,22 @@ curl http://localhost:8000/v1/completions \
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package: Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
```python ??? Code
from openai import OpenAI
```python
# Modify OpenAI's API key and API base to use vLLM's API server. from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1" # Modify OpenAI's API key and API base to use vLLM's API server.
client = OpenAI( openai_api_key = "EMPTY"
api_key=openai_api_key, openai_api_base = "http://localhost:8000/v1"
base_url=openai_api_base, client = OpenAI(
) api_key=openai_api_key,
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct", base_url=openai_api_base,
prompt="San Francisco is a") )
print("Completion result:", completion) completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
``` prompt="San Francisco is a")
print("Completion result:", completion)
```
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py> A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
...@@ -169,7 +172,7 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter ...@@ -169,7 +172,7 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model: You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
```console ```bash
curl http://localhost:8000/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
...@@ -183,26 +186,28 @@ curl http://localhost:8000/v1/chat/completions \ ...@@ -183,26 +186,28 @@ curl http://localhost:8000/v1/chat/completions \
Alternatively, you can use the `openai` Python package: Alternatively, you can use the `openai` Python package:
```python ??? Code
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server. ```python
openai_api_key = "EMPTY" from openai import OpenAI
openai_api_base = "http://localhost:8000/v1" # Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
client = OpenAI( openai_api_base = "http://localhost:8000/v1"
api_key=openai_api_key,
base_url=openai_api_base, client = OpenAI(
) api_key=openai_api_key,
base_url=openai_api_base,
chat_response = client.chat.completions.create( )
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[ chat_response = client.chat.completions.create(
{"role": "system", "content": "You are a helpful assistant."}, model="Qwen/Qwen2.5-1.5B-Instruct",
{"role": "user", "content": "Tell me a joke."}, messages=[
] {"role": "system", "content": "You are a helpful assistant."},
) {"role": "user", "content": "Tell me a joke."},
print("Chat response:", chat_response) ]
``` )
print("Chat response:", chat_response)
```
## On Attention Backends ## On Attention Backends
......
/**
* edit_and_feedback.js
*
* Enhances MkDocs Material docs pages by:
*
* 1. Adding a "Question? Give us feedback" link
* below the "Edit" button.
*
* - The link opens a GitHub issue with a template,
* auto-filled with the current page URL and path.
*
* 2. Ensuring the edit button opens in a new tab
* with target="_blank" and rel="noopener".
*/
document.addEventListener("DOMContentLoaded", function () {
const url = window.location.href;
const page = document.body.dataset.mdUrl || location.pathname;
const feedbackLink = document.createElement("a");
feedbackLink.href = `https://github.com/vllm-project/vllm/issues/new?template=100-documentation.yml&title=${encodeURIComponent(
`[Docs] Feedback for \`${page}\``
)}&body=${encodeURIComponent(`📄 **Reference:**\n${url}\n\n📝 **Feedback:**\n_Your response_`)}`;
feedbackLink.target = "_blank";
feedbackLink.rel = "noopener";
feedbackLink.title = "Provide feedback";
feedbackLink.className = "md-content__button";
feedbackLink.innerHTML = `
<svg
xmlns="http://www.w3.org/2000/svg"
height="24px"
viewBox="0 -960 960 960"
width="24px"
fill="currentColor"
>
<path d="M280-280h280v-80H280v80Zm0-160h400v-80H280v80Zm0-160h400v-80H280v80Zm-80 480q-33 0-56.5-23.5T120-200v-560q0-33 23.5-56.5T200-840h560q33 0 56.5 23.5T840-760v560q0 33-23.5 56.5T760-120H200Zm0-80h560v-560H200v560Zm0-560v560-560Z"/>
</svg>
`;
const editButton = document.querySelector('.md-content__button[href*="edit"]');
if (editButton && editButton.parentNode) {
editButton.insertAdjacentElement("beforebegin", feedbackLink);
editButton.setAttribute("target", "_blank");
editButton.setAttribute("rel", "noopener");
}
});
/**
* slack_and_forum.js
*
* Adds a custom Slack and Forum button to the MkDocs Material header.
*
*/
window.addEventListener('DOMContentLoaded', () => {
const headerInner = document.querySelector('.md-header__inner');
if (headerInner) {
const slackButton = document.createElement('button');
slackButton.className = 'slack-button';
slackButton.title = 'Join us on Slack';
slackButton.style.border = 'none';
slackButton.style.background = 'transparent';
slackButton.style.cursor = 'pointer';
slackButton.innerHTML = `
<img src="https://a.slack-edge.com/80588/marketing/img/icons/icon_slack_hash_colored.png"
style="height: 1.1rem;"
alt="Slack">
`;
slackButton.addEventListener('click', () => {
window.open('https://slack.vllm.ai', '_blank', 'noopener');
});
const forumButton = document.createElement('button');
forumButton.className = 'forum-button';
forumButton.title = 'Join the Forum';
forumButton.style.border = 'none';
forumButton.style.background = 'transparent';
forumButton.style.cursor = 'pointer';
forumButton.innerHTML = `
<svg
xmlns="http://www.w3.org/2000/svg"
viewBox="0 -960 960 960"
fill="currentColor"
>
<path d="M817.85-198.15 698.46-317.54H320q-24.48 0-41.47-16.99T261.54-376v-11.69h424.61q25.39 0 43.47-18.08 18.07-18.08 18.07-43.46v-268.92h11.69q24.48 0 41.47 16.99 17 16.99 17 41.47v461.54ZM179.08-434.69l66.84-66.85h363.31q10.77 0 17.69-6.92 6.93-6.92 6.93-17.69v-246.77q0-10.77-6.93-17.7-6.92-6.92-17.69-6.92H203.69q-10.77 0-17.69 6.92-6.92 6.93-6.92 17.7v338.23Zm-36.93 89.46v-427.69q0-25.39 18.08-43.46 18.08-18.08 43.46-18.08h405.54q25.39 0 43.46 18.08 18.08 18.07 18.08 43.46v246.77q0 25.38-18.08 43.46-18.07 18.07-43.46 18.07H261.54L142.15-345.23Zm36.93-180.92V-797.54v271.39Z"/>
</svg>
`;
forumButton.addEventListener('click', () => {
window.open('https://discuss.vllm.ai/', '_blank', 'noopener');
});
const githubSource = document.querySelector('.md-header__source');
if (githubSource) {
githubSource.parentNode.insertBefore(slackButton, githubSource.nextSibling);
githubSource.parentNode.insertBefore(forumButton, slackButton.nextSibling);
}
}
});
...@@ -34,3 +34,112 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link . ...@@ -34,3 +34,112 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
color: rgba(255, 255, 255, 0.75) !important; color: rgba(255, 255, 255, 0.75) !important;
font-weight: 700; font-weight: 700;
} }
/* Custom admonitions */
:root {
--md-admonition-icon--announcement: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M3.25 9a.75.75 0 0 1 .75.75c0 2.142.456 3.828.733 4.653a.122.122 0 0 0 .05.064.212.212 0 0 0 .117.033h1.31c.085 0 .18-.042.258-.152a.45.45 0 0 0 .075-.366A16.743 16.743 0 0 1 6 9.75a.75.75 0 0 1 1.5 0c0 1.588.25 2.926.494 3.85.293 1.113-.504 2.4-1.783 2.4H4.9c-.686 0-1.35-.41-1.589-1.12A16.4 16.4 0 0 1 2.5 9.75.75.75 0 0 1 3.25 9Z"></path><path d="M0 6a4 4 0 0 1 4-4h2.75a.75.75 0 0 1 .75.75v6.5a.75.75 0 0 1-.75.75H4a4 4 0 0 1-4-4Zm4-2.5a2.5 2.5 0 1 0 0 5h2v-5Z"></path><path d="M15.59.082A.75.75 0 0 1 16 .75v10.5a.75.75 0 0 1-1.189.608l-.002-.001h.001l-.014-.01a5.775 5.775 0 0 0-.422-.25 10.63 10.63 0 0 0-1.469-.64C11.576 10.484 9.536 10 6.75 10a.75.75 0 0 1 0-1.5c2.964 0 5.174.516 6.658 1.043.423.151.787.302 1.092.443V2.014c-.305.14-.669.292-1.092.443C11.924 2.984 9.713 3.5 6.75 3.5a.75.75 0 0 1 0-1.5c2.786 0 4.826-.484 6.155-.957.665-.236 1.154-.47 1.47-.64.144-.077.284-.161.421-.25l.014-.01a.75.75 0 0 1 .78-.061Z"></path></svg>');
--md-admonition-icon--important: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M4.47.22A.749.749 0 0 1 5 0h6c.199 0 .389.079.53.22l4.25 4.25c.141.14.22.331.22.53v6a.749.749 0 0 1-.22.53l-4.25 4.25A.749.749 0 0 1 11 16H5a.749.749 0 0 1-.53-.22L.22 11.53A.749.749 0 0 1 0 11V5c0-.199.079-.389.22-.53Zm.84 1.28L1.5 5.31v5.38l3.81 3.81h5.38l3.81-3.81V5.31L10.69 1.5ZM8 4a.75.75 0 0 1 .75.75v3.5a.75.75 0 0 1-1.5 0v-3.5A.75.75 0 0 1 8 4Zm0 8a1 1 0 1 1 0-2 1 1 0 0 1 0 2Z"></path></svg>');
}
.md-typeset .admonition.announcement,
.md-typeset details.announcement {
border-color: rgb(255, 110, 66);
}
.md-typeset .admonition.important,
.md-typeset details.important {
border-color: rgb(239, 85, 82);
}
.md-typeset .announcement > .admonition-title,
.md-typeset .announcement > summary {
background-color: rgb(255, 110, 66, 0.1);
}
.md-typeset .important > .admonition-title,
.md-typeset .important > summary {
background-color: rgb(239, 85, 82, 0.1);
}
.md-typeset .announcement > .admonition-title::before,
.md-typeset .announcement > summary::before {
background-color: rgb(239, 85, 82);
-webkit-mask-image: var(--md-admonition-icon--announcement);
mask-image: var(--md-admonition-icon--announcement);
}
.md-typeset .important > .admonition-title::before,
.md-typeset .important > summary::before {
background-color: rgb(239, 85, 82);
-webkit-mask-image: var(--md-admonition-icon--important);
mask-image: var(--md-admonition-icon--important);
}
/* Make label fully visible on hover */
.md-content__button[href*="edit"]:hover::after {
opacity: 1;
}
/* Hide edit button on generated docs/examples pages */
@media (min-width: 960px) {
.md-content__button[href*="docs/examples/"] {
display: none !important;
}
}
.md-content__button-wrapper {
position: absolute;
top: 0.6rem;
right: 0.8rem;
display: flex;
flex-direction: row;
align-items: center;
gap: 0.4rem;
z-index: 1;
}
.md-content__button-wrapper a {
display: inline-flex;
align-items: center;
justify-content: center;
height: 24px;
width: 24px;
color: var(--md-default-fg-color);
text-decoration: none;
}
.md-content__button-wrapper a:hover {
color: var(--md-accent-fg-color);
}
/* Slack and Forum css */
.slack-button,
.forum-button {
display: inline-flex;
align-items: center;
justify-content: center;
margin-left: 0.4rem;
height: 24px;
}
.slack-button img {
height: 18px;
filter: none !important;
}
.slack-button:hover,
.forum-button:hover {
opacity: 0.7;
}
.forum-button svg {
height: 28px;
opacity: 0.9;
transform: translateY(2px);
}
/* For logo css */
[data-md-color-scheme="default"] .logo-dark {
display: none;
}
[data-md-color-scheme="slate"] .logo-light {
display: none;
}
...@@ -9,27 +9,27 @@ Further reading can be found in [Run:ai Model Streamer Documentation](https://gi ...@@ -9,27 +9,27 @@ Further reading can be found in [Run:ai Model Streamer Documentation](https://gi
vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer. vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer.
You first need to install vLLM RunAI optional dependency: You first need to install vLLM RunAI optional dependency:
```console ```bash
pip3 install vllm[runai] pip3 install vllm[runai]
``` ```
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag: To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
```console ```bash
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
--load-format runai_streamer --load-format runai_streamer
``` ```
To run model from AWS S3 object store run: To run model from AWS S3 object store run:
```console ```bash
vllm serve s3://core-llm/Llama-3-8b \ vllm serve s3://core-llm/Llama-3-8b \
--load-format runai_streamer --load-format runai_streamer
``` ```
To run model from a S3 compatible object store run: To run model from a S3 compatible object store run:
```console ```bash
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 \ RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 \
AWS_EC2_METADATA_DISABLED=true \ AWS_EC2_METADATA_DISABLED=true \
AWS_ENDPOINT_URL=https://storage.googleapis.com \ AWS_ENDPOINT_URL=https://storage.googleapis.com \
...@@ -44,7 +44,7 @@ You can tune parameters using `--model-loader-extra-config`: ...@@ -44,7 +44,7 @@ You can tune parameters using `--model-loader-extra-config`:
You can tune `concurrency` that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer. You can tune `concurrency` that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer.
For reading from S3, it will be the number of client instances the host is opening to the S3 server. For reading from S3, it will be the number of client instances the host is opening to the S3 server.
```console ```bash
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
--load-format runai_streamer \ --load-format runai_streamer \
--model-loader-extra-config '{"concurrency":16}' --model-loader-extra-config '{"concurrency":16}'
...@@ -53,7 +53,7 @@ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \ ...@@ -53,7 +53,7 @@ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size. You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit). You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
```console ```bash
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
--load-format runai_streamer \ --load-format runai_streamer \
--model-loader-extra-config '{"memory_limit":5368709120}' --model-loader-extra-config '{"memory_limit":5368709120}'
...@@ -66,13 +66,13 @@ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \ ...@@ -66,13 +66,13 @@ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
vLLM also supports loading sharded models using Run:ai Model Streamer. This is particularly useful for large models that are split across multiple files. To use this feature, use the `--load-format runai_streamer_sharded` flag: vLLM also supports loading sharded models using Run:ai Model Streamer. This is particularly useful for large models that are split across multiple files. To use this feature, use the `--load-format runai_streamer_sharded` flag:
```console ```bash
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded vllm serve /path/to/sharded/model --load-format runai_streamer_sharded
``` ```
The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`: The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`:
```console ```bash
vllm serve /path/to/sharded/model \ vllm serve /path/to/sharded/model \
--load-format runai_streamer_sharded \ --load-format runai_streamer_sharded \
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}' --model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
...@@ -82,7 +82,7 @@ To create sharded model files, you can use the script provided in <gh-file:examp ...@@ -82,7 +82,7 @@ To create sharded model files, you can use the script provided in <gh-file:examp
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way: The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
```console ```bash
vllm serve /path/to/sharded/model \ vllm serve /path/to/sharded/model \
--load-format runai_streamer_sharded \ --load-format runai_streamer_sharded \
--model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}' --model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
......
...@@ -51,7 +51,7 @@ for output in outputs: ...@@ -51,7 +51,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
!!! warning !!! important
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified. By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance. However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
...@@ -81,39 +81,41 @@ The [chat][vllm.LLM.chat] method implements chat functionality on top of [genera ...@@ -81,39 +81,41 @@ The [chat][vllm.LLM.chat] method implements chat functionality on top of [genera
In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt. and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
!!! warning !!! important
In general, only instruction-tuned models have a chat template. In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation. Base models may perform poorly as they are not trained to respond to the chat conversation.
```python ??? Code
from vllm import LLM
```python
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct") from vllm import LLM
conversation = [
{ llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
"role": "system", conversation = [
"content": "You are a helpful assistant" {
}, "role": "system",
{ "content": "You are a helpful assistant"
"role": "user", },
"content": "Hello" {
}, "role": "user",
{ "content": "Hello"
"role": "assistant", },
"content": "Hello! How can I assist you today?" {
}, "role": "assistant",
{ "content": "Hello! How can I assist you today?"
"role": "user", },
"content": "Write an essay about the importance of higher education.", {
}, "role": "user",
] "content": "Write an essay about the importance of higher education.",
outputs = llm.chat(conversation) },
]
for output in outputs: outputs = llm.chat(conversation)
prompt = output.prompt
generated_text = output.outputs[0].text for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") prompt = output.prompt
``` generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py> A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>
...@@ -132,7 +134,7 @@ outputs = llm.chat(conversation, chat_template=custom_template) ...@@ -132,7 +134,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
## Online Serving ## Online Serving
Our [OpenAI-Compatible Server][openai-compatible-server] provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text. - [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template. - [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment