"vllm/vscode:/vscode.git/clone" did not exist on "9659bc7f271ec640da780b5ca739e261764b954b"
Unverified Commit dd6a3a02 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Doc] Convert docs to use colon fences (#12471)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent a7e3eba6
...@@ -47,10 +47,10 @@ When you request queued resources, the request is added to a queue maintained by ...@@ -47,10 +47,10 @@ When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use. assigned to your Google Cloud project for your immediate exclusive use.
```{note} :::{note}
In all of the following commands, replace the ALL CAPS parameter names with In all of the following commands, replace the ALL CAPS parameter names with
appropriate values. See the parameter descriptions table for more information. appropriate values. See the parameter descriptions table for more information.
``` :::
### Provision Cloud TPUs with GKE ### Provision Cloud TPUs with GKE
...@@ -75,33 +75,33 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \ ...@@ -75,33 +75,33 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--service-account SERVICE_ACCOUNT --service-account SERVICE_ACCOUNT
``` ```
```{list-table} Parameter descriptions :::{list-table} Parameter descriptions
:header-rows: 1 :header-rows: 1
* - Parameter name - * Parameter name
- Description * Description
* - QUEUED_RESOURCE_ID - * QUEUED_RESOURCE_ID
- The user-assigned ID of the queued resource request. * The user-assigned ID of the queued resource request.
* - TPU_NAME - * TPU_NAME
- The user-assigned name of the TPU which is created when the queued * The user-assigned name of the TPU which is created when the queued
resource request is allocated. resource request is allocated.
* - PROJECT_ID - * PROJECT_ID
- Your Google Cloud project * Your Google Cloud project
* - ZONE - * ZONE
- The GCP zone where you want to create your Cloud TPU. The value you use * The GCP zone where you want to create your Cloud TPU. The value you use
depends on the version of TPUs you are using. For more information, see depends on the version of TPUs you are using. For more information, see
`TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_ `TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
* - ACCELERATOR_TYPE - * ACCELERATOR_TYPE
- The TPU version you want to use. Specify the TPU version, for example * The TPU version you want to use. Specify the TPU version, for example
`v5litepod-4` specifies a v5e TPU with 4 cores. For more information, `v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_. see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION - * RUNTIME_VERSION
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_. * The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT - * SERVICE_ACCOUNT
- The email address for your service account. You can find it in the IAM * The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example: Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com` `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
``` :::
Connect to your TPU using SSH: Connect to your TPU using SSH:
...@@ -178,15 +178,15 @@ Run the Docker image with the following command: ...@@ -178,15 +178,15 @@ Run the Docker image with the following command:
docker run --privileged --net host --shm-size=16G -it vllm-tpu docker run --privileged --net host --shm-size=16G -it vllm-tpu
``` ```
```{note} :::{note}
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
possible input shapes and compiles an XLA graph for each shape. The possible input shapes and compiles an XLA graph for each shape. The
compilation time may take 20~30 minutes in the first run. However, the compilation time may take 20~30 minutes in the first run. However, the
compilation time reduces to ~5 minutes afterwards because the XLA graphs are compilation time reduces to ~5 minutes afterwards because the XLA graphs are
cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default). cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
``` :::
````{tip} :::{tip}
If you encounter the following error: If you encounter the following error:
```console ```console
...@@ -198,9 +198,10 @@ file or directory ...@@ -198,9 +198,10 @@ file or directory
Install OpenBLAS with the following command: Install OpenBLAS with the following command:
```console ```console
$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
``` ```
````
:::
## Extra information ## Extra information
......
...@@ -25,9 +25,9 @@ pip install -r requirements-cpu.txt ...@@ -25,9 +25,9 @@ pip install -r requirements-cpu.txt
pip install -e . pip install -e .
``` ```
```{note} :::{note}
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device. On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
``` :::
#### Troubleshooting #### Troubleshooting
......
...@@ -2,86 +2,86 @@ ...@@ -2,86 +2,86 @@
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions: vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} x86 ::::{tab-item} x86
:sync: x86 :sync: x86
```{include} x86.inc.md :::{include} x86.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::{tab-item} ARM ::::
::::{tab-item} ARM
:sync: arm :sync: arm
```{include} arm.inc.md :::{include} arm.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::{tab-item} Apple silicon ::::
::::{tab-item} Apple silicon
:sync: apple :sync: apple
```{include} apple.inc.md :::{include} apple.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::: ::::
:::::
## Requirements ## Requirements
- Python: 3.9 -- 3.12 - Python: 3.9 -- 3.12
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} x86 ::::{tab-item} x86
:sync: x86 :sync: x86
```{include} x86.inc.md :::{include} x86.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::{tab-item} ARM ::::
::::{tab-item} ARM
:sync: arm :sync: arm
```{include} arm.inc.md :::{include} arm.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::{tab-item} Apple silicon ::::
::::{tab-item} Apple silicon
:sync: apple :sync: apple
```{include} apple.inc.md :::{include} apple.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::: ::::
:::::
## Set up using Python ## Set up using Python
### Create a new Python environment ### Create a new Python environment
```{include} ../python_env_setup.inc.md :::{include} ../python_env_setup.inc.md
``` :::
### Pre-built wheels ### Pre-built wheels
...@@ -89,41 +89,41 @@ Currently, there are no pre-built CPU wheels. ...@@ -89,41 +89,41 @@ Currently, there are no pre-built CPU wheels.
### Build wheel from source ### Build wheel from source
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} x86 ::::{tab-item} x86
:sync: x86 :sync: x86
```{include} x86.inc.md :::{include} x86.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::{tab-item} ARM ::::
::::{tab-item} ARM
:sync: arm :sync: arm
```{include} arm.inc.md :::{include} arm.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::{tab-item} Apple silicon ::::
::::{tab-item} Apple silicon
:sync: apple :sync: apple
```{include} apple.inc.md :::{include} apple.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::: ::::
:::::
## Set up using Docker ## Set up using Docker
### Pre-built images ### Pre-built images
...@@ -142,9 +142,9 @@ $ docker run -it \ ...@@ -142,9 +142,9 @@ $ docker run -it \
vllm-cpu-env vllm-cpu-env
``` ```
:::{tip} ::::{tip}
For ARM or Apple silicon, use `Dockerfile.arm` For ARM or Apple silicon, use `Dockerfile.arm`
::: ::::
## Supported features ## Supported features
......
...@@ -17,10 +17,10 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform, ...@@ -17,10 +17,10 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
:::{include} build.inc.md :::{include} build.inc.md
::: :::
```{note} :::{note}
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16. - AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building. - If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
``` :::
## Set up using Docker ## Set up using Docker
......
...@@ -10,9 +10,9 @@ vLLM contains pre-compiled C++ and CUDA (12.1) binaries. ...@@ -10,9 +10,9 @@ vLLM contains pre-compiled C++ and CUDA (12.1) binaries.
### Create a new Python environment ### Create a new Python environment
```{note} :::{note}
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details. PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
``` :::
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations. In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
...@@ -100,10 +100,10 @@ pip install --editable . ...@@ -100,10 +100,10 @@ pip install --editable .
You can find more information about vLLM's wheels in <project:#install-the-latest-code>. You can find more information about vLLM's wheels in <project:#install-the-latest-code>.
```{note} :::{note}
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors. There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to <project:#install-the-latest-code> for instructions on how to install a specified wheel. It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to <project:#install-the-latest-code> for instructions on how to install a specified wheel.
``` :::
#### Full build (with compilation) #### Full build (with compilation)
...@@ -115,7 +115,7 @@ cd vllm ...@@ -115,7 +115,7 @@ cd vllm
pip install -e . pip install -e .
``` ```
```{tip} :::{tip}
Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results. Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` . For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
...@@ -123,7 +123,7 @@ As long as `which ccache` command can find the `ccache` binary, it will be used ...@@ -123,7 +123,7 @@ As long as `which ccache` command can find the `ccache` binary, it will be used
[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments. [sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`. The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
``` :::
##### Use an existing PyTorch installation ##### Use an existing PyTorch installation
......
...@@ -2,299 +2,299 @@ ...@@ -2,299 +2,299 @@
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions: vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} CUDA ::::{tab-item} CUDA
:sync: cuda :sync: cuda
```{include} cuda.inc.md :::{include} cuda.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::{tab-item} ROCm ::::
::::{tab-item} ROCm
:sync: rocm :sync: rocm
```{include} rocm.inc.md :::{include} rocm.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::{tab-item} XPU ::::
::::{tab-item} XPU
:sync: xpu :sync: xpu
```{include} xpu.inc.md :::{include} xpu.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::: ::::
:::::
## Requirements ## Requirements
- OS: Linux - OS: Linux
- Python: 3.9 -- 3.12 - Python: 3.9 -- 3.12
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} CUDA ::::{tab-item} CUDA
:sync: cuda :sync: cuda
```{include} cuda.inc.md :::{include} cuda.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::{tab-item} ROCm ::::
::::{tab-item} ROCm
:sync: rocm :sync: rocm
```{include} rocm.inc.md :::{include} rocm.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::{tab-item} XPU ::::
::::{tab-item} XPU
:sync: xpu :sync: xpu
```{include} xpu.inc.md :::{include} xpu.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::: ::::
:::::
## Set up using Python ## Set up using Python
### Create a new Python environment ### Create a new Python environment
```{include} ../python_env_setup.inc.md :::{include} ../python_env_setup.inc.md
``` :::
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} CUDA ::::{tab-item} CUDA
:sync: cuda :sync: cuda
```{include} cuda.inc.md :::{include} cuda.inc.md
:start-after: "## Create a new Python environment" :start-after: "## Create a new Python environment"
:end-before: "### Pre-built wheels" :end-before: "### Pre-built wheels"
```
::: :::
:::{tab-item} ROCm ::::
::::{tab-item} ROCm
:sync: rocm :sync: rocm
There is no extra information on creating a new Python environment for this device. There is no extra information on creating a new Python environment for this device.
::: ::::
:::{tab-item} XPU ::::{tab-item} XPU
:sync: xpu :sync: xpu
There is no extra information on creating a new Python environment for this device. There is no extra information on creating a new Python environment for this device.
:::
:::: ::::
:::::
### Pre-built wheels ### Pre-built wheels
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} CUDA ::::{tab-item} CUDA
:sync: cuda :sync: cuda
```{include} cuda.inc.md :::{include} cuda.inc.md
:start-after: "### Pre-built wheels" :start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source" :end-before: "### Build wheel from source"
```
::: :::
:::{tab-item} ROCm ::::
::::{tab-item} ROCm
:sync: rocm :sync: rocm
```{include} rocm.inc.md :::{include} rocm.inc.md
:start-after: "### Pre-built wheels" :start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source" :end-before: "### Build wheel from source"
```
::: :::
:::{tab-item} XPU ::::
::::{tab-item} XPU
:sync: xpu :sync: xpu
```{include} xpu.inc.md :::{include} xpu.inc.md
:start-after: "### Pre-built wheels" :start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source" :end-before: "### Build wheel from source"
```
::: :::
:::: ::::
:::::
(build-from-source)= (build-from-source)=
### Build wheel from source ### Build wheel from source
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} CUDA ::::{tab-item} CUDA
:sync: cuda :sync: cuda
```{include} cuda.inc.md :::{include} cuda.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::{tab-item} ROCm ::::
::::{tab-item} ROCm
:sync: rocm :sync: rocm
```{include} rocm.inc.md :::{include} rocm.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::{tab-item} XPU ::::
::::{tab-item} XPU
:sync: xpu :sync: xpu
```{include} xpu.inc.md :::{include} xpu.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::: ::::
:::::
## Set up using Docker ## Set up using Docker
### Pre-built images ### Pre-built images
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} CUDA ::::{tab-item} CUDA
:sync: cuda :sync: cuda
```{include} cuda.inc.md :::{include} cuda.inc.md
:start-after: "### Pre-built images" :start-after: "### Pre-built images"
:end-before: "### Build image from source" :end-before: "### Build image from source"
```
::: :::
:::{tab-item} ROCm ::::
::::{tab-item} ROCm
:sync: rocm :sync: rocm
```{include} rocm.inc.md :::{include} rocm.inc.md
:start-after: "### Pre-built images" :start-after: "### Pre-built images"
:end-before: "### Build image from source" :end-before: "### Build image from source"
```
::: :::
:::{tab-item} XPU ::::
::::{tab-item} XPU
:sync: xpu :sync: xpu
```{include} xpu.inc.md :::{include} xpu.inc.md
:start-after: "### Pre-built images" :start-after: "### Pre-built images"
:end-before: "### Build image from source" :end-before: "### Build image from source"
```
::: :::
:::: ::::
:::::
### Build image from source ### Build image from source
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} CUDA ::::{tab-item} CUDA
:sync: cuda :sync: cuda
```{include} cuda.inc.md :::{include} cuda.inc.md
:start-after: "### Build image from source" :start-after: "### Build image from source"
:end-before: "## Supported features" :end-before: "## Supported features"
```
::: :::
:::{tab-item} ROCm ::::
::::{tab-item} ROCm
:sync: rocm :sync: rocm
```{include} rocm.inc.md :::{include} rocm.inc.md
:start-after: "### Build image from source" :start-after: "### Build image from source"
:end-before: "## Supported features" :end-before: "## Supported features"
```
::: :::
:::{tab-item} XPU ::::
::::{tab-item} XPU
:sync: xpu :sync: xpu
```{include} xpu.inc.md :::{include} xpu.inc.md
:start-after: "### Build image from source" :start-after: "### Build image from source"
:end-before: "## Supported features" :end-before: "## Supported features"
```
::: :::
:::: ::::
:::::
## Supported features ## Supported features
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} CUDA ::::{tab-item} CUDA
:sync: cuda :sync: cuda
```{include} cuda.inc.md :::{include} cuda.inc.md
:start-after: "## Supported features" :start-after: "## Supported features"
```
::: :::
:::{tab-item} ROCm ::::
::::{tab-item} ROCm
:sync: rocm :sync: rocm
```{include} rocm.inc.md :::{include} rocm.inc.md
:start-after: "## Supported features" :start-after: "## Supported features"
```
::: :::
:::{tab-item} XPU ::::
::::{tab-item} XPU
:sync: xpu :sync: xpu
```{include} xpu.inc.md :::{include} xpu.inc.md
:start-after: "## Supported features" :start-after: "## Supported features"
```
::: :::
:::: ::::
:::::
...@@ -16,10 +16,10 @@ Currently, there are no pre-built ROCm wheels. ...@@ -16,10 +16,10 @@ Currently, there are no pre-built ROCm wheels.
However, the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized However, the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator. docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
```{tip} :::{tip}
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html) Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
for instructions on how to use this prebuilt docker image. for instructions on how to use this prebuilt docker image.
``` :::
### Build wheel from source ### Build wheel from source
...@@ -47,9 +47,9 @@ for instructions on how to use this prebuilt docker image. ...@@ -47,9 +47,9 @@ for instructions on how to use this prebuilt docker image.
cd ../.. cd ../..
``` ```
```{note} :::{note}
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent. If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
``` :::
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile) 2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
...@@ -67,9 +67,9 @@ for instructions on how to use this prebuilt docker image. ...@@ -67,9 +67,9 @@ for instructions on how to use this prebuilt docker image.
cd .. cd ..
``` ```
```{note} :::{note}
- You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`) You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
``` :::
3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps: 3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:
...@@ -95,17 +95,18 @@ for instructions on how to use this prebuilt docker image. ...@@ -95,17 +95,18 @@ for instructions on how to use this prebuilt docker image.
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation. This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
```{tip} <!--- pyml disable-num-lines 5 ul-indent-->
:::{tip}
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers. - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support. - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention. - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version. - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
``` :::
```{tip} :::{tip}
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level. - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization). For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
``` :::
## Set up using Docker ## Set up using Docker
......
...@@ -30,10 +30,10 @@ pip install -v -r requirements-xpu.txt ...@@ -30,10 +30,10 @@ pip install -v -r requirements-xpu.txt
VLLM_TARGET_DEVICE=xpu python setup.py install VLLM_TARGET_DEVICE=xpu python setup.py install
``` ```
```{note} :::{note}
- FP16 is the default data type in the current XPU backend. The BF16 data - FP16 is the default data type in the current XPU backend. The BF16 data
type will be supported in the future. type will be supported in the future.
``` :::
## Set up using Docker ## Set up using Docker
......
...@@ -4,10 +4,10 @@ ...@@ -4,10 +4,10 @@
vLLM supports the following hardware platforms: vLLM supports the following hardware platforms:
```{toctree} :::{toctree}
:maxdepth: 1 :maxdepth: 1
gpu/index gpu/index
cpu/index cpu/index
ai_accelerator/index ai_accelerator/index
``` :::
...@@ -6,9 +6,9 @@ conda create -n myenv python=3.12 -y ...@@ -6,9 +6,9 @@ conda create -n myenv python=3.12 -y
conda activate myenv conda activate myenv
``` ```
```{note} :::{note}
[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages. [PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
``` :::
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command: Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
......
...@@ -32,9 +32,9 @@ conda activate myenv ...@@ -32,9 +32,9 @@ conda activate myenv
pip install vllm pip install vllm
``` ```
```{note} :::{note}
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM. For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
``` :::
(quickstart-offline)= (quickstart-offline)=
...@@ -69,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model]( ...@@ -69,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](
llm = LLM(model="facebook/opt-125m") llm = LLM(model="facebook/opt-125m")
``` ```
```{note} :::{note}
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine. By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
``` :::
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens. Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
...@@ -97,10 +97,10 @@ Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instru ...@@ -97,10 +97,10 @@ Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instru
vllm serve Qwen/Qwen2.5-1.5B-Instruct vllm serve Qwen/Qwen2.5-1.5B-Instruct
``` ```
```{note} :::{note}
By default, the server uses a predefined chat template stored in the tokenizer. By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here](#chat-template). You can learn about overriding it [here](#chat-template).
``` :::
This server can be queried in the same format as OpenAI API. For example, to list the models: This server can be queried in the same format as OpenAI API. For example, to list the models:
......
...@@ -4,9 +4,9 @@ ...@@ -4,9 +4,9 @@
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
```{note} :::{note}
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated. Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
``` :::
## Hangs downloading a model ## Hangs downloading a model
...@@ -18,9 +18,9 @@ It's recommended to download the model first using the [huggingface-cli](https:/ ...@@ -18,9 +18,9 @@ It's recommended to download the model first using the [huggingface-cli](https:/
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow. If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory. It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
```{note} :::{note}
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck. To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
``` :::
## Out of memory ## Out of memory
...@@ -132,14 +132,14 @@ If the script runs successfully, you should see the message `sanity check is suc ...@@ -132,14 +132,14 @@ If the script runs successfully, you should see the message `sanity check is suc
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully. If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
```{note} :::{note}
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments: A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`. - In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`. - In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes. Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
``` :::
(troubleshooting-python-multiprocessing)= (troubleshooting-python-multiprocessing)=
......
# Welcome to vLLM # Welcome to vLLM
```{figure} ./assets/logos/vllm-logo-text-light.png :::{figure} ./assets/logos/vllm-logo-text-light.png
:align: center :align: center
:alt: vLLM :alt: vLLM
:class: no-scaled-link :class: no-scaled-link
:width: 60% :width: 60%
``` :::
```{raw} html :::{raw} html
<p style="text-align:center"> <p style="text-align:center">
<strong>Easy, fast, and cheap LLM serving for everyone <strong>Easy, fast, and cheap LLM serving for everyone
</strong> </strong>
...@@ -19,7 +19,7 @@ ...@@ -19,7 +19,7 @@
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a> <a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a> <a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p> </p>
``` :::
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is a fast and easy-to-use library for LLM inference and serving.
...@@ -58,7 +58,7 @@ For more information, check out the following: ...@@ -58,7 +58,7 @@ For more information, check out the following:
% How to start using vLLM? % How to start using vLLM?
```{toctree} :::{toctree}
:caption: Getting Started :caption: Getting Started
:maxdepth: 1 :maxdepth: 1
...@@ -67,11 +67,11 @@ getting_started/quickstart ...@@ -67,11 +67,11 @@ getting_started/quickstart
getting_started/examples/examples_index getting_started/examples/examples_index
getting_started/troubleshooting getting_started/troubleshooting
getting_started/faq getting_started/faq
``` :::
% What does vLLM support? % What does vLLM support?
```{toctree} :::{toctree}
:caption: Models :caption: Models
:maxdepth: 1 :maxdepth: 1
...@@ -79,11 +79,11 @@ models/generative_models ...@@ -79,11 +79,11 @@ models/generative_models
models/pooling_models models/pooling_models
models/supported_models models/supported_models
models/extensions/index models/extensions/index
``` :::
% Additional capabilities % Additional capabilities
```{toctree} :::{toctree}
:caption: Features :caption: Features
:maxdepth: 1 :maxdepth: 1
...@@ -96,11 +96,11 @@ features/automatic_prefix_caching ...@@ -96,11 +96,11 @@ features/automatic_prefix_caching
features/disagg_prefill features/disagg_prefill
features/spec_decode features/spec_decode
features/compatibility_matrix features/compatibility_matrix
``` :::
% Details about running vLLM % Details about running vLLM
```{toctree} :::{toctree}
:caption: Inference and Serving :caption: Inference and Serving
:maxdepth: 1 :maxdepth: 1
...@@ -113,11 +113,11 @@ serving/engine_args ...@@ -113,11 +113,11 @@ serving/engine_args
serving/env_vars serving/env_vars
serving/usage_stats serving/usage_stats
serving/integrations/index serving/integrations/index
``` :::
% Scaling up vLLM for production % Scaling up vLLM for production
```{toctree} :::{toctree}
:caption: Deployment :caption: Deployment
:maxdepth: 1 :maxdepth: 1
...@@ -126,21 +126,21 @@ deployment/k8s ...@@ -126,21 +126,21 @@ deployment/k8s
deployment/nginx deployment/nginx
deployment/frameworks/index deployment/frameworks/index
deployment/integrations/index deployment/integrations/index
``` :::
% Making the most out of vLLM % Making the most out of vLLM
```{toctree} :::{toctree}
:caption: Performance :caption: Performance
:maxdepth: 1 :maxdepth: 1
performance/optimization performance/optimization
performance/benchmarks performance/benchmarks
``` :::
% Explanation of vLLM internals % Explanation of vLLM internals
```{toctree} :::{toctree}
:caption: Design Documents :caption: Design Documents
:maxdepth: 2 :maxdepth: 2
...@@ -151,11 +151,11 @@ design/kernel/paged_attention ...@@ -151,11 +151,11 @@ design/kernel/paged_attention
design/mm_processing design/mm_processing
design/automatic_prefix_caching design/automatic_prefix_caching
design/multiprocessing design/multiprocessing
``` :::
% How to contribute to the vLLM project % How to contribute to the vLLM project
```{toctree} :::{toctree}
:caption: Developer Guide :caption: Developer Guide
:maxdepth: 2 :maxdepth: 2
...@@ -164,11 +164,11 @@ contributing/profiling/profiling_index ...@@ -164,11 +164,11 @@ contributing/profiling/profiling_index
contributing/dockerfile/dockerfile contributing/dockerfile/dockerfile
contributing/model/index contributing/model/index
contributing/vulnerability_management contributing/vulnerability_management
``` :::
% Technical API specifications % Technical API specifications
```{toctree} :::{toctree}
:caption: API Reference :caption: API Reference
:maxdepth: 2 :maxdepth: 2
...@@ -177,18 +177,18 @@ api/engine/index ...@@ -177,18 +177,18 @@ api/engine/index
api/inference_params api/inference_params
api/multimodal/index api/multimodal/index
api/model/index api/model/index
``` :::
% Latest news and acknowledgements % Latest news and acknowledgements
```{toctree} :::{toctree}
:caption: Community :caption: Community
:maxdepth: 1 :maxdepth: 1
community/blog community/blog
community/meetups community/meetups
community/sponsors community/sponsors
``` :::
## Indices and tables ## Indices and tables
......
# Built-in Extensions # Built-in Extensions
```{toctree} :::{toctree}
:maxdepth: 1 :maxdepth: 1
runai_model_streamer runai_model_streamer
tensorizer tensorizer
``` :::
...@@ -48,6 +48,6 @@ You can read further about CPU buffer memory limiting [here](https://github.com/ ...@@ -48,6 +48,6 @@ You can read further about CPU buffer memory limiting [here](https://github.com/
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}' vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
``` ```
```{note} :::{note}
For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md). For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
``` :::
...@@ -11,6 +11,6 @@ For more information on CoreWeave's Tensorizer, please refer to ...@@ -11,6 +11,6 @@ For more information on CoreWeave's Tensorizer, please refer to
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html). the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html).
```{note} :::{note}
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`. Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
``` :::
...@@ -70,10 +70,10 @@ The {class}`~vllm.LLM.chat` method implements chat functionality on top of {clas ...@@ -70,10 +70,10 @@ The {class}`~vllm.LLM.chat` method implements chat functionality on top of {clas
In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt. and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
```{important} :::{important}
In general, only instruction-tuned models have a chat template. In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation. Base models may perform poorly as they are not trained to respond to the chat conversation.
``` :::
```python ```python
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct") llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
......
...@@ -8,54 +8,54 @@ In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmMo ...@@ -8,54 +8,54 @@ In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmMo
These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
before returning them. before returning them.
```{note} :::{note}
We currently support pooling models primarily as a matter of convenience. We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much. pooling models as they only work on the generation or decode stage, so performance may not improve as much.
``` :::
For pooling models, we support the following `--task` options. For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states: The selected option sets the default pooler used to extract the final hidden states:
```{list-table} :::{list-table}
:widths: 50 25 25 25 :widths: 50 25 25 25
:header-rows: 1 :header-rows: 1
* - Task - * Task
- Pooling Type * Pooling Type
- Normalization * Normalization
- Softmax * Softmax
* - Embedding (`embed`) - * Embedding (`embed`)
- `LAST` * `LAST`
- ✅︎ * ✅︎
- *
* - Classification (`classify`) - * Classification (`classify`)
- `LAST` * `LAST`
- *
- ✅︎ * ✅︎
* - Sentence Pair Scoring (`score`) - * Sentence Pair Scoring (`score`)
- \* * \*
- \* * \*
- \* * \*
* - Reward Modeling (`reward`) - * Reward Modeling (`reward`)
- `ALL` * `ALL`
- *
- *
``` :::
\*The default pooler is always defined by the model. \*The default pooler is always defined by the model.
```{note} :::{note}
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table. If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
``` :::
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models, When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`). we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
```{tip} :::{tip}
You can customize the model's pooling method via the `--override-pooler-config` option, You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults. which takes priority over both the model's and Sentence Transformers's defaults.
``` :::
## Offline Inference ## Offline Inference
...@@ -111,10 +111,10 @@ The {class}`~vllm.LLM.score` method outputs similarity scores between sentence p ...@@ -111,10 +111,10 @@ The {class}`~vllm.LLM.score` method outputs similarity scores between sentence p
It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html). It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html).
These types of models serve as rerankers between candidate query-document pairs in RAG systems. These types of models serve as rerankers between candidate query-document pairs in RAG systems.
```{note} :::{note}
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG. vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain). To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
``` :::
```python ```python
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score") llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
......
This diff is collapsed.
...@@ -14,9 +14,9 @@ In short, you should increase the number of GPUs and the number of nodes until y ...@@ -14,9 +14,9 @@ In short, you should increase the number of GPUs and the number of nodes until y
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough. After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
```{note} :::{note}
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs. There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
``` :::
## Running vLLM on a single node ## Running vLLM on a single node
...@@ -94,12 +94,12 @@ vllm serve /path/to/the/model/in/the/container \ ...@@ -94,12 +94,12 @@ vllm serve /path/to/the/model/in/the/container \
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient. To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
```{warning} :::{warning}
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information. After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
``` :::
```{warning} :::{warning}
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes. Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model. When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
``` :::
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment