Unverified Commit 4ffd6e89 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Reduce custom syntax used in docs (#27009)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 965c5f49
...@@ -4,19 +4,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C ...@@ -4,19 +4,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:installation"
=== "ARM AArch64" === "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:installation"
=== "Apple silicon" === "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.apple.inc.md:installation"
=== "IBM Z (S390X)" === "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:installation"
## Requirements ## Requirements
...@@ -24,19 +24,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C ...@@ -24,19 +24,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:requirements"
=== "ARM AArch64" === "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:requirements"
=== "Apple silicon" === "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements" --8<-- "docs/getting_started/installation/cpu.apple.inc.md:requirements"
=== "IBM Z (S390X)" === "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:requirements"
## Set up using Python ## Set up using Python
...@@ -52,19 +52,19 @@ Currently, there are no pre-built CPU wheels. ...@@ -52,19 +52,19 @@ Currently, there are no pre-built CPU wheels.
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-wheel-from-source"
=== "ARM AArch64" === "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-wheel-from-source"
=== "Apple silicon" === "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/cpu.apple.inc.md:build-wheel-from-source"
=== "IBM Z (s390x)" === "IBM Z (s390x)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-wheel-from-source"
## Set up using Docker ## Set up using Docker
...@@ -72,24 +72,24 @@ Currently, there are no pre-built CPU wheels. ...@@ -72,24 +72,24 @@ Currently, there are no pre-built CPU wheels.
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-images"
### Build image from source ### Build image from source
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-image-from-source"
=== "ARM AArch64" === "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
=== "Apple silicon" === "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
=== "IBM Z (S390X)" === "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-image-from-source"
## Related runtime environment variables ## Related runtime environment variables
......
...@@ -157,7 +157,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i ...@@ -157,7 +157,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
### Build image from source ### Build image from source
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support. You can use [docker/Dockerfile.tpu](../../../docker/Dockerfile.tpu) to build a Docker image with TPU support.
```bash ```bash
docker build -f docker/Dockerfile.tpu -t vllm-tpu . docker build -f docker/Dockerfile.tpu -t vllm-tpu .
......
...@@ -11,7 +11,7 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries. ...@@ -11,7 +11,7 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
# --8<-- [start:set-up-using-python] # --8<-- [start:set-up-using-python]
!!! note !!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details. PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations. In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
......
...@@ -4,15 +4,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -4,15 +4,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:installation" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:installation"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:installation" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:installation"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:installation" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"
## Requirements ## Requirements
...@@ -24,15 +24,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -24,15 +24,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:requirements" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:requirements"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:requirements" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:requirements"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:requirements" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"
## Set up using Python ## Set up using Python
...@@ -42,29 +42,29 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -42,29 +42,29 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:set-up-using-python" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:set-up-using-python"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:set-up-using-python" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:set-up-using-python"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:set-up-using-python" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"
### Pre-built wheels ### Pre-built wheels
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-wheels" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-wheels"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-wheels" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-wheels"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-wheels" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
[](){ #build-from-source } [](){ #build-from-source }
...@@ -72,15 +72,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -72,15 +72,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-wheel-from-source"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-wheel-from-source"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"
## Set up using Docker ## Set up using Docker
...@@ -88,40 +88,40 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -88,40 +88,40 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-images"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-images"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
### Build image from source ### Build image from source
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-image-from-source"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-image-from-source"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
## Supported features ## Supported features
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:supported-features" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:supported-features"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:supported-features" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:supported-features"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:supported-features" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"
...@@ -146,7 +146,7 @@ Building the Docker image from source is the recommended way to use vLLM with RO ...@@ -146,7 +146,7 @@ Building the Docker image from source is the recommended way to use vLLM with RO
#### (Optional) Build an image with ROCm software stack #### (Optional) Build an image with ROCm software stack
Build a docker image from <gh-file:docker/Dockerfile.rocm_base> which setup ROCm software stack needed by the vLLM. Build a docker image from [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base) which setup ROCm software stack needed by the vLLM.
**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.** **This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
If you choose to build this rocm_base image yourself, the steps are as follows. If you choose to build this rocm_base image yourself, the steps are as follows.
...@@ -170,7 +170,7 @@ DOCKER_BUILDKIT=1 docker build \ ...@@ -170,7 +170,7 @@ DOCKER_BUILDKIT=1 docker build \
#### Build an image with vLLM #### Build an image with vLLM
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image. First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```bash ```bash
...@@ -181,10 +181,10 @@ It is important that the user kicks off the docker build using buildkit. Either ...@@ -181,10 +181,10 @@ It is important that the user kicks off the docker build using buildkit. Either
} }
``` ```
<gh-file:docker/Dockerfile.rocm> uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches. [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
It provides flexibility to customize the build of docker image using the following arguments: It provides flexibility to customize the build of docker image using the following arguments:
- `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using <gh-file:docker/Dockerfile.rocm_base> - `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base)
- `ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image - `ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image
Their values can be passed in when running `docker build` with `--build-arg` options. Their values can be passed in when running `docker build` with `--build-arg` options.
......
...@@ -75,7 +75,7 @@ vllm serve facebook/opt-13b \ ...@@ -75,7 +75,7 @@ vllm serve facebook/opt-13b \
-tp=8 -tp=8
``` ```
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script. By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
# --8<-- [end:supported-features] # --8<-- [end:supported-features]
# --8<-- [start:distributed-backend] # --8<-- [start:distributed-backend]
......
...@@ -46,7 +46,7 @@ uv pip install vllm --torch-backend=auto ...@@ -46,7 +46,7 @@ uv pip install vllm --torch-backend=auto
## Offline Batched Inference ## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic/basic.py> With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]: The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
...@@ -201,7 +201,7 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep ...@@ -201,7 +201,7 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
print("Completion result:", completion) print("Completion result:", completion)
``` ```
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py> A more detailed client example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### OpenAI Chat Completions API with vLLM ### OpenAI Chat Completions API with vLLM
...@@ -253,4 +253,4 @@ Currently, vLLM supports multiple backends for efficient Attention computation a ...@@ -253,4 +253,4 @@ Currently, vLLM supports multiple backends for efficient Attention computation a
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`. If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
!!! warning !!! warning
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see <gh-file:docker/Dockerfile> for instructions on how to install it. There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.
...@@ -137,13 +137,20 @@ class Example: ...@@ -137,13 +137,20 @@ class Example:
gh_file = (self.main_file.parent / relative_path).resolve() gh_file = (self.main_file.parent / relative_path).resolve()
gh_file = gh_file.relative_to(ROOT_DIR) gh_file = gh_file.relative_to(ROOT_DIR)
return f"[{link_text}](gh-file:{gh_file})" # Make GitHub URL
url = "https://github.com/vllm-project/vllm/"
url += "tree/main" if self.path.is_dir() else "blob/main"
gh_url = f"{url}/{gh_file}"
return f"[{link_text}]({gh_url})"
return re.sub(link_pattern, replace_link, content) return re.sub(link_pattern, replace_link, content)
def generate(self) -> str: def generate(self) -> str:
content = f"# {self.title}\n\n" content = f"# {self.title}\n\n"
content += f"Source <gh-file:{self.path.relative_to(ROOT_DIR)}>.\n\n" url = "https://github.com/vllm-project/vllm/"
url += "tree/main" if self.path.is_dir() else "blob/main"
content += f"Source <{url}/{self.path.relative_to(ROOT_DIR)}>.\n\n"
# Use long code fence to avoid issues with # Use long code fence to avoid issues with
# included files containing code fences too # included files containing code fences too
......
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
""" """
This is basically a port of MyST parser’s external URL resolution mechanism MkDocs hook to enable the following links to render correctly:
(https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#customising-external-url-resolution)
to work with MkDocs.
It allows Markdown authors to use GitHub shorthand links like: - Relative file links outside of the `docs/` directory, e.g.:
- [Text](../some_file.py)
- [Text](gh-issue:123) - [Directory](../../some_directory/)
- <gh-pr:456> - GitHub URLs for issues, pull requests, and projects, e.g.:
- [File](gh-file:path/to/file.py#L10) - Adds GitHub icon before links
- Replaces raw links with descriptive text,
These are automatically rewritten into fully qualified GitHub URLs pointing to e.g. <...pull/123> -> [Pull Request #123](.../pull/123)
issues, pull requests, files, directories, or projects in the - Works for external repos too by including the `owner/repo` in the link title
`vllm-project/vllm` repository.
The goal is to simplify cross-referencing common GitHub resources The goal is to simplify cross-referencing common GitHub resources
in project docs. in project docs.
""" """
from pathlib import Path
import regex as re import regex as re
from mkdocs.config.defaults import MkDocsConfig from mkdocs.config.defaults import MkDocsConfig
from mkdocs.structure.files import Files from mkdocs.structure.files import Files
from mkdocs.structure.pages import Page from mkdocs.structure.pages import Page
ROOT_DIR = Path(__file__).parent.parent.parent.parent.resolve()
DOC_DIR = ROOT_DIR / "docs"
gh_icon = ":octicons-mark-github-16:"
# Regex pieces
TITLE = r"(?P<title>[^\[\]<>]+?)"
REPO = r"(?P<repo>.+?/.+?)"
TYPE = r"(?P<type>issues|pull|projects)"
NUMBER = r"(?P<number>\d+)"
FRAGMENT = r"(?P<fragment>#[^\s]+)?"
URL = f"https://github.com/{REPO}/{TYPE}/{NUMBER}{FRAGMENT}"
RELATIVE = r"(?!(https?|ftp)://|#)(?P<path>[^\s]+?)"
# Common titles to use for GitHub links when none is provided in the link.
TITLES = {"issues": "Issue ", "pull": "Pull Request ", "projects": "Project "}
# Regex to match GitHub issue, PR, and project links with optional titles.
github_link = re.compile(rf"(\[{TITLE}\]\(|<){URL}(\)|>)")
# Regex to match relative file links with optional titles.
relative_link = re.compile(rf"\[{TITLE}\]\({RELATIVE}\)")
def on_page_markdown( def on_page_markdown(
markdown: str, *, page: Page, config: MkDocsConfig, files: Files markdown: str, *, page: Page, config: MkDocsConfig, files: Files
) -> str: ) -> str:
""" def replace_relative_link(match: re.Match) -> str:
Custom MkDocs plugin hook to rewrite special GitHub reference links """Replace relative file links with URLs if they point outside the docs dir."""
in Markdown. title = match.group("title")
This function scans the given Markdown content for specially formatted
GitHub shorthand links, such as:
- `[Link text](gh-issue:123)`
- `<gh-pr:456>`
And rewrites them into fully-qualified GitHub URLs with GitHub icons:
- `[:octicons-mark-github-16: Link text](https://github.com/vllm-project/vllm/issues/123)`
- `[:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)`
Supported shorthand types:
- `gh-issue`
- `gh-pr`
- `gh-project`
- `gh-dir`
- `gh-file`
Args:
markdown (str): The raw Markdown content of the page.
page (Page): The MkDocs page object being processed.
config (MkDocsConfig): The MkDocs site configuration.
files (Files): The collection of files in the MkDocs build.
Returns:
str: The updated Markdown content with GitHub shorthand links replaced.
"""
gh_icon = ":octicons-mark-github-16:"
gh_url = "https://github.com"
repo_url = f"{gh_url}/vllm-project/vllm"
org_url = f"{gh_url}/orgs/vllm-project"
# Mapping of shorthand types to their corresponding GitHub base URLs
urls = {
"issue": f"{repo_url}/issues",
"pr": f"{repo_url}/pull",
"project": f"{org_url}/projects",
"dir": f"{repo_url}/tree/main",
"file": f"{repo_url}/blob/main",
}
# Default title prefixes for auto links
titles = {
"issue": "Issue #",
"pr": "Pull Request #",
"project": "Project #",
"dir": "",
"file": "",
}
# Regular expression to match GitHub shorthand links
scheme = r"gh-(?P<type>.+?):(?P<path>.+?)(#(?P<fragment>.+?))?"
inline_link = re.compile(r"\[(?P<title>[^\[]+?)\]\(" + scheme + r"\)")
auto_link = re.compile(f"<{scheme}>")
def replace_inline_link(match: re.Match) -> str:
"""
Replaces a matched inline-style GitHub shorthand link
with a full Markdown link.
Example:
[My issue](gh-issue:123) → [:octicons-mark-github-16: My issue](https://github.com/vllm-project/vllm/issues/123)
"""
url = f"{urls[match.group('type')]}/{match.group('path')}"
if fragment := match.group("fragment"):
url += f"#{fragment}"
return f"[{gh_icon} {match.group('title')}]({url})"
def replace_auto_link(match: re.Match) -> str:
"""
Replaces a matched autolink-style GitHub shorthand
with a full Markdown link.
Example:
<gh-pr:456> → [:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)
"""
type = match.group("type")
path = match.group("path") path = match.group("path")
title = f"{titles[type]}{path}" path = (Path(page.file.abs_src_path).parent / path).resolve()
url = f"{urls[type]}/{path}"
if fragment := match.group("fragment"): # Check if the path exists and is outside the docs dir
url += f"#{fragment}" if not path.exists() or path.is_relative_to(DOC_DIR):
return match.group(0)
# Files and directories have different URL schemes on GitHub
slug = "tree/main" if path.is_dir() else "blob/main"
path = path.relative_to(ROOT_DIR)
url = f"https://github.com/vllm-project/vllm/{slug}/{path}"
return f"[{gh_icon} {title}]({url})" return f"[{gh_icon} {title}]({url})"
# Replace both inline and autolinks def replace_github_link(match: re.Match) -> str:
markdown = inline_link.sub(replace_inline_link, markdown) """Replace GitHub issue, PR, and project links with enhanced Markdown links."""
markdown = auto_link.sub(replace_auto_link, markdown) repo = match.group("repo")
type = match.group("type")
number = match.group("number")
# Title and fragment could be None
title = match.group("title") or ""
fragment = match.group("fragment") or ""
# Use default titles for raw links
if not title:
title = TITLES[type]
if "vllm-project" not in repo:
title += repo
title += f"#{number}"
url = f"https://github.com/{repo}/{type}/{number}{fragment}"
return f"[{gh_icon} {title}]({url})"
markdown = relative_link.sub(replace_relative_link, markdown)
markdown = github_link.sub(replace_github_link, markdown)
if "interface" in str(page.file.abs_src_path):
print(markdown)
return markdown return markdown
...@@ -82,7 +82,7 @@ vllm serve /path/to/sharded/model \ ...@@ -82,7 +82,7 @@ vllm serve /path/to/sharded/model \
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}' --model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
``` ```
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader. To create sharded model files, you can use the script provided in [examples/offline_inference/save_sharded_state.py](../../../examples/offline_inference/save_sharded_state.py). This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way: The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
......
...@@ -59,7 +59,7 @@ for output in outputs: ...@@ -59,7 +59,7 @@ for output in outputs:
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified. By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance. However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py> A code example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### `LLM.beam_search` ### `LLM.beam_search`
...@@ -121,7 +121,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc ...@@ -121,7 +121,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py> A code example can be found here: [examples/offline_inference/basic/chat.py](../../examples/offline_inference/basic/chat.py)
If the model doesn't have a chat template or you want to specify another one, If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template: you can explicitly pass a chat template:
......
...@@ -9,7 +9,7 @@ before returning them. ...@@ -9,7 +9,7 @@ before returning them.
!!! note !!! note
We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly. We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly.
We are now planning to optimize pooling models in vLLM. Please comment on <gh-issue:21796> if you have any suggestions! We are now planning to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!
## Configuration ## Configuration
...@@ -98,7 +98,7 @@ embeds = output.outputs.embedding ...@@ -98,7 +98,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})") print(f"Embeddings: {embeds!r} (size={len(embeds)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py> A code example can be found here: [examples/offline_inference/basic/embed.py](../../examples/offline_inference/basic/embed.py)
### `LLM.classify` ### `LLM.classify`
...@@ -115,7 +115,7 @@ probs = output.outputs.probs ...@@ -115,7 +115,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})") print(f"Class Probabilities: {probs!r} (size={len(probs)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py> A code example can be found here: [examples/offline_inference/basic/classify.py](../../examples/offline_inference/basic/classify.py)
### `LLM.score` ### `LLM.score`
...@@ -139,7 +139,7 @@ score = output.outputs.score ...@@ -139,7 +139,7 @@ score = output.outputs.score
print(f"Score: {score}") print(f"Score: {score}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py> A code example can be found here: [examples/offline_inference/basic/score.py](../../examples/offline_inference/basic/score.py)
### `LLM.reward` ### `LLM.reward`
...@@ -156,7 +156,7 @@ data = output.outputs.data ...@@ -156,7 +156,7 @@ data = output.outputs.data
print(f"Data: {data!r}") print(f"Data: {data!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/reward.py> A code example can be found here: [examples/offline_inference/basic/reward.py](../../examples/offline_inference/basic/reward.py)
### `LLM.encode` ### `LLM.encode`
...@@ -234,7 +234,7 @@ outputs = llm.embed( ...@@ -234,7 +234,7 @@ outputs = llm.embed(
print(outputs[0].outputs) print(outputs[0].outputs)
``` ```
A code example can be found here: <gh-file:examples/offline_inference/pooling/embed_matryoshka_fy.py> A code example can be found here: [examples/offline_inference/pooling/embed_matryoshka_fy.py](../../examples/offline_inference/pooling/embed_matryoshka_fy.py)
### Online Inference ### Online Inference
...@@ -264,4 +264,4 @@ Expected output: ...@@ -264,4 +264,4 @@ Expected output:
{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}} {"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
``` ```
An OpenAI client example can be found here: <gh-file:examples/online_serving/pooling/openai_embedding_matryoshka_fy.py> An OpenAI client example can be found here: [examples/online_serving/pooling/openai_embedding_matryoshka_fy.py](../../examples/online_serving/pooling/openai_embedding_matryoshka_fy.py)
...@@ -9,7 +9,7 @@ Alongside each architecture, we include some popular models that use it. ...@@ -9,7 +9,7 @@ Alongside each architecture, we include some popular models that use it.
### vLLM ### vLLM
If vLLM natively supports a model, its implementation can be found in <gh-file:vllm/model_executor/models>. If vLLM natively supports a model, its implementation can be found in [vllm/model_executor/models](../../vllm/model_executor/models).
These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models]. These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
...@@ -116,7 +116,7 @@ Here is what happens in the background when this model is loaded: ...@@ -116,7 +116,7 @@ Here is what happens in the background when this model is loaded:
1. The config is loaded. 1. The config is loaded.
2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`. 2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
3. `MyModel` is loaded into one of the Transformers backend classes in <gh-file:vllm/model_executor/models/transformers.py> which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used. 3. `MyModel` is loaded into one of the Transformers backend classes in [vllm/model_executor/models/transformers.py](../../vllm/model_executor/models/transformers.py) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
That's it! That's it!
...@@ -543,7 +543,7 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A ...@@ -543,7 +543,7 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
``` ```
!!! note !!! note
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: <gh-file:examples/offline_inference/pooling/qwen3_reranker.py>. Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/offline_inference/pooling/qwen3_reranker.py](../../examples/offline_inference/pooling/qwen3_reranker.py).
```bash ```bash
vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
...@@ -581,7 +581,7 @@ These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode) ...@@ -581,7 +581,7 @@ These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode)
| `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | | | `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | |
!!! note !!! note
Named Entity Recognition (NER) usage, please refer to <gh-file:examples/offline_inference/pooling/ner.py>, <gh-file:examples/online_serving/pooling/ner_client.py>. Named Entity Recognition (NER) usage, please refer to [examples/offline_inference/pooling/ner.py](../../examples/offline_inference/pooling/ner.py), [examples/online_serving/pooling/ner_client.py](../../examples/online_serving/pooling/ner_client.py).
[](){ #supported-mm-models } [](){ #supported-mm-models }
...@@ -776,7 +776,7 @@ Some models are supported only via the [Transformers backend](#transformers). Th ...@@ -776,7 +776,7 @@ Some models are supported only via the [Transformers backend](#transformers). Th
!!! note !!! note
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now. The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <gh-pr:4087#issuecomment-2250397630> For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630>
!!! warning !!! warning
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1. Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
...@@ -856,5 +856,5 @@ We have the following levels of testing for models: ...@@ -856,5 +856,5 @@ We have the following levels of testing for models:
1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test. 1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test. 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test. 3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](../../tests) and [examples](../../examples) for the models that have passed this test.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category. 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
...@@ -16,7 +16,7 @@ For MoE models, when any requests are in progress in any rank, we must ensure th ...@@ -16,7 +16,7 @@ For MoE models, when any requests are in progress in any rank, we must ensure th
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently. In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see <gh-file:examples/offline_inference/data_parallel.py>. This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see [examples/offline_inference/data_parallel.py](../../examples/offline_inference/data_parallel.py).
There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing. There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing.
......
...@@ -4,11 +4,11 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md). ...@@ -4,11 +4,11 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
## Verify inter-node GPU communication ## Verify inter-node GPU communication
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to <gh-file:examples/online_serving/run_cluster.sh>, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <gh-issue:6803>. After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
## No available node types can fulfill resource request ## No available node types can fulfill resource request
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in <gh-file:examples/online_serving/run_cluster.sh> (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <gh-issue:7815>. The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <https://github.com/vllm-project/vllm/issues/7815>.
## Ray observability ## Ray observability
......
...@@ -8,9 +8,9 @@ EP is typically coupled with Data Parallelism (DP). While DP can be used indepen ...@@ -8,9 +8,9 @@ EP is typically coupled with Data Parallelism (DP). While DP can be used indepen
Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future: Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels). 1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels).
2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation). 2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). 3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
### Backend Selection Guide ### Backend Selection Guide
...@@ -195,7 +195,7 @@ For production deployments requiring strict SLA guarantees for time-to-first-tok ...@@ -195,7 +195,7 @@ For production deployments requiring strict SLA guarantees for time-to-first-tok
### Setup Steps ### Setup Steps
1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](gh-file:tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](gh-file:tools/install_nixl_from_source_ubuntu.py) script. 1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script.
2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'` 2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`
......
...@@ -92,7 +92,7 @@ and all chat requests will error. ...@@ -92,7 +92,7 @@ and all chat requests will error.
vllm serve <model> --chat-template ./path-to-chat-template.jinja vllm serve <model> --chat-template ./path-to-chat-template.jinja
``` ```
vLLM community provides a set of chat templates for popular models. You can find them under the <gh-dir:examples> directory. vLLM community provides a set of chat templates for popular models. You can find them under the [examples](../../examples) directory.
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below: both a `type` and a `text` field. An example is provided below:
...@@ -181,7 +181,7 @@ with `--enable-request-id-headers`. ...@@ -181,7 +181,7 @@ with `--enable-request-id-headers`.
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions); Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/online_serving/openai_completion_client.py> Code example: [examples/online_serving/openai_completion_client.py](../../examples/online_serving/openai_completion_client.py)
#### Extra parameters #### Extra parameters
...@@ -214,7 +214,7 @@ see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more inf ...@@ -214,7 +214,7 @@ see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more inf
- *Note: `image_url.detail` parameter is not supported.* - *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py> Code example: [examples/online_serving/openai_chat_completion_client.py](../../examples/online_serving/openai_chat_completion_client.py)
#### Extra parameters #### Extra parameters
...@@ -241,7 +241,7 @@ The following extra parameters are supported: ...@@ -241,7 +241,7 @@ The following extra parameters are supported:
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings); Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/online_serving/pooling/openai_embedding_client.py> Code example: [examples/online_serving/pooling/openai_embedding_client.py](../../examples/online_serving/pooling/openai_embedding_client.py)
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api]) If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations: which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
...@@ -289,7 +289,7 @@ and passing a list of `messages` in the request. Refer to the examples below for ...@@ -289,7 +289,7 @@ and passing a list of `messages` in the request. Refer to the examples below for
to run this model in embedding mode instead of text generation mode. to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model, The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec_phi3v.jinja> and can be found here: [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja)
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
...@@ -336,13 +336,13 @@ and passing a list of `messages` in the request. Refer to the examples below for ...@@ -336,13 +336,13 @@ and passing a list of `messages` in the request. Refer to the examples below for
Like with VLM2Vec, we have to explicitly pass `--runner pooling`. Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja> by a custom chat template: [examples/template_dse_qwen2_vl.jinja](../../examples/template_dse_qwen2_vl.jinja)
!!! important !!! important
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details. example below for details.
Full example: <gh-file:examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py> Full example: [examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py](../../examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py)
#### Extra parameters #### Extra parameters
...@@ -379,7 +379,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai ...@@ -379,7 +379,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
!!! note !!! note
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`. To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_transcription_client.py> Code example: [examples/online_serving/openai_transcription_client.py](../../examples/online_serving/openai_transcription_client.py)
#### API Enforced Limits #### API Enforced Limits
...@@ -496,7 +496,7 @@ Please mind that the popular `openai/whisper-large-v3-turbo` model does not supp ...@@ -496,7 +496,7 @@ Please mind that the popular `openai/whisper-large-v3-turbo` model does not supp
!!! note !!! note
To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`. To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_translation_client.py> Code example: [examples/online_serving/openai_translation_client.py](../../examples/online_serving/openai_translation_client.py)
#### Extra Parameters #### Extra Parameters
...@@ -530,7 +530,7 @@ Our Pooling API encodes input prompts using a [pooling model](../models/pooling_ ...@@ -530,7 +530,7 @@ Our Pooling API encodes input prompts using a [pooling model](../models/pooling_
The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats. The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/online_serving/pooling/openai_pooling_client.py> Code example: [examples/online_serving/pooling/openai_pooling_client.py](../../examples/online_serving/pooling/openai_pooling_client.py)
[](){ #classification-api } [](){ #classification-api }
...@@ -540,7 +540,7 @@ Our Classification API directly supports Hugging Face sequence-classification mo ...@@ -540,7 +540,7 @@ Our Classification API directly supports Hugging Face sequence-classification mo
We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities. We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
Code example: <gh-file:examples/online_serving/pooling/openai_classification_client.py> Code example: [examples/online_serving/pooling/openai_classification_client.py](../../examples/online_serving/pooling/openai_classification_client.py)
#### Example Requests #### Example Requests
...@@ -658,7 +658,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent ...@@ -658,7 +658,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html). You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py> Code example: [examples/online_serving/openai_cross_encoder_score.py](../../examples/online_serving/openai_cross_encoder_score.py)
#### Single inference #### Single inference
...@@ -839,7 +839,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including ...@@ -839,7 +839,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including
print("Scoring output:", response_json["data"][0]["score"]) print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"]) print("Scoring output:", response_json["data"][1]["score"])
``` ```
Full example: <gh-file:examples/online_serving/openai_cross_encoder_score_for_multimodal.py> Full example: [examples/online_serving/openai_cross_encoder_score_for_multimodal.py](../../examples/online_serving/openai_cross_encoder_score_for_multimodal.py)
#### Extra parameters #### Extra parameters
...@@ -871,7 +871,7 @@ endpoints are compatible with both [Jina AI's re-rank API interface](https://jin ...@@ -871,7 +871,7 @@ endpoints are compatible with both [Jina AI's re-rank API interface](https://jin
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with [Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools. popular open-source tools.
Code example: <gh-file:examples/online_serving/pooling/jinaai_rerank_client.py> Code example: [examples/online_serving/pooling/jinaai_rerank_client.py](../../examples/online_serving/pooling/jinaai_rerank_client.py)
#### Example Request #### Example Request
...@@ -949,6 +949,6 @@ Key capabilities: ...@@ -949,6 +949,6 @@ Key capabilities:
- Scales from a single GPU to a multi-node cluster without code changes. - Scales from a single GPU to a multi-node cluster without code changes.
- Provides observability and autoscaling policies through Ray dashboards and metrics. - Provides observability and autoscaling policies through Ray dashboards and metrics.
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: <gh-file:examples/online_serving/ray_serve_deepseek.py>. The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: [examples/online_serving/ray_serve_deepseek.py](../../examples/online_serving/ray_serve_deepseek.py).
Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/serving-llms.html). Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/serving-llms.html).
...@@ -72,7 +72,7 @@ For details, see the [Ray documentation](https://docs.ray.io/en/latest/index.htm ...@@ -72,7 +72,7 @@ For details, see the [Ray documentation](https://docs.ray.io/en/latest/index.htm
### Ray cluster setup with containers ### Ray cluster setup with containers
The helper script <gh-file:examples/online_serving/run_cluster.sh> starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command. The helper script [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command.
Choose one node as the head node and run: Choose one node as the head node and run:
...@@ -132,7 +132,7 @@ vllm serve /path/to/the/model/in/the/container \ ...@@ -132,7 +132,7 @@ vllm serve /path/to/the/model/in/the/container \
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the
<gh-file:examples/online_serving/run_cluster.sh> helper script. [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) helper script.
Contact your system administrator for more information about the required flags. Contact your system administrator for more information about the required flags.
## Enabling GPUDirect RDMA ## Enabling GPUDirect RDMA
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment