Unverified Commit 4ffd6e89 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Reduce custom syntax used in docs (#27009)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 965c5f49
......@@ -4,19 +4,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:installation"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:installation"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation"
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:installation"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:installation"
## Requirements
......@@ -24,19 +24,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:requirements"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:requirements"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements"
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:requirements"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:requirements"
## Set up using Python
......@@ -52,19 +52,19 @@ Currently, there are no pre-built CPU wheels.
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-wheel-from-source"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-wheel-from-source"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:build-wheel-from-source"
=== "IBM Z (s390x)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-wheel-from-source"
## Set up using Docker
......@@ -72,24 +72,24 @@ Currently, there are no pre-built CPU wheels.
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-images"
### Build image from source
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-image-from-source"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-image-from-source"
## Related runtime environment variables
......
......@@ -157,7 +157,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
### Build image from source
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
You can use [docker/Dockerfile.tpu](../../../docker/Dockerfile.tpu) to build a Docker image with TPU support.
```bash
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
......
......@@ -11,7 +11,7 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
# --8<-- [start:set-up-using-python]
!!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
......
......@@ -4,15 +4,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:installation"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:installation"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:installation"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:installation"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:installation"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"
## Requirements
......@@ -24,15 +24,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:requirements"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:requirements"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:requirements"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:requirements"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:requirements"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"
## Set up using Python
......@@ -42,29 +42,29 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:set-up-using-python"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:set-up-using-python"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:set-up-using-python"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:set-up-using-python"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:set-up-using-python"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"
### Pre-built wheels
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-wheels"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-wheels"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-wheels"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-wheels"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-wheels"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
[](){ #build-from-source }
......@@ -72,15 +72,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-wheel-from-source"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-wheel-from-source"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"
## Set up using Docker
......@@ -88,40 +88,40 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-images"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-images"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-images"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-images"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-images"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
### Build image from source
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-image-from-source"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-image-from-source"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
## Supported features
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:supported-features"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:supported-features"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:supported-features"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:supported-features"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:supported-features"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"
......@@ -146,7 +146,7 @@ Building the Docker image from source is the recommended way to use vLLM with RO
#### (Optional) Build an image with ROCm software stack
Build a docker image from <gh-file:docker/Dockerfile.rocm_base> which setup ROCm software stack needed by the vLLM.
Build a docker image from [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base) which setup ROCm software stack needed by the vLLM.
**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
If you choose to build this rocm_base image yourself, the steps are as follows.
......@@ -170,7 +170,7 @@ DOCKER_BUILDKIT=1 docker build \
#### Build an image with vLLM
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```bash
......@@ -181,10 +181,10 @@ It is important that the user kicks off the docker build using buildkit. Either
}
```
<gh-file:docker/Dockerfile.rocm> uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
[docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
It provides flexibility to customize the build of docker image using the following arguments:
- `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using <gh-file:docker/Dockerfile.rocm_base>
- `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base)
- `ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image
Their values can be passed in when running `docker build` with `--build-arg` options.
......
......@@ -75,7 +75,7 @@ vllm serve facebook/opt-13b \
-tp=8
```
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
# --8<-- [end:supported-features]
# --8<-- [start:distributed-backend]
......
......@@ -46,7 +46,7 @@ uv pip install vllm --torch-backend=auto
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic/basic.py>
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
......@@ -201,7 +201,7 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
print("Completion result:", completion)
```
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
A more detailed client example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### OpenAI Chat Completions API with vLLM
......@@ -253,4 +253,4 @@ Currently, vLLM supports multiple backends for efficient Attention computation a
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
!!! warning
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see <gh-file:docker/Dockerfile> for instructions on how to install it.
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.
......@@ -137,13 +137,20 @@ class Example:
gh_file = (self.main_file.parent / relative_path).resolve()
gh_file = gh_file.relative_to(ROOT_DIR)
return f"[{link_text}](gh-file:{gh_file})"
# Make GitHub URL
url = "https://github.com/vllm-project/vllm/"
url += "tree/main" if self.path.is_dir() else "blob/main"
gh_url = f"{url}/{gh_file}"
return f"[{link_text}]({gh_url})"
return re.sub(link_pattern, replace_link, content)
def generate(self) -> str:
content = f"# {self.title}\n\n"
content += f"Source <gh-file:{self.path.relative_to(ROOT_DIR)}>.\n\n"
url = "https://github.com/vllm-project/vllm/"
url += "tree/main" if self.path.is_dir() else "blob/main"
content += f"Source <{url}/{self.path.relative_to(ROOT_DIR)}>.\n\n"
# Use long code fence to avoid issues with
# included files containing code fences too
......
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
This is basically a port of MyST parser’s external URL resolution mechanism
(https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#customising-external-url-resolution)
to work with MkDocs.
MkDocs hook to enable the following links to render correctly:
It allows Markdown authors to use GitHub shorthand links like:
- [Text](gh-issue:123)
- <gh-pr:456>
- [File](gh-file:path/to/file.py#L10)
These are automatically rewritten into fully qualified GitHub URLs pointing to
issues, pull requests, files, directories, or projects in the
`vllm-project/vllm` repository.
- Relative file links outside of the `docs/` directory, e.g.:
- [Text](../some_file.py)
- [Directory](../../some_directory/)
- GitHub URLs for issues, pull requests, and projects, e.g.:
- Adds GitHub icon before links
- Replaces raw links with descriptive text,
e.g. <...pull/123> -> [Pull Request #123](.../pull/123)
- Works for external repos too by including the `owner/repo` in the link title
The goal is to simplify cross-referencing common GitHub resources
in project docs.
"""
from pathlib import Path
import regex as re
from mkdocs.config.defaults import MkDocsConfig
from mkdocs.structure.files import Files
from mkdocs.structure.pages import Page
ROOT_DIR = Path(__file__).parent.parent.parent.parent.resolve()
DOC_DIR = ROOT_DIR / "docs"
gh_icon = ":octicons-mark-github-16:"
# Regex pieces
TITLE = r"(?P<title>[^\[\]<>]+?)"
REPO = r"(?P<repo>.+?/.+?)"
TYPE = r"(?P<type>issues|pull|projects)"
NUMBER = r"(?P<number>\d+)"
FRAGMENT = r"(?P<fragment>#[^\s]+)?"
URL = f"https://github.com/{REPO}/{TYPE}/{NUMBER}{FRAGMENT}"
RELATIVE = r"(?!(https?|ftp)://|#)(?P<path>[^\s]+?)"
# Common titles to use for GitHub links when none is provided in the link.
TITLES = {"issues": "Issue ", "pull": "Pull Request ", "projects": "Project "}
# Regex to match GitHub issue, PR, and project links with optional titles.
github_link = re.compile(rf"(\[{TITLE}\]\(|<){URL}(\)|>)")
# Regex to match relative file links with optional titles.
relative_link = re.compile(rf"\[{TITLE}\]\({RELATIVE}\)")
def on_page_markdown(
markdown: str, *, page: Page, config: MkDocsConfig, files: Files
) -> str:
"""
Custom MkDocs plugin hook to rewrite special GitHub reference links
in Markdown.
This function scans the given Markdown content for specially formatted
GitHub shorthand links, such as:
- `[Link text](gh-issue:123)`
- `<gh-pr:456>`
And rewrites them into fully-qualified GitHub URLs with GitHub icons:
- `[:octicons-mark-github-16: Link text](https://github.com/vllm-project/vllm/issues/123)`
- `[:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)`
Supported shorthand types:
- `gh-issue`
- `gh-pr`
- `gh-project`
- `gh-dir`
- `gh-file`
Args:
markdown (str): The raw Markdown content of the page.
page (Page): The MkDocs page object being processed.
config (MkDocsConfig): The MkDocs site configuration.
files (Files): The collection of files in the MkDocs build.
Returns:
str: The updated Markdown content with GitHub shorthand links replaced.
"""
gh_icon = ":octicons-mark-github-16:"
gh_url = "https://github.com"
repo_url = f"{gh_url}/vllm-project/vllm"
org_url = f"{gh_url}/orgs/vllm-project"
# Mapping of shorthand types to their corresponding GitHub base URLs
urls = {
"issue": f"{repo_url}/issues",
"pr": f"{repo_url}/pull",
"project": f"{org_url}/projects",
"dir": f"{repo_url}/tree/main",
"file": f"{repo_url}/blob/main",
}
# Default title prefixes for auto links
titles = {
"issue": "Issue #",
"pr": "Pull Request #",
"project": "Project #",
"dir": "",
"file": "",
}
# Regular expression to match GitHub shorthand links
scheme = r"gh-(?P<type>.+?):(?P<path>.+?)(#(?P<fragment>.+?))?"
inline_link = re.compile(r"\[(?P<title>[^\[]+?)\]\(" + scheme + r"\)")
auto_link = re.compile(f"<{scheme}>")
def replace_inline_link(match: re.Match) -> str:
"""
Replaces a matched inline-style GitHub shorthand link
with a full Markdown link.
Example:
[My issue](gh-issue:123) → [:octicons-mark-github-16: My issue](https://github.com/vllm-project/vllm/issues/123)
"""
url = f"{urls[match.group('type')]}/{match.group('path')}"
if fragment := match.group("fragment"):
url += f"#{fragment}"
return f"[{gh_icon} {match.group('title')}]({url})"
def replace_auto_link(match: re.Match) -> str:
"""
Replaces a matched autolink-style GitHub shorthand
with a full Markdown link.
Example:
<gh-pr:456> → [:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)
"""
type = match.group("type")
def replace_relative_link(match: re.Match) -> str:
"""Replace relative file links with URLs if they point outside the docs dir."""
title = match.group("title")
path = match.group("path")
title = f"{titles[type]}{path}"
url = f"{urls[type]}/{path}"
if fragment := match.group("fragment"):
url += f"#{fragment}"
path = (Path(page.file.abs_src_path).parent / path).resolve()
# Check if the path exists and is outside the docs dir
if not path.exists() or path.is_relative_to(DOC_DIR):
return match.group(0)
# Files and directories have different URL schemes on GitHub
slug = "tree/main" if path.is_dir() else "blob/main"
path = path.relative_to(ROOT_DIR)
url = f"https://github.com/vllm-project/vllm/{slug}/{path}"
return f"[{gh_icon} {title}]({url})"
# Replace both inline and autolinks
markdown = inline_link.sub(replace_inline_link, markdown)
markdown = auto_link.sub(replace_auto_link, markdown)
def replace_github_link(match: re.Match) -> str:
"""Replace GitHub issue, PR, and project links with enhanced Markdown links."""
repo = match.group("repo")
type = match.group("type")
number = match.group("number")
# Title and fragment could be None
title = match.group("title") or ""
fragment = match.group("fragment") or ""
# Use default titles for raw links
if not title:
title = TITLES[type]
if "vllm-project" not in repo:
title += repo
title += f"#{number}"
url = f"https://github.com/{repo}/{type}/{number}{fragment}"
return f"[{gh_icon} {title}]({url})"
markdown = relative_link.sub(replace_relative_link, markdown)
markdown = github_link.sub(replace_github_link, markdown)
if "interface" in str(page.file.abs_src_path):
print(markdown)
return markdown
......@@ -82,7 +82,7 @@ vllm serve /path/to/sharded/model \
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
```
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
To create sharded model files, you can use the script provided in [examples/offline_inference/save_sharded_state.py](../../../examples/offline_inference/save_sharded_state.py). This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
......
......@@ -59,7 +59,7 @@ for output in outputs:
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py>
A code example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### `LLM.beam_search`
......@@ -121,7 +121,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>
A code example can be found here: [examples/offline_inference/basic/chat.py](../../examples/offline_inference/basic/chat.py)
If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template:
......
......@@ -9,7 +9,7 @@ before returning them.
!!! note
We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly.
We are now planning to optimize pooling models in vLLM. Please comment on <gh-issue:21796> if you have any suggestions!
We are now planning to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!
## Configuration
......@@ -98,7 +98,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py>
A code example can be found here: [examples/offline_inference/basic/embed.py](../../examples/offline_inference/basic/embed.py)
### `LLM.classify`
......@@ -115,7 +115,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py>
A code example can be found here: [examples/offline_inference/basic/classify.py](../../examples/offline_inference/basic/classify.py)
### `LLM.score`
......@@ -139,7 +139,7 @@ score = output.outputs.score
print(f"Score: {score}")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py>
A code example can be found here: [examples/offline_inference/basic/score.py](../../examples/offline_inference/basic/score.py)
### `LLM.reward`
......@@ -156,7 +156,7 @@ data = output.outputs.data
print(f"Data: {data!r}")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/reward.py>
A code example can be found here: [examples/offline_inference/basic/reward.py](../../examples/offline_inference/basic/reward.py)
### `LLM.encode`
......@@ -234,7 +234,7 @@ outputs = llm.embed(
print(outputs[0].outputs)
```
A code example can be found here: <gh-file:examples/offline_inference/pooling/embed_matryoshka_fy.py>
A code example can be found here: [examples/offline_inference/pooling/embed_matryoshka_fy.py](../../examples/offline_inference/pooling/embed_matryoshka_fy.py)
### Online Inference
......@@ -264,4 +264,4 @@ Expected output:
{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
```
An OpenAI client example can be found here: <gh-file:examples/online_serving/pooling/openai_embedding_matryoshka_fy.py>
An OpenAI client example can be found here: [examples/online_serving/pooling/openai_embedding_matryoshka_fy.py](../../examples/online_serving/pooling/openai_embedding_matryoshka_fy.py)
......@@ -9,7 +9,7 @@ Alongside each architecture, we include some popular models that use it.
### vLLM
If vLLM natively supports a model, its implementation can be found in <gh-file:vllm/model_executor/models>.
If vLLM natively supports a model, its implementation can be found in [vllm/model_executor/models](../../vllm/model_executor/models).
These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
......@@ -116,7 +116,7 @@ Here is what happens in the background when this model is loaded:
1. The config is loaded.
2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
3. `MyModel` is loaded into one of the Transformers backend classes in <gh-file:vllm/model_executor/models/transformers.py> which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
3. `MyModel` is loaded into one of the Transformers backend classes in [vllm/model_executor/models/transformers.py](../../vllm/model_executor/models/transformers.py) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
That's it!
......@@ -543,7 +543,7 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
```
!!! note
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: <gh-file:examples/offline_inference/pooling/qwen3_reranker.py>.
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/offline_inference/pooling/qwen3_reranker.py](../../examples/offline_inference/pooling/qwen3_reranker.py).
```bash
vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
......@@ -581,7 +581,7 @@ These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode)
| `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | |
!!! note
Named Entity Recognition (NER) usage, please refer to <gh-file:examples/offline_inference/pooling/ner.py>, <gh-file:examples/online_serving/pooling/ner_client.py>.
Named Entity Recognition (NER) usage, please refer to [examples/offline_inference/pooling/ner.py](../../examples/offline_inference/pooling/ner.py), [examples/online_serving/pooling/ner_client.py](../../examples/online_serving/pooling/ner_client.py).
[](){ #supported-mm-models }
......@@ -776,7 +776,7 @@ Some models are supported only via the [Transformers backend](#transformers). Th
!!! note
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630>
!!! warning
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
......@@ -856,5 +856,5 @@ We have the following levels of testing for models:
1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](../../tests) and [examples](../../examples) for the models that have passed this test.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
......@@ -16,7 +16,7 @@ For MoE models, when any requests are in progress in any rank, we must ensure th
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see <gh-file:examples/offline_inference/data_parallel.py>.
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see [examples/offline_inference/data_parallel.py](../../examples/offline_inference/data_parallel.py).
There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing.
......
......@@ -4,11 +4,11 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
## Verify inter-node GPU communication
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to <gh-file:examples/online_serving/run_cluster.sh>, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <gh-issue:6803>.
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
## No available node types can fulfill resource request
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in <gh-file:examples/online_serving/run_cluster.sh> (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <gh-issue:7815>.
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <https://github.com/vllm-project/vllm/issues/7815>.
## Ray observability
......
......@@ -8,9 +8,9 @@ EP is typically coupled with Data Parallelism (DP). While DP can be used indepen
Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels).
1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels).
2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
### Backend Selection Guide
......@@ -195,7 +195,7 @@ For production deployments requiring strict SLA guarantees for time-to-first-tok
### Setup Steps
1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](gh-file:tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](gh-file:tools/install_nixl_from_source_ubuntu.py) script.
1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script.
2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`
......
......@@ -92,7 +92,7 @@ and all chat requests will error.
vllm serve <model> --chat-template ./path-to-chat-template.jinja
```
vLLM community provides a set of chat templates for popular models. You can find them under the <gh-dir:examples> directory.
vLLM community provides a set of chat templates for popular models. You can find them under the [examples](../../examples) directory.
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below:
......@@ -181,7 +181,7 @@ with `--enable-request-id-headers`.
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/online_serving/openai_completion_client.py>
Code example: [examples/online_serving/openai_completion_client.py](../../examples/online_serving/openai_completion_client.py)
#### Extra parameters
......@@ -214,7 +214,7 @@ see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more inf
- *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
Code example: [examples/online_serving/openai_chat_completion_client.py](../../examples/online_serving/openai_chat_completion_client.py)
#### Extra parameters
......@@ -241,7 +241,7 @@ The following extra parameters are supported:
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/online_serving/pooling/openai_embedding_client.py>
Code example: [examples/online_serving/pooling/openai_embedding_client.py](../../examples/online_serving/pooling/openai_embedding_client.py)
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
......@@ -289,7 +289,7 @@ and passing a list of `messages` in the request. Refer to the examples below for
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec_phi3v.jinja>
and can be found here: [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja)
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
......@@ -336,13 +336,13 @@ and passing a list of `messages` in the request. Refer to the examples below for
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
by a custom chat template: [examples/template_dse_qwen2_vl.jinja](../../examples/template_dse_qwen2_vl.jinja)
!!! important
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
Full example: <gh-file:examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py>
Full example: [examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py](../../examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py)
#### Extra parameters
......@@ -379,7 +379,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
!!! note
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
Code example: [examples/online_serving/openai_transcription_client.py](../../examples/online_serving/openai_transcription_client.py)
#### API Enforced Limits
......@@ -496,7 +496,7 @@ Please mind that the popular `openai/whisper-large-v3-turbo` model does not supp
!!! note
To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_translation_client.py>
Code example: [examples/online_serving/openai_translation_client.py](../../examples/online_serving/openai_translation_client.py)
#### Extra Parameters
......@@ -530,7 +530,7 @@ Our Pooling API encodes input prompts using a [pooling model](../models/pooling_
The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/online_serving/pooling/openai_pooling_client.py>
Code example: [examples/online_serving/pooling/openai_pooling_client.py](../../examples/online_serving/pooling/openai_pooling_client.py)
[](){ #classification-api }
......@@ -540,7 +540,7 @@ Our Classification API directly supports Hugging Face sequence-classification mo
We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
Code example: <gh-file:examples/online_serving/pooling/openai_classification_client.py>
Code example: [examples/online_serving/pooling/openai_classification_client.py](../../examples/online_serving/pooling/openai_classification_client.py)
#### Example Requests
......@@ -658,7 +658,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>
Code example: [examples/online_serving/openai_cross_encoder_score.py](../../examples/online_serving/openai_cross_encoder_score.py)
#### Single inference
......@@ -839,7 +839,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])
```
Full example: <gh-file:examples/online_serving/openai_cross_encoder_score_for_multimodal.py>
Full example: [examples/online_serving/openai_cross_encoder_score_for_multimodal.py](../../examples/online_serving/openai_cross_encoder_score_for_multimodal.py)
#### Extra parameters
......@@ -871,7 +871,7 @@ endpoints are compatible with both [Jina AI's re-rank API interface](https://jin
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools.
Code example: <gh-file:examples/online_serving/pooling/jinaai_rerank_client.py>
Code example: [examples/online_serving/pooling/jinaai_rerank_client.py](../../examples/online_serving/pooling/jinaai_rerank_client.py)
#### Example Request
......@@ -949,6 +949,6 @@ Key capabilities:
- Scales from a single GPU to a multi-node cluster without code changes.
- Provides observability and autoscaling policies through Ray dashboards and metrics.
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: <gh-file:examples/online_serving/ray_serve_deepseek.py>.
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: [examples/online_serving/ray_serve_deepseek.py](../../examples/online_serving/ray_serve_deepseek.py).
Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/serving-llms.html).
......@@ -72,7 +72,7 @@ For details, see the [Ray documentation](https://docs.ray.io/en/latest/index.htm
### Ray cluster setup with containers
The helper script <gh-file:examples/online_serving/run_cluster.sh> starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command.
The helper script [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command.
Choose one node as the head node and run:
......@@ -132,7 +132,7 @@ vllm serve /path/to/the/model/in/the/container \
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the
<gh-file:examples/online_serving/run_cluster.sh> helper script.
[examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) helper script.
Contact your system administrator for more information about the required flags.
## Enabling GPUDirect RDMA
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment