Commit 711aa9d5 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.10.0' into v0.10.0-dev

parents 751c492c 6d8d0a24
docs/assets/deployment/open_webui.png

67.7 KB | W: | H:

docs/assets/deployment/open_webui.png

57.2 KB | W: | H:

docs/assets/deployment/open_webui.png
docs/assets/deployment/open_webui.png
docs/assets/deployment/open_webui.png
docs/assets/deployment/open_webui.png
  • 2-up
  • Swipe
  • Onion skin
---
toc_depth: 4
---
# vLLM CLI Guide # vLLM CLI Guide
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
...@@ -16,7 +20,7 @@ vllm {chat,complete,serve,bench,collect-env,run-batch} ...@@ -16,7 +20,7 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}
Start the vLLM OpenAI Compatible API server. Start the vLLM OpenAI Compatible API server.
??? Examples ??? console "Examples"
```bash ```bash
# Start with a model # Start with a model
...@@ -37,8 +41,15 @@ Start the vLLM OpenAI Compatible API server. ...@@ -37,8 +41,15 @@ Start the vLLM OpenAI Compatible API server.
# To search by keyword # To search by keyword
vllm serve --help=max vllm serve --help=max
# To view full help with pager (less/more)
vllm serve --help=page
``` ```
### Options
--8<-- "docs/argparse/serve.md"
## chat ## chat
Generate chat completions via the running API server. Generate chat completions via the running API server.
......
--- # Contact Us
title: Contact Us
---
[](){ #contactus }
--8<-- "README.md:contact-us" --8<-- "README.md:contact-us"
--- # Meetups
title: Meetups
---
[](){ #meetups }
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below: We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
......
...@@ -33,7 +33,7 @@ Quantized models take less memory at the cost of lower precision. ...@@ -33,7 +33,7 @@ Quantized models take less memory at the cost of lower precision.
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI)) Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
and used directly without extra configuration. and used directly without extra configuration.
Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details. Dynamic quantization is also supported via the `quantization` option -- see [here](../features/quantization/README.md) for more details.
## Context length and batch size ## Context length and batch size
...@@ -57,7 +57,7 @@ By default, we optimize model inference using CUDA graphs which take up extra me ...@@ -57,7 +57,7 @@ By default, we optimize model inference using CUDA graphs which take up extra me
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
??? Code ??? code
```python ```python
from vllm import LLM from vllm import LLM
...@@ -129,7 +129,7 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory. ...@@ -129,7 +129,7 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
Here are some examples: Here are some examples:
??? Code ??? code
```python ```python
from vllm import LLM from vllm import LLM
......
--- ---
title: Engine Arguments toc_depth: 3
--- ---
[](){ #engine-args }
# Engine Arguments
Engine arguments control the behavior of the vLLM engine. Engine arguments control the behavior of the vLLM engine.
- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class. - For [offline inference](../serving/offline_inference.md), they are part of the arguments to [LLM][vllm.LLM] class.
- For [online serving][serving-openai-compatible-server], they are part of the arguments to `vllm serve`. - For [online serving](../serving/openai_compatible_server.md), they are part of the arguments to `vllm serve`.
The engine argument classes, [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs], are a combination of the configuration classes defined in [vllm.config][]. Therefore, if you are interested in developer documentation, we recommend looking at these configuration classes as they are the source of truth for types, defaults and docstrings.
You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments. ## `EngineArgs`
However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented. --8<-- "docs/argparse/engine_args.md"
For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config. ## `AsyncEngineArgs`
!!! note --8<-- "docs/argparse/async_engine_args.md"
Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`
...@@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system: ...@@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system:
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
??? Code ??? code
```python ```python
--8<-- "vllm/envs.py:env-vars-definition" --8<-- "vllm/envs.py:env-vars-definition"
......
...@@ -14,10 +14,10 @@ For example: ...@@ -14,10 +14,10 @@ For example:
```python ```python
from vllm import LLM from vllm import LLM
model = LLM( llm = LLM(
model="cerebras/Cerebras-GPT-1.3B", model="cerebras/Cerebras-GPT-1.3B",
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2 hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
) )
``` ```
Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM. Our [list of supported models](../models/supported_models.md) shows the model architectures that are recognized by vLLM.
--- # Server Arguments
title: Server Arguments
---
[](){ #serve-args }
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.
## CLI Arguments ## CLI Arguments
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.
To see the available CLI arguments, run `vllm serve --help`! To see the available options, take a look at the [CLI Reference](../cli/README.md#options)!
## Configuration file ## Configuration file
You can load CLI arguments via a [YAML](https://yaml.org/) config file. You can load CLI arguments via a [YAML](https://yaml.org/) config file.
The argument names must be the long form of those outlined [above][serve-args]. The argument names must be the long form of those outlined [above](serve_args.md).
For example: For example:
......
...@@ -95,7 +95,7 @@ For additional features and advanced configurations, refer to the official [MkDo ...@@ -95,7 +95,7 @@ For additional features and advanced configurations, refer to the official [MkDo
## Testing ## Testing
??? note "Commands" ??? console "Commands"
```bash ```bash
pip install -r requirements/dev.txt pip install -r requirements/dev.txt
......
--- # Benchmark Suites
title: Benchmark Suites
---
[](){ #benchmarks }
vLLM contains two sets of benchmarks: vLLM contains two sets of benchmarks:
......
...@@ -6,9 +6,9 @@ the failure? ...@@ -6,9 +6,9 @@ the failure?
- Check the dashboard of current CI test failures: - Check the dashboard of current CI test failures:
👉 [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20) 👉 [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20)
- If your failure **is already listed**, it's likely unrelated to your PR. - If your failure **is already listed**, it's likely unrelated to your PR.
Help fixing it is always welcome! Help fixing it is always welcome!
- Leave comments with links to additional instances of the failure. - Leave comments with links to additional instances of the failure.
- React with a 👍 to signal how many are affected. - React with a 👍 to signal how many are affected.
- If your failure **is not listed**, you should **file an issue**. - If your failure **is not listed**, you should **file an issue**.
...@@ -19,25 +19,25 @@ the failure? ...@@ -19,25 +19,25 @@ the failure?
👉 [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml) 👉 [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml)
- **Use this title format:** - **Use this title format:**
``` ```
[CI Failure]: failing-test-job - regex/matching/failing:test [CI Failure]: failing-test-job - regex/matching/failing:test
``` ```
- **For the environment field:** - **For the environment field:**
``` ```
Still failing on main as of commit abcdef123 Still failing on main as of commit abcdef123
``` ```
- **In the description, include failing tests:** - **In the description, include failing tests:**
``` ```
FAILED failing/test.py:failing_test1 - Failure description FAILED failing/test.py:failing_test1 - Failure description
FAILED failing/test.py:failing_test2 - Failure description FAILED failing/test.py:failing_test2 - Failure description
https://github.com/orgs/vllm-project/projects/20 https://github.com/orgs/vllm-project/projects/20
https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml
FAILED failing/test.py:failing_test3 - Failure description FAILED failing/test.py:failing_test3 - Failure description
``` ```
- **Attach logs** (collapsible section example): - **Attach logs** (collapsible section example):
...@@ -45,17 +45,17 @@ the failure? ...@@ -45,17 +45,17 @@ the failure?
<summary>Logs:</summary> <summary>Logs:</summary>
```text ```text
ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data
--- Logging error --- --- Logging error ---
Traceback (most recent call last): Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model
return self.model_executor.execute_model(scheduler_output) return self.model_executor.execute_model(scheduler_output)
... ...
FAILED failing/test.py:failing_test1 - Failure description FAILED failing/test.py:failing_test1 - Failure description
FAILED failing/test.py:failing_test2 - Failure description FAILED failing/test.py:failing_test2 - Failure description
FAILED failing/test.py:failing_test3 - Failure description FAILED failing/test.py:failing_test3 - Failure description
``` ```
</details> </details>
## Logs Wrangling ## Logs Wrangling
...@@ -78,7 +78,7 @@ tail -525 ci_build.log | wl-copy ...@@ -78,7 +78,7 @@ tail -525 ci_build.log | wl-copy
## Investigating a CI Test Failure ## Investigating a CI Test Failure
1. Go to 👉 [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main) 1. Go to 👉 [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main)
2. Bisect to find the first build that shows the issue. 2. Bisect to find the first build that shows the issue.
3. Add your findings to the GitHub issue. 3. Add your findings to the GitHub issue.
4. If you find a strong candidate PR, mention it in the issue and ping contributors. 4. If you find a strong candidate PR, mention it in the issue and ping contributors.
...@@ -97,9 +97,9 @@ CI test failures may be flaky. Use a bash loop to run repeatedly: ...@@ -97,9 +97,9 @@ CI test failures may be flaky. Use a bash loop to run repeatedly:
If you submit a PR to fix a CI failure: If you submit a PR to fix a CI failure:
- Link the PR to the issue: - Link the PR to the issue:
Add `Closes #12345` to the PR description. Add `Closes #12345` to the PR description.
- Add the `ci-failure` label: - Add the `ci-failure` label:
This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20). This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20).
## Other Resources ## Other Resources
......
--- # Update PyTorch version on vLLM OSS CI/CD
title: Update PyTorch version on vLLM OSS CI/CD
---
vLLM's current policy is to always use the latest PyTorch stable vLLM's current policy is to always use the latest PyTorch stable
release in CI/CD. It is standard practice to submit a PR to update the release in CI/CD. It is standard practice to submit a PR to update the
PyTorch version as early as possible when a new [PyTorch stable PyTorch version as early as possible when a new [PyTorch stable
release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available. release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available.
This process is non-trivial due to the gap between PyTorch This process is non-trivial due to the gap between PyTorch
releases. Using [#16859](https://github.com/vllm-project/vllm/pull/16859) as releases. Using <gh-pr:16859> as an example, this document outlines common steps to achieve this
an example, this document outlines common steps to achieve this update along with update along with a list of potential issues and how to address them.
a list of potential issues and how to address them.
## Test PyTorch release candidates (RCs) ## Test PyTorch release candidates (RCs)
...@@ -19,11 +16,12 @@ by waiting for the next release or by implementing hacky workarounds in vLLM. ...@@ -19,11 +16,12 @@ by waiting for the next release or by implementing hacky workarounds in vLLM.
The better solution is to test vLLM with PyTorch release candidates (RC) to ensure The better solution is to test vLLM with PyTorch release candidates (RC) to ensure
compatibility before each release. compatibility before each release.
PyTorch release candidates can be downloaded from PyTorch test index at https://download.pytorch.org/whl/test. PyTorch release candidates can be downloaded from [PyTorch test index](https://download.pytorch.org/whl/test).
For example, torch2.7.0+cu12.8 RC can be installed using the following command: For example, `torch2.7.0+cu12.8` RC can be installed using the following command:
``` ```bash
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu128 uv pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/test/cu128
``` ```
When the final RC is ready for testing, it will be announced to the community When the final RC is ready for testing, it will be announced to the community
...@@ -31,13 +29,28 @@ on the [PyTorch dev-discuss forum](https://dev-discuss.pytorch.org/c/release-ann ...@@ -31,13 +29,28 @@ on the [PyTorch dev-discuss forum](https://dev-discuss.pytorch.org/c/release-ann
After this announcement, we can begin testing vLLM integration by drafting a pull request After this announcement, we can begin testing vLLM integration by drafting a pull request
following this 3-step process: following this 3-step process:
1. Update requirements files in https://github.com/vllm-project/vllm/tree/main/requirements 1. Update [requirements files](https://github.com/vllm-project/vllm/tree/main/requirements)
to point to the new releases for torch, torchvision, and torchaudio. to point to the new releases for `torch`, `torchvision`, and `torchaudio`.
2. Use `--extra-index-url https://download.pytorch.org/whl/test/<PLATFORM>` to
get the final release candidates' wheels. Some common platforms are `cpu`, `cu128`, 2. Use the following option to get the final release candidates' wheels. Some common platforms are `cpu`, `cu128`, and `rocm6.2.4`.
and `rocm6.2.4`.
3. As vLLM uses uv, make sure that `unsafe-best-match` strategy is set either ```bash
via `UV_INDEX_STRATEGY` env variable or via `--index-strategy unsafe-best-match`. --extra-index-url https://download.pytorch.org/whl/test/<PLATFORM>
```
3. Since vLLM uses `uv`, ensure the following index strategy is applied:
- Via environment variable:
```bash
export UV_INDEX_STRATEGY=unsafe-best-match
```
- Or via CLI flag:
```bash
--index-strategy unsafe-best-match
```
If failures are found in the pull request, raise them as issues on vLLM and If failures are found in the pull request, raise them as issues on vLLM and
cc the PyTorch release team to initiate discussion on how to address them. cc the PyTorch release team to initiate discussion on how to address them.
...@@ -45,20 +58,25 @@ cc the PyTorch release team to initiate discussion on how to address them. ...@@ -45,20 +58,25 @@ cc the PyTorch release team to initiate discussion on how to address them.
## Update CUDA version ## Update CUDA version
The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example, The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example,
torch2.7.0+cu12.6) is uploaded to PyPI. However, vLLM may require a different CUDA version, `torch2.7.0+cu12.6`) is uploaded to PyPI. However, vLLM may require a different CUDA version,
such as 12.8 for Blackwell support. such as 12.8 for Blackwell support.
This complicates the process as we cannot use the out-of-the-box This complicates the process as we cannot use the out-of-the-box
`pip install torch torchvision torchaudio` command. The solution is to use `pip install torch torchvision torchaudio` command. The solution is to use
`--extra-index-url` in vLLM's Dockerfiles. `--extra-index-url` in vLLM's Dockerfiles.
1. Use `--extra-index-url https://download.pytorch.org/whl/cu128` to install torch+cu128. - Important indexes at the moment include:
2. Other important indexes at the moment include:
1. CPU ‒ https://download.pytorch.org/whl/cpu | Platform | `--extra-index-url` |
2. ROCm ‒ https://download.pytorch.org/whl/rocm6.2.4 and https://download.pytorch.org/whl/rocm6.3 |----------|-----------------|
3. XPU ‒ https://download.pytorch.org/whl/xpu | CUDA 12.8| [https://download.pytorch.org/whl/cu128](https://download.pytorch.org/whl/cu128)|
3. Update .buildkite/release-pipeline.yaml and .buildkite/scripts/upload-wheels.sh to | CPU | [https://download.pytorch.org/whl/cpu](https://download.pytorch.org/whl/cpu)|
match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested | ROCm 6.2 | [https://download.pytorch.org/whl/rocm6.2.4](https://download.pytorch.org/whl/rocm6.2.4) |
on CI. | ROCm 6.3 | [https://download.pytorch.org/whl/rocm6.3](https://download.pytorch.org/whl/rocm6.3) |
| XPU | [https://download.pytorch.org/whl/xpu](https://download.pytorch.org/whl/xpu) |
- Update the below files to match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested on CI.
- `.buildkite/release-pipeline.yaml`
- `.buildkite/scripts/upload-wheels.sh`
## Address long vLLM build time ## Address long vLLM build time
...@@ -68,8 +86,8 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod ...@@ -68,8 +86,8 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod
it doesn't populate the cache, so re-running it to warm up the cache it doesn't populate the cache, so re-running it to warm up the cache
is ineffective. is ineffective.
While ongoing efforts like [#17419](https://github.com/vllm-project/vllm/issues/17419) While ongoing efforts like [#17419](gh-issue:17419)
address the long build time at its source, the current workaround is to set VLLM_CI_BRANCH address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH`
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`) to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`)
when manually triggering a build on Buildkite. This branch accomplishes two things: when manually triggering a build on Buildkite. This branch accomplishes two things:
...@@ -89,17 +107,18 @@ releases (which would take too much time), they can be built from ...@@ -89,17 +107,18 @@ releases (which would take too much time), they can be built from
source to unblock the update process. source to unblock the update process.
### FlashInfer ### FlashInfer
Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271): Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
```bash ```bash
export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX' export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
export FLASHINFER_ENABLE_SM90=1 export FLASHINFER_ENABLE_SM90=1
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1" uv pip install --system \
--no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1"
``` ```
One caveat is that building FlashInfer from source adds approximately 30 One caveat is that building FlashInfer from source adds approximately 30
minutes to the vLLM build time. Therefore, it's preferable to cache the wheel in a minutes to the vLLM build time. Therefore, it's preferable to cache the wheel in a
public location for immediate installation, such as https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl. For future releases, contact the PyTorch release public location for immediate installation, such as [this FlashInfer wheel link](https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl). For future releases, contact the PyTorch release
team if you want to get the package published there. team if you want to get the package published there.
### xFormers ### xFormers
...@@ -107,13 +126,15 @@ Similar to FlashInfer, here is how to build and install xFormers from source: ...@@ -107,13 +126,15 @@ Similar to FlashInfer, here is how to build and install xFormers from source:
```bash ```bash
export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX' export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30" MAX_JOBS=16 uv pip install --system \
--no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30"
``` ```
### Mamba ### Mamba
```bash ```bash
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4" uv pip install --system \
--no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
``` ```
### causal-conv1d ### causal-conv1d
...@@ -128,7 +149,6 @@ Rather than attempting to update all vLLM platforms in a single pull request, it ...@@ -128,7 +149,6 @@ Rather than attempting to update all vLLM platforms in a single pull request, it
to handle some platforms separately. The separation of requirements and Dockerfiles to handle some platforms separately. The separation of requirements and Dockerfiles
for different platforms in vLLM CI/CD allows us to selectively choose for different platforms in vLLM CI/CD allows us to selectively choose
which platforms to update. For instance, updating XPU requires the corresponding which platforms to update. For instance, updating XPU requires the corresponding
release from https://github.com/intel/intel-extension-for-pytorch by Intel. release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel.
While https://github.com/vllm-project/vllm/pull/16859 updated vLLM to PyTorch While <gh-pr:16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
2.7.0 on CPU, CUDA, and ROCm, https://github.com/vllm-project/vllm/pull/17444 <gh-pr:17444> completed the update for XPU.
completed the update for XPU.
# Dockerfile # Dockerfile
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM. We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here][deployment-docker]. More information about deploying with Docker can be found [here](../../deployment/docker.md).
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes: Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
......
...@@ -84,6 +84,7 @@ Below is an example of what the generated `CMakeUserPresets.json` might look lik ...@@ -84,6 +84,7 @@ Below is an example of what the generated `CMakeUserPresets.json` might look lik
``` ```
**What do the various configurations mean?** **What do the various configurations mean?**
- `CMAKE_CUDA_COMPILER`: Path to your `nvcc` binary. The script attempts to find this automatically. - `CMAKE_CUDA_COMPILER`: Path to your `nvcc` binary. The script attempts to find this automatically.
- `CMAKE_C_COMPILER_LAUNCHER`, `CMAKE_CXX_COMPILER_LAUNCHER`, `CMAKE_CUDA_COMPILER_LAUNCHER`: Setting these to `ccache` (or `sccache`) significantly speeds up rebuilds by caching compilation results. Ensure `ccache` is installed (e.g., `sudo apt install ccache` or `conda install ccache`). The script sets these by default. - `CMAKE_C_COMPILER_LAUNCHER`, `CMAKE_CXX_COMPILER_LAUNCHER`, `CMAKE_CUDA_COMPILER_LAUNCHER`: Setting these to `ccache` (or `sccache`) significantly speeds up rebuilds by caching compilation results. Ensure `ccache` is installed (e.g., `sudo apt install ccache` or `conda install ccache`). The script sets these by default.
- `VLLM_PYTHON_EXECUTABLE`: Path to the Python executable in your vLLM development environment. The script will prompt for this, defaulting to the current Python environment if suitable. - `VLLM_PYTHON_EXECUTABLE`: Path to the Python executable in your vLLM development environment. The script will prompt for this, defaulting to the current Python environment if suitable.
...@@ -98,16 +99,16 @@ Once your `CMakeUserPresets.json` is configured: ...@@ -98,16 +99,16 @@ Once your `CMakeUserPresets.json` is configured:
1. **Initialize the CMake build environment:** 1. **Initialize the CMake build environment:**
This step configures the build system according to your chosen preset (e.g., `release`) and creates the build directory at `binaryDir` This step configures the build system according to your chosen preset (e.g., `release`) and creates the build directory at `binaryDir`
```console ```console
cmake --preset release cmake --preset release
``` ```
2. **Build and install the vLLM components:** 2. **Build and install the vLLM components:**
This command compiles the code and installs the resulting binaries into your vLLM source directory, making them available to your editable Python installation. This command compiles the code and installs the resulting binaries into your vLLM source directory, making them available to your editable Python installation.
```console ```console
cmake --build --preset release --target install cmake --build --preset release --target install
``` ```
3. **Make changes and repeat!** 3. **Make changes and repeat!**
Now you start using your editable install of vLLM, testing and making changes as needed. If you need to build again to update based on changes, simply run the CMake command again to build only the affected files. Now you start using your editable install of vLLM, testing and making changes as needed. If you need to build again to update based on changes, simply run the CMake command again to build only the affected files.
......
--- # Summary
title: Summary
---
[](){ #new-model }
!!! important !!! important
Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first! Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!
vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features][compatibility-matrix] to optimize their performance. vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features](../../features/compatibility_matrix.md) to optimize their performance.
The complexity of integrating a model into vLLM depends heavily on the model's architecture. The complexity of integrating a model into vLLM depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM. The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
......
--- # Basic Model
title: Basic Model
---
[](){ #new-model-basic }
This guide walks you through the steps to implement a basic vLLM model. This guide walks you through the steps to implement a basic vLLM model.
...@@ -27,7 +24,7 @@ All vLLM modules within the model must include a `prefix` argument in their cons ...@@ -27,7 +24,7 @@ All vLLM modules within the model must include a `prefix` argument in their cons
The initialization code should look like this: The initialization code should look like this:
??? Code ??? code
```python ```python
from torch import nn from torch import nn
...@@ -76,6 +73,8 @@ def forward( ...@@ -76,6 +73,8 @@ def forward(
self, self,
input_ids: torch.Tensor, input_ids: torch.Tensor,
positions: torch.Tensor, positions: torch.Tensor,
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> torch.Tensor: ) -> torch.Tensor:
... ...
``` ```
...@@ -108,7 +107,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a ...@@ -108,7 +107,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a
## 5. Register your model ## 5. Register your model
See [this page][new-model-registration] for instructions on how to register your new model to be used by vLLM. See [this page](registration.md) for instructions on how to register your new model to be used by vLLM.
## Frequently Asked Questions ## Frequently Asked Questions
......
--- # Multi-Modal Support
title: Multi-Modal Support
---
[](){ #supports-multimodal }
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs][multimodal-inputs]. This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md).
## 1. Update the base vLLM model ## 1. Update the base vLLM model
It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic]. It is assumed that you have already implemented the model in vLLM according to [these steps](basic.md).
Further update the model as follows: Further update the model as follows:
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model. - Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
??? Code ??? code
```python ```python
class YourModelForImage2Seq(nn.Module): class YourModelForImage2Seq(nn.Module):
...@@ -41,7 +38,7 @@ Further update the model as follows: ...@@ -41,7 +38,7 @@ Further update the model as follows:
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs. - Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
??? Code ??? code
```python ```python
class YourModelForImage2Seq(nn.Module): class YourModelForImage2Seq(nn.Module):
...@@ -71,7 +68,7 @@ Further update the model as follows: ...@@ -71,7 +68,7 @@ Further update the model as follows:
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings. - Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
??? Code ??? code
```python ```python
from .utils import merge_multimodal_embeddings from .utils import merge_multimodal_embeddings
...@@ -155,7 +152,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -155,7 +152,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Looking at the code of HF's `LlavaForConditionalGeneration`: Looking at the code of HF's `LlavaForConditionalGeneration`:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544 # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
...@@ -179,7 +176,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -179,7 +176,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
The number of placeholder feature tokens per image is `image_features.shape[1]`. The number of placeholder feature tokens per image is `image_features.shape[1]`.
`image_features` is calculated inside the `get_image_features` method: `image_features` is calculated inside the `get_image_features` method:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300 # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
...@@ -217,7 +214,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -217,7 +214,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`: To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257 # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
...@@ -244,7 +241,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -244,7 +241,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Overall, the number of placeholder feature tokens for an image can be calculated as: Overall, the number of placeholder feature tokens for an image can be calculated as:
??? Code ??? code
```python ```python
def get_num_image_tokens( def get_num_image_tokens(
...@@ -269,7 +266,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -269,7 +266,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Notice that the number of image tokens doesn't depend on the image width and height. Notice that the number of image tokens doesn't depend on the image width and height.
We can simply use a dummy `image_size` to calculate the multimodal profiling data: We can simply use a dummy `image_size` to calculate the multimodal profiling data:
??? Code ??? code
```python ```python
# NOTE: In actuality, this is usually implemented as part of the # NOTE: In actuality, this is usually implemented as part of the
...@@ -314,7 +311,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -314,7 +311,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Looking at the code of HF's `FuyuForCausalLM`: Looking at the code of HF's `FuyuForCausalLM`:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
...@@ -344,7 +341,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -344,7 +341,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`, In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
returning the dimensions after resizing (but before padding) as metadata. returning the dimensions after resizing (but before padding) as metadata.
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
...@@ -382,7 +379,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -382,7 +379,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata: In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
...@@ -420,7 +417,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -420,7 +417,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`: The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
...@@ -457,7 +454,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -457,7 +454,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
For the multimodal image profiling data, the logic is very similar to LLaVA: For the multimodal image profiling data, the logic is very similar to LLaVA:
??? Code ??? code
```python ```python
def get_dummy_mm_data( def get_dummy_mm_data(
...@@ -483,7 +480,7 @@ Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.proce ...@@ -483,7 +480,7 @@ Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.proce
to fill in the missing details about HF processing. to fill in the missing details about HF processing.
!!! info !!! info
[Multi-Modal Data Processing][mm-processing] [Multi-Modal Data Processing](../../design/mm_processing.md)
### Multi-modal fields ### Multi-modal fields
...@@ -546,7 +543,7 @@ return a schema of the tensors outputted by the HF processor that are related to ...@@ -546,7 +543,7 @@ return a schema of the tensors outputted by the HF processor that are related to
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA, In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]: we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
??? Code ??? code
```python ```python
def _call_hf_processor( def _call_hf_processor(
...@@ -623,7 +620,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -623,7 +620,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`). It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows: Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
??? Code ??? code
```python ```python
def _get_prompt_updates( def _get_prompt_updates(
...@@ -668,7 +665,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -668,7 +665,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
We define a helper function to return `ncols` and `nrows` directly: We define a helper function to return `ncols` and `nrows` directly:
??? Code ??? code
```python ```python
def get_image_feature_grid_size( def get_image_feature_grid_size(
...@@ -698,7 +695,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -698,7 +695,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
Based on this, we can initially define our replacement tokens as: Based on this, we can initially define our replacement tokens as:
??? Code ??? code
```python ```python
def get_replacement(item_idx: int): def get_replacement(item_idx: int):
...@@ -718,7 +715,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -718,7 +715,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called, However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
a BOS token (`<s>`) is also added to the promopt: a BOS token (`<s>`) is also added to the promopt:
??? Code ??? code
```python ```python
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435 # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
...@@ -745,7 +742,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -745,7 +742,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
To assign the vision embeddings to only the image tokens, instead of a string To assign the vision embeddings to only the image tokens, instead of a string
you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]: you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
??? Code ??? code
```python ```python
hf_config = self.info.get_hf_config() hf_config = self.info.get_hf_config()
...@@ -772,7 +769,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -772,7 +769,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt, Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
we can search for it to conduct the replacement at the start of the string: we can search for it to conduct the replacement at the start of the string:
??? Code ??? code
```python ```python
def _get_prompt_updates( def _get_prompt_updates(
...@@ -819,7 +816,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -819,7 +816,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2), After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3), [BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4), and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_processor <vllm.multimodal.registry.MultiModalRegistry.register_processor>` decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.processing.MultiModalRegistry.register_processor]
to register them to the multi-modal registry: to register them to the multi-modal registry:
```diff ```diff
...@@ -846,7 +843,7 @@ Examples: ...@@ -846,7 +843,7 @@ Examples:
### Handling prompt updates unrelated to multi-modal data ### Handling prompt updates unrelated to multi-modal data
[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design][mm-processing]. [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design](../../design/mm_processing.md).
Examples: Examples:
......
--- # Registering a Model
title: Registering a Model
---
[](){ #new-model-registration }
vLLM relies on a model registry to determine how to run each model. vLLM relies on a model registry to determine how to run each model.
A list of pre-registered architectures can be found [here][supported-models]. A list of pre-registered architectures can be found [here](../../models/supported_models.md).
If your model is not on this list, you must register it to vLLM. If your model is not on this list, you must register it to vLLM.
This page provides detailed instructions on how to do so. This page provides detailed instructions on how to do so.
...@@ -14,16 +11,16 @@ This page provides detailed instructions on how to do so. ...@@ -14,16 +11,16 @@ This page provides detailed instructions on how to do so.
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source]. To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
This gives you the ability to modify the codebase and test your model. This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial][new-model-basic]), put it into the <gh-dir:vllm/model_executor/models> directory. After you have implemented your model (see [tutorial](basic.md)), put it into the <gh-dir:vllm/model_executor/models> directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM. Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models][supported-models] to promote your model! Finally, update our [list of supported models](../../models/supported_models.md) to promote your model!
!!! important !!! important
The list of models in each section should be maintained in alphabetical order. The list of models in each section should be maintained in alphabetical order.
## Out-of-tree models ## Out-of-tree models
You can load an external model [using a plugin][plugin-system] without modifying the vLLM codebase. You can load an external model [using a plugin](../../design/plugin_system.md) without modifying the vLLM codebase.
To register the model, use the following code: To register the model, use the following code:
...@@ -51,4 +48,4 @@ def register(): ...@@ -51,4 +48,4 @@ def register():
!!! important !!! important
If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface. If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
Read more about that [here][supports-multimodal]. Read more about that [here](multimodal.md).
--- # Unit Testing
title: Unit Testing
---
[](){ #new-model-tests }
This page explains how to write unit tests to verify the implementation of your model. This page explains how to write unit tests to verify the implementation of your model.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment