Unverified Commit 4ffd6e89 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Docs] Reduce custom syntax used in docs (#27009)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent 965c5f49
...@@ -23,7 +23,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2) ...@@ -23,7 +23,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
!!! note !!! note
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Quantization ## Quantization
......
...@@ -174,14 +174,14 @@ Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to u ...@@ -174,14 +174,14 @@ Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to u
Known supported models (with corresponding benchmarks): Known supported models (with corresponding benchmarks):
- dots_ocr (<gh-pr:25466>) - dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
- GLM-4.1V or above (<gh-pr:23168>) - GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
- InternVL (<gh-pr:23909>) - InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
- Kimi-VL (<gh-pr:23817>) - Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
- Llama4 (<gh-pr:18368>) - Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
- MiniCPM-V-2.5 or above (<gh-pr:23327>, <gh-pr:23948>) - MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
- Qwen2-VL or above (<gh-pr:22742>, <gh-pr:24955>, <gh-pr:25445>) - Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
- Step3 (<gh-pr:22697>) - Step3 (<https://github.com/vllm-project/vllm/pull/22697>)
## Input Processing ## Input Processing
......
...@@ -96,7 +96,7 @@ Although it’s common to do this with GPUs, don't try to fragment 2 or 8 differ ...@@ -96,7 +96,7 @@ Although it’s common to do this with GPUs, don't try to fragment 2 or 8 differ
### Tune your workloads ### Tune your workloads
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case. Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
### Future Topics We'll Cover ### Future Topics We'll Cover
......
...@@ -22,7 +22,7 @@ Unsure on where to start? Check out the following links for tasks to work on: ...@@ -22,7 +22,7 @@ Unsure on where to start? Check out the following links for tasks to work on:
## License ## License
See <gh-file:LICENSE>. See [LICENSE](../../LICENSE).
## Developing ## Developing
...@@ -54,7 +54,7 @@ For more details about installing from source and installing for other hardware, ...@@ -54,7 +54,7 @@ For more details about installing from source and installing for other hardware,
For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations. For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations.
!!! tip !!! tip
vLLM is compatible with Python versions 3.10 to 3.13. However, vLLM's default [Dockerfile](gh-file:docker/Dockerfile) ships with Python 3.12 and tests in CI (except `mypy`) are run with Python 3.12. vLLM is compatible with Python versions 3.10 to 3.13. However, vLLM's default [Dockerfile](../../docker/Dockerfile) ships with Python 3.12 and tests in CI (except `mypy`) are run with Python 3.12.
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment. Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
...@@ -88,7 +88,7 @@ vLLM's `pre-commit` hooks will now run automatically every time you commit. ...@@ -88,7 +88,7 @@ vLLM's `pre-commit` hooks will now run automatically every time you commit.
### Documentation ### Documentation
MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, <gh-file:mkdocs.yaml>. MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, [mkdocs.yaml](../../mkdocs.yaml).
Get started with: Get started with:
...@@ -152,7 +152,7 @@ pytest -s -v tests/test_logger.py ...@@ -152,7 +152,7 @@ pytest -s -v tests/test_logger.py
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
!!! important !!! important
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability). If you discover a security vulnerability, please follow the instructions [here](../../SECURITY.md).
## Pull Requests & Code Reviews ## Pull Requests & Code Reviews
...@@ -162,7 +162,7 @@ code quality and improve the efficiency of the review process. ...@@ -162,7 +162,7 @@ code quality and improve the efficiency of the review process.
### DCO and Signed-off-by ### DCO and Signed-off-by
When contributing changes to this project, you must agree to the <gh-file:DCO>. When contributing changes to this project, you must agree to the [DCO](../../DCO).
Commits must include a `Signed-off-by:` header which certifies agreement with Commits must include a `Signed-off-by:` header which certifies agreement with
the terms of the DCO. the terms of the DCO.
......
...@@ -822,7 +822,7 @@ you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backen ...@@ -822,7 +822,7 @@ you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backen
- CLIP: `--backend openai-embeddings-clip` - CLIP: `--backend openai-embeddings-clip`
- VLM2Vec: `--backend openai-embeddings-vlm2vec` - VLM2Vec: `--backend openai-embeddings-vlm2vec`
For other models, please add your own implementation inside <gh-file:vllm/benchmarks/lib/endpoint_request_func.py> to match the expected instruction format. For other models, please add your own implementation inside [vllm/benchmarks/lib/endpoint_request_func.py](../../vllm/benchmarks/lib/endpoint_request_func.py) to match the expected instruction format.
You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it. You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings. For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
...@@ -962,7 +962,7 @@ For more results visualization, check the [visualizing the results](https://gith ...@@ -962,7 +962,7 @@ For more results visualization, check the [visualizing the results](https://gith
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm). The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md). More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
### Continuous Benchmarking ### Continuous Benchmarking
...@@ -996,4 +996,4 @@ These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lm ...@@ -996,4 +996,4 @@ These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lm
The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html). The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html).
More information on the nightly benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/nightly-descriptions.md). More information on the nightly benchmarks and their parameters can be found [here](../../.buildkite/nightly-benchmarks/nightly-descriptions.md).
...@@ -64,7 +64,7 @@ Download the full log file from Buildkite locally. ...@@ -64,7 +64,7 @@ Download the full log file from Buildkite locally.
Strip timestamps and colorization: Strip timestamps and colorization:
<gh-file:.buildkite/scripts/ci-clean-log.sh> [.buildkite/scripts/ci-clean-log.sh](../../../.buildkite/scripts/ci-clean-log.sh)
```bash ```bash
./ci-clean-log.sh ci.log ./ci-clean-log.sh ci.log
...@@ -87,7 +87,7 @@ tail -525 ci_build.log | wl-copy ...@@ -87,7 +87,7 @@ tail -525 ci_build.log | wl-copy
CI test failures may be flaky. Use a bash loop to run repeatedly: CI test failures may be flaky. Use a bash loop to run repeatedly:
<gh-file:.buildkite/scripts/rerun-test.sh> [.buildkite/scripts/rerun-test.sh](../../../.buildkite/scripts/rerun-test.sh)
```bash ```bash
./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp] ./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]
......
...@@ -5,7 +5,7 @@ release in CI/CD. It is standard practice to submit a PR to update the ...@@ -5,7 +5,7 @@ release in CI/CD. It is standard practice to submit a PR to update the
PyTorch version as early as possible when a new [PyTorch stable PyTorch version as early as possible when a new [PyTorch stable
release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available. release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available.
This process is non-trivial due to the gap between PyTorch This process is non-trivial due to the gap between PyTorch
releases. Using <gh-pr:16859> as an example, this document outlines common steps to achieve this releases. Using <https://github.com/vllm-project/vllm/pull/16859> as an example, this document outlines common steps to achieve this
update along with a list of potential issues and how to address them. update along with a list of potential issues and how to address them.
## Test PyTorch release candidates (RCs) ## Test PyTorch release candidates (RCs)
...@@ -85,7 +85,7 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod ...@@ -85,7 +85,7 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod
it doesn't populate the cache, so re-running it to warm up the cache it doesn't populate the cache, so re-running it to warm up the cache
is ineffective. is ineffective.
While ongoing efforts like [#17419](gh-issue:17419) While ongoing efforts like <https://github.com/vllm-project/vllm/issues/17419>
address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH` address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH`
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`) to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`)
when manually triggering a build on Buildkite. This branch accomplishes two things: when manually triggering a build on Buildkite. This branch accomplishes two things:
...@@ -138,5 +138,5 @@ to handle some platforms separately. The separation of requirements and Dockerfi ...@@ -138,5 +138,5 @@ to handle some platforms separately. The separation of requirements and Dockerfi
for different platforms in vLLM CI/CD allows us to selectively choose for different platforms in vLLM CI/CD allows us to selectively choose
which platforms to update. For instance, updating XPU requires the corresponding which platforms to update. For instance, updating XPU requires the corresponding
release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel. release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel.
While <gh-pr:16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm, While <https://github.com/vllm-project/vllm/pull/16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
<gh-pr:17444> completed the update for XPU. <https://github.com/vllm-project/vllm/pull/17444> completed the update for XPU.
# Dockerfile # Dockerfile
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM. We provide a [docker/Dockerfile](../../../docker/Dockerfile) to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here](../../deployment/docker.md). More information about deploying with Docker can be found [here](../../deployment/docker.md).
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes: Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
......
...@@ -5,7 +5,7 @@ This guide walks you through the steps to implement a basic vLLM model. ...@@ -5,7 +5,7 @@ This guide walks you through the steps to implement a basic vLLM model.
## 1. Bring your model code ## 1. Bring your model code
First, clone the PyTorch model code from the source repository. First, clone the PyTorch model code from the source repository.
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from For instance, vLLM's [OPT model](../../../vllm/model_executor/models/opt.py) was adapted from
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file. HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
!!! warning !!! warning
...@@ -83,7 +83,7 @@ def forward( ...@@ -83,7 +83,7 @@ def forward(
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings. Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM. If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples. For reference, check out our [Llama implementation](../../../vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out [vllm/model_executor/models](../../../vllm/model_executor/models) for more examples.
## 3. (Optional) Implement tensor parallelism and quantization support ## 3. (Optional) Implement tensor parallelism and quantization support
...@@ -130,22 +130,22 @@ We consider 3 different scenarios: ...@@ -130,22 +130,22 @@ We consider 3 different scenarios:
2. Models that combine Mamba layers (either Mamba-1 or Mamba-2) together with attention layers. 2. Models that combine Mamba layers (either Mamba-1 or Mamba-2) together with attention layers.
3. Models that combine Mamba-like mechanisms (e.g., Linear Attention, ShortConv) together with attention layers. 3. Models that combine Mamba-like mechanisms (e.g., Linear Attention, ShortConv) together with attention layers.
For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](gh-file:vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](gh-file:vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference. For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](../../../vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](../../../vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference.
The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config. The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config.
For the mamba layers themselves, please use the [`MambaMixer`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes. For the mamba layers themselves, please use the [`MambaMixer`](../../../vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](../../../vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes.
Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations. Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations.
V0-only classes and code will be removed in the very near future. V0-only classes and code will be removed in the very near future.
The model should also be added to the `MODELS_CONFIG_MAP` dictionary in <gh-file:vllm/model_executor/models/config.py> to ensure that the runtime defaults are optimized. The model should also be added to the `MODELS_CONFIG_MAP` dictionary in [vllm/model_executor/models/config.py](../../../vllm/model_executor/models/config.py) to ensure that the runtime defaults are optimized.
For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](gh-file:vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](gh-file:vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together). For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](../../../vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](../../../vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together).
These models should follow the same instructions as case (1), but they should inherit protocol `IsHybrid` (instead of `IsAttentionFree`) and it is *not* necessary to add them to the `MODELS_CONFIG_MAP` (their runtime defaults will be inferred from the protocol). These models should follow the same instructions as case (1), but they should inherit protocol `IsHybrid` (instead of `IsAttentionFree`) and it is *not* necessary to add them to the `MODELS_CONFIG_MAP` (their runtime defaults will be inferred from the protocol).
For case (3), we recommend looking at the implementation of [`MiniMaxText01ForCausalLM`](gh-file:vllm/model_executor/models/minimax_text_01.py) or [`Lfm2ForCausalLM`](gh-file:vllm/model_executor/models/lfm2.py) as a reference, which use custom "mamba-like" layers `MiniMaxText01LinearAttention` and `ShortConv` respectively. For case (3), we recommend looking at the implementation of [`MiniMaxText01ForCausalLM`](../../../vllm/model_executor/models/minimax_text_01.py) or [`Lfm2ForCausalLM`](../../../vllm/model_executor/models/lfm2.py) as a reference, which use custom "mamba-like" layers `MiniMaxText01LinearAttention` and `ShortConv` respectively.
Please follow the same guidelines as case (2) for implementing these models. Please follow the same guidelines as case (2) for implementing these models.
We use "mamba-like" to refer to layers that posses a state that is updated in-place, rather than being appended-to (like KV cache for attention). We use "mamba-like" to refer to layers that posses a state that is updated in-place, rather than being appended-to (like KV cache for attention).
For implementing new custom mamba-like layers, one should inherit from `MambaBase` and implement the methods `get_state_dtype`, `get_state_shape` to calculate the data types and state shapes at runtime, as well as `mamba_type` and `get_attn_backend`. For implementing new custom mamba-like layers, one should inherit from `MambaBase` and implement the methods `get_state_dtype`, `get_state_shape` to calculate the data types and state shapes at runtime, as well as `mamba_type` and `get_attn_backend`.
It is also necessary to implement the "attention meta-data" class which handles the meta-data that is common across all layers. It is also necessary to implement the "attention meta-data" class which handles the meta-data that is common across all layers.
Please see [`LinearAttentionMetadata`](gh-file:vllm/v1/attention/backends/linear_attn.py) or [`ShortConvAttentionMetadata`](gh-file:v1/attention/backends/short_conv_attn.py) for examples of this. Please see [`LinearAttentionMetadata`](../../../vllm/v1/attention/backends/linear_attn.py) or [`ShortConvAttentionMetadata`](../../../vllm/v1/attention/backends/short_conv_attn.py) for examples of this.
Finally, if one wants to support torch compile and CUDA graphs, it necessary to wrap the call to the mamba-like layer inside a custom op and register it. Finally, if one wants to support torch compile and CUDA graphs, it necessary to wrap the call to the mamba-like layer inside a custom op and register it.
Please see the calls to `direct_register_custom_op` in <gh-file:vllm/model_executor/models/minimax_text_01.py> or <gh-file:vllm/model_executor/layers/mamba/short_conv.py> for examples of this. Please see the calls to `direct_register_custom_op` in [vllm/model_executor/models/minimax_text_01.py](../../../vllm/model_executor/models/minimax_text_01.py) or [vllm/model_executor/layers/mamba/short_conv.py](../../../vllm/model_executor/layers/mamba/short_conv.py) for examples of this.
The new custom op should then be added to the list `_attention_ops` in <gh-file:vllm/config/compilation.py> to ensure that piecewise CUDA graphs works as intended. The new custom op should then be added to the list `_attention_ops` in [vllm/config/compilation.py](../../../vllm/config/compilation.py) to ensure that piecewise CUDA graphs works as intended.
...@@ -507,7 +507,7 @@ return a schema of the tensors outputted by the HF processor that are related to ...@@ -507,7 +507,7 @@ return a schema of the tensors outputted by the HF processor that are related to
``` ```
!!! note !!! note
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports Our [actual code](../../../vllm/model_executor/models/llava.py) additionally supports
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument. pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
=== "With postprocessing: Fuyu" === "With postprocessing: Fuyu"
...@@ -569,7 +569,7 @@ return a schema of the tensors outputted by the HF processor that are related to ...@@ -569,7 +569,7 @@ return a schema of the tensors outputted by the HF processor that are related to
``` ```
!!! note !!! note
Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling Our [actual code](../../../vllm/model_executor/models/fuyu.py) has special handling
for text-only inputs to prevent unnecessary warnings from HF processor. for text-only inputs to prevent unnecessary warnings from HF processor.
!!! note !!! note
...@@ -828,8 +828,8 @@ Some HF processors directly insert feature tokens without replacing anything in ...@@ -828,8 +828,8 @@ Some HF processors directly insert feature tokens without replacing anything in
Examples: Examples:
- BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py> - BLIP-2 (insert at start of prompt): [vllm/model_executor/models/blip2.py](../../../vllm/model_executor/models/blip2.py)
- Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py> - Molmo (insert after `<|endoftext|>` token): [vllm/model_executor/models/molmo.py](../../../vllm/model_executor/models/molmo.py)
### Handling prompt updates unrelated to multi-modal data ### Handling prompt updates unrelated to multi-modal data
...@@ -837,9 +837,9 @@ Examples: ...@@ -837,9 +837,9 @@ Examples:
Examples: Examples:
- Chameleon (appends `sep_token`): <gh-file:vllm/model_executor/models/chameleon.py> - Chameleon (appends `sep_token`): [vllm/model_executor/models/chameleon.py](../../../vllm/model_executor/models/chameleon.py)
- Fuyu (appends `boa_token`): <gh-file:vllm/model_executor/models/fuyu.py> - Fuyu (appends `boa_token`): [vllm/model_executor/models/fuyu.py](../../../vllm/model_executor/models/fuyu.py)
- Molmo (applies chat template which is not defined elsewhere): <gh-file:vllm/model_executor/models/molmo.py> - Molmo (applies chat template which is not defined elsewhere): [vllm/model_executor/models/molmo.py](../../../vllm/model_executor/models/molmo.py)
### Custom HF processor ### Custom HF processor
...@@ -847,6 +847,6 @@ Some models don't define an HF processor class on HF Hub. In that case, you can ...@@ -847,6 +847,6 @@ Some models don't define an HF processor class on HF Hub. In that case, you can
Examples: Examples:
- DeepSeek-VL2: <gh-file:vllm/model_executor/models/deepseek_vl2.py> - DeepSeek-VL2: [vllm/model_executor/models/deepseek_vl2.py](../../../vllm/model_executor/models/deepseek_vl2.py)
- InternVL: <gh-file:vllm/model_executor/models/internvl.py> - InternVL: [vllm/model_executor/models/internvl.py](../../../vllm/model_executor/models/internvl.py)
- Qwen-VL: <gh-file:vllm/model_executor/models/qwen_vl.py> - Qwen-VL: [vllm/model_executor/models/qwen_vl.py](../../../vllm/model_executor/models/qwen_vl.py)
...@@ -11,8 +11,8 @@ This page provides detailed instructions on how to do so. ...@@ -11,8 +11,8 @@ This page provides detailed instructions on how to do so.
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source]. To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
This gives you the ability to modify the codebase and test your model. This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial](basic.md)), put it into the <gh-dir:vllm/model_executor/models> directory. After you have implemented your model (see [tutorial](basic.md)), put it into the [vllm/model_executor/models](../../../vllm/model_executor/models) directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM. Then, add your model class to `_VLLM_MODELS` in [vllm/model_executor/models/registry.py](../../../vllm/model_executor/models/registry.py) so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models](../../models/supported_models.md) to promote your model! Finally, update our [list of supported models](../../models/supported_models.md) to promote your model!
!!! important !!! important
......
...@@ -9,7 +9,7 @@ Without them, the CI for your PR will fail. ...@@ -9,7 +9,7 @@ Without them, the CI for your PR will fail.
### Model loading ### Model loading
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>. Include an example HuggingFace repository for your model in [tests/models/registry.py](../../../tests/models/registry.py).
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM. This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
!!! important !!! important
...@@ -26,18 +26,18 @@ Passing these tests provides more confidence that your implementation is correct ...@@ -26,18 +26,18 @@ Passing these tests provides more confidence that your implementation is correct
### Model correctness ### Model correctness
These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of <gh-dir:tests/models>. These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of [tests/models](../../../tests/models).
#### Generative models #### Generative models
For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in <gh-file:tests/models/utils.py>: For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in [tests/models/utils.py](../../../tests/models/utils.py):
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF. - Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa. - Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
#### Pooling models #### Pooling models
For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in <gh-file:tests/models/utils.py>. For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in [tests/models/utils.py](../../../tests/models/utils.py).
[](){ #mm-processing-tests } [](){ #mm-processing-tests }
...@@ -45,7 +45,7 @@ For [pooling models](../../models/pooling_models.md), we simply check the cosine ...@@ -45,7 +45,7 @@ For [pooling models](../../models/pooling_models.md), we simply check the cosine
#### Common tests #### Common tests
Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py> verifies that the following input combinations result in the same outputs: Adding your model to [tests/models/multimodal/processing/test_common.py](../../../tests/models/multimodal/processing/test_common.py) verifies that the following input combinations result in the same outputs:
- Text + multi-modal data - Text + multi-modal data
- Tokens + multi-modal data - Tokens + multi-modal data
...@@ -54,6 +54,6 @@ Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py> ...@@ -54,6 +54,6 @@ Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py>
#### Model-specific tests #### Model-specific tests
You can add a new file under <gh-dir:tests/models/multimodal/processing> to run tests that only apply to your model. You can add a new file under [tests/models/multimodal/processing](../../../tests/models/multimodal/processing) to run tests that only apply to your model.
For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in <gh-file:tests/models/multimodal/processing/test_phi3v.py>. For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in [tests/models/multimodal/processing/test_phi3v.py](../../../tests/models/multimodal/processing/test_phi3v.py).
...@@ -248,9 +248,9 @@ No extra registration is required beyond having your model class available via t ...@@ -248,9 +248,9 @@ No extra registration is required beyond having your model class available via t
## Examples in-tree ## Examples in-tree
- Whisper encoder–decoder (audio-only): <gh-file:vllm/model_executor/models/whisper.py> - Whisper encoder–decoder (audio-only): [vllm/model_executor/models/whisper.py](../../../vllm/model_executor/models/whisper.py)
- Voxtral decoder-only (audio embeddings + LLM): <gh-file:vllm/model_executor/models/voxtral.py> - Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py](../../../vllm/model_executor/models/voxtral.py)
- Gemma3n decoder-only with fixed instruction prompt: <gh-file:vllm/model_executor/models/gemma3n_mm.py> - Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py](../../../vllm/model_executor/models/gemma3n_mm.py)
## Test with the API ## Test with the API
...@@ -278,7 +278,7 @@ Once your model implements `SupportsTranscription`, you can test the endpoints ( ...@@ -278,7 +278,7 @@ Once your model implements `SupportsTranscription`, you can test the endpoints (
http://localhost:8000/v1/audio/translations http://localhost:8000/v1/audio/translations
``` ```
Or check out more examples in <gh-file:examples/online_serving>. Or check out more examples in [examples/online_serving](../../../examples/online_serving).
!!! note !!! note
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking. - If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.
......
...@@ -33,7 +33,7 @@ Traces can be visualized using <https://ui.perfetto.dev/>. ...@@ -33,7 +33,7 @@ Traces can be visualized using <https://ui.perfetto.dev/>.
#### Offline Inference #### Offline Inference
Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example. Refer to [examples/offline_inference/simple_profiling.py](../../examples/offline_inference/simple_profiling.py) for an example.
#### OpenAI Server #### OpenAI Server
......
...@@ -37,7 +37,7 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af ...@@ -37,7 +37,7 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af
memory to share data between processes under the hood, particularly for tensor parallel inference. memory to share data between processes under the hood, particularly for tensor parallel inference.
!!! note !!! note
Optional dependencies are not included in order to avoid licensing issues (e.g. <gh-issue:8030>). Optional dependencies are not included in order to avoid licensing issues (e.g. <https://github.com/vllm-project/vllm/issues/8030>).
If you need to use those dependencies (having accepted the license terms), If you need to use those dependencies (having accepted the license terms),
create a custom Dockerfile on top of the base image with an extra layer that installs them: create a custom Dockerfile on top of the base image with an extra layer that installs them:
...@@ -66,7 +66,7 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af ...@@ -66,7 +66,7 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af
## Building vLLM's Docker Image from Source ## Building vLLM's Docker Image from Source
You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM: You can build and run vLLM from source via the provided [docker/Dockerfile](../../docker/Dockerfile). To build vLLM:
```bash ```bash
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2 # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
......
...@@ -5,7 +5,7 @@ ...@@ -5,7 +5,7 @@
[Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray. [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.
Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray
without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like <gh-file:examples/online_serving/run_cluster.sh>. without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like [examples/online_serving/run_cluster.sh](../../../examples/online_serving/run_cluster.sh).
When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm). When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm).
......
...@@ -36,7 +36,7 @@ pip install -U vllm \ ...@@ -36,7 +36,7 @@ pip install -U vllm \
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001 vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
``` ```
1. Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_langchain.py> 1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_langchain.py](../../../examples/online_serving/retrieval_augmented_generation_with_langchain.py)
1. Run the script 1. Run the script
...@@ -74,7 +74,7 @@ pip install vllm \ ...@@ -74,7 +74,7 @@ pip install vllm \
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001 vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
``` ```
1. Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_llamaindex.py> 1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_llamaindex.py](../../../examples/online_serving/retrieval_augmented_generation_with_llamaindex.py)
1. Run the script: 1. Run the script:
......
...@@ -20,7 +20,7 @@ pip install vllm streamlit openai ...@@ -20,7 +20,7 @@ pip install vllm streamlit openai
vllm serve Qwen/Qwen1.5-0.5B-Chat vllm serve Qwen/Qwen1.5-0.5B-Chat
``` ```
1. Use the script: <gh-file:examples/online_serving/streamlit_openai_chatbot_webserver.py> 1. Use the script: [examples/online_serving/streamlit_openai_chatbot_webserver.py](../../../examples/online_serving/streamlit_openai_chatbot_webserver.py)
1. Start the streamlit web UI and start to chat: 1. Start the streamlit web UI and start to chat:
......
...@@ -49,7 +49,7 @@ Here is a sample of `LLM` class usage: ...@@ -49,7 +49,7 @@ Here is a sample of `LLM` class usage:
More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs. More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs.
The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>. The code for the `LLM` class can be found in [vllm/entrypoints/llm.py](../../vllm/entrypoints/llm.py).
### OpenAI-Compatible API Server ### OpenAI-Compatible API Server
...@@ -60,7 +60,7 @@ This server can be started using the `vllm serve` command. ...@@ -60,7 +60,7 @@ This server can be started using the `vllm serve` command.
vllm serve <model> vllm serve <model>
``` ```
The code for the `vllm` CLI can be found in <gh-file:vllm/entrypoints/cli/main.py>. The code for the `vllm` CLI can be found in [vllm/entrypoints/cli/main.py](../../vllm/entrypoints/cli/main.py).
Sometimes you may see the API server entrypoint used directly instead of via the Sometimes you may see the API server entrypoint used directly instead of via the
`vllm` CLI command. For example: `vllm` CLI command. For example:
...@@ -74,7 +74,7 @@ python -m vllm.entrypoints.openai.api_server --model <model> ...@@ -74,7 +74,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
`python -m vllm.entrypoints.openai.api_server` is deprecated `python -m vllm.entrypoints.openai.api_server` is deprecated
and may become unsupported in a future release. and may become unsupported in a future release.
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>. That code can be found in [vllm/entrypoints/openai/api_server.py](../../vllm/entrypoints/openai/api_server.py).
More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document. More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document.
...@@ -101,7 +101,7 @@ processing. ...@@ -101,7 +101,7 @@ processing.
- **Output Processing**: Processes the outputs generated by the model, decoding the - **Output Processing**: Processes the outputs generated by the model, decoding the
token IDs from a language model into human-readable text. token IDs from a language model into human-readable text.
The code for `LLMEngine` can be found in <gh-file:vllm/engine/llm_engine.py>. The code for `LLMEngine` can be found in [vllm/engine/llm_engine.py](../../vllm/engine/llm_engine.py).
### AsyncLLMEngine ### AsyncLLMEngine
...@@ -111,9 +111,9 @@ incoming requests. The `AsyncLLMEngine` is designed for online serving, where it ...@@ -111,9 +111,9 @@ incoming requests. The `AsyncLLMEngine` is designed for online serving, where it
can handle multiple concurrent requests and stream outputs to clients. can handle multiple concurrent requests and stream outputs to clients.
The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo
API server that serves as a simpler example in <gh-file:vllm/entrypoints/api_server.py>. API server that serves as a simpler example in [vllm/entrypoints/api_server.py](../../vllm/entrypoints/api_server.py).
The code for `AsyncLLMEngine` can be found in <gh-file:vllm/engine/async_llm_engine.py>. The code for `AsyncLLMEngine` can be found in [vllm/engine/async_llm_engine.py](../../vllm/engine/async_llm_engine.py).
## Worker ## Worker
......
...@@ -17,7 +17,7 @@ In this document we will discuss the: ...@@ -17,7 +17,7 @@ In this document we will discuss the:
In this document, we refer to pure decode (`max_query_len=1`) or speculative decode (`max_query_len =1+num_spec_tokens`) as **uniform decode** batches, and the opposite would be **non-uniform** batches (i.e., prefill or mixed prefill-decode batches). In this document, we refer to pure decode (`max_query_len=1`) or speculative decode (`max_query_len =1+num_spec_tokens`) as **uniform decode** batches, and the opposite would be **non-uniform** batches (i.e., prefill or mixed prefill-decode batches).
!!! note !!! note
The following contents are mostly based on the last commit of <gh-pr:20059>. The following contents are mostly based on the last commit of <https://github.com/vllm-project/vllm/pull/20059>.
## Motivation ## Motivation
...@@ -92,7 +92,7 @@ where `num_tokens` can be the padded token length, and `uniform_decode` is deter ...@@ -92,7 +92,7 @@ where `num_tokens` can be the padded token length, and `uniform_decode` is deter
The goal of this structure is to uniquely identify a (padded) batch with minimal possible items corresponding to a CUDA Graphs item. We are safe to exclude items like `uniform_query_len` because it is a constant at runtime for a certain setup currently. For example, it should be either `1` for a commonly pure decode or `1+num_spec_tokens` for a validation phase of speculative decode. The goal of this structure is to uniquely identify a (padded) batch with minimal possible items corresponding to a CUDA Graphs item. We are safe to exclude items like `uniform_query_len` because it is a constant at runtime for a certain setup currently. For example, it should be either `1` for a commonly pure decode or `1+num_spec_tokens` for a validation phase of speculative decode.
!!! note !!! note
The prototype of `BatchDescriptor` may be extended for more general situations in the future, e.g., include more items, like `uniform_query_len` to support multiple different uniform decode lengths settings (<gh-pr:23679>), or other modifications needed to support CUDA Graphs for models whose inputs are not necessarily token length aware (for example, some multi-modal inputs). The prototype of `BatchDescriptor` may be extended for more general situations in the future, e.g., include more items, like `uniform_query_len` to support multiple different uniform decode lengths settings (<https://github.com/vllm-project/vllm/pull/23679>), or other modifications needed to support CUDA Graphs for models whose inputs are not necessarily token length aware (for example, some multi-modal inputs).
### `CudagraphDispatcher` ### `CudagraphDispatcher`
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment