Merge remote-tracking branch 'mirror/releases/v0.9.0' into v0.9.0-ori

4eabe123 · zhuwenwen · 45840cd2 · 58738772 · 4eabe123 · 4eabe123
Commit 4eabe123 authored May 28, 2025 by zhuwenwen
20 changed files
--- a/docs/source/assets/kernel/q_vecs.png
+++ b/docs/source/assets/kernel/q_vecs.png
--- a/docs/source/assets/kernel/query.png
+++ b/docs/source/assets/kernel/query.png
--- a/docs/source/assets/kernel/v_vec.png
+++ b/docs/source/assets/kernel/v_vec.png
--- a/docs/source/assets/kernel/value.png
+++ b/docs/source/assets/kernel/value.png
--- a/docs/source/assets/logos/vllm-logo-only-light.ico
+++ b/docs/source/assets/logos/vllm-logo-only-light.ico
--- a/docs/source/assets/logos/vllm-logo-only-light.png
+++ b/docs/source/assets/logos/vllm-logo-only-light.png
--- a/docs/source/assets/logos/vllm-logo-text-dark.png
+++ b/docs/source/assets/logos/vllm-logo-text-dark.png
--- a/docs/source/assets/logos/vllm-logo-text-light.png
+++ b/docs/source/assets/logos/vllm-logo-text-light.png
--- a/docs/source/community/meetups.md
+++ b/docs/source/community/meetups.md
-(meetups)=
+---
+title: Meetups
-# vLLM Meetups
+---
+[](){ #meetups }
 We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:

--- a/docs/source/community/sponsors.md
+++ b/docs/source/community/sponsors.md
--- a/docs/configuration/README.md
+++ b/docs/configuration/README.md
+# Configuration Options
+This section lists the most common options for running vLLM.
+There are three main levels of configuration, from highest priority to lowest priority:
+- [Request parameters][completions-api] and [input arguments][sampling-params]
+- [Engine arguments](./engine_args.md)
+- [Environment variables](./env_vars.md)
--- a/docs/source/serving/offline_inference.md
+++ b/docs/source/serving/offline_inference.md
-(offline-inference)=
+# Conserving Memory
-# Offline Inference
-You can run vLLM in your own code on a list of prompts.
-The offline API is based on the {class}`~vllm.LLM` class.
-To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
-For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
-and runs it in vLLM using the default configuration.
-```python
-from vllm import LLM
-llm = LLM(model="facebook/opt-125m")
-```
-After initializing the `LLM` instance, you can perform model inference using various APIs.
-The available APIs depend on the type of model that is being run:
- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](#pooling-models) output their hidden states directly.
-Please refer to the above pages for more details about each API.
-:::{seealso}
-[API Reference](#offline-inference-api)
-:::
-(configuration-options)=
-## Configuration Options
-This section lists the most common options for running the vLLM engine.
-For a full list, refer to the <project:#configuration> page.
-(model-resolution)=
-### Model resolution
-vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
-and finding the corresponding implementation that is registered to vLLM.
-Nevertheless, our model resolution may fail for the following reasons:
- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
-To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
-For example:
-```python
-from vllm import LLM
-model = LLM(
-    model="cerebras/Cerebras-GPT-1.3B",
-    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
-)
-```
-Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
-(reducing-memory-usage)=
-### Reducing memory usage
 Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
-#### Tensor Parallelism (TP)
+## Tensor Parallelism (TP)
 Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
@@ -80,29 +15,27 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
          tensor_parallel_size=2)
 ```
-:::{important}
+!!! warning
-To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
+    To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
-before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
+    before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
-To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
+    To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
-:::
-:::{note}
+!!! note
-With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
+    With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
-You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
+    You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
-:::
-#### Quantization
+## Quantization
 Quantized models take less memory at the cost of lower precision.
 Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
 and used directly without extra configuration.
-Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.
+Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
-#### Context length and batch size
+## Context length and batch size
 You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
 and the maximum batch size (`max_num_seqs` option).
@@ -115,13 +48,12 @@ llm = LLM(model="adept/fuyu-8b",
          max_num_seqs=2)
 ```
-#### Reduce CUDA Graphs
+## Reduce CUDA Graphs
 By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
-:::{important}
+!!! warning
-CUDA graph capture takes up more memory in V1 than in V0.
+    CUDA graph capture takes up more memory in V1 than in V0.
-:::
 You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
@@ -148,14 +80,14 @@ llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
          enforce_eager=True)
 ```
-#### Adjust cache size
+## Adjust cache size
 If you run out of CPU RAM, try the following options:
 - (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
 - (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
-#### Multi-modal input limits
+## Multi-modal input limits
 You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
@@ -188,7 +120,7 @@ llm = LLM(model="google/gemma-3-27b-it",
          limit_mm_per_prompt={"image": 0})
 ```
-#### Multi-modal processor arguments
+## Multi-modal processor arguments
 For certain models, you can adjust the multi-modal processor arguments to
 reduce the size of the processed multi-modal inputs, which in turn saves memory.
@@ -210,8 +142,3 @@ llm = LLM(model="OpenGVLab/InternVL2-2B",
              "max_dynamic_patch": 4,  # Default is 12
          })
 ```
-### Performance optimization and tuning
-You can potentially improve the performance of vLLM by finetuning various options.
-Please refer to [this guide](#optimization-and-tuning) for more details.
--- a/docs/configuration/engine_args.md
+++ b/docs/configuration/engine_args.md
+---
+title: Engine Arguments
+---
+[](){ #engine-args }
+Engine arguments control the behavior of the vLLM engine.
+- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
+- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`.
+You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
+However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented.
+For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config.
+!!! note
+    Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`
--- a/docs/configuration/env_vars.md
+++ b/docs/configuration/env_vars.md
+# Environment Variables
+vLLM uses the following environment variables to configure the system:
+!!! warning
+    Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
+    All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
+```python
+--8<-- "vllm/envs.py:env-vars-definition"
+```
--- a/docs/configuration/model_resolution.md
+++ b/docs/configuration/model_resolution.md
+# Model Resolution
+vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
+and finding the corresponding implementation that is registered to vLLM.
+Nevertheless, our model resolution may fail for the following reasons:
+- The `config.json` of the model repository lacks the `architectures` field.
+- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
+- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
+To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
+For example:
+```python
+from vllm import LLM
+model = LLM(
+    model="cerebras/Cerebras-GPT-1.3B",
+    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
+)
+```
+Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
--- a/docs/source/performance/optimization.md
+++ b/docs/source/performance/optimization.md
-(optimization-and-tuning)=
 # Optimization and Tuning
 This guide covers optimization strategies and performance tuning for vLLM V1.
@@ -26,7 +24,7 @@ You can monitor the number of preemption requests through Prometheus metrics exp
 In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
-(chunked-prefill)=
+[](){ #chunked-prefill }
 ## Chunked Prefill

--- a/docs/source/serving/serve_args.md
+++ b/docs/source/serving/serve_args.md
-(serve-args)=
+---
+title: Server Arguments
-# Server Arguments
+---
+[](){ #serve-args }
 The `vllm serve` command is used to launch the OpenAI-compatible server.
 ## CLI Arguments
-The following are all arguments available from the `vllm serve` command:
+The `vllm serve` command is used to launch the OpenAI-compatible server.
+To see the available CLI arguments, run `vllm serve --help`!
-<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
-```{eval-rst}
-.. argparse::
-    :module: vllm.entrypoints.openai.cli_args
-    :func: create_parser_for_docs
-    :prog: vllm serve
-    :nodefaultconst:
-    :markdownhelp:
-```
 ## Configuration file
 You can load CLI arguments via a [YAML](https://yaml.org/) config file.
-The argument names must be the long form of those outlined [above](#serve-args).
+The argument names must be the long form of those outlined [above][serve-args].
 For example:
@@ -40,8 +32,7 @@ To use the above config file:
 vllm serve --config config.yaml
 ```
-:::{note}
+!!! note
-In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
+    In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
-The order of priorities is `command line > config file values > defaults`.
+    The order of priorities is `command line > config file values > defaults`.
-e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
+    e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
-:::
--- a/docs/source/contributing/overview.md
+++ b/docs/source/contributing/overview.md
@@ -27,7 +27,21 @@ See <gh-file:LICENSE>.
 ## Developing
 Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
-Check out the [building from source](#build-from-source) documentation for details.
+Check out the [building from source][build-from-source] documentation for details.
+### Building the docs
+Install the dependencies:
+```bash
+pip install -r requirements/docs.txt
+```
+Start the autoreloading MkDocs server:
+```bash
+mkdocs serve
+```
 ## Testing
@@ -48,29 +62,25 @@ pre-commit run mypy-3.9 --hook-stage manual --all-files
 pytest tests/
 ```
-:::{tip}
+!!! tip
-Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
+    Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
-Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
+    Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
-:::
-:::{note}
+!!! note
-Currently, the repository is not fully checked by `mypy`.
+    Currently, the repository is not fully checked by `mypy`.
-:::
-:::{note}
+!!! note
-Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
+    Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
-platform to run unit tests locally, rely on the continuous integration system to run the tests for
+    platform to run unit tests locally, rely on the continuous integration system to run the tests for
-now.
+    now.
-:::
 ## Issues
 If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
-:::{important}
+!!! warning
-If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
+    If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
-:::
 ## Pull Requests & Code Reviews
@@ -106,9 +116,8 @@ appropriately to indicate the type of change. Please use one of the following:
 - `[Misc]` for PRs that do not fit the above categories. Please use this
  sparingly.
-:::{note}
+!!! note
-If the PR spans more than one category, please include all relevant prefixes.
+    If the PR spans more than one category, please include all relevant prefixes.
-:::
 ### Code Quality
@@ -121,9 +130,8 @@ The PR needs to meet the following code quality standards:
  understand the code.
 - Include sufficient tests to ensure the project stays correct and robust. This
  includes both unit tests and integration tests.
- Please add documentation to `docs/source/` if the PR modifies the
+- Please add documentation to `docs/` if the PR modifies the user-facing behaviors of vLLM.
-  user-facing behaviors of vLLM. It helps vLLM users understand and utilize the
+  It helps vLLM users understand and utilize the new features or changes.
-  new features or changes.
 ### Adding or Changing Kernels

--- a/docs/source/performance/benchmarks.md
+++ b/docs/source/performance/benchmarks.md
-(benchmarks)=
+---
+title: Benchmark Suites
-# Benchmark Suites
+---
+[](){ #benchmarks }
 vLLM contains two sets of benchmarks:
- [Performance benchmarks](#performance-benchmarks)
+- [Performance benchmarks][performance-benchmarks]
- [Nightly benchmarks](#nightly-benchmarks)
+- [Nightly benchmarks][nightly-benchmarks]
-(performance-benchmarks)=
+[](){ #performance-benchmarks }
 ## Performance Benchmarks
@@ -17,7 +18,7 @@ The latest performance results are hosted on the public [vLLM Performance Dashbo
 More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
-(nightly-benchmarks)=
+[](){ #nightly-benchmarks }
 ## Nightly Benchmarks

--- a/docs/source/contributing/deprecation_policy.md
+++ b/docs/source/contributing/deprecation_policy.md