Commit 8d75f22e authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.13.0rc1' into v0.13.0rc1-ori

parents ce888aa4 7d80c73d
......@@ -32,14 +32,14 @@ Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-Cpn
## 2 Usage Example
The current reference pathway is **SharedStorageConnector**.
The current reference pathway is **ExampleConnector**.
Below ready-to-run scripts shows the workflow:
1 Encoder instance + 1 PD instance:
`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh`
`examples/online_serving/disaggregated_encoder/disagg_1e1pd_example.sh`
1 Encoder instance + 1 Prefill instance + 1 Decode instance:
`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_epd_example.sh`
`examples/online_serving/disaggregated_encoder/disagg_1e1p1d_example.sh`
---
......
......@@ -21,14 +21,14 @@ Please refer to [examples/online_serving/disaggregated_prefill.sh](../../example
Now supports 5 types of connectors:
- **SharedStorageConnector**: refer to [examples/offline_inference/disaggregated-prefill-v1/run.sh](../../examples/offline_inference/disaggregated-prefill-v1/run.sh) for the example usage of SharedStorageConnector disaggregated prefilling.
- **ExampleConnector**: refer to [examples/offline_inference/disaggregated-prefill-v1/run.sh](../../examples/offline_inference/disaggregated-prefill-v1/run.sh) for the example usage of ExampleConnector disaggregated prefilling.
- **LMCacheConnectorV1**: refer to [examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh](../../examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh) for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
- **NixlConnector**: refer to [tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md).
- **P2pNcclConnector**: refer to [examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh](../../examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh) for the example usage of P2pNcclConnector disaggregated prefilling.
- **MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
```bash
--kv-transfer-config '{"kv_connector":"MultiConnector","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"NixlConnector","kv_role":"kv_both"},{"kv_connector":"SharedStorageConnector","kv_role":"kv_both","kv_connector_extra_config":{"shared_storage_path":"local_storage"}}]}}'
--kv-transfer-config '{"kv_connector":"MultiConnector","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"NixlConnector","kv_role":"kv_both"},{"kv_connector":"ExampleConnector","kv_role":"kv_both","kv_connector_extra_config":{"shared_storage_path":"local_storage"}}]}}'
```
For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
......
# MooncakeConnector Usage Guide
## About Mooncake
Mooncake aims to enhance the inference efficiency of large language models (LLMs), especially in slow object storage environments, by constructing a multi-level caching pool on high-speed interconnected DRAM/SSD resources. Compared to traditional caching systems, Mooncake utilizes (GPUDirect) RDMA technology to transfer data directly in a zero-copy manner, while maximizing the use of multi-NIC resources on a single machine.
For more details about Mooncake, please refer to [Mooncake project](https://github.com/kvcache-ai/Mooncake) and [Mooncake documents](https://kvcache-ai.github.io/Mooncake/).
## Prerequisites
### Installation
Install mooncake through pip: `uv pip install mooncake-transfer-engine`.
Refer to [Mooncake official repository](https://github.com/kvcache-ai/Mooncake) for more installation instructions
## Usage
### Prefiller Node (192.168.0.2)
```bash
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8010 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
```
### Decoder Node (192.168.0.3)
```bash
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8020 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
```
### Proxy
```bash
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --prefiller-host 192.168.0.2 --prefiller-port 8010 --decoder-host 192.168.0.3 --decoder-port 8020
```
> NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.
Now you can send requests to the proxy server through port 8000.
## Environment Variables
- `VLLM_MOONCAKE_BOOTSTRAP_PORT`: Port for Mooncake bootstrap server
- Default: 8998
- Required only for prefiller instances
- Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
- For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank
- Used for the decoder notifying the prefiller
- `VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)
- Default: 480
- If a request is aborted and the decoder has not yet notified the prefiller, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.
## KV Role Options
- **kv_producer**: For prefiller instances that generate KV caches
- **kv_consumer**: For decoder instances that consume KV caches from prefiller
- **kv_both**: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined.
......@@ -443,7 +443,9 @@ For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embedd
print(generated_text)
```
#### Audio Embeddings
For Qwen3-VL, the `image_embeds` should contain both the base image embedding and deepstack features.
#### Audio Embedding Inputs
You can pass pre-computed audio embeddings similar to image embeddings:
......@@ -795,14 +797,12 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
??? code
```python
from vllm.utils.serial_utils import tensor2base64
image_embedding = torch.load(...)
grid_thw = torch.load(...) # Required by Qwen/Qwen2-VL-2B-Instruct
buffer = io.BytesIO()
torch.save(image_embedding, buffer)
buffer.seek(0)
binary_data = buffer.read()
base64_image_embedding = base64.b64encode(binary_data).decode('utf-8')
base64_image_embedding = tensor2base64(image_embedding)
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
......@@ -892,5 +892,11 @@ For Online Serving, you can also skip sending media if you expect cache hits wit
```
!!! note
Only one message can contain `{"type": "image_embeds"}`.
Multiple messages can now contain `{"type": "image_embeds"}`, enabling you to pass multiple image embeddings in a single request (similar to regular images). The number of embeddings is limited by `--limit-mm-per-prompt`.
**Important**: The embedding shape format differs based on the number of embeddings:
- **Single embedding**: 3D tensor of shape `(1, feature_size, hidden_size)`
- **Multiple embeddings**: List of 2D tensors, each of shape `(feature_size, hidden_size)`
If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
......@@ -146,6 +146,8 @@ python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
--decoder-ports 8000 8000
```
For multi-host DP deployment, only need to provide the host/port of the head instances.
### KV Role Options
- **kv_producer**: For prefiller instances that generate KV caches
......
......@@ -14,7 +14,7 @@ Contents:
- [INT4 W4A16](int4.md)
- [INT8 W8A8](int8.md)
- [FP8 W8A8](fp8.md)
- [NVIDIA TensorRT Model Optimizer](modelopt.md)
- [NVIDIA Model Optimizer](modelopt.md)
- [AMD Quark](quark.md)
- [Quantized KV Cache](quantized_kvcache.md)
- [TorchAO](torchao.md)
......
# NVIDIA TensorRT Model Optimizer
# NVIDIA Model Optimizer
The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models.
The [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models.
We recommend installing the library with:
......@@ -10,7 +10,7 @@ pip install nvidia-modelopt
## Quantizing HuggingFace Models with PTQ
You can quantize HuggingFace models using the example scripts provided in the TensorRT Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.
You can quantize HuggingFace models using the example scripts provided in the Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.
Below is an example showing how to quantize a model using modelopt's PTQ API:
......
......@@ -18,6 +18,7 @@ vLLM currently supports the following reasoning models:
| [ERNIE-4.5-VL series](https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT) | `ernie45` | `json`, `regex` | ❌ |
| [ERNIE-4.5-21B-A3B-Thinking](https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking) | `ernie45` | `json`, `regex` | ✅ |
| [GLM-4.5 series](https://huggingface.co/collections/zai-org/glm-45-687c621d34bda8c9e4bf503b) | `glm45` | `json`, `regex` | ✅ |
| [Holo2 series](https://huggingface.co/collections/Hcompany/holo2) | `holo2` | `json`, `regex` | ✅ |
| [Hunyuan A13B series](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | `hunyuan_a13b` | `json`, `regex` | ✅ |
| [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ |
| [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) | `minimax_m2_append_think` | `json`, `regex` | ✅ |
......@@ -28,6 +29,7 @@ vLLM currently supports the following reasoning models:
IBM Granite 3.2 and DeepSeek-V3.1 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
DeepSeek-V3.1 tool calling is supported in non-thinking mode.
Holo2 reasoning is enabled by default. To disable it, you must also pass `thinking=False` in your `chat_template_kwargs`.
## Quickstart
......@@ -297,6 +299,9 @@ Additionally, to enable structured output, you'll need to create a new `Reasoner
def is_reasoning_end(self, input_ids: list[int]) -> bool:
return self.end_token_id in input_ids
def is_reasoning_end_streaming(self, input_ids: list[int], delta_ids: list[int]) -> bool:
return self.end_token_id in delta_token_ids
...
```
......
......@@ -376,6 +376,19 @@ Supported models:
Flags: `--tool-call-parser olmo3`
### Gigachat 3 Models (`gigachat3`)
Use chat template from the Hugging Face model files.
Supported models:
* `ai-sage/GigaChat3-702B-A36B-preview`
* `ai-sage/GigaChat3-702B-A36B-preview-bf16`
* `ai-sage/GigaChat3-10B-A1.8B`
* `ai-sage/GigaChat3-10B-A1.8B-bf16`
Flags: `--tool-call-parser gigachat3`
### Models with Pythonic Tool Calls (`pythonic`)
A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
......
......@@ -4,9 +4,6 @@ vLLM has experimental support for macOS with Apple Silicon. For now, users must
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
......@@ -20,6 +17,8 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Apple silicon CPU wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
......@@ -78,6 +77,8 @@ uv pip install -e .
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Arm silicon CPU images.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
......
# --8<-- [start:installation]
vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.
ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16.
# --8<-- [end:installation]
# --8<-- [start:requirements]
......@@ -20,6 +15,23 @@ ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels contain pre-compiled C++ binaries.
Please replace `<version>` in the commands below with a specific version string (e.g., `0.11.2`).
```bash
uv pip install --pre vllm==<version>+cpu --extra-index-url https://wheels.vllm.ai/<version>%2Bcpu/
```
??? console "pip"
```bash
pip install --pre vllm==<version>+cpu --extra-index-url https://wheels.vllm.ai/<version>%2Bcpu/
```
The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
!!! note
Nightly wheels are currently unsupported for this architecture. (e.g. to bisect the behavior change, performance regression).
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
......@@ -69,6 +81,8 @@ Testing has been conducted on AWS Graviton3 instances for compatibility.
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images]
Currently, there are no pre-built Arm CPU images.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
```bash
......
......@@ -46,11 +46,25 @@ vLLM is a Python library that supports the following CPU variants. Select your C
### Pre-built wheels
Please refer to the instructions for [pre-built wheels on GPU](./gpu.md#pre-built-wheels).
When specifying the index URL, please make sure to use the `cpu` variant subdirectory.
For example, the nightly build index is: `https://wheels.vllm.ai/nightly/cpu/`.
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-wheels"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:pre-built-wheels"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:pre-built-wheels"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:pre-built-wheels"
### Build wheel from source
#### Set up using Python-only build (without compilation) {#python-only-build}
......@@ -87,6 +101,18 @@ VLLM_USE_PRECOMPILED=1 VLLM_PRECOMPILED_WHEEL_VARIANT=cpu VLLM_TARGET_DEVICE=cpu
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-images"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:pre-built-images"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:pre-built-images"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:pre-built-images"
### Build image from source
=== "Intel/AMD x86"
......
......@@ -4,9 +4,6 @@ vLLM has experimental support for s390x architecture on IBM Z platform. For now,
Currently, the CPU implementation for s390x architecture supports FP32 datatype only.
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
......@@ -21,6 +18,8 @@ Currently, the CPU implementation for s390x architecture supports FP32 datatype
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built IBM Z CPU wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
......@@ -69,6 +68,8 @@ Execute the following commands to build and install vLLM from source.
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images]
Currently, there are no pre-built IBM Z CPU images.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
......
......@@ -17,6 +17,8 @@ vLLM supports basic model inferencing and serving on x86 CPU platform, with data
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built x86 CPU wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
......
......@@ -5,9 +5,6 @@ vLLM supports AMD GPUs with ROCm 6.3 or above, and torch 2.8.0 and above.
!!! tip
[Docker](#set-up-using-docker) is the recommended way to use vLLM on ROCm.
!!! warning
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
# --8<-- [end:installation]
# --8<-- [start:requirements]
......
......@@ -2,9 +2,6 @@
vLLM initially supports basic model inference and serving on Intel GPU platform.
!!! warning
There are no pre-built wheels for this device, so you need build vLLM from source. Or you can use pre-built images which are based on vLLM released versions.
# --8<-- [end:installation]
# --8<-- [start:requirements]
......
# Collaboration Policy
This page outlines how vLLM collaborates with model providers, hardware vendors, and other stakeholders.
## Adding New Major Features
Anyone can contribute to vLLM. For major features, submit an RFC (request for comments) first. To submit an RFC, create an [issue](https://github.com/vllm-project/vllm/issues/new/choose) and select the `RFC` template.
RFCs are similar to design docs that discuss the motivation, problem solved, alternatives considered, and proposed change.
Once you submit the RFC, please post it in the #contributors channel in vLLM Slack, and loop in area owners and committers for feedback.
For high-interest features, the committers nominate a person to help with the RFC process and PR review. This makes sure someone is guiding you through the process. It is reflected as the "assignee" field in the RFC issue.
If the assignee and lead maintainers find the feature to be contentious, the maintainer team aims to make decisions quickly after learning the details from everyone. This involves assigning a committer as the DRI (Directly Responsible Individual) to make the decision and shepherd the code contribution process.
For features that you intend to maintain, please feel free to add yourself in [`mergify.yml`](https://github.com/vllm-project/vllm/blob/main/.github/mergify.yml) to receive notifications and auto-assignment when the PRs touching the feature you are maintaining. Over time, the ownership will be evaluated and updated through the committers nomination and voting process.
## Adding New Models
If you use vLLM, we recommend you making the model work with vLLM by following the [model registration](../contributing/model/registration.md) process before you release it publicly.
The vLLM team helps with new model architectures not supported by vLLM, especially models pushing architectural frontiers.
Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate.
Once we establish the connection between the vLLM team and model provider:
- The vLLM team learns the model architecture and relevant changes, then plans which area owners to involve and what features to include.
- The vLLM team creates a private communication channel (currently a Slack channel in the vLLM workspace) and a private fork within the vllm-project organization. The model provider team can invite others to the channel and repo.
- Third parties like compute providers, hosted inference providers, hardware vendors, and other organizations often work with both the model provider and vLLM on model releases. We establish direct communication (with permission) or three-way communication as needed.
The vLLM team works with model providers on features, integrations, and release timelines. We work to meet release timelines, but engineering challenges like feature development, model accuracy alignment, and optimizations can cause delays.
The vLLM maintainers will not publicly share details about model architecture, release timelines, or upcoming releases. We maintain model weights on secure servers with security measures (though we can work with security reviews and testing without certification). We delete pre-release weights or artifacts upon request.
The vLLM team collaborates on marketing and promotional efforts for model releases. model providers can use vLLM's trademark and logo in publications and materials.
## Adding New Hardware
vLLM is designed as a platform for frontier model architectures and high-performance accelerators.
For new hardware, follow the [hardware plugin](../design/plugin_system.md) system to add support.
Use the platform plugin system to add hardware support.
As hardware gains popularity, we help endorse it in our documentation and marketing materials.
The vLLM GitHub organization can host hardware plugin repositories, especially for collaborative efforts among companies.
We rarely add new hardware to vLLM directly. Instead, we make existing hardware platforms modular to keep the vLLM core hardware-agnostic.
# Committers
This document lists the current committers of the vLLM project and the core areas they maintain.
Committers have write access to the vLLM repository and are responsible for reviewing and merging PRs.
You can also refer to the [CODEOWNERS](https://github.com/vllm-project/vllm/blob/main/.github/CODEOWNERS) file for concrete file-level ownership and reviewers. Both this documents and the CODEOWNERS file are living documents and they complement each other.
## Active Committers
We try to summarize each committer's role in vLLM in a few words. In general, vLLM committers cover a wide range of areas and help each other in the maintenance process.
Please refer to the later section about Area Owners for exact component ownership details.
Sorted alphabetically by GitHub handle:
- [@22quinn](https://github.com/22quinn): RL API
- [@aarnphm](https://github.com/aarnphm): Structured output
- [@alexm-redhat](https://github.com/alexm-redhat): Performance
- [@ApostaC](https://github.com/ApostaC): Connectors, offloading
- [@benchislett](https://github.com/benchislett): Engine core and spec decode
- [@bigPYJ1151](https://github.com/bigPYJ1151): Intel CPU/XPU integration
- [@chaunceyjiang](https://github.com/chaunceyjiang): Tool use and reasoning parser
- [@DarkLight1337](https://github.com/DarkLight1337): Multimodality, API server
- [@esmeetu](https://github.com/esmeetu): developer marketing, community
- [@gshtras](https://github.com/gshtras): AMD integration
- [@heheda12345](https://github.com/heheda12345): Hybrid memory allocator
- [@hmellor](https://github.com/hmellor): Hugging Face integration, documentation
- [@houseroad](https://github.com/houseroad): Engine core and Llama models
- [@Isotr0py](https://github.com/Isotr0py): Multimodality, new model support
- [@jeejeelee](https://github.com/jeejeelee): LoRA, new model support
- [@jikunshang](https://github.com/jikunshang): Intel CPU/XPU integration
- [@khluu](https://github.com/khluu): CI infrastructure
- [@KuntaiDu](https://github.com/KuntaiDu): KV Connector
- [@LucasWilkinson](https://github.com/LucasWilkinson): Kernels and performance
- [@luccafong](https://github.com/luccafong): Llama models, speculative decoding, distributed
- [@markmc](https://github.com/markmc): Observability
- [@mgoin](https://github.com/mgoin): Quantization and performance
- [@NickLucche](https://github.com/NickLucche): KV connector
- [@njhill](https://github.com/njhill): Distributed, API server, engine core
- [@noooop](https://github.com/noooop): Pooling models
- [@patrickvonplaten](https://github.com/patrickvonplaten): Mistral models, new model support
- [@pavanimajety](https://github.com/pavanimajety): NVIDIA GPU integration
- [@ProExpertProg](https://github.com/ProExpertProg): Compilation, startup UX
- [@robertgshaw2-redhat](https://github.com/robertgshaw2-redhat): Core, distributed, disagg
- [@ruisearch42](https://github.com/ruisearch42): Pipeline parallelism, Ray Support
- [@russellb](https://github.com/russellb): Structured output, engine core, security
- [@sighingnow](https://github.com/sighingnow): Qwen models, new model support
- [@simon-mo](https://github.com/simon-mo): Project lead, API entrypoints, community
- [@tdoublep](https://github.com/tdoublep): State space models
- [@tjtanaa](https://github.com/tjtanaa): AMD GPU integration
- [@tlrmchlsmth](https://github.com/tlrmchlsmth): Kernels and performance, distributed, disagg
- [@WoosukKwon](https://github.com/WoosukKwon): Project lead, engine core
- [@yaochengji](https://github.com/yaochengji): TPU integration
- [@yeqcharlotte](https://github.com/yeqcharlotte): Benchmark, Llama models
- [@yewentao256](https://github.com/yewentao256): Kernels and performance
- [@Yikun](https://github.com/Yikun): Pluggable hardware interface
- [@youkaichao](https://github.com/youkaichao): Project lead, distributed, compile, community
- [@ywang96](https://github.com/ywang96): Multimodality, benchmarks
- [@zhuohan123](https://github.com/zhuohan123): Project lead, RL integration, numerics
- [@zou3519](https://github.com/zou3519): Compilation
### Emeritus Committers
Committers who have contributed to vLLM significantly in the past (thank you!) but no longer active:
- [@andoorve](https://github.com/andoorve): Pipeline parallelism
- [@cadedaniel](https://github.com/cadedaniel): Speculative decoding
- [@comaniac](https://github.com/comaniac): KV cache management, pipeline parallelism
- [@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU): Speculative decoding
- [@pcmoritz](https://github.com/pcmoritz): MoE
- [@rkooo567](https://github.com/rkooo567): Chunked prefill
- [@sroy745](https://github.com/sroy745): Speculative decoding
- [@Yard1](https://github.com/Yard1): kernels and performance
- [@zhisbug](https://github.com/zhisbug): Arctic models, distributed
## Area Owners
This section breaks down the active committers by vLLM components and lists the area owners.
If you have PRs touching the area, please feel free to ping the area owner for review.
### Engine Core
- Scheduler: the core vLLM engine loop scheduling requests to next batch
- @WoosukKwon, @robertgshaw2-redhat, @njhill, @heheda12345
- KV Cache Manager: memory management layer within scheduler maintaining KV cache logical block data
- @heheda12345, @WoosukKwon
- AsyncLLM: the zmq based protocol hosting engine core and making it accessible for entrypoints
- @robertgshaw2-redhat, @njhill, @russellb
- ModelRunner, Executor, Worker: the abstractions for engine wrapping model implementation
- @WoosukKwon, @tlrmchlsmth, @heheda12345, @LucasWilkinson, @ProExpertProg
- KV Connector: Connector interface and implementation for KV cache offload and transfer
- @robertgshaw2-redhat, @njhill, @KuntaiDu, @NickLucche, @ApostaC
- Distributed, Parallelism, Process Management: Process launchers managing each worker, and assign them to the right DP/TP/PP/EP ranks
- @youkaichao, @njhill, @WoosukKwon, @ruisearch42
- Collectives: the usage of nccl and other communication libraries/kernels
- @tlrmchlsmth, @youkaichao
- Multimodality engine and memory management: core scheduling and memory management concerning vision, audio, and video inputs.
- @ywang96, @DarkLight1337
### Model Implementations
- Model Interface: The `nn.Module` interface and implementation for various models
- @zhuohan123, @mgoin, @simon-mo, @houseroad, @ywang96 (multimodality), @jeejeelee (lora)
- Logits Processors / Sampler: The provided sampler class and pluggable logits processors
- @njhill, @houseroad, @22quinn
- Custom Layers: Utility layers in vLLM such as rotary embedding and rms norms
- @ProExpertProg
- Attention: Attention interface for paged attention
- @WoosukKwon, @LucasWilkinson, @heheda12345
- FusedMoE: FusedMoE kernel, Modular kernel framework, EPLB
- @tlrmchlsmth
- Quantization: Various quantization config, weight loading, and kernel.
- @mgoin, @Isotr0py, @yewentao256
- Custom quantized GEMM kernels (cutlass_scaled_mm, marlin, machete)
- @tlrmchlsmth, @LucasWilkinson
- Multi-modal Input Processing: Components that load and process image/video/audio data into feature tensors
- @DarkLight1337, @ywang96, @Isotr0py
- torch compile: The torch.compile integration in vLLM, custom passes & transformations
- @ProExpertProg, @zou3519, @youkaichao
- State space models: The state space models implementation in vLLM
- @tdoublep, @tlrmchlsmth
- Reasoning and tool calling parsers
- @chaunceyjiang, @aarnphm
### Entrypoints
- LLM Class: The LLM class for offline inference
- @DarkLight1337
- API Server: The OpenAI-compatible API server
- @DarkLight1337, @njhill, @aarnphm, @simon-mo, @heheda12345 (Responses API)
- Batch Runner: The OpenAI-compatible batch runner
- @simon-mo
### Features
- Spec Decode: Covers model definition, attention, sampler, and scheduler related to n-grams, EAGLE, and MTP.
- @WoosukKwon, @benchislett, @luccafong
- Structured Output: The structured output implementation
- @russellb, @aarnphm
- RL: The RL related features such as collective rpc, sleep mode, etc.
- @youkaichao, @zhuohan123, @22quinn
- LoRA: @jeejeelee
- Observability: Metrics and Logging
- @markmc, @robertgshaw2-redhat, @simon-mo
### Code Base
- Config: Configuration registration and parsing
- @hmellor
- Documentation: @hmellor, @DarkLight1337, @simon-mo
- Benchmarks: @ywang96, @simon-mo
- CI, Build, Release Process: @khluu, @njhill, @simon-mo
- Security: @russellb
### External Kernels Integration
- FlashAttention: @LucasWilkinson
- FlashInfer: @LucasWilkinson, @mgoin, @WoosukKwon
- Blackwell Kernels: @mgoin, @yewentao256
- DeepEP/DeepGEMM/pplx: @mgoin, @yewentao256
### Integrations
- Hugging Face: @hmellor, @Isotr0py
- Ray: @ruisearch42
- NIXL: @robertgshaw2-redhat, @NickLucche
### Collaboration with Model Vendors
- gpt-oss: @heheda12345, @simon-mo, @zhuohan123
- Llama: @luccafong
- Qwen: @sighingnow
- Mistral: @patrickvonplaten
### Hardware
- Plugin Interface: @youkaichao, @Yikun
- NVIDIA GPU: @pavanimajety
- AMD GPU: @gshtras, @tjtanaa
- Intel CPU/GPU: @jikunshang, @bigPYJ1151
- Google TPU: @yaochengji
### Ecosystem Projects
- Ascend NPU: [@wangxiyuan](https://github.com/wangxiyuan) and [see more details](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html#maintainers)
- Intel Gaudi HPU [@xuechendi](https://github.com/xuechendi) and [@kzawora-intel](https://github.com/kzawora-intel)
# Governance Process
vLLM's success comes from our strong open source community. We favor informal, meritocratic norms over formal policies. This document clarifies our governance philosophy and practices.
## Values
vLLM aims to be the fastest and easiest-to-use LLM inference and serving engine. We stay current with advances, enable innovation, and support diverse models, modalities, and hardware.
### Design Values
1. **Top performance**: System performance is our top priority. We monitor overheads, optimize kernels, and publish benchmarks. We never leave performance on the table.
2. **Ease of use**: vLLM must be simple to install, configure, and operate. We provide clear documentation, fast startup, clean logs, helpful error messages, and monitoring guides. Many users fork our code or study it deeply, so we keep it readable and modular.
3. **Wide coverage**: vLLM supports frontier models and high-performance accelerators. We make it easy to add new models and hardware. vLLM + PyTorch form a simple interface that avoids complexity.
4. **Production ready**: vLLM runs 24/7 in production. It must be easy to operate and monitor for health issues.
5. **Extensibility**: vLLM serves as fundamental LLM infrastructure. Our codebase cannot cover every use case, so we design for easy forking and customization.
### Collaboration Values
1. **Tightly Knit and Fast-Moving**: Our maintainer team is aligned on vision, philosophy, and roadmap. We work closely to unblock each other and move quickly.
2. **Individual Merit**: No one buys their way into governance. Committer status belongs to individuals, not companies. We reward contribution, maintenance, and project stewardship.
## Project Maintainers
Maintainers form a hierarchy based on sustained, high-quality contributions and alignment with our design philosophy.
### Core Maintainers
Core Maintainers function like a project planning and decision making committee. In other convention, they might be called a Technical Steering Committee (TSC). In vLLM vocabulary, they are often known as "Project Leads". They meet weekly to coordinate roadmap priorities and allocate engineering resources. Current active leads: @WoosukKwon, @zhuohan123, @simon-mo, @youkaichao, @robertshaw2-redhat, @tlrmchlsmth, @mgoin, @njhill, @ywang96, @houseroad, @yeqcharlotte, @ApostaC
The responsibilities of the core maintainers are:
* Author quarterly roadmap and responsible for each development effort.
* Making major changes to the technical direction or scope of vLLM and vLLM projects.
* Defining the project's release strategy.
* Work with model providers, hardware vendors, and key users of vLLM to ensure the project is on the right track.
### Lead Maintainers
While Core maintainers assume the day-to-day responsibilities of the project, Lead maintainers are responsible for the overall direction and strategy of the project. A committee of @WoosukKwon, @zhuohan123, @simon-mo, and @youkaichao currently shares this role with divided responsibilities.
The responsibilities of the lead maintainers are:
* Making decisions where consensus among core maintainers cannot be reached.
* Adopting changes to the project's technical governance.
* Organizing the voting process for new committers.
### Committers and Area Owners
Committers have write access and merge rights. They typically have deep expertise in specific areas and help the community.
The responsibilities of the committers are:
* Reviewing PRs and providing feedback.
* Addressing issues and questions from the community.
* Own specific areas of the codebase and development efforts: reviewing PRs, addressing issues, answering questions, improving documentation.
Specially, committers are almost all area owners. They author subsystems, review PRs, refactor code, monitor tests, and ensure compatibility with other areas. All area owners are committers with deep expertise in that area, but not all committers own areas.
For a full list of committers and their respective areas, see the [committers](./committers.md) page.
#### Nomination Process
Any committer can nominate candidates via our private mailing list:
1. **Nominate**: Any committer may nominate a candidate by email to the private maintainers’ list, citing evidence mapped to the pre‑existing standards with links to PRs, reviews, RFCs, issues, benchmarks, and adoption evidence.
2. **Vote**: The lead maintainers will group voices support or concerns. Shared concerns can stop the process. The vote typically last 3 working days. For concerns, committers group discuss the clear criteria for such person to be nominated again. The lead maintainers will make the final decision.
3. **Confirm**: The lead maintainers send invitation, update CODEOWNERS, assign permissions, add to communications channels (mailing list and Slack).
Committership is highly selective and merit based. The selection criteria requires:
* **Area expertise**: leading design/implementation of core subsystems, material performance or reliability improvements adopted project‑wide, or accepted RFCs that shape technical direction.
* **Sustained contributions**: high‑quality merged contributions and reviews across releases, responsiveness to feedback, and stewardship of code health.
* **Community leadership**: mentoring contributors, triaging issues, improving docs, and elevating project standards.
To further illustrate, a committer typically satisfies at least two of the following accomplishment patterns:
* Author of an accepted RFC or design that materially shaped project direction
* Measurable, widely adopted performance or reliability improvement in core paths
* Long‑term ownership of a subsystem with demonstrable quality and stability gains
* Significant cross‑project compatibility or ecosystem enablement work (models, hardware, tooling)
While there isn't a quantitative bar, past committers have:
* Submitted approximately 30+ PRs of substantial quality and scope
* Provided high-quality reviews of approximately 10+ substantial external contributor PRs
* Addressed multiple issues and questions from the community in issues/forums/Slack
* Led concentrated efforts on RFCs and their implementation, or significant performance or reliability improvements adopted project‑wide
### Working Groups
vLLM runs informal working groups such as CI, CI infrastructure, torch compile, and startup UX. These can be loosely tracked via `#sig-` (or `#feat-`) channels in vLLM Slack. Some groups have regular sync meetings.
### Advisory Board
vLLM project leads consult with an informal advisory board that is composed of model providers, hardware vendors, and ecosystem partners. This manifests as a collaboration channel in Slack and frequent communications.
## Process
### Project Roadmap
Project Leads publish quarterly roadmaps as GitHub issues. These clarify current priorities. Unlisted topics aren't excluded but may get less review attention. See [https://roadmap.vllm.ai/](https://roadmap.vllm.ai/).
### Decision Making
We make technical decisions in Slack and GitHub using RFCs and design docs. Discussion may happen elsewhere, but we maintain public records of significant changes: problem statements, rationale, and alternatives considered.
### Merging Code
Contributors and maintainers often collaborate closely on code changes, especially within organizations or specific areas. Maintainers should give others appropriate review opportunities based on change significance.
PRs requires at least one committer review and approval. If the code is covered by CODEOWNERS, the PR should be reviewed by the CODEOWNERS. There are cases where the code is trivial or hotfix, the PR can be merged by the lead maintainers directly.
In case where CI didn't pass due to the failure is not related to the PR, the PR can be merged by the lead maintainers using "force merge" option that overrides the CI checks.
### Slack
Contributors are encouraged to join `#pr-reviews` and `#contributors` channels.
There are `#sig-` and `#feat-` channels for discussion and coordination around specific topics.
The project maintainer group also uses a private channel for high-bandwidth collaboration.
### Meetings
We hold weekly contributor syncs with standup-style updates on progress, blockers, and plans. You can refer to the notes [standup.vllm.ai](https://standup.vllm.ai) for joining instructions.
......@@ -2,7 +2,8 @@
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import itertools
import logging
from dataclasses import dataclass, field
from dataclasses import dataclass
from functools import cached_property
from pathlib import Path
from typing import Literal
......@@ -16,13 +17,18 @@ EXAMPLE_DIR = ROOT_DIR / "examples"
EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples"
def fix_case(text: str) -> str:
def title(text: str) -> str:
# Default title case
text = text.replace("_", " ").replace("/", " - ").title()
# Custom substitutions
subs = {
"io": "IO",
"api": "API",
"cli": "CLI",
"cpu": "CPU",
"llm": "LLM",
"mae": "MAE",
"ner": "NER",
"tpu": "TPU",
"gguf": "GGUF",
"lora": "LoRA",
......@@ -48,71 +54,65 @@ class Example:
Attributes:
path (Path): The path to the main directory or file.
category (str): The category of the document.
main_file (Path): The main file in the directory.
other_files (list[Path]): list of other files in the directory.
title (str): The title of the document.
Properties::
main_file() -> Path | None: Determines the main file in the given path.
other_files() -> list[Path]: Determines other files in the directory excluding
the main file.
title() -> str: Determines the title of the document.
Methods:
__post_init__(): Initializes the main_file, other_files, and title attributes.
determine_main_file() -> Path: Determines the main file in the given path.
determine_other_files() -> list[Path]: Determines other files in the directory excluding the main file.
determine_title() -> str: Determines the title of the document.
generate() -> str: Generates the documentation content.
""" # noqa: E501
"""
path: Path
category: str = None
main_file: Path = field(init=False)
other_files: list[Path] = field(init=False)
title: str = field(init=False)
def __post_init__(self):
self.main_file = self.determine_main_file()
self.other_files = self.determine_other_files()
self.title = self.determine_title()
@property
def is_code(self) -> bool:
return self.main_file.suffix != ".md"
def determine_main_file(self) -> Path:
"""
Determines the main file in the given path.
If the path is a file, it returns the path itself. Otherwise, it searches
for Markdown files (*.md) in the directory and returns the first one found.
Returns:
Path: The main file path, either the original path if it's a file or the first
Markdown file found in the directory.
Raises:
IndexError: If no Markdown files are found in the directory.
""" # noqa: E501
return self.path if self.path.is_file() else list(self.path.glob("*.md")).pop()
def determine_other_files(self) -> list[Path]:
"""
Determine other files in the directory excluding the main file.
category: str
This method checks if the given path is a file. If it is, it returns an empty list.
Otherwise, it recursively searches through the directory and returns a list of all
files that are not the main file.
@cached_property
def main_file(self) -> Path | None:
"""Determines the main file in the given path.
Returns:
list[Path]: A list of Path objects representing the other files in the directory.
""" # noqa: E501
If path is a file, it returns the path itself. If path is a directory, it
searches for Markdown files (*.md) in the directory and returns the first one
found. If no Markdown files are found, it returns None."""
# Single file example
if self.path.is_file():
return self.path
# Multi file example with a README
if md_paths := list(self.path.glob("*.md")):
return md_paths[0]
# Multi file example without a README
return None
@cached_property
def other_files(self) -> list[Path]:
"""Determine other files in the directory excluding the main file.
If path is a file, it returns an empty list. Otherwise, it returns every file
in the directory except the main file in a list."""
# Single file example
if self.path.is_file():
return []
# Multi file example
is_other_file = lambda file: file.is_file() and file != self.main_file
return [file for file in self.path.rglob("*") if is_other_file(file)]
def determine_title(self) -> str:
if not self.is_code:
# Specify encoding for building on Windows
with open(self.main_file, encoding="utf-8") as f:
first_line = f.readline().strip()
match = re.match(r"^#\s+(?P<title>.+)$", first_line)
if match:
return match.group("title")
return fix_case(self.path.stem.replace("_", " ").title())
return sorted(file for file in self.path.rglob("*") if is_other_file(file))
@cached_property
def is_code(self) -> bool:
return self.main_file is not None and self.main_file.suffix != ".md"
@cached_property
def title(self) -> str:
# Generate title from filename if no main md file found
if self.main_file is None or self.is_code:
return title(self.path.stem)
# Specify encoding for building on Windows
with open(self.main_file, encoding="utf-8") as f:
first_line = f.readline().strip()
match = re.match(r"^#\s+(?P<title>.+)$", first_line)
if match:
return match.group("title")
raise ValueError(f"Title not found in {self.main_file}")
def fix_relative_links(self, content: str) -> str:
"""
......@@ -156,24 +156,35 @@ class Example:
# included files containing code fences too
code_fence = "``````"
if self.is_code:
content += (
f"{code_fence}{self.main_file.suffix[1:]}\n"
f'--8<-- "{self.main_file}"\n'
f"{code_fence}\n"
)
if self.main_file is not None:
# Single file example or multi file example with a README
if self.is_code:
content += (
f"{code_fence}{self.main_file.suffix[1:]}\n"
f'--8<-- "{self.main_file}"\n'
f"{code_fence}\n"
)
else:
with open(self.main_file, encoding="utf-8") as f:
# Skip the title from md snippets as it's been included above
main_content = f.readlines()[1:]
content += self.fix_relative_links("".join(main_content))
content += "\n"
else:
with open(self.main_file) as f:
# Skip the title from md snippets as it's been included above
main_content = f.readlines()[1:]
content += self.fix_relative_links("".join(main_content))
content += "\n"
# Multi file example without a README
for file in self.other_files:
file_title = title(str(file.relative_to(self.path).with_suffix("")))
content += f"## {file_title}\n\n"
content += (
f'{code_fence}{file.suffix[1:]}\n--8<-- "{file}"\n{code_fence}\n\n'
)
return content
if not self.other_files:
return content
content += "## Example materials\n\n"
for file in sorted(self.other_files):
for file in self.other_files:
content += f'??? abstract "{file.relative_to(self.path)}"\n'
if file.suffix != ".md":
content += f" {code_fence}{file.suffix[1:]}\n"
......@@ -200,11 +211,13 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
glob_patterns = ["*.py", "*.md", "*.sh"]
# Find categorised examples
for category in categories:
logger.info("Processing category: %s", category.stem)
globs = [category.glob(pattern) for pattern in glob_patterns]
for path in itertools.chain(*globs):
examples.append(Example(path, category.stem))
# Find examples in subdirectories
for path in category.glob("*/*.md"):
globs = [category.glob(f"*/{pattern}") for pattern in glob_patterns]
for path in itertools.chain(*globs):
examples.append(Example(path.parent, category.stem))
# Generate the example documentation
......@@ -217,3 +230,4 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
with open(doc_path, "w+", encoding="utf-8") as f:
f.write(example.generate())
logger.debug("Example generated: %s", doc_path.relative_to(ROOT_DIR))
logger.info("Total examples generated: %d", len(examples))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment