Commit 7e63ef82 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.14.0' into v0.14.0-dev

parents 8cbcac5d b17039bc
...@@ -17,6 +17,16 @@ The E4M3 format offers higher precision compared to E5M2. However, due to its sm ...@@ -17,6 +17,16 @@ The E4M3 format offers higher precision compared to E5M2. However, due to its sm
For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling factors of a finer granularity (e.g. per-channel). For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling factors of a finer granularity (e.g. per-channel).
### How FP8 KV Cache Works
The FP8 KV cache implementation follows this workflow:
1. **Storage**: Key and Value tensors are quantized to FP8 format using scaling factors before being stored in the KV cache
2. **Retrieval**: When needed for attention computation, cached KV tensors are dequantized back to higher precision (FP16/BF16)
3. **Attention**: The attention-value multiplication (softmax output × V) is performed using the dequantized higher-precision V tensor
This means the final attention computation operates on dequantized values, not FP8 tensors. The quantization reduces memory usage during storage but maintains computation accuracy by using higher precision during the actual attention operations.
### Performance Impact ### Performance Impact
The current FP8 KV cache implementation primarily benefits throughput by allowing approximately double the amount of space for KV cache allocation. This enables either: The current FP8 KV cache implementation primarily benefits throughput by allowing approximately double the amount of space for KV cache allocation. This enables either:
......
...@@ -20,7 +20,7 @@ for more installation details. ...@@ -20,7 +20,7 @@ for more installation details.
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation: Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```bash ```bash
pip install vllm git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api] pip install vllm "lm-eval[api]>=0.4.9.2"
``` ```
## Quantization Process ## Quantization Process
......
...@@ -204,6 +204,42 @@ The reasoning content is also available when both tool calling and the reasoning ...@@ -204,6 +204,42 @@ The reasoning content is also available when both tool calling and the reasoning
For more examples, please refer to [examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py](../../examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py). For more examples, please refer to [examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py](../../examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py).
## Server-Level Default Chat Template Kwargs
You can set default `chat_template_kwargs` at the server level using the `--default-chat-template-kwargs` CLI argument. This is useful for configuring reasoning behavior across all requests without requiring clients to specify it in each request.
### Disabling Thinking Mode by Default
For models like Qwen3 where thinking is enabled by default, you can disable it server-wide:
```bash
vllm serve Qwen/Qwen3-8B \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"enable_thinking": false}'
```
### Enabling Thinking Mode by Default
For models like IBM Granite 3.2 or DeepSeek-V3.1 where thinking is disabled by default, you can enable it server-wide:
```bash
vllm serve ibm-granite/granite-3.2-2b-instruct \
--reasoning-parser granite \
--default-chat-template-kwargs '{"thinking": true}'
```
### Request-Level Override
Request-level `chat_template_kwargs` always take priority over server defaults. For example, if the server is started with `enable_thinking=false`, a client can still enable it for a specific request:
```python
response = client.chat.completions.create(
model=model,
messages=messages,
extra_body={"chat_template_kwargs": {"enable_thinking": True}} # Overrides server default
)
```
## Limitations ## Limitations
- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`). - The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
......
...@@ -173,7 +173,7 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s ...@@ -173,7 +173,7 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s
## Speculating using MLP speculators ## Speculating using MLP speculators
The following code configures vLLM to use speculative decoding where proposals are generated by The following code configures vLLM to use speculative decoding where proposals are generated by
draft models that conditioning draft predictions on both context vectors and sampled tokens. draft models that condition draft predictions on both context vectors and sampled tokens.
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
[this technical report](https://arxiv.org/abs/2404.19124). [this technical report](https://arxiv.org/abs/2404.19124).
......
...@@ -39,7 +39,7 @@ request. You may also choose a specific backend, along with ...@@ -39,7 +39,7 @@ request. You may also choose a specific backend, along with
some options. A full set of options is available in the `vllm serve --help` some options. A full set of options is available in the `vllm serve --help`
text. text.
Now let´s see an example for each of the cases, starting with the `choice`, as it´s the easiest one: Now let's see an example for each of the cases, starting with the `choice`, as it's the easiest one:
??? code ??? code
...@@ -126,12 +126,12 @@ The next example shows how to use the `response_format` parameter with a Pydanti ...@@ -126,12 +126,12 @@ The next example shows how to use the `response_format` parameter with a Pydanti
``` ```
!!! tip !!! tip
While not strictly necessary, normally it´s better to indicate in the prompt the While not strictly necessary, normally it's better to indicate in the prompt the
JSON schema and how the fields should be populated. This can improve the JSON schema and how the fields should be populated. This can improve the
results notably in most cases. results notably in most cases.
Finally we have the `grammar` option, which is probably the most Finally we have the `grammar` option, which is probably the most
difficult to use, but it´s really powerful. It allows us to define complete difficult to use, but it's really powerful. It allows us to define complete
languages like SQL queries. It works by using a context free EBNF grammar. languages like SQL queries. It works by using a context free EBNF grammar.
As an example, we can use to define a specific format of simplified SQL queries: As an example, we can use to define a specific format of simplified SQL queries:
...@@ -303,7 +303,7 @@ An example of using `structural_tag` can be found here: [examples/online_serving ...@@ -303,7 +303,7 @@ An example of using `structural_tag` can be found here: [examples/online_serving
## Offline Inference ## Offline Inference
Offline inference allows for the same types of structured outputs. Offline inference allows for the same types of structured outputs.
To use it, we´ll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`. To use it, we'll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
The main available options inside `StructuredOutputsParams` are: The main available options inside `StructuredOutputsParams` are:
- `json` - `json`
......
...@@ -317,6 +317,15 @@ Supported models: ...@@ -317,6 +317,15 @@ Supported models:
Flags: `--tool-call-parser deepseek_v31 --chat-template {see_above}` Flags: `--tool-call-parser deepseek_v31 --chat-template {see_above}`
### OpenAI OSS Models ('openai`)
Supported models:
* `openai/gpt-oss-20b`
* `openai/gpt-oss-120b`
Flags: `--tool-call-parser openai`
### Kimi-K2 Models (`kimi_k2`) ### Kimi-K2 Models (`kimi_k2`)
Supported models: Supported models:
...@@ -352,15 +361,46 @@ Supported models: ...@@ -352,15 +361,46 @@ Supported models:
* `zai-org/GLM-4.5` * `zai-org/GLM-4.5`
* `zai-org/GLM-4.5-Air` * `zai-org/GLM-4.5-Air`
* `zai-org/GLM-4.6` * `zai-org/GLM-4.6`
* `zai-org/GLM-4.6-Air`
Flags: `--tool-call-parser glm45` Flags: `--tool-call-parser glm45`
### GLM-4.7 Models (`glm47`)
Supported models:
* `zai-org/GLM-4.7`
Flags: `--tool-call-parser glm47`
### FunctionGemma Models (`functiongemma`)
Google's FunctionGemma is a lightweight (270M parameter) model specifically designed for function calling.
It's built on Gemma 3 and optimized for edge deployment on devices like laptops and phones.
Supported models:
* `google/functiongemma-270m-it`
FunctionGemma uses a unique output format with `<start_function_call>` and `<end_function_call>` tags:
```text
<start_function_call>call:get_weather{location:<escape>London<escape>}<end_function_call>
```
The model is designed to be fine-tuned for specific function-calling tasks for best results.
Flags: `--tool-call-parser functiongemma --chat-template examples/tool_chat_template_functiongemma.jinja`
!!! note
FunctionGemma is intended to be fine-tuned for your specific function-calling task.
The base model provides general function calling capabilities, but best results
are achieved with task-specific fine-tuning. See Google's [FunctionGemma documentation](https://ai.google.dev/gemma/docs/functiongemma) for fine-tuning guides.
### Qwen3-Coder Models (`qwen3_xml`) ### Qwen3-Coder Models (`qwen3_xml`)
Supported models: Supported models:
* `Qwen/Qwen3-480B-A35B-Instruct` * `Qwen/Qwen3-Coder-480B-A35B-Instruct`
* `Qwen/Qwen3-Coder-30B-A3B-Instruct` * `Qwen/Qwen3-Coder-30B-A3B-Instruct`
Flags: `--tool-call-parser qwen3_xml` Flags: `--tool-call-parser qwen3_xml`
......
...@@ -14,16 +14,6 @@ vLLM supports the following hardware platforms: ...@@ -14,16 +14,6 @@ vLLM supports the following hardware platforms:
## Hardware Plugins ## Hardware Plugins
The backends below live **outside** the main `vllm` repository and follow the vLLM supports third-party hardware plugins that live **outside** the main `vllm` repository. These follow the [Hardware-Pluggable RFC](../../design/plugin_system.md).
[Hardware-Pluggable RFC](../../design/plugin_system.md).
| Accelerator | PyPI / package | Repository | A list of all supported hardware can be found on the [vllm.ai website](https://vllm.ai/#hardware). If you want to add new hardware, please contact us on [Slack](https://slack.vllm.ai/) or [Email](mailto:collaboration@vllm.ai).
|-------------|----------------|------------|
| Google TPU | `tpu-inference` | <https://github.com/vllm-project/tpu-inference> |
| Ascend NPU | `vllm-ascend` | <https://github.com/vllm-project/vllm-ascend> |
| Intel Gaudi (HPU) | N/A, install from source | <https://github.com/vllm-project/vllm-gaudi> |
| MetaX MACA GPU | N/A, install from source | <https://github.com/MetaX-MACA/vLLM-metax> |
| Rebellions ATOM / REBEL NPU | `vllm-rbln` | <https://github.com/rebellions-sw/vllm-rbln> |
| IBM Spyre AIU | `vllm-spyre` | <https://github.com/vllm-project/vllm-spyre> |
| Cambricon MLU | `vllm-mlu` | <https://github.com/Cambricon/vllm-mlu> |
| Baidu Kunlun XPU | N/A, install from source | <https://github.com/baidu/vLLM-Kunlun> |
...@@ -4,6 +4,9 @@ vLLM has experimental support for macOS with Apple Silicon. For now, users must ...@@ -4,6 +4,9 @@ vLLM has experimental support for macOS with Apple Silicon. For now, users must
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes. Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
!!! tip "GPU-Accelerated Inference with vLLM-Metal"
For GPU-accelerated inference on Apple Silicon using Metal, check out [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend.
# --8<-- [end:installation] # --8<-- [end:installation]
# --8<-- [start:requirements] # --8<-- [start:requirements]
......
# --8<-- [start:installation] # --8<-- [start:installation]
vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16. vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16.
# --8<-- [end:installation] # --8<-- [end:installation]
# --8<-- [start:requirements] # --8<-- [start:requirements]
...@@ -19,12 +19,26 @@ Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels c ...@@ -19,12 +19,26 @@ Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels c
```bash ```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//') export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_VERSION}/cpu uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl
``` ```
??? console "pip" ??? console "pip"
```bash ```bash
pip install vllm==${VLLM_VERSION}+cpu --extra-index-url https://wheels.vllm.ai/${VLLM_VERSION}/cpu pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl
```
!!! warning "set `LD_PRELOAD`"
Before use vLLM CPU installed via wheels, make sure TCMalloc is installed and added to `LD_PRELOAD`:
```bash
# install TCMalloc
sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4
# manually find the path
sudo find / -iname *libtcmalloc_minimal.so.4
TC_PATH=...
# add them to LD_PRELOAD
export LD_PRELOAD="$TC_PATH:$LD_PRELOAD"
``` ```
The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
...@@ -37,7 +51,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe ...@@ -37,7 +51,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
To install from nightly index, run: To install from nightly index, run:
```bash ```bash
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index
``` ```
??? console "pip (there's a caveat)" ??? console "pip (there's a caveat)"
...@@ -56,7 +70,7 @@ If you want to access the wheels for previous commits (e.g. to bisect the behavi ...@@ -56,7 +70,7 @@ If you want to access the wheels for previous commits (e.g. to bisect the behavi
```bash ```bash
export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit hash from the main branch export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit hash from the main branch
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index
``` ```
# --8<-- [end:pre-built-wheels] # --8<-- [end:pre-built-wheels]
...@@ -105,6 +119,20 @@ VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation ...@@ -105,6 +119,20 @@ VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation
Testing has been conducted on AWS Graviton3 instances for compatibility. Testing has been conducted on AWS Graviton3 instances for compatibility.
!!! warning "set `LD_PRELOAD`"
Before use vLLM CPU installed via wheels, make sure TCMalloc is installed and added to `LD_PRELOAD`:
```bash
# install TCMalloc
sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4
# manually find the path
sudo find / -iname *libtcmalloc_minimal.so.4
TC_PATH=...
# add them to LD_PRELOAD
export LD_PRELOAD="$TC_PATH:$LD_PRELOAD"
```
# --8<-- [end:build-wheel-from-source] # --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
......
...@@ -18,6 +18,12 @@ vLLM is a Python library that supports the following CPU variants. Select your C ...@@ -18,6 +18,12 @@ vLLM is a Python library that supports the following CPU variants. Select your C
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:installation"
## Technical Discussions
The main discussions happen in the `#sig-cpu` channel of [vLLM Slack](https://slack.vllm.ai/).
When open a Github issue about the CPU backend, please add `[CPU Backend]` in the title and it will be labeled with `cpu` for better awareness.
## Requirements ## Requirements
- Python: 3.10 -- 3.13 - Python: 3.10 -- 3.13
...@@ -166,13 +172,13 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe ...@@ -166,13 +172,13 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe
### What are supported models on CPU? ### What are supported models on CPU?
For the full and up-to-date list of models validated on CPU platforms, please see the official documentation: [Supported Models on CPU](https://docs.vllm.ai/en/latest/models/hardware_supported_models/cpu) For the full and up-to-date list of models validated on CPU platforms, please see the official documentation: [Supported Models on CPU](../../models/hardware_supported_models/cpu.md)
### How to find benchmark configuration examples for supported CPU models? ### How to find benchmark configuration examples for supported CPU models?
For any model listed under [Supported Models on CPU](https://docs.vllm.ai/en/latest/models/hardware_supported_models/cpu), optimized runtime configurations are provided in the vLLM Benchmark Suite’s CPU test cases, defined in [cpu test cases](https://github.com/vllm-project/vllm/blob/main/.buildkite/performance-benchmarks/tests/serving-tests-cpu.json) For any model listed under [Supported Models on CPU](../../models/hardware_supported_models/cpu.md), optimized runtime configurations are provided in the vLLM Benchmark Suite’s CPU test cases, defined in [cpu test cases](../../../.buildkite/performance-benchmarks/tests/serving-tests-cpu.json)
For details on how these optimized configurations are determined, see: [performance-benchmark-details](https://github.com/vllm-project/vllm/tree/main/.buildkite/performance-benchmarks#performance-benchmark-details). For details on how these optimized configurations are determined, see: [performance-benchmark-details](../../../.buildkite/performance-benchmarks/README.md#performance-benchmark-details).
To benchmark the supported models using these optimized settings, follow the steps in [running vLLM Benchmark Suite manually](https://docs.vllm.ai/en/latest/contributing/benchmarks/#manually-trigger-the-benchmark) and run the Benchmark Suite on a CPU environment. To benchmark the supported models using these optimized settings, follow the steps in [running vLLM Benchmark Suite manually](../../benchmarking/dashboard.md#manually-trigger-the-benchmark) and run the Benchmark Suite on a CPU environment.
Below is an example command to benchmark all CPU-supported models using optimized configurations. Below is an example command to benchmark all CPU-supported models using optimized configurations.
...@@ -258,11 +264,6 @@ vLLM CPU supports data parallel (DP), tensor parallel (TP) and pipeline parallel ...@@ -258,11 +264,6 @@ vLLM CPU supports data parallel (DP), tensor parallel (TP) and pipeline parallel
- GPTQ (x86 only) - GPTQ (x86 only)
- compressed-tensor INT8 W8A8 (x86, s390x) - compressed-tensor INT8 W8A8 (x86, s390x)
### (x86 only) What is the purpose of `VLLM_CPU_SGL_KERNEL`?
- Both of them require `amx` CPU flag.
- `VLLM_CPU_SGL_KERNEL` can provide better performance for MoE models and small-batch scenarios.
### Why do I see `get_mempolicy: Operation not permitted` when running in Docker? ### Why do I see `get_mempolicy: Operation not permitted` when running in Docker?
In some container environments (like Docker), NUMA-related syscalls used by vLLM (e.g., `get_mempolicy`, `migrate_pages`) are blocked/denied in the runtime's default seccomp/capabilities settings. This may lead to warnings like `get_mempolicy: Operation not permitted`. Functionality is not affected, but NUMA memory binding/migration optimizations may not take effect and performance can be suboptimal. In some container environments (like Docker), NUMA-related syscalls used by vLLM (e.g., `get_mempolicy`, `migrate_pages`) are blocked/denied in the runtime's default seccomp/capabilities settings. This may lead to warnings like `get_mempolicy: Operation not permitted`. Functionality is not affected, but NUMA memory binding/migration optimizations may not take effect and performance can be suboptimal.
......
...@@ -17,7 +17,51 @@ vLLM supports basic model inferencing and serving on x86 CPU platform, with data ...@@ -17,7 +17,51 @@ vLLM supports basic model inferencing and serving on x86 CPU platform, with data
# --8<-- [end:set-up-using-python] # --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] # --8<-- [start:pre-built-wheels]
Currently, there are no pre-built x86 CPU wheels. Pre-built vLLM wheels for x86 with AVX512 are available since version 0.13.0. To install release wheels:
```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
# use uv
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu
```
??? console "pip"
```bash
# use pip
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cpu
```
!!! warning "set `LD_PRELOAD`"
Before use vLLM CPU installed via wheels, make sure TCMalloc and Intel OpenMP are installed and added to `LD_PRELOAD`:
```bash
# install TCMalloc, Intel OpenMP is installed with vLLM CPU
sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4
# manually find the path
sudo find / -iname *libtcmalloc_minimal.so.4
sudo find / -iname *libiomp5.so
TC_PATH=...
IOMP_PATH=...
# add them to LD_PRELOAD
export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
```
**Install the latest code**
To install the wheel built from the latest main branch:
```bash
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index --torch-backend cpu
```
**Install specific revisions**
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
```bash
export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit hash from the main branch
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index --torch-backend cpu
```
# --8<-- [end:pre-built-wheels] # --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] # --8<-- [start:build-wheel-from-source]
...@@ -26,10 +70,12 @@ Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the def ...@@ -26,10 +70,12 @@ Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the def
```bash ```bash
sudo apt-get update -y sudo apt-get update -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev sudo apt-get install -y gcc-12 g++-12 libnuma-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
``` ```
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
Clone the vLLM project: Clone the vLLM project:
```bash ```bash
...@@ -82,6 +128,22 @@ uv pip install dist/*.whl ...@@ -82,6 +128,22 @@ uv pip install dist/*.whl
pip install dist/*.whl pip install dist/*.whl
``` ```
!!! warning "set `LD_PRELOAD`"
Before use vLLM CPU installed via wheels, make sure TCMalloc and Intel OpenMP are installed and added to `LD_PRELOAD`:
```bash
# install TCMalloc, Intel OpenMP is installed with vLLM CPU
sudo apt-get install -y --no-install-recommends libtcmalloc-minimal4
# manually find the path
sudo find / -iname *libtcmalloc_minimal.so.4
sudo find / -iname *libiomp5.so
TC_PATH=...
IOMP_PATH=...
# add them to LD_PRELOAD
export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
```
!!! example "Troubleshooting" !!! example "Troubleshooting"
- **NumPy ≥2.0 error**: Downgrade using `pip install "numpy<2.0"`. - **NumPy ≥2.0 error**: Downgrade using `pip install "numpy<2.0"`.
- **CMake picks up CUDA**: Add `CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON` to prevent CUDA detection during CPU builds, even if CUDA is installed. - **CMake picks up CUDA**: Add `CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON` to prevent CUDA detection during CPU builds, even if CUDA is installed.
...@@ -95,7 +157,6 @@ uv pip install dist/*.whl ...@@ -95,7 +157,6 @@ uv pip install dist/*.whl
"torch==X.Y.Z+cpu" # <------- "torch==X.Y.Z+cpu" # <-------
] ]
``` ```
- If you are building vLLM from source and not using the pre-built images, remember to set `LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD"` on x86 machines before running vLLM.
# --8<-- [end:build-wheel-from-source] # --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] # --8<-- [start:pre-built-images]
...@@ -112,6 +173,7 @@ uv pip install dist/*.whl ...@@ -112,6 +173,7 @@ uv pip install dist/*.whl
docker build -f docker/Dockerfile.cpu \ docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512BF16=false (default)|true \ --build-arg VLLM_CPU_AVX512BF16=false (default)|true \
--build-arg VLLM_CPU_AVX512VNNI=false (default)|true \ --build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
--build-arg VLLM_CPU_AMXBF16=false|true (default) \
--build-arg VLLM_CPU_DISABLE_AVX512=false (default)|true \ --build-arg VLLM_CPU_DISABLE_AVX512=false (default)|true \
--tag vllm-cpu-env \ --tag vllm-cpu-env \
--target vllm-openai . --target vllm-openai .
...@@ -123,9 +185,8 @@ docker run --rm \ ...@@ -123,9 +185,8 @@ docker run --rm \
--shm-size=4g \ --shm-size=4g \
-p 8000:8000 \ -p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \ -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \ vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \ meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \ --dtype=bfloat16 \
other vLLM OpenAI server arguments other vLLM OpenAI server arguments
``` ```
......
...@@ -98,9 +98,24 @@ Currently, there are no pre-built ROCm wheels. ...@@ -98,9 +98,24 @@ Currently, there are no pre-built ROCm wheels.
!!! note !!! note
- You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose. - You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
- The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). - The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
4. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps:
4. If you want to use MORI for EP or PD disaggregation, you can install [MORI](https://github.com/ROCm/mori) using the following steps:
```bash
git clone https://github.com/ROCm/mori.git
cd mori
git checkout $MORI_BRANCH_OR_COMMIT
git submodule sync; git submodule update --init --recursive
MORI_GPU_ARCHS="gfx942;gfx950" python3 install .
```
!!! note
- You will need to config the `$MORI_BRANCH_OR_COMMIT` for your purpose.
- The validated `$MORI_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
5. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps:
???+ console "Commands" ???+ console "Commands"
......
On NVIDIA CUDA only, it's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands: It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
```bash ```bash
uv venv --python 3.12 --seed uv venv --python 3.12 --seed
......
...@@ -75,7 +75,7 @@ This guide will help you quickly get started with vLLM to perform: ...@@ -75,7 +75,7 @@ This guide will help you quickly get started with vLLM to perform:
For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/). For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/).
!!! note !!! note
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM. For more detail and non-CUDA platforms, please refer to the [installation guide](installation/README.md) for specific instructions on how to install vLLM.
## Offline Batched Inference ## Offline Batched Inference
......
...@@ -18,7 +18,7 @@ For features that you intend to maintain, please feel free to add yourself in [` ...@@ -18,7 +18,7 @@ For features that you intend to maintain, please feel free to add yourself in [`
If you use vLLM, we recommend you making the model work with vLLM by following the [model registration](../contributing/model/registration.md) process before you release it publicly. If you use vLLM, we recommend you making the model work with vLLM by following the [model registration](../contributing/model/registration.md) process before you release it publicly.
The vLLM team helps with new model architectures not supported by vLLM, especially models pushing architectural frontiers. The vLLM team helps with new model architectures not supported by vLLM, especially models pushing architectural frontiers.
Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate. Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. Model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate.
Once we establish the connection between the vLLM team and model provider: Once we establish the connection between the vLLM team and model provider:
...@@ -30,7 +30,7 @@ The vLLM team works with model providers on features, integrations, and release ...@@ -30,7 +30,7 @@ The vLLM team works with model providers on features, integrations, and release
The vLLM maintainers will not publicly share details about model architecture, release timelines, or upcoming releases. We maintain model weights on secure servers with security measures (though we can work with security reviews and testing without certification). We delete pre-release weights or artifacts upon request. The vLLM maintainers will not publicly share details about model architecture, release timelines, or upcoming releases. We maintain model weights on secure servers with security measures (though we can work with security reviews and testing without certification). We delete pre-release weights or artifacts upon request.
The vLLM team collaborates on marketing and promotional efforts for model releases. model providers can use vLLM's trademark and logo in publications and materials. The vLLM team collaborates on marketing and promotional efforts for model releases. Model providers can use vLLM's trademark and logo in publications and materials.
## Adding New Hardware ## Adding New Hardware
......
...@@ -181,3 +181,4 @@ If you have PRs touching the area, please feel free to ping the area owner for r ...@@ -181,3 +181,4 @@ If you have PRs touching the area, please feel free to ping the area owner for r
- Ascend NPU: [@wangxiyuan](https://github.com/wangxiyuan) and [see more details](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html#maintainers) - Ascend NPU: [@wangxiyuan](https://github.com/wangxiyuan) and [see more details](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html#maintainers)
- Intel Gaudi HPU [@xuechendi](https://github.com/xuechendi) and [@kzawora-intel](https://github.com/kzawora-intel) - Intel Gaudi HPU [@xuechendi](https://github.com/xuechendi) and [@kzawora-intel](https://github.com/kzawora-intel)
- Semantic Router: [@xunzhuo](https://github.com/xunzhuo), [@rootfs](https://github.com/rootfs) and [see more details](https://vllm-semantic-router.com/community/team)
...@@ -17,7 +17,7 @@ from pydantic_core import core_schema ...@@ -17,7 +17,7 @@ from pydantic_core import core_schema
logger = logging.getLogger("mkdocs") logger = logging.getLogger("mkdocs")
ROOT_DIR = Path(__file__).parent.parent.parent.parent ROOT_DIR = Path(__file__).parent.parent.parent.parent
ARGPARSE_DOC_DIR = ROOT_DIR / "docs/argparse" ARGPARSE_DOC_DIR = ROOT_DIR / "docs/generated/argparse"
sys.path.insert(0, str(ROOT_DIR)) sys.path.insert(0, str(ROOT_DIR))
...@@ -92,6 +92,7 @@ def auto_mock(module_name: str, attr: str, max_mocks: int = 100): ...@@ -92,6 +92,7 @@ def auto_mock(module_name: str, attr: str, max_mocks: int = 100):
bench_latency = auto_mock("vllm.benchmarks", "latency") bench_latency = auto_mock("vllm.benchmarks", "latency")
bench_mm_processor = auto_mock("vllm.benchmarks", "mm_processor")
bench_serve = auto_mock("vllm.benchmarks", "serve") bench_serve = auto_mock("vllm.benchmarks", "serve")
bench_sweep_plot = auto_mock("vllm.benchmarks.sweep.plot", "SweepPlotArgs") bench_sweep_plot = auto_mock("vllm.benchmarks.sweep.plot", "SweepPlotArgs")
bench_sweep_plot_pareto = auto_mock( bench_sweep_plot_pareto = auto_mock(
...@@ -222,6 +223,7 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool): ...@@ -222,6 +223,7 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
"run-batch": create_parser(openai_run_batch.make_arg_parser), "run-batch": create_parser(openai_run_batch.make_arg_parser),
# Benchmark CLI # Benchmark CLI
"bench_latency": create_parser(bench_latency.add_cli_args), "bench_latency": create_parser(bench_latency.add_cli_args),
"bench_mm_processor": create_parser(bench_mm_processor.add_cli_args),
"bench_serve": create_parser(bench_serve.add_cli_args), "bench_serve": create_parser(bench_serve.add_cli_args),
"bench_sweep_plot": create_parser(bench_sweep_plot.add_cli_args), "bench_sweep_plot": create_parser(bench_sweep_plot.add_cli_args),
"bench_sweep_plot_pareto": create_parser(bench_sweep_plot_pareto.add_cli_args), "bench_sweep_plot_pareto": create_parser(bench_sweep_plot_pareto.add_cli_args),
......
...@@ -13,14 +13,14 @@ GENERATED_METRICS_DIR = DOCS_DIR / "generated" / "metrics" ...@@ -13,14 +13,14 @@ GENERATED_METRICS_DIR = DOCS_DIR / "generated" / "metrics"
# Files to scan for metric definitions - each will generate a separate table # Files to scan for metric definitions - each will generate a separate table
METRIC_SOURCE_FILES = [ METRIC_SOURCE_FILES = [
{"path": "vllm/v1/metrics/loggers.py", "output": "general.md"}, {"path": "vllm/v1/metrics/loggers.py", "output": "general.inc.md"},
{ {
"path": "vllm/v1/spec_decode/metrics.py", "path": "vllm/v1/spec_decode/metrics.py",
"output": "spec_decode.md", "output": "spec_decode.inc.md",
}, },
{ {
"path": "vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", "path": "vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py",
"output": "nixl_connector.md", "output": "nixl_connector.inc.md",
}, },
] ]
......
...@@ -34,9 +34,10 @@ TITLE = r"(?P<title>[^\[\]<>]+?)" ...@@ -34,9 +34,10 @@ TITLE = r"(?P<title>[^\[\]<>]+?)"
REPO = r"(?P<repo>.+?/.+?)" REPO = r"(?P<repo>.+?/.+?)"
TYPE = r"(?P<type>issues|pull|projects)" TYPE = r"(?P<type>issues|pull|projects)"
NUMBER = r"(?P<number>\d+)" NUMBER = r"(?P<number>\d+)"
PATH = r"(?P<path>[^\s]+?)"
FRAGMENT = r"(?P<fragment>#[^\s]+)?" FRAGMENT = r"(?P<fragment>#[^\s]+)?"
URL = f"https://github.com/{REPO}/{TYPE}/{NUMBER}{FRAGMENT}" URL = f"https://github.com/{REPO}/{TYPE}/{NUMBER}{FRAGMENT}"
RELATIVE = r"(?!(https?|ftp)://|#)(?P<path>[^\s]+?)" RELATIVE = rf"(?!(https?|ftp)://|#){PATH}{FRAGMENT}"
# Common titles to use for GitHub links when none is provided in the link. # Common titles to use for GitHub links when none is provided in the link.
TITLES = {"issues": "Issue ", "pull": "Pull Request ", "projects": "Project "} TITLES = {"issues": "Issue ", "pull": "Pull Request ", "projects": "Project "}
...@@ -55,6 +56,7 @@ def on_page_markdown( ...@@ -55,6 +56,7 @@ def on_page_markdown(
title = match.group("title") title = match.group("title")
path = match.group("path") path = match.group("path")
path = (Path(page.file.abs_src_path).parent / path).resolve() path = (Path(page.file.abs_src_path).parent / path).resolve()
fragment = match.group("fragment") or ""
# Check if the path exists and is outside the docs dir # Check if the path exists and is outside the docs dir
if not path.exists() or path.is_relative_to(DOC_DIR): if not path.exists() or path.is_relative_to(DOC_DIR):
...@@ -64,7 +66,7 @@ def on_page_markdown( ...@@ -64,7 +66,7 @@ def on_page_markdown(
slug = "tree/main" if path.is_dir() else "blob/main" slug = "tree/main" if path.is_dir() else "blob/main"
path = path.relative_to(ROOT_DIR) path = path.relative_to(ROOT_DIR)
url = f"https://github.com/vllm-project/vllm/{slug}/{path}" url = f"https://github.com/vllm-project/vllm/{slug}/{path}{fragment}"
return f"[{gh_icon} {title}]({url})" return f"[{gh_icon} {title}]({url})"
def replace_github_link(match: re.Match) -> str: def replace_github_link(match: re.Match) -> str:
...@@ -88,8 +90,4 @@ def on_page_markdown( ...@@ -88,8 +90,4 @@ def on_page_markdown(
markdown = relative_link.sub(replace_relative_link, markdown) markdown = relative_link.sub(replace_relative_link, markdown)
markdown = github_link.sub(replace_github_link, markdown) markdown = github_link.sub(replace_github_link, markdown)
if "interface" in str(page.file.abs_src_path):
print(markdown)
return markdown return markdown
Loading Model weights with fastsafetensors Loading model weights with fastsafetensors
=================================================================== ===================================================================
Using fastsafetensors library enables loading model weights to GPU memory by leveraging GPU direct storage. See [their GitHub repository](https://github.com/foundation-model-stack/fastsafetensors) for more details. Using fastsafetensors library enables loading model weights to GPU memory by leveraging GPU direct storage. See [their GitHub repository](https://github.com/foundation-model-stack/fastsafetensors) for more details.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment