Commit 4eabe123 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge remote-tracking branch 'mirror/releases/v0.9.0' into v0.9.0-ori

parents 45840cd2 58738772
(meetups)= ---
title: Meetups
# vLLM Meetups ---
[](){ #meetups }
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below: We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
......
# Configuration Options
This section lists the most common options for running vLLM.
There are three main levels of configuration, from highest priority to lowest priority:
- [Request parameters][completions-api] and [input arguments][sampling-params]
- [Engine arguments](./engine_args.md)
- [Environment variables](./env_vars.md)
(offline-inference)= # Conserving Memory
# Offline Inference
You can run vLLM in your own code on a list of prompts.
The offline API is based on the {class}`~vllm.LLM` class.
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
and runs it in vLLM using the default configuration.
```python
from vllm import LLM
llm = LLM(model="facebook/opt-125m")
```
After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:
- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](#pooling-models) output their hidden states directly.
Please refer to the above pages for more details about each API.
:::{seealso}
[API Reference](#offline-inference-api)
:::
(configuration-options)=
## Configuration Options
This section lists the most common options for running the vLLM engine.
For a full list, refer to the <project:#configuration> page.
(model-resolution)=
### Model resolution
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
and finding the corresponding implementation that is registered to vLLM.
Nevertheless, our model resolution may fail for the following reasons:
- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
For example:
```python
from vllm import LLM
model = LLM(
model="cerebras/Cerebras-GPT-1.3B",
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
)
```
Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
(reducing-memory-usage)=
### Reducing memory usage
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem. Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
#### Tensor Parallelism (TP) ## Tensor Parallelism (TP)
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs. Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
...@@ -80,29 +15,27 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", ...@@ -80,29 +15,27 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
tensor_parallel_size=2) tensor_parallel_size=2)
``` ```
:::{important} !!! warning
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`) To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`. before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable. To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
:::
:::{note} !!! note
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
:::
#### Quantization ## Quantization
Quantized models take less memory at the cost of lower precision. Quantized models take less memory at the cost of lower precision.
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI)) Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
and used directly without extra configuration. and used directly without extra configuration.
Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details. Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
#### Context length and batch size ## Context length and batch size
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option) You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
and the maximum batch size (`max_num_seqs` option). and the maximum batch size (`max_num_seqs` option).
...@@ -115,13 +48,12 @@ llm = LLM(model="adept/fuyu-8b", ...@@ -115,13 +48,12 @@ llm = LLM(model="adept/fuyu-8b",
max_num_seqs=2) max_num_seqs=2)
``` ```
#### Reduce CUDA Graphs ## Reduce CUDA Graphs
By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU. By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
:::{important} !!! warning
CUDA graph capture takes up more memory in V1 than in V0. CUDA graph capture takes up more memory in V1 than in V0.
:::
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
...@@ -148,14 +80,14 @@ llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", ...@@ -148,14 +80,14 @@ llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
enforce_eager=True) enforce_eager=True)
``` ```
#### Adjust cache size ## Adjust cache size
If you run out of CPU RAM, try the following options: If you run out of CPU RAM, try the following options:
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB). - (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB). - (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
#### Multi-modal input limits ## Multi-modal input limits
You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model: You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
...@@ -188,7 +120,7 @@ llm = LLM(model="google/gemma-3-27b-it", ...@@ -188,7 +120,7 @@ llm = LLM(model="google/gemma-3-27b-it",
limit_mm_per_prompt={"image": 0}) limit_mm_per_prompt={"image": 0})
``` ```
#### Multi-modal processor arguments ## Multi-modal processor arguments
For certain models, you can adjust the multi-modal processor arguments to For certain models, you can adjust the multi-modal processor arguments to
reduce the size of the processed multi-modal inputs, which in turn saves memory. reduce the size of the processed multi-modal inputs, which in turn saves memory.
...@@ -210,8 +142,3 @@ llm = LLM(model="OpenGVLab/InternVL2-2B", ...@@ -210,8 +142,3 @@ llm = LLM(model="OpenGVLab/InternVL2-2B",
"max_dynamic_patch": 4, # Default is 12 "max_dynamic_patch": 4, # Default is 12
}) })
``` ```
### Performance optimization and tuning
You can potentially improve the performance of vLLM by finetuning various options.
Please refer to [this guide](#optimization-and-tuning) for more details.
---
title: Engine Arguments
---
[](){ #engine-args }
Engine arguments control the behavior of the vLLM engine.
- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`.
You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented.
For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config.
!!! note
Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`
# Environment Variables
vLLM uses the following environment variables to configure the system:
!!! warning
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
```python
--8<-- "vllm/envs.py:env-vars-definition"
```
# Model Resolution
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
and finding the corresponding implementation that is registered to vLLM.
Nevertheless, our model resolution may fail for the following reasons:
- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
For example:
```python
from vllm import LLM
model = LLM(
model="cerebras/Cerebras-GPT-1.3B",
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
)
```
Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
(optimization-and-tuning)=
# Optimization and Tuning # Optimization and Tuning
This guide covers optimization strategies and performance tuning for vLLM V1. This guide covers optimization strategies and performance tuning for vLLM V1.
...@@ -26,7 +24,7 @@ You can monitor the number of preemption requests through Prometheus metrics exp ...@@ -26,7 +24,7 @@ You can monitor the number of preemption requests through Prometheus metrics exp
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture. In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
(chunked-prefill)= [](){ #chunked-prefill }
## Chunked Prefill ## Chunked Prefill
......
(serve-args)= ---
title: Server Arguments
# Server Arguments ---
[](){ #serve-args }
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.
## CLI Arguments ## CLI Arguments
The following are all arguments available from the `vllm serve` command: The `vllm serve` command is used to launch the OpenAI-compatible server.
To see the available CLI arguments, run `vllm serve --help`!
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
```{eval-rst}
.. argparse::
:module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs
:prog: vllm serve
:nodefaultconst:
:markdownhelp:
```
## Configuration file ## Configuration file
You can load CLI arguments via a [YAML](https://yaml.org/) config file. You can load CLI arguments via a [YAML](https://yaml.org/) config file.
The argument names must be the long form of those outlined [above](#serve-args). The argument names must be the long form of those outlined [above][serve-args].
For example: For example:
...@@ -40,8 +32,7 @@ To use the above config file: ...@@ -40,8 +32,7 @@ To use the above config file:
vllm serve --config config.yaml vllm serve --config config.yaml
``` ```
:::{note} !!! note
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence. In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`. The order of priorities is `command line > config file values > defaults`.
e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file. e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
:::
...@@ -27,7 +27,21 @@ See <gh-file:LICENSE>. ...@@ -27,7 +27,21 @@ See <gh-file:LICENSE>.
## Developing ## Developing
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
Check out the [building from source](#build-from-source) documentation for details. Check out the [building from source][build-from-source] documentation for details.
### Building the docs
Install the dependencies:
```bash
pip install -r requirements/docs.txt
```
Start the autoreloading MkDocs server:
```bash
mkdocs serve
```
## Testing ## Testing
...@@ -48,29 +62,25 @@ pre-commit run mypy-3.9 --hook-stage manual --all-files ...@@ -48,29 +62,25 @@ pre-commit run mypy-3.9 --hook-stage manual --all-files
pytest tests/ pytest tests/
``` ```
:::{tip} !!! tip
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12. Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment. Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
:::
:::{note} !!! note
Currently, the repository is not fully checked by `mypy`. Currently, the repository is not fully checked by `mypy`.
:::
:::{note} !!! note
Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
platform to run unit tests locally, rely on the continuous integration system to run the tests for platform to run unit tests locally, rely on the continuous integration system to run the tests for
now. now.
:::
## Issues ## Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
:::{important} !!! warning
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability). If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
:::
## Pull Requests & Code Reviews ## Pull Requests & Code Reviews
...@@ -106,9 +116,8 @@ appropriately to indicate the type of change. Please use one of the following: ...@@ -106,9 +116,8 @@ appropriately to indicate the type of change. Please use one of the following:
- `[Misc]` for PRs that do not fit the above categories. Please use this - `[Misc]` for PRs that do not fit the above categories. Please use this
sparingly. sparingly.
:::{note} !!! note
If the PR spans more than one category, please include all relevant prefixes. If the PR spans more than one category, please include all relevant prefixes.
:::
### Code Quality ### Code Quality
...@@ -121,9 +130,8 @@ The PR needs to meet the following code quality standards: ...@@ -121,9 +130,8 @@ The PR needs to meet the following code quality standards:
understand the code. understand the code.
- Include sufficient tests to ensure the project stays correct and robust. This - Include sufficient tests to ensure the project stays correct and robust. This
includes both unit tests and integration tests. includes both unit tests and integration tests.
- Please add documentation to `docs/source/` if the PR modifies the - Please add documentation to `docs/` if the PR modifies the user-facing behaviors of vLLM.
user-facing behaviors of vLLM. It helps vLLM users understand and utilize the It helps vLLM users understand and utilize the new features or changes.
new features or changes.
### Adding or Changing Kernels ### Adding or Changing Kernels
......
(benchmarks)= ---
title: Benchmark Suites
# Benchmark Suites ---
[](){ #benchmarks }
vLLM contains two sets of benchmarks: vLLM contains two sets of benchmarks:
- [Performance benchmarks](#performance-benchmarks) - [Performance benchmarks][performance-benchmarks]
- [Nightly benchmarks](#nightly-benchmarks) - [Nightly benchmarks][nightly-benchmarks]
(performance-benchmarks)= [](){ #performance-benchmarks }
## Performance Benchmarks ## Performance Benchmarks
...@@ -17,7 +18,7 @@ The latest performance results are hosted on the public [vLLM Performance Dashbo ...@@ -17,7 +18,7 @@ The latest performance results are hosted on the public [vLLM Performance Dashbo
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md). More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
(nightly-benchmarks)= [](){ #nightly-benchmarks }
## Nightly Benchmarks ## Nightly Benchmarks
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment