vllm.md

# vLLM

We recommend you trying [vLLM](https://github.com/vllm-project/vllm) for your deployment of Qwen. 
It is simple to use, and it is fast with state-of-the-art serving throughput, efficient management of attention key value memory with PagedAttention, continuous batching of input requests, optimized CUDA kernels, etc. 
To learn more about vLLM, please refer to the [paper](https://arxiv.org/abs/2309.06180) and [documentation](https://docs.vllm.ai/).

## Environment Setup

By default, you can install `vllm` with pip in a clean environment:

```shell
pip install "vllm>=0.8.4"
```

Please note that the prebuilt `vllm` has strict dependencies on `torch` and its CUDA versions.
Check the note in the official document for installation ([link](https://docs.vllm.ai/en/latest/getting_started/installation.html)) for more help.

## API Service

It is easy to build an OpenAI-compatible API service with vLLM, which can be deployed as a server that implements OpenAI API protocol.
By default, it starts the server at `http://localhost:8000`. 
You can specify the address with `--host` and `--port` arguments. 
Run the command as shown below:
```shell
vllm serve Qwen/Qwen3-8B
```

By default, if the model does not point to a valid local directory, it will download the model files from the HuggingFace Hub.
To download model from ModelScope, set the following before running the above command:
```shell
export VLLM_USE_MODELSCOPE=true
```

For distrbiuted inference with tensor parallelism, it is as simple as
```shell
vllm server Qwen/Qwen3-8B --tensor-parallel-size 4
```
The above command will use tensor parallelism on 4 GPUs.
You should change the number of GPUs according to your demand.

### Basic Usage

Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:

::::{tab-set}

:::{tab-item} curl
```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 32768
}'
```
:::

:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:

```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[
        {"role": "user", "content": "Give me a short introduction to large language models."},
    ],
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    max_tokens=32768,
)
print("Chat response:", chat_response)
```
::::

:::{tip}
`vllm` will use the sampling parameters from the `generation_config.json` in the model files.

While the default sampling parameters would work most of the time for thinking mode,
it is recommended to adjust the sampling parameters according to your application, 
and always pass the sampling parameters to the API.
:::


### Thinking & Non-Thinking Modes

Qwen3 models will think before respond.
This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think.

The hard switch is availabe in vLLM through the following configuration to the API call.
To disable thinking, use

::::{tab-set}

:::{tab-item} curl
```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-8B",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "max_tokens": 8192,
  "presence_penalty": 1.5,
  "chat_template_kwargs": {"enable_thinking": false}
}'
```
:::

:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:

```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[
        {"role": "user", "content": "Give me a short introduction to large language models."},
    ],
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    presence_penalty=1.5,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print("Chat response:", chat_response)
```
::::

:::{tip}
It is recommended to set sampling parameters differently for thinking and non-thinking modes.
:::

### Parsing Thinking Content

vLLM supports parsing the thinking content from the model generation into structured messages:
```shell
vllm serve Qwen/Qwen3-8B --enable-reasoning-parser --reasoning-parser deepseek_r1
```

The response message will have a field named `reasoning_content` in addition to `content`, containing the thinking content generated by the model.

:::{note}
Please note that this feature is not OpenAI API compatible.
:::

### Parsing Tool Calls

vLLM supports parsing the tool calling content from the model generation into structured messages:
```shell
vllm serve Qwen/Qwen3-8B --enable-auto-tool-choice --tool-call-parser hermes
```

For more information, please refer to [our guide on Function Calling](../framework/function_call.md#vllm).

:::{note}
As of vLLM 0.5.4, it is not supported to parse the thinking content and the tool calling from the model generation at the same time.
:::

### Structured/JSON Output

vLLM supports structured/JSON output. 
Please refer to [vLLM's documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#extra-parameters-for-chat-api) for the `guided_json` parameters.
Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt.


### Serving Quantized models

Qwen3 comes with two types of pre-quantized models, FP8 and AWQ.

The command serving those models are the same as the original models except for the name change:
```shell
# For FP8 quantized model
vllm serve Qwen3/Qwen3-8B-FP8

# For AWQ quantized model
vllm serve Qwen3/Qwen3-8B-AWQ
```

:::{note}
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9, that is, Ada Lovelace, Hopper, and later GPUs.

FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
:::

:::{important}
As of vLLM 0.5.4, there are currently compatibility issues with `vllm` with the Qwen3 FP8 checkpoints. 
For a quick fix, you should make the following changes to the file `vllm/vllm/model_executor/layers/linear.py`:
```python
# these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
...
shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
shard_size = self._get_shard_size_mapping(loaded_shard_id)
# add the following code
if isinstance(param, BlockQuantScaleParameter):
    weight_block_size = self.quant_method.quant_config.weight_block_size
    block_n, _ = weight_block_size[0], weight_block_size[1]
    shard_offset = (shard_offset + block_n - 1) // block_n
    shard_size = (shard_size + block_n - 1) // block_n
# end of the modification
param.load_qkv_weight(loaded_weight=loaded_weight,
                        num_heads=self.num_kv_head_replicas,
                        shard_id=loaded_shard_id,
                        shard_offset=shard_offset,
                        shard_size=shard_size)
...
```
:::


### Context Length

The context length for Qwen3 models in pretraining is up to 32,768 tokenns.
To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied.
We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

vLLM supports YaRN, which can be configured as
```shell
vllm serve Qwen3/Qwen3-8B --rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072  
```

:::{note}
vLLM implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
We advise adding the `rope_scaling` configuration only when processing long contexts is required. 
It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0. 
:::

:::{note}
The default `max_position_embeddings` in `config.json` is set to 40,960, which used by vLLM, if `--max-model-len` is not specified.
This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking.
If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
:::

## Python Library

vLLM can also be directly used as a Python library, which is convinient for offline batch inference but lack some API-only features, such as parsing model generation to structure messages.

The following shows the basic usage of vLLM as a library:

```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Configurae the sampling parameters (for thinking mode)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, max_tokens=32768)

# Initialize the vLLM engine
llm = LLM(model="Qwen/Qwen3-8B")

# Prepare the input to the model
prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

# Generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```


## FAQ

You may encounter OOM issues that are pretty annoying.
We recommend two arguments for you to make some fix.

- The first one is `--max-model-len`.
  Our provided default `max_position_embedding` is `40960` and thus the maximum length for the serving is also this value, leading to higher requirements of memory.
  Reducing it to a proper length for yourself often helps with the OOM issue.
- Another argument you can pay attention to is `--gpu-memory-utilization`.
  vLLM will pre-allocate this much GPU memory.
  By default, it is `0.9`.
  This is also why you find a vLLM service always takes so much memory.
  If you are in eager mode (by default it is not), you can level it up to tackle the OOM problem.
  Otherwise, CUDA Graphs are used, which will use GPU memory not controlled by vLLM, and you should try lowering it.
  If it doesn't work, you should try `--enforce-eager`, which may slow down infernece, or reduce the `--max-model-len`.